LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] powerpc/powernv: reduce multi-hit of iommu_add_device()
From: Gavin Shan @ 2014-04-30  0:28 UTC (permalink / raw)
  To: Wei Yang; +Cc: linuxppc-dev, Alexey Kardashevskiy, gwshan
In-Reply-To: <20140429064955.GA5066@richard>

On Tue, Apr 29, 2014 at 02:49:55PM +0800, Wei Yang wrote:
>On Mon, Apr 28, 2014 at 11:35:32PM +1000, Alexey Kardashevskiy wrote:
>>On 04/23/2014 12:26 PM, Wei Yang wrote:

.../...

>Generally, when kernel enumerate on the pci device, following functions will
>be invoked.
>
>     pci_device_add
>        pcibios_setup_bus_device
>        ...
>        set_iommu_table_base_and_group   
>     device_add
>        ...
>        tce_iommu_bus_notifier           
>     pcibios_fixup_bus
>        pcibios_add_pci_devices
>        ...
>        pcibios_setup_bus_devices        
>
>From the call flow, we see for a normall pci device, the
>pcibios_setup_bus_device() will be invoked twice.
>
>At the bootup time, none of them succeed to setup the dma, since the PE is not
>assigned or the tbl is not set. The iommu tbl and group is setup in
>pnv_pci_ioda_setup_DMA().
>

Yes, we don't assign PE# for PCI devices until ppc_md.pcibios_fixup().
We gets IOMMU group and IOMMU group device registered in ppc_md.pcibios_fixup().

As Alexy already pointed out, "tce_iommu_bus_notifier" doesn't take effect
during system boot stage.

>This call flow maintains the same when EEH error happens on Bus PE, while the
>behavior is a little different. 
>
>     pci_device_add
>        pcibios_setup_bus_device
>        ...
>        set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>     device_add
>        ...
>        tce_iommu_bus_notifier           <- succeed
>     pcibios_fixp_bus
>        pcibios_add_pci_devices
>        ...
>        pcibios_setup_bus_devices        <- warning, re-attach
>
>While this call flow will change a little on a VF. For a VF,
>pcibios_fixp_bus() will not be invoked. Current behavior is this.
>
>     pci_device_add
>        pcibios_setup_bus_device
>        ...
>        set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>     device_add
>        ...
>        tce_iommu_bus_notifier           <- succeed
>

It seems that we have 2 problems here:

- For non-SRIOV case, pcibios_setup_device() is called for towice. That
  seems incorrect. We could simply remove pcibios_setup_bus_devices()
  from pcibios_fixup_bus().

- It's too early to register IOMMU group/device in pnv_pci_ioda_dma_dev_setup()
  because the sysfs entries of the PCI device aren't finalized yet. So we could
  remove all logic we have in pnv_pci_ioda_dma_dev_setup() and just purely rely
  on "tce_iommu_bus_notifier".

By the way, I never tried EEH on SRIOV PF/VFs. However, I never hit similar
issue in non-SRIOV cases.

Thanks,
Gavin

^ permalink raw reply

* [PATCH 2/2] powerpc: fix smp_processor_id() in preemptible splat in set_breakpoint
From: Paul Gortmaker @ 2014-04-29 19:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: Paul Gortmaker, linuxppc-dev, Steven Rostedt
In-Reply-To: <1398799517-47879-1-git-send-email-paul.gortmaker@windriver.com>

Currently, on 8641D, which doesn't set CONFIG_HAVE_HW_BREAKPOINT
we get the following splat:

BUG: using smp_processor_id() in preemptible [00000000] code: login/1382
caller is set_breakpoint+0x1c/0xa0
CPU: 0 PID: 1382 Comm: login Not tainted 3.15.0-rc3-00041-g2aafe1a4d451 #1
Call Trace:
[decd5d80] [c0008dc4] show_stack+0x50/0x158 (unreliable)
[decd5dc0] [c03c6fa0] dump_stack+0x7c/0xdc
[decd5de0] [c01f8818] check_preemption_disabled+0xf4/0x104
[decd5e00] [c00086b8] set_breakpoint+0x1c/0xa0
[decd5e10] [c00d4530] flush_old_exec+0x2bc/0x588
[decd5e40] [c011c468] load_elf_binary+0x2ac/0x1164
[decd5ec0] [c00d35f8] search_binary_handler+0xc4/0x1f8
[decd5ef0] [c00d4ee8] do_execve+0x3d8/0x4b8
[decd5f40] [c001185c] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0xfeee554
    LR = 0xfeee7d4

The call path in this case is:

	flush_thread
	   --> set_debug_reg_defaults
	     --> set_breakpoint
	       --> __get_cpu_var

Since preemption is enabled in the cleanup of flush thread, and
there is no need to disable it, introduce the distinction between
set_breakpoint and __set_breakpoint, leaving only the flush_thread
instance as the current user of set_breakpoint.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 arch/powerpc/include/asm/debug.h         |  1 +
 arch/powerpc/include/asm/hw_breakpoint.h |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c      |  8 ++++----
 arch/powerpc/kernel/process.c            | 11 +++++++++--
 arch/powerpc/kernel/signal.c             |  2 +-
 arch/powerpc/xmon/xmon.c                 |  2 +-
 6 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 1d7f966d3b18..a954e4975049 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -47,6 +47,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
 void set_breakpoint(struct arch_hw_breakpoint *brk);
+void __set_breakpoint(struct arch_hw_breakpoint *brk);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
 			 unsigned long error_code, int signal_code, int brkpt);
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h
index eb0f4ac75c4c..ac6432d9be46 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -79,7 +79,7 @@ static inline void hw_breakpoint_disable(void)
 	brk.address = 0;
 	brk.type = 0;
 	brk.len = 0;
-	set_breakpoint(&brk);
+	__set_breakpoint(&brk);
 }
 extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs);
 
diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
index b0a1792279bb..0bb5918faaaf 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -72,7 +72,7 @@ int arch_install_hw_breakpoint(struct perf_event *bp)
 	 * If so, DABR will be populated in single_step_dabr_instruction().
 	 */
 	if (current->thread.last_hit_ubp != bp)
-		set_breakpoint(info);
+		__set_breakpoint(info);
 
 	return 0;
 }
@@ -198,7 +198,7 @@ void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs)
 
 	info = counter_arch_bp(tsk->thread.last_hit_ubp);
 	regs->msr &= ~MSR_SE;
-	set_breakpoint(info);
+	__set_breakpoint(info);
 	tsk->thread.last_hit_ubp = NULL;
 }
 
@@ -284,7 +284,7 @@ int __kprobes hw_breakpoint_handler(struct die_args *args)
 	if (!(info->type & HW_BRK_TYPE_EXTRANEOUS_IRQ))
 		perf_bp_event(bp, regs);
 
-	set_breakpoint(info);
+	__set_breakpoint(info);
 out:
 	rcu_read_unlock();
 	return rc;
@@ -316,7 +316,7 @@ int __kprobes single_step_dabr_instruction(struct die_args *args)
 	if (!(info->type & HW_BRK_TYPE_EXTRANEOUS_IRQ))
 		perf_bp_event(bp, regs);
 
-	set_breakpoint(info);
+	__set_breakpoint(info);
 	current->thread.last_hit_ubp = NULL;
 
 	/*
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index c4e3d50e86f8..bc2d668b68ed 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -495,7 +495,7 @@ static inline int set_dawr(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
-void set_breakpoint(struct arch_hw_breakpoint *brk)
+void __set_breakpoint(struct arch_hw_breakpoint *brk)
 {
 	__get_cpu_var(current_brk) = *brk;
 
@@ -505,6 +505,13 @@ void set_breakpoint(struct arch_hw_breakpoint *brk)
 		set_dabr(brk);
 }
 
+void set_breakpoint(struct arch_hw_breakpoint *brk)
+{
+	preempt_disable();
+	__set_breakpoint(brk);
+	preempt_enable();
+}
+
 #ifdef CONFIG_PPC64
 DEFINE_PER_CPU(struct cpu_usage, cpu_usage_array);
 #endif
@@ -834,7 +841,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
  */
 #ifndef CONFIG_HAVE_HW_BREAKPOINT
 	if (unlikely(!hw_brk_match(&__get_cpu_var(current_brk), &new->thread.hw_brk)))
-		set_breakpoint(&new->thread.hw_brk);
+		__set_breakpoint(&new->thread.hw_brk);
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #endif
 
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 8fc4177ed65a..1c794cef2883 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -134,7 +134,7 @@ static int do_signal(struct pt_regs *regs)
 	 */
 	if (current->thread.hw_brk.address &&
 		current->thread.hw_brk.type)
-		set_breakpoint(&current->thread.hw_brk);
+		__set_breakpoint(&current->thread.hw_brk);
 #endif
 	/* Re-enable the breakpoints for the signal stack */
 	thread_change_pc(current, regs);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 08504e75b2c7..d3759b7a5535 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -759,7 +759,7 @@ static void insert_cpu_bpts(void)
 		brk.address = dabr.address;
 		brk.type = (dabr.enabled & HW_BRK_TYPE_DABR) | HW_BRK_TYPE_PRIV_ALL;
 		brk.len = 8;
-		set_breakpoint(&brk);
+		__set_breakpoint(&brk);
 	}
 	if (iabr && cpu_has_feature(CPU_FTR_IABR))
 		mtspr(SPRN_IABR, iabr->address
-- 
1.9.0

^ permalink raw reply related

* [PATCH 0/2] Fix smp_processor_id() in preemptible splat
From: Paul Gortmaker @ 2014-04-29 19:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: Paul Gortmaker, linuxppc-dev, Steven Rostedt

This seems to have been around for a while; I tripped over it while
working on testing 3.10-rt with Steven, and then confirmed it still
was an issue on today's mainline. (didn't check -next however...)

I didn't bisect, but it looks like it may have come in from the
changes in v3.9-rc1~100^2~57.  It probably only impacts the older
cores like the 74xx that don't have CONFIG_HAVE_HW_BREAKPOINT,
and I know I haven't booted my sbc8641D in over a year, so no
surprise this went undetected for a while.  Maybe adding a Cc:
stable is warranted?

It is pretty hard to ignore the splats since you get one per process,
so it must be just the older platform never getting much testing.

The 1st patch is just a small cleanup that makes the 2nd patch
a bit smaller.

If the __set_breakpoint() addition isn't liked, then we could
alternatively just wrap the one call site with preempt_dis/en/able.

Paul.
---

Paul Gortmaker (2):
  powerpc: drop return value from set_breakpoint as it is unused
  powerpc: fix smp_processor_id() in preemptible splat in set_breakpoint

 arch/powerpc/include/asm/debug.h         |  3 ++-
 arch/powerpc/include/asm/hw_breakpoint.h |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c      |  8 ++++----
 arch/powerpc/kernel/process.c            | 15 +++++++++++----
 arch/powerpc/kernel/signal.c             |  2 +-
 arch/powerpc/xmon/xmon.c                 |  2 +-
 6 files changed, 20 insertions(+), 12 deletions(-)

-- 
1.9.0

^ permalink raw reply

* [PATCH] powerpc: memcpy optimization for 64bit LE
From: Anton Blanchard @ 2014-04-29 23:12 UTC (permalink / raw)
  To: benh, paulus, felix; +Cc: linuxppc-dev
In-Reply-To: <20140430091054.4de84c9b@kryten>

From: Philippe Bergheaud <felix@linux.vnet.ibm.com>

Unaligned stores take alignment exceptions on POWER7 running in little-endian.
This is a dumb little-endian base memcpy that prevents unaligned stores.
Once booted the feature fixup code switches over to the VMX copy loops
(which are already endian safe).

The question is what we do before that switch over. The base 64bit
memcpy takes alignment exceptions on POWER7 so we can't use it as is.
Fixing the causes of alignment exception would slow it down, because
we'd need to ensure all loads and stores are aligned either through
rotate tricks or bytewise loads and stores. Either would be bad for
all other 64bit platforms.

[ I simplified the loop a bit - Anton ]

Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/include/asm/string.h |  4 ----
 arch/powerpc/kernel/ppc_ksyms.c   |  2 --
 arch/powerpc/lib/Makefile         |  2 --
 arch/powerpc/lib/memcpy_64.S      | 16 ++++++++++++++++
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index 0dffad6..e40010a 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -10,9 +10,7 @@
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_STRCAT
 #define __HAVE_ARCH_MEMSET
-#ifdef __BIG_ENDIAN__
 #define __HAVE_ARCH_MEMCPY
-#endif
 #define __HAVE_ARCH_MEMMOVE
 #define __HAVE_ARCH_MEMCMP
 #define __HAVE_ARCH_MEMCHR
@@ -24,9 +22,7 @@ extern int strcmp(const char *,const char *);
 extern int strncmp(const char *, const char *, __kernel_size_t);
 extern char * strcat(char *, const char *);
 extern void * memset(void *,int,__kernel_size_t);
-#ifdef __BIG_ENDIAN__
 extern void * memcpy(void *,const void *,__kernel_size_t);
-#endif
 extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c
index 3bd77ed..8de1c48 100644
--- a/arch/powerpc/kernel/ppc_ksyms.c
+++ b/arch/powerpc/kernel/ppc_ksyms.c
@@ -154,9 +154,7 @@ EXPORT_SYMBOL(__cmpdi2);
 #endif
 long long __bswapdi2(long long);
 EXPORT_SYMBOL(__bswapdi2);
-#ifdef __BIG_ENDIAN__
 EXPORT_SYMBOL(memcpy);
-#endif
 EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(memcmp);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 95a20e1..59fa2de 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -23,9 +23,7 @@ obj-y			+= checksum_$(CONFIG_WORD_SIZE).o
 obj-$(CONFIG_PPC64)	+= checksum_wrappers_64.o
 endif
 
-ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),)
 obj-$(CONFIG_PPC64)		+= memcpy_power7.o memcpy_64.o 
-endif
 
 obj-$(CONFIG_PPC_EMULATE_SSTEP)	+= sstep.o ldstfp.o
 
diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S
index 72ad055..dc4ba79 100644
--- a/arch/powerpc/lib/memcpy_64.S
+++ b/arch/powerpc/lib/memcpy_64.S
@@ -12,12 +12,27 @@
 	.align	7
 _GLOBAL(memcpy)
 BEGIN_FTR_SECTION
+#ifdef __LITTLE_ENDIAN__
+	cmpdi	cr7,r5,0
+#else
 	std	r3,48(r1)	/* save destination pointer for return value */
+#endif
 FTR_SECTION_ELSE
 #ifndef SELFTEST
 	b	memcpy_power7
 #endif
 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
+#ifdef __LITTLE_ENDIAN__
+	/* dumb little-endian memcpy that will get replaced at runtime */
+	addi r9,r3,-1
+	addi r4,r4,-1
+	beqlr cr7
+	mtctr r5
+1:	lbzu r10,1(r4)
+	stbu r10,1(r9)
+	bdnz 1b
+	blr
+#else
 	PPC_MTOCRF(0x01,r5)
 	cmpldi	cr1,r5,16
 	neg	r6,r3		# LS 3 bits = # bytes to 8-byte dest bdry
@@ -203,3 +218,4 @@ END_FTR_SECTION_IFCLR(CPU_FTR_UNALIGNED_LD_STD)
 	stb	r0,0(r3)
 4:	ld	r3,48(r1)	/* return dest pointer */
 	blr
+#endif
-- 
1.9.1

^ permalink raw reply related

* [PATCH] powerpc: memcpy optimization for 64bit LE
From: Anton Blanchard @ 2014-04-29 23:10 UTC (permalink / raw)
  To: benh, paulus, felix, linuxppc-dev

Unaligned stores take alignment exceptions on POWER7 running in little-endian.
This is a dumb little-endian base memcpy that prevents unaligned stores.
Once booted the feature fixup code switches over to the VMX copy loops
(which are already endian safe).

The question is what we do before that switch over. The base 64bit
memcpy takes alignment exceptions on POWER7 so we can't use it as is.
Fixing the causes of alignment exception would slow it down, because
we'd need to ensure all loads and stores are aligned either through
rotate tricks or bytewise loads and stores. Either would be bad for
all other 64bit platforms.

[ I simplified the loop a bit - Anton ]

Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/include/asm/string.h |  4 ----
 arch/powerpc/kernel/ppc_ksyms.c   |  2 --
 arch/powerpc/lib/Makefile         |  2 --
 arch/powerpc/lib/memcpy_64.S      | 16 ++++++++++++++++
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index 0dffad6..e40010a 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -10,9 +10,7 @@
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_STRCAT
 #define __HAVE_ARCH_MEMSET
-#ifdef __BIG_ENDIAN__
 #define __HAVE_ARCH_MEMCPY
-#endif
 #define __HAVE_ARCH_MEMMOVE
 #define __HAVE_ARCH_MEMCMP
 #define __HAVE_ARCH_MEMCHR
@@ -24,9 +22,7 @@ extern int strcmp(const char *,const char *);
 extern int strncmp(const char *, const char *, __kernel_size_t);
 extern char * strcat(char *, const char *);
 extern void * memset(void *,int,__kernel_size_t);
-#ifdef __BIG_ENDIAN__
 extern void * memcpy(void *,const void *,__kernel_size_t);
-#endif
 extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c
index 3bd77ed..8de1c48 100644
--- a/arch/powerpc/kernel/ppc_ksyms.c
+++ b/arch/powerpc/kernel/ppc_ksyms.c
@@ -154,9 +154,7 @@ EXPORT_SYMBOL(__cmpdi2);
 #endif
 long long __bswapdi2(long long);
 EXPORT_SYMBOL(__bswapdi2);
-#ifdef __BIG_ENDIAN__
 EXPORT_SYMBOL(memcpy);
-#endif
 EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(memcmp);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 95a20e1..59fa2de 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -23,9 +23,7 @@ obj-y			+= checksum_$(CONFIG_WORD_SIZE).o
 obj-$(CONFIG_PPC64)	+= checksum_wrappers_64.o
 endif
 
-ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),)
 obj-$(CONFIG_PPC64)		+= memcpy_power7.o memcpy_64.o 
-endif
 
 obj-$(CONFIG_PPC_EMULATE_SSTEP)	+= sstep.o ldstfp.o
 
diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S
index 72ad055..dc4ba79 100644
--- a/arch/powerpc/lib/memcpy_64.S
+++ b/arch/powerpc/lib/memcpy_64.S
@@ -12,12 +12,27 @@
 	.align	7
 _GLOBAL(memcpy)
 BEGIN_FTR_SECTION
+#ifdef __LITTLE_ENDIAN__
+	cmpdi	cr7,r5,0
+#else
 	std	r3,48(r1)	/* save destination pointer for return value */
+#endif
 FTR_SECTION_ELSE
 #ifndef SELFTEST
 	b	memcpy_power7
 #endif
 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
+#ifdef __LITTLE_ENDIAN__
+	/* dumb little-endian memcpy that will get replaced at runtime */
+	addi r9,r3,-1
+	addi r4,r4,-1
+	beqlr cr7
+	mtctr r5
+1:	lbzu r10,1(r4)
+	stbu r10,1(r9)
+	bdnz 1b
+	blr
+#else
 	PPC_MTOCRF(0x01,r5)
 	cmpldi	cr1,r5,16
 	neg	r6,r3		# LS 3 bits = # bytes to 8-byte dest bdry
@@ -203,3 +218,4 @@ END_FTR_SECTION_IFCLR(CPU_FTR_UNALIGNED_LD_STD)
 	stb	r0,0(r3)
 4:	ld	r3,48(r1)	/* return dest pointer */
 	blr
+#endif
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/2] powerpc: drop return value from set_breakpoint as it is unused
From: Paul Gortmaker @ 2014-04-29 19:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: Paul Gortmaker, linuxppc-dev, Steven Rostedt
In-Reply-To: <1398799517-47879-1-git-send-email-paul.gortmaker@windriver.com>

None of the callers check the return value, so it might as
well not have one at all.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 arch/powerpc/include/asm/debug.h | 2 +-
 arch/powerpc/kernel/process.c    | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index d2516308ed1e..1d7f966d3b18 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,7 +46,7 @@ static inline int debugger_break_match(struct pt_regs *regs) { return 0; }
 static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
-int set_breakpoint(struct arch_hw_breakpoint *brk);
+void set_breakpoint(struct arch_hw_breakpoint *brk);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
 			 unsigned long error_code, int signal_code, int brkpt);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 31d021506d21..c4e3d50e86f8 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -495,14 +495,14 @@ static inline int set_dawr(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
-int set_breakpoint(struct arch_hw_breakpoint *brk)
+void set_breakpoint(struct arch_hw_breakpoint *brk)
 {
 	__get_cpu_var(current_brk) = *brk;
 
 	if (cpu_has_feature(CPU_FTR_DAWR))
-		return set_dawr(brk);
-
-	return set_dabr(brk);
+		set_dawr(brk);
+	else
+		set_dabr(brk);
 }
 
 #ifdef CONFIG_PPC64
-- 
1.9.0

^ permalink raw reply related

* Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
From: Rafael J. Wysocki @ 2014-04-29 23:07 UTC (permalink / raw)
  To: Scott Wood
  Cc: Zhao Chenhui, linux-pm, Dongsheng Wang, Leo Li,
	正雄 金, linuxppc-dev
In-Reply-To: <1398811632.24575.98.camel@snotra.buserror.net>

On Tuesday, April 29, 2014 05:47:12 PM Scott Wood wrote:
> On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
> > On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood <scottwood@freescale.com> wrote:
> > > On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
> > >> From: Wang Dongsheng <dongsheng.wang@freescale.com>
> > >>
> > >> Add set_pm_suspend_state & pm_suspend_state functions to set/get
> > >> suspend state. When system going to sleep or deep sleep, devices
> > >> can get the system suspend state(STANDBY/MEM) through pm_suspend_state
> > >> function and to handle different situations.
> > >>
> > >> Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
> > >> ---
> > >> *v2*
> > >> Move pm api from fsl platform to powerpc general framework.
> > >
> > > What is powerpc-specific about this?
> > 
> > Generally I agree with you.  But I had the discussion about this topic
> > a while ago with the PM maintainer.  He suggestion to go with the
> > platform way.
> > 
> > https://lkml.org/lkml/2013/8/16/505
> 
> If what he meant was whether you could do what this patch does, then you
> can answer him with, "No, because it got nacked as not being platform or
> arch specific."  Oh, and you're still using .valid as the hook to set
> the platform state, which is awful -- I think .begin is what you want to
> use.
> 
> If we did it in powerpc code, then what would we do on ARM?  Copy the
> code?  No.
> 
> Now, a more legitimate objection to putting it in generic code might be
> that "standby" and "mem" are loosely defined and the knowledge of how a
> driver should react to each is platform specific -- but your patch
> doesn't address that.  You still have the driver itself interpret what
> "standby" and "mem" mean.
> 
> As for "in analogy with ACPI suspend operations", can someone familiar
> with ACPI explain what is meant?

ACPI defines sleep states S3 and S1 which are mappend to "mem" and
"standby" currently, but that may change in the future.

Generally speaking the meaning of "mem" and "standby" is platform-specific
except that "mem" should be a deeper (lower-power) sleep state than "standby".

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
From: Scott Wood @ 2014-04-29 22:47 UTC (permalink / raw)
  To: Leo Li
  Cc: Zhao Chenhui, linux-pm, Rafael J. Wysocki, Dongsheng Wang,
	正雄 金, linuxppc-dev
In-Reply-To: <CADRPPNTGRyLSSjcdpdZL95NVnRepR_ZqvVk1zrMU3ZOL0_bbNQ@mail.gmail.com>

On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
> On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood <scottwood@freescale.com> wrote:
> > On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
> >> From: Wang Dongsheng <dongsheng.wang@freescale.com>
> >>
> >> Add set_pm_suspend_state & pm_suspend_state functions to set/get
> >> suspend state. When system going to sleep or deep sleep, devices
> >> can get the system suspend state(STANDBY/MEM) through pm_suspend_state
> >> function and to handle different situations.
> >>
> >> Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
> >> ---
> >> *v2*
> >> Move pm api from fsl platform to powerpc general framework.
> >
> > What is powerpc-specific about this?
> 
> Generally I agree with you.  But I had the discussion about this topic
> a while ago with the PM maintainer.  He suggestion to go with the
> platform way.
> 
> https://lkml.org/lkml/2013/8/16/505

If what he meant was whether you could do what this patch does, then you
can answer him with, "No, because it got nacked as not being platform or
arch specific."  Oh, and you're still using .valid as the hook to set
the platform state, which is awful -- I think .begin is what you want to
use.

If we did it in powerpc code, then what would we do on ARM?  Copy the
code?  No.

Now, a more legitimate objection to putting it in generic code might be
that "standby" and "mem" are loosely defined and the knowledge of how a
driver should react to each is platform specific -- but your patch
doesn't address that.  You still have the driver itself interpret what
"standby" and "mem" mean.

As for "in analogy with ACPI suspend operations", can someone familiar
with ACPI explain what is meant?

-Scott

^ permalink raw reply

* Re: [PATCH] powerpc/fsl-rio: Fix fsl_rio_setup error paths and use-after-unmap
From: Scott Wood @ 2014-04-29 17:12 UTC (permalink / raw)
  To: Liu Gang-B34182; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <dd70c20406d242b694b16e898cc2cddf@BY2PR03MB412.namprd03.prod.outlook.com>

On Mon, 2014-04-28 at 23:04 -0500, Liu Gang-B34182 wrote:
> > -----Original Message-----
> > From: Wood Scott-B07421
> > Sent: Tuesday, April 29, 2014 9:32 AM
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: Wood Scott-B07421; Liu Gang-B34182
> > Subject: [PATCH] powerpc/fsl-rio: Fix fsl_rio_setup error paths and use-
> > after-unmap
> > 
> > Several of the error paths from fsl_rio_setup are missing error messages.
> > 
> > Worse, fsl_rio_setup initializes several global pointers and does not
> > NULL them out after freeing/unmapping on error.  This caused
> > fsl_rio_mcheck_exception() to crash when accessing rio_regs_win which was
> > non-NULL but had been unmapped.
> > 
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > Cc: Liu Gang <Gang.Liu@freescale.com>
> > ---
> > Liu Gang, are you sure all of these error conditions are fatal?  Why does
> > the rio driver fail if rmu is not present (e.g.  on t4240)?
> 
> Hi Scott, I think the errors you modified in the patch are serious and should
> be fixed. Thanks very much!
> And in fact, the rio driver can be used just for the submodule of the SRIO: RMU.
> It should be used with arch/powerpc/sysdev/fsl_rmu.c and there should have the
> RMU module.
> The fsl_rio.c implements some basic and needed works to support the RMU running well.

I don't quite follow -- is it expected that rio can work without rmu, or
not?  As is, fsl_rio_setup() will error out if it doesn't find an
fsl,srio-rmu-handle property.

> In addition, could you please help to apply the following patch:
> 
> http://patchwork.ozlabs.org/patch/337810/

Yes, I'll apply it when I do my next batch of patch applications.

-Scott

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: crtsaveres.o needed only when -Os flag is enabled
From: Brian W Hart @ 2014-04-29 15:38 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1398729908-15787-1-git-send-email-linuxram@us.ibm.com>

On Mon, Apr 28, 2014 at 05:05:08PM -0700, Ram Pai wrote:
>     powerpc: crtsaveres.o needed only when -Os flag is enabled
>     
>     Currently on powerpc arch, out-of-tree module fails to build without
>     crtsaveres.o, even when the module has no dependency on the symbols
>     provided by the file; when built without the -Os flag.
>     
>     BTW: '-Os' flag is enabled when CONFIG_CC_OPTIMIZE_FOR_SIZE is
>     configured.
>     
>     This patch fixes that problem.
>     
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> 
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 4c0cedf..cf12f38 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -157,7 +157,10 @@ CPP		= $(CC) -E $(KBUILD_CFLAGS)
>  
>  CHECKFLAGS	+= -m$(CONFIG_WORD_SIZE) -D__powerpc__ -D__powerpc$(CONFIG_WORD_SIZE)__
>  
> +ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
>  KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
> +endif
> +
>  
>  # No AltiVec or VSX instructions when building kernel
>  KBUILD_CFLAGS += $(call cc-option,-mno-altivec)

I didn't try building a kernel or in-tree modules, but I confirmed
that it allows building of out-of-tree modules when crtsavres.o is
not present (e.g. as for a distro install where the kernel headers
are provided by package, rather than being manually prepared from
the sources).

Tested-by: Brian W Hart <hartb@linux.vnet.ibm.com>

^ permalink raw reply

* Re: [PATCH v2 00/21] FDT clean-ups and libfdt support
From: Grant Likely @ 2014-04-29 14:00 UTC (permalink / raw)
  To: Rob Herring, linux-kernel, devicetree
  Cc: linux-mips, Aurelien Jacquiot, H. Peter Anvin, Max Filippov,
	Paul Mackerras, linux, Jonas Bonn, Rob Herring, Russell King,
	linux-c6x-dev, x86, Ingo Molnar, Mark Salter, linux-xtensa,
	James Hogan, Thomas Gleixner, linux-metag, linux-arm-kernel,
	Chris Zankel, Michal Simek, Vineet Gupta, Ralf Baechle,
	linuxppc-dev
In-Reply-To: <1398215901-25609-1-git-send-email-robherring2@gmail.com>

On Tue, 22 Apr 2014 20:18:00 -0500, Rob Herring <robherring2@gmail.com> wrote:
> From: Rob Herring <robh@kernel.org>
> 
> This is a series of clean-ups of architecture FDT code and converts the
> core FDT code over to using libfdt functions. This is in preparation
> to add FDT based address translation parsing functions for early
> console support. This series removes direct access to FDT data from all
> arches except powerpc.
> 
> The current MIPS lantiq and xlp DT code is buggy as built-in DTBs need
> to be copied out of init section. Patches 2 and 3 should be applied to
> 3.15.
> 
> Changes in v2 are relatively minor. There was a bug in the unflattening
> code where walking up the tree was not being handled correctly (thanks
> to Michal Simek). I re-worked things a bit to avoid globally adding
> libfdt include paths.
> 
> A branch is available here[1], and I plan to put into linux-next in a few
> days. Please test! I've compiled on arm, arm64, mips, microblaze, xtensa,
> and powerpc and booted on arm and arm64.

This is pretty great work. I'll read through them again and I may have a
comment or two, but in general you can add my tested by tag:

Tested-by: Grant Likely <grant.likely@linaro.org>

>  40 files changed, 280 insertions(+), 598 deletions(-)

I love the diffstat!

g.

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/powernv: reduce multi-hit of iommu_add_device()
From: Alexey Kardashevskiy @ 2014-04-29 13:11 UTC (permalink / raw)
  To: Wei Yang, Alexey Kardashevskiy; +Cc: linuxppc-dev, gwshan
In-Reply-To: <20140429093719.GB8028@richard>

On 04/29/2014 07:37 PM, Wei Yang wrote:
> On Tue, Apr 29, 2014 at 05:55:48PM +1000, Alexey Kardashevskiy wrote:
>> On 04/29/2014 04:49 PM, Wei Yang wrote:
>>> On Mon, Apr 28, 2014 at 11:35:32PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/23/2014 12:26 PM, Wei Yang wrote:
>>>>> During the EEH hotplug event, iommu_add_device() will be invoked three times
>>>>> and two of them will trigger warning or error.
>>>>>
>>>>> The three times to invoke the iommu_add_device() are:
>>>>>
>>>>>     pci_device_add
>>>>>        ...
>>>>>        set_iommu_table_base_and_group   <- 1st time, fail
>>>>>     device_add
>>>>>        ...
>>>>>        tce_iommu_bus_notifier           <- 2nd time, succees
>>>>>     pcibios_add_pci_devices
>>>>>        ...
>>>>>        pcibios_setup_bus_devices        <- 3rd time, re-attach
>>>>>
>>>>> The first time fails, since the dev->kobj->sd is not initialized. The
>>>>> dev->kobj->sd is initialized in device_add().
>>>>> The third time's warning is triggered by the re-attach of the iommu_group.
>>>>>
>>>>> After applying this patch, the error
>>>>
>>>> Nack.
>>>>
>>>> The patch still seems incorrect and we actually need to remove
>>>> tce_iommu_bus_notifier completely as pcibios_setup_bus_devices is called
>>> >from another notifier anyway. Could you please test it?
>>>>
>>>>
>>>
>>> Hi, Alexey,
>>>
>>> Nice to see your comment. Let me show what I got fist.
>>>
>>> Generally, when kernel enumerate on the pci device, following functions will
>>> be invoked.
>>>
>>>      pci_device_add
>>>         pcibios_setup_bus_device
>>>         ...
>>>         set_iommu_table_base_and_group   
>>>      device_add
>>>         ...
>>>         tce_iommu_bus_notifier           
>>>      pcibios_fixp_bus
>>>         pcibios_add_pci_devices
>>>         ...
>>>         pcibios_setup_bus_devices        
>>>
>>> >From the call flow, we see for a normall pci device, the
>>> pcibios_setup_bus_device() will be invoked twice.
>>
>>
>> tce_iommu_bus_notifier should not be called for devices during boot-time
>> PCI discovery as the discovery is done from subsys_initcall handler and
>> tce_iommu_bus_notifier is registered as subsys_initcall_sync. Are you sure
>> this is happening? You should see warnings as for PF's EEH but you do not
>> say that.
>>
> 
> Let me make it clearer. I list the call flow for general purpose to show the
> sequence of the call flow.
> 
> The tce_iommu_bus_notifier is not invoked at the boot time. And none of them
> do real thing at boot time.
> 
> I don't understand your last sentence. I see warning and error during EEH
> hotplug. If necessary, I will add this in the commit log.
> 
>>
>>> At the bootup time, none of them succeed to setup the dma, since the PE is not
>>> assigned or the tbl is not set. The iommu tbl and group is setup in
>>> pnv_pci_ioda_setup_DMA().
>>>
>>> This call flow maintains the same when EEH error happens on Bus PE, while the
>>> behavior is a little different. 
>>>
>>>      pci_device_add
>>>         pcibios_setup_bus_device
>>>         ...
>>>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>>
>>
>> Sorry, what is uninitialized? The only kobj here is the one in iommu_group
>> and it must have been taken care of before adding a device.
> 
> pci_dev->dev->kobj->sd.
> 
> I have explained this in previous discussion. This kobj->sd is initialized in
> device_add().
> 
>>
>>
>>>      device_add
>>>         ...
>>>         tce_iommu_bus_notifier           <- succeed
>>>      pcibios_fixp_bus
>>>         pcibios_add_pci_devices
>>>         ...
>>>         pcibios_setup_bus_devices        <- warning, re-attach
>>
>>
>> This is why I am suggesting getting rid of tce_iommu_bus_notifier - we will
>> avoid the warning.
> 
> Then how about the first time's error?


What do you call "first time error" here?


>>> While this call flow will change a little on a VF. For a VF,
>>> pcibios_fixp_bus() will not be invoked. Current behavior is this.
>>
>>
>> You meant pcibios_fixup_bus? I would expect it to get called (from
>> pci_rescan_bus() or something like that?) and this seems to be the problem
>> here.
> 
> When EEH happens and hotplug the pci bus, we need to do the pci_scan_bridge()
> which will do the pcibios_fixup_bus().

Are you talking now about EEH on PF (physical function) or EEH on VF
(virtual function)?

Are you calling pci_scan_bridge()


> So you suggest to remove it? This is the generic code.


I suggest removing tce_iommu_bus_notifier only. It must go. It was my
mistake at the first place.


>> How are VFs added in terms of code flow? Is there any git tree to look at?
>>
> 
> VF add code flow is a generic code in drivers/pci/iov.c, virtfn_add().

virtfn_add -> pci_device_add -> pcibios_add_device -> pcibios_setup_device
-> ppc_md.pci_dma_dev_setup -> pnv_pci_dma_dev_setup ->
pnv_pci_ioda_dma_dev_setup ->
set_iommu_table_base.


What part of this is missing?


> 
>>
>>
>>>      pci_device_add
>>>         pcibios_setup_bus_device
>>>         ...
>>>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>>>      device_add
>>>         ...
>>>         tce_iommu_bus_notifier           <- succeed
>>>
>>> And if an EEH error happens just on a VF, I believe the same call flow should
>>> go for recovery. (This is not set down, still under discussion with Gavin.)
>>>
>>> My conclusion is:
>>> 1. remove the tce_iommu_bus_notifier completely will make the VF not work.
>>>    So I choose to revert the code and attach make the iommu group attachment
>>>    just happens in one place.
>>>
>>> BTW, I know my patch is not a perfect one. For a PF, the tbl will still be set
>>> twice. I am not sure why we need to invoke pcibios_setup_device() twice on a
>>> PF, maybe some platform need to fixup some thing after the pci_bus is added.
>>> So I don't remove one of them to solve the problem.
>>>
>>> If you have a better idea, I am glad to take it.



-- 
Alexey

^ permalink raw reply

* Re: [PATCH 0/3] Add new ptrace request macros on PowerPC
From: Anshuman Khandual @ 2014-04-29 12:22 UTC (permalink / raw)
  To: Michael Neuling; +Cc: Linux PPC dev, linux-kernel, avagin, roland, oleg
In-Reply-To: <CAEjGV6y8MRiCfRBtWjNfrwKMccROEAJw2rPVc_u+r=tAo2EcHQ@mail.gmail.com>

On 04/29/2014 01:52 PM, Michael Neuling wrote:
> That's not what that patch does. It shouldn't make any user visible changes
> to DSCR or PPR.

It may not when it runs uninterrupted but after the tracee process has stopped,
thread.dscr reflects the default DSCR value as mentioned before. This can be
proved by changing the "dscr_default" value in arch/powerpc/sysfs.c file.

> 
> Over syscall PPR and DSCR may change. Depending on your test case, that may
> be your problem.

I would guess when the tracee process stops for ptrace analysis, tm_reclaim or
tm_recheckpoint path might be crossed which is causing this dscr_default value
to go into thread_struct.

^ permalink raw reply

* Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries
From: Madhavan Srinivasan @ 2014-04-29 10:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, riel, rusty, dave.hansen, peterz, x86, linux-kernel,
	linux-mm, ak, paulus, mgorman, Linus Torvalds, akpm, linuxppc-dev,
	kirill.shutemov
In-Reply-To: <20140429070632.GB27951@gmail.com>

On Tuesday 29 April 2014 12:36 PM, Ingo Molnar wrote:
> 
> * Madhavan Srinivasan <maddy@linux.vnet.ibm.com> wrote:
> 
>> Performance data for different FAULT_AROUND_ORDER values from 4 socket
>> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
>> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
>> v3.15-rc1 for different fault around order values.
>>
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>>
>> Linux build (make -j64)
>> minor-faults            47,437,359      35,279,286      25,425,347      23,461,275      22,002,189      21,435,836
>> times in seconds        347.302528420   344.061588460   340.974022391   348.193508116   348.673900158   350.986543618
>>  stddev for time        ( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
>>  %chg time to baseline                  -0.9%           -1.8%           0.2%            0.39%           1.06%
> 
> Probably too noisy.

Ok. I should have added the formula used for %change to clarify the data
presented. My bad.

Just to clarify, %change here is calculated based on this formula.

((new value - baseline)/baseline)

And in this case, negative %change says it a drop in time and
positive value has increase the time when compared to baseline.

With regards
Maddy

> 
>> Linux rebuild (make -j64)
>> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
>> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> 
> Here it looks like a speedup. Optimal value: 5+.
> 
>> Binutils build (make all -j64 )
>> minor-faults            474,821         371,380         269,463         247,715         235,255         228,337
>> times in seconds        53.882492432    53.584289348    53.882773216    53.755816431    53.607824348    53.423759642
>>  stddev for time        ( +-  0.08% )   ( +-  0.56% )   ( +-  0.17% )   ( +-  0.11% )   ( +-  0.60% )   ( +-  0.69% )
>>  %chg time to baseline                  -0.55%          0.0%            -0.23%          -0.51%          -0.85%
> 
> Probably too noisy, but looks like a potential slowdown?
> 
>> Two synthetic tests: access every word in file in sequential/random order.
>>
>> Sequential access 16GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     263,148         131,166         32,908          16,514          8,260           1,093
>>        times in seconds 53.091138345    53.113191672    53.188776177    53.233017218    53.206841347    53.429979442
>>        stddev for time  ( +-  0.06% )   ( +-  0.07% )   ( +-  0.08% )   ( +-  0.09% )   ( +-  0.03% )   ( +-  0.03% )
>>        %chg time to baseline            0.04%           0.18%           0.26%           0.21%           0.63%
> 
> Speedup, optimal value: 8+.
> 
>> 8 threads
>>        minor-faults     2,097,267       1,048,753       262,237         131,397         65,621          8,274
>>        times in seconds 55.173790028    54.591880790    54.824623287    54.802162211    54.969680503    54.790387715
>>        stddev for time  ( +-  0.78% )   ( +-  0.09% )   ( +-  0.08% )   ( +-  0.07% )   ( +-  0.28% )   ( +-  0.05% )
>>        %chg time to baseline            -1.05%          -0.63%          -0.67%          -0.36%          -0.69%
> 
> Looks like a regression?
> 
>> 32 threads
>>        minor-faults     8,388,751       4,195,621       1,049,664       525,461         262,535         32,924
>>        times in seconds 60.431573046    60.669110744    60.485336388    60.697789706    60.077959564    60.588855032
>>        stddev for time  ( +-  0.44% )   ( +-  0.27% )   ( +-  0.46% )   ( +-  0.67% )   ( +-  0.31% )   ( +-  0.49% )
>>        %chg time to baseline            0.39%           0.08%           0.44%           -0.58%          0.25%
> 
> Probably too noisy.
> 
>> 64 threads
>>        minor-faults     16,777,409      8,607,527       2,289,766       1,202,264       598,405         67,587
>>        times in seconds 96.932617720    100.675418760   102.109880836   103.881733383   102.580199555   105.751194041
>>        stddev for time  ( +-  1.39% )   ( +-  1.06% )   ( +-  0.99% )   ( +-  0.76% )   ( +-  1.65% )   ( +-  1.60% )
>>        %chg time to baseline            3.86%           5.34%           7.16%           5.82%           9.09%
> 
> Speedup, optimal value: 4+
> 
>> 128 threads
>>        minor-faults     33,554,705      17,375,375      4,682,462       2,337,245       1,179,007       134,819
>>        times in seconds 128.766704495   115.659225437   120.353046307   115.291871270   115.450886036   113.991902150
>>        stddev for time  ( +-  2.93% )   ( +-  0.30% )   ( +-  2.93% )   ( +-  1.24% )   ( +-  1.03% )   ( +-  0.70% )
>>        %chg time to baseline            -10.17%         -6.53%          -10.46%         -10.34%         -11.47%
> 
> Rather significant regression at order 1 already.
> 
>> Random access 1GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     17,155          8,678           2,126           1,097           581             134
>>        times in seconds 51.904430523    51.658017987    51.919270792    51.560531738    52.354431597    51.976469502
>>        stddev for time  ( +-  3.19% )   ( +-  1.35% )   ( +-  1.56% )   ( +-  0.91% )   ( +-  1.70% )   ( +-  2.02% )
>>        %chg time to baseline            -0.47%          0.02%           -0.66%          0.86%           0.13%
> 
> Probably too noisy.
> 
>> 8 threads
>>        minor-faults     131,844         70,705          17,457          8,505           4,251           598
>>        times in seconds 58.162813956    54.991706305    54.952675791    55.323057492    54.755587379    53.376722828
>>        stddev for time  ( +-  1.44% )   ( +-  0.69% )   ( +-  1.23% )   ( +-  2.78% )   ( +-  1.90% )   ( +-  2.91% )
>>        %chg time to baseline            -5.45%          -5.52%          -4.88%          -5.86%          -8.22%
> 
> Regression.
> 
>> 32 threads
>>        minor-faults     524,437         270,760         67,069          33,414          16,641          2,204
>>        times in seconds 69.981777072    76.539570015    79.753578505    76.245943618    77.254258344    79.072596831
>>        stddev for time  ( +-  2.81% )   ( +-  1.95% )   ( +-  2.66% )   ( +-  0.99% )   ( +-  2.35% )   ( +-  3.22% )
>>        %chg time to baseline            9.37%           13.96%          8.95%           10.39%          12.98%
> 
> Speedup, optimal value hard to tell due to noise - 3+ or 8+.
> 
>> 64 threads
>>        minor-faults     1,049,117       527,451         134,016         66,638          33,391          4,559
>>        times in seconds 108.024517536   117.575067996   115.322659914   111.943998437   115.049450815   119.218450840
>>        stddev for time  ( +-  2.40% )   ( +-  1.77% )   ( +-  1.19% )   ( +-  3.29% )   ( +-  2.32% )   ( +-  1.42% )
>>        %chg time to baseline            8.84%           6.75%           3.62%           6.5%            10.3%
> 
> Speedup, optimal value again hard to tell due to noise.
> 
>> 128 threads
>>        minor-faults     2,097,440       1,054,360       267,042         133,328         66,532          8,652
>>        times in seconds 155.055861167   153.059625968   152.449492156   151.024005282   150.844647770   155.954366718
>>        stddev for time  ( +-  1.32% )   ( +-  1.14% )   ( +-  1.32% )   ( +-  0.81% )   ( +-  0.75% )   ( +-  0.72% )
>>        %chg time to baseline            -1.28%          -1.68%          -2.59%          -2.71%          0.57%
> 
> Slowdown for most orders.
> 
>> Incase of Kernel compilation, fault around order (fao) of 1 and 3 
>> provides fast compilation time when compared to a value of 4. On 
>> closer look, fao of 3 has higher agains. Incase of Sequential access 
>> synthetic tests fao of 1 has higher gains and in Random access test, 
>> fao of 3 has marginal gains. Going by compilation time, fao value of 
>> 3 is suggested in this patch for pseries platform.
> 
> So I'm really at loss to understand where you get the optimal value of 
> '3' from. The data does not seem to match your claim that '1 and 3 
> provides fast compilation time when compared to a value of 4':
> 



>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>>
>> Linux rebuild (make -j64)
>> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
>> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> 
> 5 and 8, and probably 6, 7 are better than 4.
> 
> 3 is probably _slower_ than the current default - but it's hard to 
> tell due to inherent noise.
> 
> But the other two build tests were too noisy, and if then they showed 
> a slowdown.
> 
>> Worst case scenario: we touch one page every 16M to demonstrate overhead.
>>
>> Touch only one page in page table in 16GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     1,104           1,090           1,071           1,068           1,065           1,063
>>        times in seconds 0.006583298     0.008531502     0.019733795     0.036033763     0.062300553     0.406857086
>>        stddev for time  ( +-  2.79% )   ( +-  2.42% )   ( +-  3.47% )   ( +-  2.81% )   ( +-  2.01% )   ( +-  1.33% )
>> 8 threads
>>        minor-faults     8,279           8,264           8,245           8,243           8,239           8,240
>>        times in seconds 0.044572398     0.057211811     0.107606306     0.205626815     0.381679120     2.647979955
>>        stddev for time  ( +-  1.95% )   ( +-  2.98% )   ( +-  1.74% )   ( +-  2.80% )   ( +-  2.01% )   ( +-  1.86% )
>> 32 threads
>>        minor-faults     32,879          32,864          32,849          32,845          32,839          32,843
>>        times in seconds 0.197659343     0.218486087     0.445116407     0.694235883     1.296894038     9.127517045
>>        stddev for time  ( +-  3.05% )   ( +-  3.05% )   ( +-  4.33% )   ( +-  3.08% )   ( +-  3.75% )   ( +-  0.56% )
>> 64 threads
>>        minor-faults     65,680          65,664          65,646          65,645          65,640          65,647
>>        times in seconds 0.455537304     0.489688780     0.866490093     1.427393118     2.379628982     17.059295051
>>        stddev for time  ( +-  4.01% )   ( +-  4.13% )   ( +-  2.92% )   ( +-  1.68% )   ( +-  1.79% )   ( +-  0.48% )
>> 128 threads
>>        minor-faults     131,279         131,265         131,250         131,245         131,241         131,254
>>        times in seconds 1.026880651     1.095327536     1.721728274     2.808233068     4.662729948     31.732848290
>>        stddev for time  ( +-  6.85% )   ( +-  4.09% )   ( +-  1.71% )   ( +-  3.45% )   ( +-  2.40% )   ( +-  0.68% )
> 
> There's no '%change' values shown, but the slowdown looks significant, 
> it's the worst case: for example with 1 thread order 3 looks about 
> 300% slower (3x slowdown) compared to order 0.
> 
> All in one, looking at your latest data I don't think the conclusion 
> from your first version of this optimization patch from a month ago is 
> true anymore:
> 
>> +	/* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
>> +	fault_around_order = 3;
> 
> As the data is rather conflicting and inconclusive, and if it shows a 
> sweet spot it's not at order 3. New data should in general trigger 
> reanalysis of your first optimization value.
>
> I'm starting to suspect that maybe workloads ought to be given a 
> choice in this matter, via madvise() or such.
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply

* Re: [git pull] Please pull abiv2 branch
From: Philippe Bergheaud @ 2014-04-29  9:38 UTC (permalink / raw)
  To: Rusty Russell
  Cc: philippe.bergheaud, mikey, amodra, rostedt, ulrich.weigand, mjw,
	Anton Blanchard, paulus, linuxppc-dev
In-Reply-To: <877g69t1my.fsf@rustcorp.com.au>

Rusty Russell wrote:
> Philippe Bergheaud <felix@linux.vnet.ibm.com> writes:
> 
>>Anton Blanchard wrote:
>>
>>>Here are the ABIv2 patches rebased against 3.15-rc2.
>>
>>After recompiling 3.15-rc2 with the ABIv2 patches,
>>I see the following line in Modules.symvers:
>>
>>     0x00000000 TOC. vmlinux EXPORT_SYMBOL
>>
>>Kernel will not load modules because TOC. has no CRC.
>>Is this expected ? Shouldn't TOC. have a CRC ?
> 
> 
> What happens when you try to load a module?  It should work...

My mistake, sorry: kernel 3.15-rc2 crashes at boot, in the SLES12
Beta5 environment that I am using, before any module load attempt.

The problem happens with the SLES12 kernel of the day 3.12.17,
plus the backported ABIv2 patch set.  Boot fails with:
     kernel: ibmveth: no symbol version for TOC.
     kernel: ibmveth: Unknown symbol TOC. (err -22)
     kernel: scsi_mod: no symbol version for TOC.
     kernel: scsi_mod: Unknown symbol TOC. (err -22)

In the rescue shell, repeating a plain modprobe fails again:
     :/# modprobe scsi_mod
     modprobe: ERROR: could not insert 'scsi_mod': Invalid argument

And finally, modprobe succeeds with --force:
     :/# modprobe --force scsi_mod
     scsi_mod: module has bad taint, not creating trace events

Philippe

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/powernv: reduce multi-hit of iommu_add_device()
From: Wei Yang @ 2014-04-29  9:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Wei Yang, linuxppc-dev, gwshan
In-Reply-To: <535F5B04.5000102@au1.ibm.com>

On Tue, Apr 29, 2014 at 05:55:48PM +1000, Alexey Kardashevskiy wrote:
>On 04/29/2014 04:49 PM, Wei Yang wrote:
>> On Mon, Apr 28, 2014 at 11:35:32PM +1000, Alexey Kardashevskiy wrote:
>>> On 04/23/2014 12:26 PM, Wei Yang wrote:
>>>> During the EEH hotplug event, iommu_add_device() will be invoked three times
>>>> and two of them will trigger warning or error.
>>>>
>>>> The three times to invoke the iommu_add_device() are:
>>>>
>>>>     pci_device_add
>>>>        ...
>>>>        set_iommu_table_base_and_group   <- 1st time, fail
>>>>     device_add
>>>>        ...
>>>>        tce_iommu_bus_notifier           <- 2nd time, succees
>>>>     pcibios_add_pci_devices
>>>>        ...
>>>>        pcibios_setup_bus_devices        <- 3rd time, re-attach
>>>>
>>>> The first time fails, since the dev->kobj->sd is not initialized. The
>>>> dev->kobj->sd is initialized in device_add().
>>>> The third time's warning is triggered by the re-attach of the iommu_group.
>>>>
>>>> After applying this patch, the error
>>>
>>> Nack.
>>>
>>> The patch still seems incorrect and we actually need to remove
>>> tce_iommu_bus_notifier completely as pcibios_setup_bus_devices is called
>>>from another notifier anyway. Could you please test it?
>>>
>>>
>> 
>> Hi, Alexey,
>> 
>> Nice to see your comment. Let me show what I got fist.
>> 
>> Generally, when kernel enumerate on the pci device, following functions will
>> be invoked.
>> 
>>      pci_device_add
>>         pcibios_setup_bus_device
>>         ...
>>         set_iommu_table_base_and_group   
>>      device_add
>>         ...
>>         tce_iommu_bus_notifier           
>>      pcibios_fixp_bus
>>         pcibios_add_pci_devices
>>         ...
>>         pcibios_setup_bus_devices        
>> 
>>>From the call flow, we see for a normall pci device, the
>> pcibios_setup_bus_device() will be invoked twice.
>
>
>tce_iommu_bus_notifier should not be called for devices during boot-time
>PCI discovery as the discovery is done from subsys_initcall handler and
>tce_iommu_bus_notifier is registered as subsys_initcall_sync. Are you sure
>this is happening? You should see warnings as for PF's EEH but you do not
>say that.
>

Let me make it clearer. I list the call flow for general purpose to show the
sequence of the call flow.

The tce_iommu_bus_notifier is not invoked at the boot time. And none of them
do real thing at boot time.

I don't understand your last sentence. I see warning and error during EEH
hotplug. If necessary, I will add this in the commit log.

>
>> At the bootup time, none of them succeed to setup the dma, since the PE is not
>> assigned or the tbl is not set. The iommu tbl and group is setup in
>> pnv_pci_ioda_setup_DMA().
>> 
>> This call flow maintains the same when EEH error happens on Bus PE, while the
>> behavior is a little different. 
>> 
>>      pci_device_add
>>         pcibios_setup_bus_device
>>         ...
>>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>
>
>Sorry, what is uninitialized? The only kobj here is the one in iommu_group
>and it must have been taken care of before adding a device.

pci_dev->dev->kobj->sd.

I have explained this in previous discussion. This kobj->sd is initialized in
device_add().

>
>
>>      device_add
>>         ...
>>         tce_iommu_bus_notifier           <- succeed
>>      pcibios_fixp_bus
>>         pcibios_add_pci_devices
>>         ...
>>         pcibios_setup_bus_devices        <- warning, re-attach
>
>
>This is why I am suggesting getting rid of tce_iommu_bus_notifier - we will
>avoid the warning.

Then how about the first time's error?

>
>
>> While this call flow will change a little on a VF. For a VF,
>> pcibios_fixp_bus() will not be invoked. Current behavior is this.
>
>
>You meant pcibios_fixup_bus? I would expect it to get called (from
>pci_rescan_bus() or something like that?) and this seems to be the problem
>here.

When EEH happens and hotplug the pci bus, we need to do the pci_scan_bridge()
which will do the pcibios_fixup_bus().

So you suggest to remove it? This is the generic code.

>
>How are VFs added in terms of code flow? Is there any git tree to look at?
>

VF add code flow is a generic code in drivers/pci/iov.c, virtfn_add().

>
>
>>      pci_device_add
>>         pcibios_setup_bus_device
>>         ...
>>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>>      device_add
>>         ...
>>         tce_iommu_bus_notifier           <- succeed
>> 
>> And if an EEH error happens just on a VF, I believe the same call flow should
>> go for recovery. (This is not set down, still under discussion with Gavin.)
>> 
>> My conclusion is:
>> 1. remove the tce_iommu_bus_notifier completely will make the VF not work.
>>    So I choose to revert the code and attach make the iommu group attachment
>>    just happens in one place.
>> 
>> BTW, I know my patch is not a perfect one. For a PF, the tbl will still be set
>> twice. I am not sure why we need to invoke pcibios_setup_device() twice on a
>> PF, maybe some platform need to fixup some thing after the pci_bus is added.
>> So I don't remove one of them to solve the problem.
>> 
>> If you have a better idea, I am glad to take it.
>> 
>
>
>-- 
>Alexey Kardashevskiy
>IBM OzLabs, LTC Team
>
>e-mail: aik@au1.ibm.com
>notes: Alexey Kardashevskiy/Australia/IBM

-- 
Richard Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries
From: Madhavan Srinivasan @ 2014-04-29  9:36 UTC (permalink / raw)
  To: Rusty Russell, linux-kernel, linuxppc-dev, linux-mm, linux-arch,
	x86
  Cc: riel, ak, peterz, dave.hansen, paulus, mgorman, akpm, mingo,
	kirill.shutemov
In-Reply-To: <877g686fpb.fsf@rustcorp.com.au>

On Tuesday 29 April 2014 07:48 AM, Rusty Russell wrote:
> Madhavan Srinivasan <maddy@linux.vnet.ibm.com> writes:
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 2db8cc6..c87e6b6 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -74,6 +74,8 @@ int CMO_SecPSP = -1;
>>  unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
>>  EXPORT_SYMBOL(CMO_PageSize);
>>  
>> +extern unsigned int fault_around_order;
>> +
> 
> It's considered bad form to do this.  Put the declaration in linux/mm.h.
> 

ok. Will change it.

Thanks for review
With regards
Maddy

> Thanks,
> Rusty.
> PS.  But we're getting there! :)
> 

^ permalink raw reply

* Re: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/
From: Madhavan Srinivasan @ 2014-04-29  9:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-arch, riel, rusty, dave.hansen, peterz, x86, linux-kernel,
	linux-mm, ak, paulus, mgorman, akpm, linuxppc-dev, mingo
In-Reply-To: <20140428093606.95B90E009B@blue.fi.intel.com>

On Monday 28 April 2014 03:06 PM, Kirill A. Shutemov wrote:
> Madhavan Srinivasan wrote:
>> Kirill A. Shutemov with 8c6e50b029 commit introduced
>> vm_ops->map_pages() for mapping easy accessible pages around
>> fault address in hope to reduce number of minor page faults.
>>
>> This patch creates infrastructure to modify the FAULT_AROUND_ORDER
>> value using mm/Kconfig. This will enable architecture maintainers
>> to decide on suitable FAULT_AROUND_ORDER value based on
>> performance data for that architecture. Patch also defaults
>> FAULT_AROUND_ORDER Kconfig element to 4.
>>
>> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
>> ---
>>  mm/Kconfig  |    8 ++++++++
>>  mm/memory.c |   11 ++++-------
>>  2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ebe5880..c7fc4f1 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -176,6 +176,14 @@ config MOVABLE_NODE
>>  config HAVE_BOOTMEM_INFO_NODE
>>  	def_bool n
>>  
>> +#
>> +# Fault around order is a control knob to decide the fault around pages.
>> +# Default value is set to 4 , but the arch can override it as desired.
>> +#
>> +config FAULT_AROUND_ORDER
>> +	int
>> +	default	4
>> +
>>  # eventually, we can have this option just 'select SPARSEMEM'
>>  config MEMORY_HOTPLUG
>>  	bool "Allow for memory hot-add"
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d0f0bef..457436d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3382,11 +3382,9 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>>  	update_mmu_cache(vma, address, pte);
>>  }
>>  
>> -#define FAULT_AROUND_ORDER 4
>> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>>  
>>  #ifdef CONFIG_DEBUG_FS
>> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
>> -
>>  static int fault_around_order_get(void *data, u64 *val)
>>  {
>>  	*val = fault_around_order;
>> @@ -3395,7 +3393,6 @@ static int fault_around_order_get(void *data, u64 *val)
>>  
>>  static int fault_around_order_set(void *data, u64 val)
>>  {
>> -	BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
>>  	if (1UL << val > PTRS_PER_PTE)
>>  		return -EINVAL;
>>  	fault_around_order = val;
>> @@ -3430,14 +3427,14 @@ static inline unsigned long fault_around_pages(void)
>>  {
>>  	unsigned long nr_pages;
>>  
>> -	nr_pages = 1UL << FAULT_AROUND_ORDER;
>> +	nr_pages = 1UL << fault_around_order;
>>  	BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
> 
> This BUILD_BUG_ON() doesn't make sense any more since compiler usually can't
> prove anything about extern variable.
> 
> Drop it or change to VM_BUG_ON() or something.
> 

Ok.

>>  	return nr_pages;
>>  }
>>  
>>  static inline unsigned long fault_around_mask(void)
>>  {
>> -	return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
>> +	return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
> 
> fault_around_pages() and fault_around_mask() should be moved outside of
> #ifdef CONFIG_DEBUG_FS and consolidated since they are practically identical
> both branches after the changes.
> 

Agreed. Will change it.

>>  }
>>  #endif
>>  
>> @@ -3495,7 +3492,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>  	 * if page by the offset is not ready to be mapped (cold cache or
>>  	 * something).
>>  	 */
>> -	if (vma->vm_ops->map_pages) {
>> +	if ((vma->vm_ops->map_pages) && (fault_around_order)) {
> 
> Way too many parentheses.
> 

Sure. Will modify it.

Thanks for review
with regards
Maddy
>>  		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>  		do_fault_around(vma, address, pte, pgoff, flags);
>>  		if (!pte_same(*pte, orig_pte))

^ permalink raw reply

* Re: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/
From: Madhavan Srinivasan @ 2014-04-29  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, riel, rusty, dave.hansen, x86, linux-kernel, linux-mm,
	ak, paulus, mgorman, akpm, linuxppc-dev, mingo, kirill.shutemov
In-Reply-To: <20140428090649.GD27561@twins.programming.kicks-ass.net>

On Monday 28 April 2014 02:36 PM, Peter Zijlstra wrote:
> On Mon, Apr 28, 2014 at 02:31:29PM +0530, Madhavan Srinivasan wrote:
>> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
> 
> __read_mostly?
> 
Agreed. Will add it.

Thanks for review.
With regards
Maddy

^ permalink raw reply

* Re: [PATCH 0/3] Add new ptrace request macros on PowerPC
From: Michael Neuling @ 2014-04-29  8:22 UTC (permalink / raw)
  To: Anshuman Khandual; +Cc: Linux PPC dev, linux-kernel, avagin, roland, oleg
In-Reply-To: <535F5BDE.2030309@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 882 bytes --]

That's not what that patch does. It shouldn't make any user visible changes
to DSCR or PPR.

Over syscall PPR and DSCR may change. Depending on your test case, that may
be your problem.

Mikey
On 29 Apr 2014 18:02, "Anshuman Khandual" <khandual@linux.vnet.ibm.com>
wrote:

> On 04/29/2014 12:36 PM, Michael Neuling wrote:
> > How is it causing the problem?
>
> As mentioned before, what I thought to be a problem is
> something expected behaviour. So it's not a problem any
> more. DSCR value inside the transaction will fall back
> to default as kernel wont let user specified value to
> remain applied for a long time.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>

[-- Attachment #2: Type: text/html, Size: 1418 bytes --]

^ permalink raw reply

* Re: [PATCH 0/3] Add new ptrace request macros on PowerPC
From: Anshuman Khandual @ 2014-04-29  7:59 UTC (permalink / raw)
  To: Michael Neuling; +Cc: Linux PPC dev, linux-kernel, avagin, roland, oleg
In-Reply-To: <CAEjGV6yi5YY95tM-HY4akN4BGXGn4chA8_sCaqvWx3HPSu2V3w@mail.gmail.com>

On 04/29/2014 12:36 PM, Michael Neuling wrote:
> How is it causing the problem?

As mentioned before, what I thought to be a problem is
something expected behaviour. So it's not a problem any
more. DSCR value inside the transaction will fall back
to default as kernel wont let user specified value to
remain applied for a long time.

^ permalink raw reply

* [PATCH] powerpc/eeh: Dump PE location code
From: Gavin Shan @ 2014-04-29  8:00 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Gavin Shan

As Ben suggested, it's meaningful to dump PE's location code
for site engineers when hitting EEH errors. The patch introduces
function eeh_pe_loc_get() to retireve the location code from
dev-tree so that we can output it when hitting EEH errors.

If primary PE bus is root bus, the PHB's dev-node would be tried
prior to root port's dev-node. Otherwise, the upstream bridge's
dev-node of the primary PE bus will be check for the location code
directly.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h            |  1 +
 arch/powerpc/kernel/eeh.c                 |  9 ++---
 arch/powerpc/kernel/eeh_pe.c              | 60 +++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/eeh-ioda.c | 27 ++++++++------
 4 files changed, 82 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index d12529f..7782056 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -255,6 +255,7 @@ void *eeh_pe_traverse(struct eeh_pe *root,
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void eeh_pe_restore_bars(struct eeh_pe *pe);
+const char *eeh_pe_loc_get(struct eeh_pe *pe);
 struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
 
 void *eeh_dev_init(struct device_node *dn, void *data);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3764fb7..9f8de75 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -329,8 +329,8 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 	eeh_pe_state_mark(phb_pe, EEH_PE_ISOLATED);
 	eeh_serialize_unlock(flags);
 
-	pr_err("EEH: PHB#%x failure detected\n",
-		phb_pe->phb->global_number);
+	pr_err("EEH: PHB#%x (%s) failure detected\n",
+		phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe));
 	dump_stack();
 	eeh_send_failure_event(phb_pe);
 
@@ -459,8 +459,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	 * a stack trace will help the device-driver authors figure
 	 * out what happened.  So print that out.
 	 */
-	pr_err("EEH: Frozen PE#%x detected on PHB#%x\n",
-		pe->addr, pe->phb->global_number);
+	pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
+		pe->phb->global_number, pe->addr,
+		eeh_pe_loc_get(pe));
 	dump_stack();
 
 	eeh_send_failure_event(pe);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 995c2a2..fbd01eb 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -792,6 +792,66 @@ void eeh_pe_restore_bars(struct eeh_pe *pe)
 }
 
 /**
+ * eeh_pe_loc_get - Retrieve location code binding to the given PE
+ * @pe: EEH PE
+ *
+ * Retrieve the location code of the given PE. If the primary PE bus
+ * is root bus, we will grab location code from PHB device tree node
+ * or root port. Otherwise, the upstream bridge's device tree node
+ * of the primary PE bus will be checked for the location code.
+ */
+const char *eeh_pe_loc_get(struct eeh_pe *pe)
+{
+	struct pci_controller *hose;
+	struct pci_bus *bus = eeh_pe_bus_get(pe);
+	struct pci_dev *pdev;
+	struct device_node *dn;
+	const char *loc;
+
+	if (!bus)
+		return "N/A";
+
+	/* PHB PE or root PE ? */
+	if (pci_is_root_bus(bus)) {
+		hose = pci_bus_to_host(bus);
+		loc = of_get_property(hose->dn,
+				"ibm,loc-code", NULL);
+		if (loc)
+			return loc;
+		loc = of_get_property(hose->dn,
+				"ibm,io-base-loc-code", NULL);
+		if (loc)
+			return loc;
+
+		pdev = pci_get_slot(bus, 0x0);
+	} else {
+		pdev = bus->self;
+	}
+
+	if (!pdev) {
+		loc = "N/A";
+		goto out;
+	}
+
+	dn = pci_device_to_OF_node(pdev);
+	if (!dn) {
+		loc = "N/A";
+		goto out;
+	}
+
+	loc = of_get_property(dn, "ibm,loc-code", NULL);
+	if (!loc)
+		loc = of_get_property(dn, "ibm,slot-location-code", NULL);
+	if (!loc)
+		loc = "N/A";
+
+out:
+	if (pci_is_root_bus(bus) && pdev)
+		pci_dev_put(pdev);
+	return loc;
+}
+
+/**
  * eeh_pe_bus_get - Retrieve PCI bus according to the given PE
  * @pe: EEH PE
  *
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 79d0cdf..e5b88c5 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -851,18 +851,21 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 		case OPAL_EEH_PHB_ERROR:
 			if (severity == OPAL_EEH_SEV_PHB_DEAD) {
 				*pe = phb_pe;
-				pr_err("EEH: dead PHB#%x detected\n",
-					hose->global_number);
+				pr_err("EEH: dead PHB#%x (%s) detected\n",
+					hose->global_number,
+					eeh_pe_loc_get(phb_pe));
 				ret = EEH_NEXT_ERR_DEAD_PHB;
 			} else if (severity == OPAL_EEH_SEV_PHB_FENCED) {
 				*pe = phb_pe;
-				pr_err("EEH: fenced PHB#%x detected\n",
-					hose->global_number);
+				pr_err("EEH: Fenced PHB#%x (%s) detected\n",
+					hose->global_number,
+					eeh_pe_loc_get(phb_pe));
 				ret = EEH_NEXT_ERR_FENCED_PHB;
 			} else if (severity == OPAL_EEH_SEV_INF) {
-				pr_info("EEH: PHB#%x informative error "
+				pr_info("EEH: PHB#%x (%s) informative error "
 					"detected\n",
-					hose->global_number);
+					hose->global_number,
+					eeh_pe_loc_get(phb_pe));
 				ioda_eeh_phb_diag(hose);
 				ret = EEH_NEXT_ERR_NONE;
 			}
@@ -881,16 +884,18 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 			 */
 			if (ioda_eeh_get_pe(hose, frozen_pe_no, pe)) {
 				*pe = phb_pe;
-				pr_err("EEH: Escalated fenced PHB#%x "
-				       "detected for PE#%llx\n",
+				pr_err("EEH: Escalated frozen PHB#%x-"
+				       "PE#%llx (%s) detected\n",
 					hose->global_number,
-					frozen_pe_no);
+					frozen_pe_no,
+					eeh_pe_loc_get(phb_pe));
 				ret = EEH_NEXT_ERR_FENCED_PHB;
 			} else if ((*pe)->state & EEH_PE_ISOLATED) {
 				ret = EEH_NEXT_ERR_NONE;
 			} else {
-				pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
-					(*pe)->addr, (*pe)->phb->global_number);
+				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
+					(*pe)->phb->global_number, (*pe)->addr,
+					eeh_pe_loc_get(*pe));
 				ret = EEH_NEXT_ERR_FROZEN_PE;
 			}
 
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH 1/2] powerpc/powernv: reduce multi-hit of iommu_add_device()
From: Alexey Kardashevskiy @ 2014-04-29  7:55 UTC (permalink / raw)
  To: Wei Yang; +Cc: linuxppc-dev, gwshan
In-Reply-To: <20140429064955.GA5066@richard>

On 04/29/2014 04:49 PM, Wei Yang wrote:
> On Mon, Apr 28, 2014 at 11:35:32PM +1000, Alexey Kardashevskiy wrote:
>> On 04/23/2014 12:26 PM, Wei Yang wrote:
>>> During the EEH hotplug event, iommu_add_device() will be invoked three times
>>> and two of them will trigger warning or error.
>>>
>>> The three times to invoke the iommu_add_device() are:
>>>
>>>     pci_device_add
>>>        ...
>>>        set_iommu_table_base_and_group   <- 1st time, fail
>>>     device_add
>>>        ...
>>>        tce_iommu_bus_notifier           <- 2nd time, succees
>>>     pcibios_add_pci_devices
>>>        ...
>>>        pcibios_setup_bus_devices        <- 3rd time, re-attach
>>>
>>> The first time fails, since the dev->kobj->sd is not initialized. The
>>> dev->kobj->sd is initialized in device_add().
>>> The third time's warning is triggered by the re-attach of the iommu_group.
>>>
>>> After applying this patch, the error
>>
>> Nack.
>>
>> The patch still seems incorrect and we actually need to remove
>> tce_iommu_bus_notifier completely as pcibios_setup_bus_devices is called
>>from another notifier anyway. Could you please test it?
>>
>>
> 
> Hi, Alexey,
> 
> Nice to see your comment. Let me show what I got fist.
> 
> Generally, when kernel enumerate on the pci device, following functions will
> be invoked.
> 
>      pci_device_add
>         pcibios_setup_bus_device
>         ...
>         set_iommu_table_base_and_group   
>      device_add
>         ...
>         tce_iommu_bus_notifier           
>      pcibios_fixp_bus
>         pcibios_add_pci_devices
>         ...
>         pcibios_setup_bus_devices        
> 
>>From the call flow, we see for a normall pci device, the
> pcibios_setup_bus_device() will be invoked twice.


tce_iommu_bus_notifier should not be called for devices during boot-time
PCI discovery as the discovery is done from subsys_initcall handler and
tce_iommu_bus_notifier is registered as subsys_initcall_sync. Are you sure
this is happening? You should see warnings as for PF's EEH but you do not
say that.


> At the bootup time, none of them succeed to setup the dma, since the PE is not
> assigned or the tbl is not set. The iommu tbl and group is setup in
> pnv_pci_ioda_setup_DMA().
> 
> This call flow maintains the same when EEH error happens on Bus PE, while the
> behavior is a little different. 
> 
>      pci_device_add
>         pcibios_setup_bus_device
>         ...
>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized


Sorry, what is uninitialized? The only kobj here is the one in iommu_group
and it must have been taken care of before adding a device.


>      device_add
>         ...
>         tce_iommu_bus_notifier           <- succeed
>      pcibios_fixp_bus
>         pcibios_add_pci_devices
>         ...
>         pcibios_setup_bus_devices        <- warning, re-attach


This is why I am suggesting getting rid of tce_iommu_bus_notifier - we will
avoid the warning.


> While this call flow will change a little on a VF. For a VF,
> pcibios_fixp_bus() will not be invoked. Current behavior is this.


You meant pcibios_fixup_bus? I would expect it to get called (from
pci_rescan_bus() or something like that?) and this seems to be the problem
here.

How are VFs added in terms of code flow? Is there any git tree to look at?



>      pci_device_add
>         pcibios_setup_bus_device
>         ...
>         set_iommu_table_base_and_group   <- fail, kobj->sd is not initialized
>      device_add
>         ...
>         tce_iommu_bus_notifier           <- succeed
> 
> And if an EEH error happens just on a VF, I believe the same call flow should
> go for recovery. (This is not set down, still under discussion with Gavin.)
> 
> My conclusion is:
> 1. remove the tce_iommu_bus_notifier completely will make the VF not work.
>    So I choose to revert the code and attach make the iommu group attachment
>    just happens in one place.
> 
> BTW, I know my patch is not a perfect one. For a PF, the tbl will still be set
> twice. I am not sure why we need to invoke pcibios_setup_device() twice on a
> PF, maybe some platform need to fixup some thing after the pci_bus is added.
> So I don't remove one of them to solve the problem.
> 
> If you have a better idea, I am glad to take it.
> 


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM

^ permalink raw reply

* [PATCH 1/2] ftrace: PPC, fix obsolete comment
From: Jiri Slaby @ 2014-04-29  7:24 UTC (permalink / raw)
  To: rostedt
  Cc: jirislaby, linux-kernel, Ingo Molnar, Paul Mackerras,
	Frederic Weisbecker, Jiri Slaby, linuxppc-dev

CONFIG_MCOUNT is not defined anymore, the corresponding #ifdef there
is CONFIG_FUNCTION_TRACER.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/powerpc/kernel/entry_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 22b45a4955cd..f1f652a8f395 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -1457,4 +1457,4 @@ _GLOBAL(return_to_handler)
 	blr
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
-#endif /* CONFIG_MCOUNT */
+#endif /* CONFIG_FUNCTION_TRACER */
-- 
1.9.2

^ permalink raw reply related

* Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries
From: Ingo Molnar @ 2014-04-29  7:06 UTC (permalink / raw)
  To: Madhavan Srinivasan
  Cc: linux-arch, riel, rusty, dave.hansen, peterz, x86, linux-kernel,
	linux-mm, ak, paulus, mgorman, Linus Torvalds, akpm, linuxppc-dev,
	kirill.shutemov
In-Reply-To: <1398675690-16186-3-git-send-email-maddy@linux.vnet.ibm.com>


* Madhavan Srinivasan <maddy@linux.vnet.ibm.com> wrote:

> Performance data for different FAULT_AROUND_ORDER values from 4 socket
> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
> v3.15-rc1 for different fault around order values.
> 
> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> 
> Linux build (make -j64)
> minor-faults            47,437,359      35,279,286      25,425,347      23,461,275      22,002,189      21,435,836
> times in seconds        347.302528420   344.061588460   340.974022391   348.193508116   348.673900158   350.986543618
>  stddev for time        ( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
>  %chg time to baseline                  -0.9%           -1.8%           0.2%            0.39%           1.06%

Probably too noisy.

> Linux rebuild (make -j64)
> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%

Here it looks like a speedup. Optimal value: 5+.

> Binutils build (make all -j64 )
> minor-faults            474,821         371,380         269,463         247,715         235,255         228,337
> times in seconds        53.882492432    53.584289348    53.882773216    53.755816431    53.607824348    53.423759642
>  stddev for time        ( +-  0.08% )   ( +-  0.56% )   ( +-  0.17% )   ( +-  0.11% )   ( +-  0.60% )   ( +-  0.69% )
>  %chg time to baseline                  -0.55%          0.0%            -0.23%          -0.51%          -0.85%

Probably too noisy, but looks like a potential slowdown?

> Two synthetic tests: access every word in file in sequential/random order.
> 
> Sequential access 16GiB file
> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> 1 thread
>        minor-faults     263,148         131,166         32,908          16,514          8,260           1,093
>        times in seconds 53.091138345    53.113191672    53.188776177    53.233017218    53.206841347    53.429979442
>        stddev for time  ( +-  0.06% )   ( +-  0.07% )   ( +-  0.08% )   ( +-  0.09% )   ( +-  0.03% )   ( +-  0.03% )
>        %chg time to baseline            0.04%           0.18%           0.26%           0.21%           0.63%

Speedup, optimal value: 8+.

> 8 threads
>        minor-faults     2,097,267       1,048,753       262,237         131,397         65,621          8,274
>        times in seconds 55.173790028    54.591880790    54.824623287    54.802162211    54.969680503    54.790387715
>        stddev for time  ( +-  0.78% )   ( +-  0.09% )   ( +-  0.08% )   ( +-  0.07% )   ( +-  0.28% )   ( +-  0.05% )
>        %chg time to baseline            -1.05%          -0.63%          -0.67%          -0.36%          -0.69%

Looks like a regression?

> 32 threads
>        minor-faults     8,388,751       4,195,621       1,049,664       525,461         262,535         32,924
>        times in seconds 60.431573046    60.669110744    60.485336388    60.697789706    60.077959564    60.588855032
>        stddev for time  ( +-  0.44% )   ( +-  0.27% )   ( +-  0.46% )   ( +-  0.67% )   ( +-  0.31% )   ( +-  0.49% )
>        %chg time to baseline            0.39%           0.08%           0.44%           -0.58%          0.25%

Probably too noisy.

> 64 threads
>        minor-faults     16,777,409      8,607,527       2,289,766       1,202,264       598,405         67,587
>        times in seconds 96.932617720    100.675418760   102.109880836   103.881733383   102.580199555   105.751194041
>        stddev for time  ( +-  1.39% )   ( +-  1.06% )   ( +-  0.99% )   ( +-  0.76% )   ( +-  1.65% )   ( +-  1.60% )
>        %chg time to baseline            3.86%           5.34%           7.16%           5.82%           9.09%

Speedup, optimal value: 4+

> 128 threads
>        minor-faults     33,554,705      17,375,375      4,682,462       2,337,245       1,179,007       134,819
>        times in seconds 128.766704495   115.659225437   120.353046307   115.291871270   115.450886036   113.991902150
>        stddev for time  ( +-  2.93% )   ( +-  0.30% )   ( +-  2.93% )   ( +-  1.24% )   ( +-  1.03% )   ( +-  0.70% )
>        %chg time to baseline            -10.17%         -6.53%          -10.46%         -10.34%         -11.47%

Rather significant regression at order 1 already.

> Random access 1GiB file
> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> 1 thread
>        minor-faults     17,155          8,678           2,126           1,097           581             134
>        times in seconds 51.904430523    51.658017987    51.919270792    51.560531738    52.354431597    51.976469502
>        stddev for time  ( +-  3.19% )   ( +-  1.35% )   ( +-  1.56% )   ( +-  0.91% )   ( +-  1.70% )   ( +-  2.02% )
>        %chg time to baseline            -0.47%          0.02%           -0.66%          0.86%           0.13%

Probably too noisy.

> 8 threads
>        minor-faults     131,844         70,705          17,457          8,505           4,251           598
>        times in seconds 58.162813956    54.991706305    54.952675791    55.323057492    54.755587379    53.376722828
>        stddev for time  ( +-  1.44% )   ( +-  0.69% )   ( +-  1.23% )   ( +-  2.78% )   ( +-  1.90% )   ( +-  2.91% )
>        %chg time to baseline            -5.45%          -5.52%          -4.88%          -5.86%          -8.22%

Regression.

> 32 threads
>        minor-faults     524,437         270,760         67,069          33,414          16,641          2,204
>        times in seconds 69.981777072    76.539570015    79.753578505    76.245943618    77.254258344    79.072596831
>        stddev for time  ( +-  2.81% )   ( +-  1.95% )   ( +-  2.66% )   ( +-  0.99% )   ( +-  2.35% )   ( +-  3.22% )
>        %chg time to baseline            9.37%           13.96%          8.95%           10.39%          12.98%

Speedup, optimal value hard to tell due to noise - 3+ or 8+.

> 64 threads
>        minor-faults     1,049,117       527,451         134,016         66,638          33,391          4,559
>        times in seconds 108.024517536   117.575067996   115.322659914   111.943998437   115.049450815   119.218450840
>        stddev for time  ( +-  2.40% )   ( +-  1.77% )   ( +-  1.19% )   ( +-  3.29% )   ( +-  2.32% )   ( +-  1.42% )
>        %chg time to baseline            8.84%           6.75%           3.62%           6.5%            10.3%

Speedup, optimal value again hard to tell due to noise.

> 128 threads
>        minor-faults     2,097,440       1,054,360       267,042         133,328         66,532          8,652
>        times in seconds 155.055861167   153.059625968   152.449492156   151.024005282   150.844647770   155.954366718
>        stddev for time  ( +-  1.32% )   ( +-  1.14% )   ( +-  1.32% )   ( +-  0.81% )   ( +-  0.75% )   ( +-  0.72% )
>        %chg time to baseline            -1.28%          -1.68%          -2.59%          -2.71%          0.57%

Slowdown for most orders.

> Incase of Kernel compilation, fault around order (fao) of 1 and 3 
> provides fast compilation time when compared to a value of 4. On 
> closer look, fao of 3 has higher agains. Incase of Sequential access 
> synthetic tests fao of 1 has higher gains and in Random access test, 
> fao of 3 has marginal gains. Going by compilation time, fao value of 
> 3 is suggested in this patch for pseries platform.

So I'm really at loss to understand where you get the optimal value of 
'3' from. The data does not seem to match your claim that '1 and 3 
provides fast compilation time when compared to a value of 4':

> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> 
> Linux rebuild (make -j64)
> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%

5 and 8, and probably 6, 7 are better than 4.

3 is probably _slower_ than the current default - but it's hard to 
tell due to inherent noise.

But the other two build tests were too noisy, and if then they showed 
a slowdown.

> Worst case scenario: we touch one page every 16M to demonstrate overhead.
> 
> Touch only one page in page table in 16GiB file
> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> 1 thread
>        minor-faults     1,104           1,090           1,071           1,068           1,065           1,063
>        times in seconds 0.006583298     0.008531502     0.019733795     0.036033763     0.062300553     0.406857086
>        stddev for time  ( +-  2.79% )   ( +-  2.42% )   ( +-  3.47% )   ( +-  2.81% )   ( +-  2.01% )   ( +-  1.33% )
> 8 threads
>        minor-faults     8,279           8,264           8,245           8,243           8,239           8,240
>        times in seconds 0.044572398     0.057211811     0.107606306     0.205626815     0.381679120     2.647979955
>        stddev for time  ( +-  1.95% )   ( +-  2.98% )   ( +-  1.74% )   ( +-  2.80% )   ( +-  2.01% )   ( +-  1.86% )
> 32 threads
>        minor-faults     32,879          32,864          32,849          32,845          32,839          32,843
>        times in seconds 0.197659343     0.218486087     0.445116407     0.694235883     1.296894038     9.127517045
>        stddev for time  ( +-  3.05% )   ( +-  3.05% )   ( +-  4.33% )   ( +-  3.08% )   ( +-  3.75% )   ( +-  0.56% )
> 64 threads
>        minor-faults     65,680          65,664          65,646          65,645          65,640          65,647
>        times in seconds 0.455537304     0.489688780     0.866490093     1.427393118     2.379628982     17.059295051
>        stddev for time  ( +-  4.01% )   ( +-  4.13% )   ( +-  2.92% )   ( +-  1.68% )   ( +-  1.79% )   ( +-  0.48% )
> 128 threads
>        minor-faults     131,279         131,265         131,250         131,245         131,241         131,254
>        times in seconds 1.026880651     1.095327536     1.721728274     2.808233068     4.662729948     31.732848290
>        stddev for time  ( +-  6.85% )   ( +-  4.09% )   ( +-  1.71% )   ( +-  3.45% )   ( +-  2.40% )   ( +-  0.68% )

There's no '%change' values shown, but the slowdown looks significant, 
it's the worst case: for example with 1 thread order 3 looks about 
300% slower (3x slowdown) compared to order 0.

All in one, looking at your latest data I don't think the conclusion 
from your first version of this optimization patch from a month ago is 
true anymore:

> +	/* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
> +	fault_around_order = 3;

As the data is rather conflicting and inconclusive, and if it shows a 
sweet spot it's not at order 3. New data should in general trigger 
reanalysis of your first optimization value.

I'm starting to suspect that maybe workloads ought to be given a 
choice in this matter, via madvise() or such.

Thanks,

	Ingo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox