LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap
From: Mel Gorman @ 2014-03-03 17:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: riel, Peter Zijlstra, linux-mm, paulus, Aneesh Kumar,
	linuxppc-dev
In-Reply-To: <20140219180200.GA29257@linux.vnet.ibm.com>

On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote:
> 
> On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP
> not enabled,  kernel panics.
> 

This?

---8<---
sched: numa: Do not group tasks if last cpu is not set

On configurations with vmemmap disabled, the following partial is observed

[  299.268623] CPU: 47 PID: 4366 Comm: numa01 Tainted: G      D      3.14.0-rc5-vanilla #4
[  299.278295] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
[  299.287452] task: ffff880c670bc110 ti: ffff880c66db6000 task.ti: ffff880c66db6000
[  299.296642] RIP: 0010:[<ffffffff8109013f>]  [<ffffffff8109013f>] task_numa_fault+0x50f/0x8b0
[  299.306778] RSP: 0000:ffff880c66db7670  EFLAGS: 00010282
[  299.313769] RAX: 00000000000033ee RBX: ffff880c670bc110 RCX: 0000000000000001
[  299.322590] RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000ffffffff
[  299.331394] RBP: ffff880c66db76c8 R08: 0000000000000000 R09: 00000000000166b0
[  299.340203] R10: ffff880c7ffecd80 R11: 0000000000000000 R12: 00000000000001ff
[  299.348989] R13: 00000000000000ff R14: 00000000ffffffff R15: 0000000000000003
[  299.357763] FS:  00007f5a60a3f700(0000) GS:ffff88106f2c0000(0000) knlGS:0000000000000000
[  299.367510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  299.374913] CR2: 00000000000037da CR3: 0000000868ed4000 CR4: 00000000000007e0
[  299.383726] Stack:
[  299.387414]  0000000000000003 0000000000000000 0000000100000003 0000000100000003
[  299.396564]  ffffffff811888f4 ffff880c66db7698 0000000000000003 ffff880c7f9b3ac0
[  299.405730]  ffff880c66ccebd8 00000000ffffffff 0000000000000003 ffff880c66db7718
[  299.414907] Call Trace:
[  299.419095]  [<ffffffff811888f4>] ? migrate_misplaced_page+0xb4/0x140
[  299.427301]  [<ffffffff8115950c>] do_numa_page+0x18c/0x1f0
[  299.434554]  [<ffffffff8115a6f7>] handle_mm_fault+0x617/0xf70
[  ..........]  SNIPPED

The oops occurs in task_numa_group looking up cpu_rq(LAST__CPU_MASK). The
bug exists for all configurations but will manifest differently. On vmemmap
configurations, it looks up garbage and on !vmemmap configuraitons it
will oops. This patch adds the necessary check and also fixes the type
for LAST__PID_MASK and LAST__CPU_MASK which are currently signed instead
of unsigned integers.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org

diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index da52366..6f661d9 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -63,10 +63,10 @@
 
 #ifdef CONFIG_NUMA_BALANCING
 #define LAST__PID_SHIFT 8
-#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+#define LAST__PID_MASK  ((1UL << LAST__PID_SHIFT)-1)
 
 #define LAST__CPU_SHIFT NR_CPUS_BITS
-#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
+#define LAST__CPU_MASK  ((1UL << LAST__CPU_SHIFT)-1)
 
 #define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7815709..b44a8b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1463,6 +1463,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	int cpu = cpupid_to_cpu(cpupid);
 	int i;
 
+	if (unlikely(cpu == LAST__CPU_MASK && !cpu_online(cpu)))
+		return;
+
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
 				    2*nr_node_ids*sizeof(unsigned long);

^ permalink raw reply related

* Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap
From: Aneesh Kumar K.V @ 2014-03-03 19:15 UTC (permalink / raw)
  To: Mel Gorman, Srikar Dronamraju
  Cc: riel, Peter Zijlstra, linux-mm, paulus, linuxppc-dev
In-Reply-To: <20140303172649.GU6732@suse.de>

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote:
>> 
>> On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP
>> not enabled,  kernel panics.
>> 
>
> This?

This one fixed that crash on ppc64

http://mid.gmane.org/1393578122-6500-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com

-aneesh

^ permalink raw reply

* Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap
From: Mel Gorman @ 2014-03-03 20:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: riel, Srikar Dronamraju, Peter Zijlstra, linux-mm, paulus,
	linuxppc-dev
In-Reply-To: <874n3fxfeg.fsf@linux.vnet.ibm.com>

On Tue, Mar 04, 2014 at 12:45:19AM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote:
> >> 
> >> On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP
> >> not enabled,  kernel panics.
> >> 
> >
> > This?
> 
> This one fixed that crash on ppc64
> 
> http://mid.gmane.org/1393578122-6500-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* [PATCH] powerpc: Align p_dyn, p_rela and p_st symbols
From: Anton Blanchard @ 2014-03-03 21:31 UTC (permalink / raw)
  To: benh, paulus, ldufour; +Cc: linuxppc-dev


The 64bit relocation code places a few symbols in the text segment.
These symbols are only 4 byte aligned where they need to be 8 byte
aligned. Add an explicit alignment.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
---

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 1482327..d88736f 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -81,6 +81,7 @@ _GLOBAL(relocate)
 
 6:	blr
 
+.balign 8
 p_dyn:	.llong	__dynamic_start - 0b
 p_rela:	.llong	__rela_dyn_start - 0b
 p_st:	.llong	_stext - 0b

^ permalink raw reply related

* [PATCH] powerpc: Check that all cpu features are in the possible map
From: Michael Ellerman @ 2014-03-04  2:44 UTC (permalink / raw)
  To: linuxppc-dev

cpu_has_feature() has an optimisation where it maintains a map of
possible cpu features. This allows the compiler to determine at compile
time that some cpu_has_feature() checks will always return 0, and
therefore the code guarded by the check can be elided.

However we have no logic to check whether the set of cpu features in the
current cpu spec are in the possible map. Although that should never
happen, if it does things are likely to go badly. So add a check and
print a warning.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/kernel/cputable.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 6c8dd5d..eb31713 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -2255,6 +2255,15 @@ static struct cpu_spec * __init setup_cpu_spec(unsigned long offset,
 	}
 #endif /* CONFIG_PPC64 || CONFIG_BOOKE */
 
+	/*
+	 * Check that all CPU features are in the possible mask. We don't want
+	 * to WARN() because we're called very early, and doing so will kill
+	 * the machine, so just printk() instead.
+	 */
+	if (t->cpu_features != (t->cpu_features & CPU_FTRS_POSSIBLE))
+		printk("WARNING: cpu spec contains impossible features! 0x%lx\n",
+			(t->cpu_features & ~CPU_FTRS_POSSIBLE));
+
 	return t;
 }
 
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH] powerpc: Check that all cpu features are in the possible map
From: Michael Ellerman @ 2014-03-04  3:38 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1393901079-25319-1-git-send-email-mpe@ellerman.id.au>

On Tue, 2014-03-04 at 13:44 +1100, Michael Ellerman wrote:
> diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
> index 6c8dd5d..eb31713 100644
> --- a/arch/powerpc/kernel/cputable.c
> +++ b/arch/powerpc/kernel/cputable.c
> @@ -2255,6 +2255,15 @@ static struct cpu_spec * __init setup_cpu_spec(unsigned long offset,
>  	}
>  #endif /* CONFIG_PPC64 || CONFIG_BOOKE */
>  
> +	/*
> +	 * Check that all CPU features are in the possible mask. We don't want
> +	 * to WARN() because we're called very early, and doing so will kill
> +	 * the machine, so just printk() instead.
> +	 */
> +	if (t->cpu_features != (t->cpu_features & CPU_FTRS_POSSIBLE))
> +		printk("WARNING: cpu spec contains impossible features! 0x%lx\n",
> +			(t->cpu_features & ~CPU_FTRS_POSSIBLE));


Actually even this is not safe.

printk() takes the logbuf_lock, but we don't have a paca yet, so our LOCK_TOKEN
will be some random guff in low memory. And although that's OK, we're single
threaded, it's a bit fishy.

What's worse is printk() does local_irq_save(), which will _store_ to our paca
(soft_enabled), and so we risk flipping a value somewhere.

Interestingly lockdep appears to be OK, even though we haven't initialised it
yet. printk() helpfully turns lockdep off before doing any locking.

cheers

^ permalink raw reply

* Re: [PATCH v3 02/11] perf: add PMU_FORMAT_RANGE() helper for use by sw-like pmus
From: Michael Ellerman @ 2014-03-04  5:19 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra
  Cc: Peter Zijlstra, scottwood, Cody P Schafer, LKML
In-Reply-To: <1393535105-7528-3-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-27-02 at 21:04:55 UTC, Cody P Schafer wrote:
> Add PMU_FORMAT_RANGE() and PMU_FORMAT_RANGE_RESERVED() (for reserved
> areas) which generate functions to extract the relevent bits from
> event->attr.config{,1,2} for use by sw-like pmus where the
> 'config{,1,2}' values don't map directly to hardware registers.
> 
> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  include/linux/perf_event.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e56b07f..3da5081 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -871,4 +871,21 @@ _name##_show(struct device *dev,					\
>  									\
>  static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>  
> +#define PMU_FORMAT_RANGE(name, attr_var, bit_start, bit_end)		\
> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end);		\
> +PMU_FORMAT_RANGE_RESERVED(name, attr_var, bit_start, bit_end)

I really think these should have event in the name.

Someone looking at the code is going to see event_get_foo() and wonder where
that is defined. Grep won't find a definition, tags won't find a definition,
the least you can do is have the macro name give some hint.

> +#define PMU_FORMAT_RANGE_RESERVED(name, attr_var, bit_start, bit_end)	\

It doesn't generate a format attribute.

> +static u64 event_get_##name##_max(void)					\
> +{									\
> +	int bits = (bit_end) - (bit_start) + 1;				\
> +	return ((0x1ULL << (bits - 1ULL)) - 1ULL) |			\
> +		(0xFULL << (bits - 4ULL));				\

What's wrong with:

	(0x1ULL << ((bit_end) - (bit_start) + 1)) - 1ULL;

cheers

^ permalink raw reply

* Re: [PATCH v3 03/11] perf: provide a common perf_event_nop_0() for use with .event_idx
From: Michael Ellerman @ 2014-03-04  5:19 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra
  Cc: Peter Zijlstra, scottwood, Cody P Schafer, LKML
In-Reply-To: <1393535105-7528-4-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-27-02 at 21:04:56 UTC, Cody P Schafer wrote:
> Rather an having every pmu that needs a function that just returns 0 for
> .event_idx define their own copy, reuse the one in kernel/events/core.c.
> 
> Rename from perf_swevent_event_idx() because we're no longer using it
> for just software events. Naming is based on the perf_pmu_nop_*()
> functions.

You could just use perf_pmu_nop_int() directly.

Peterz, OK by you?

cheers

> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c       | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3da5081..24a7b45 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -560,6 +560,7 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
>  extern u64 perf_event_read_value(struct perf_event *event,
>  				 u64 *enabled, u64 *running);
>  
> +extern int perf_event_nop_0(struct perf_event *event);
>  
>  struct perf_sample_data {
>  	u64				type;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 56003c6..2938a77 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5816,7 +5816,7 @@ static int perf_swevent_init(struct perf_event *event)
>  	return 0;
>  }
>  
> -static int perf_swevent_event_idx(struct perf_event *event)
> +int perf_event_nop_0(struct perf_event *event)
>  {
>  	return 0;
>  }
> @@ -5831,7 +5831,7 @@ static struct pmu perf_swevent = {
>  	.stop		= perf_swevent_stop,
>  	.read		= perf_swevent_read,
>  
> -	.event_idx	= perf_swevent_event_idx,
> +	.event_idx	= perf_event_nop_0,
>  };
>  
>  #ifdef CONFIG_EVENT_TRACING
> @@ -5950,7 +5950,7 @@ static struct pmu perf_tracepoint = {
>  	.stop		= perf_swevent_stop,
>  	.read		= perf_swevent_read,
>  
> -	.event_idx	= perf_swevent_event_idx,
> +	.event_idx	= perf_event_nop_0,
>  };
>  
>  static inline void perf_tp_register(void)
> @@ -6177,7 +6177,7 @@ static struct pmu perf_cpu_clock = {
>  	.stop		= cpu_clock_event_stop,
>  	.read		= cpu_clock_event_read,
>  
> -	.event_idx	= perf_swevent_event_idx,
> +	.event_idx	= perf_event_nop_0,
>  };
>  
>  /*
> @@ -6257,7 +6257,7 @@ static struct pmu perf_task_clock = {
>  	.stop		= task_clock_event_stop,
>  	.read		= task_clock_event_read,
>  
> -	.event_idx	= perf_swevent_event_idx,
> +	.event_idx	= perf_event_nop_0,
>  };
>  
>  static void perf_pmu_nop_void(struct pmu *pmu)
> -- 
> 1.9.0
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* [PATCH v3 1/2] powerpc: Add a cpu feature CPU_FTR_PMAO_BUG
From: Michael Ellerman @ 2014-03-04  5:31 UTC (permalink / raw)
  To: linuxppc-dev

Some power8 revisions have a hardware bug where we can lose a
Performance Monitor (PMU) exception under certain circumstances.

We will be adding a workaround for this case, see the next commit for
details. The observed behaviour is that writing PMAO doesn't cause an
exception as we would expect, hence the name of the feature.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/cputable.h | 6 ++++--
 arch/powerpc/kernel/cputable.c      | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

v3: Add CPU_FTRS_POWER8E to the possible map!
v2: Set the bit directly via the cputable entry for POWER8E.

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index 617cc76..bc23477 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -189,6 +189,7 @@ extern const char *powerpc_base_platform;
 #define	CPU_FTR_HAS_PPR			LONG_ASM_CONST(0x0200000000000000)
 #define CPU_FTR_DAWR			LONG_ASM_CONST(0x0400000000000000)
 #define CPU_FTR_DABRX			LONG_ASM_CONST(0x0800000000000000)
+#define CPU_FTR_PMAO_BUG		LONG_ASM_CONST(0x1000000000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -445,6 +446,7 @@ extern const char *powerpc_base_platform;
 	    CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
 	    CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
 	    CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP)
+#define CPU_FTRS_POWER8E (CPU_FTRS_POWER8 | CPU_FTR_PMAO_BUG)
 #define CPU_FTRS_CELL	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -466,8 +468,8 @@ extern const char *powerpc_base_platform;
 #define CPU_FTRS_POSSIBLE	\
 	    (CPU_FTRS_POWER3 | CPU_FTRS_RS64 | CPU_FTRS_POWER4 |	\
 	    CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | CPU_FTRS_POWER6 |	\
-	    CPU_FTRS_POWER7 | CPU_FTRS_POWER8 | CPU_FTRS_CELL |		\
-	    CPU_FTRS_PA6T | CPU_FTR_VSX)
+	    CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 |	\
+	    CPU_FTRS_CELL | CPU_FTRS_PA6T | CPU_FTR_VSX)
 #endif
 #else
 enum {
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 6c8dd5d..c1faade 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -510,7 +510,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
 		.pvr_mask		= 0xffff0000,
 		.pvr_value		= 0x004b0000,
 		.cpu_name		= "POWER8E (raw)",
-		.cpu_features		= CPU_FTRS_POWER8,
+		.cpu_features		= CPU_FTRS_POWER8E,
 		.cpu_user_features	= COMMON_USER_POWER8,
 		.cpu_user_features2	= COMMON_USER2_POWER8,
 		.mmu_features		= MMU_FTRS_POWER8,
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH v3 2/2] powerpc/perf: Add lost exception workaround
From: Michael Ellerman @ 2014-03-04  5:31 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1393911093-8248-1-git-send-email-mpe@ellerman.id.au>

Some power8 revisions have a hardware bug where we can lose a PMU
exception, this commit adds a workaround to detect the bad condition and
rectify the situation.

See the comment in the commit for a full description.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/reg.h  |   2 +
 arch/powerpc/perf/core-book3s.c | 100 +++++++++++++++++++++++++++++++++++++++-
 arch/powerpc/perf/power8-pmu.c  |   5 ++
 3 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..3003472 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -670,6 +670,7 @@
 #define   MMCR0_PMC1CE	0x00008000UL /* PMC1 count enable*/
 #define   MMCR0_PMCjCE	0x00004000UL /* PMCj count enable*/
 #define   MMCR0_TRIGGER	0x00002000UL /* TRIGGER enable */
+#define   MMCR0_PMAO_SYNC 0x00000800UL /* PMU interrupt is synchronous */
 #define   MMCR0_PMAO	0x00000080UL /* performance monitor alert has occurred, set to 0 after handling exception */
 #define   MMCR0_SHRFC	0x00000040UL /* SHRre freeze conditions between threads */
 #define   MMCR0_FC56	0x00000010UL /* freeze counters 5 and 6 */
@@ -703,6 +704,7 @@
 #define SPRN_EBBHR	804	/* Event based branch handler register */
 #define SPRN_EBBRR	805	/* Event based branch return register */
 #define SPRN_BESCR	806	/* Branch event status and control register */
+#define   BESCR_GE	0x8000000000000000ULL /* Global Enable */
 #define SPRN_WORT	895	/* Workload optimization register - thread */
 
 #define SPRN_PMC1	787
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 67cf220..9b3065d 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -120,6 +120,7 @@ static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
 void power_pmu_flush_branch_stack(void) {}
 static inline void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw) {}
+static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
 
 static bool regs_use_siar(struct pt_regs *regs)
@@ -545,10 +546,18 @@ static unsigned long ebb_switch_in(bool ebb, unsigned long mmcr0)
 	/* Enable EBB and read/write to all 6 PMCs for userspace */
 	mmcr0 |= MMCR0_EBE | MMCR0_PMCC_U6;
 
-	/* Add any bits from the user reg, FC or PMAO */
+	/*
+	 * Add any bits from the user MMCR0, FC or PMAO. This is compatible
+	 * with pmao_restore_workaround() because we may add PMAO but we never
+	 * clear it here.
+	 */
 	mmcr0 |= current->thread.mmcr0;
 
-	/* Be careful not to set PMXE if userspace had it cleared */
+	/*
+	 * Be careful not to set PMXE if userspace had it cleared. This is also
+	 * compatible with pmao_restore_workaround() because it has already
+	 * cleared PMXE and we leave PMAO alone.
+	 */
 	if (!(current->thread.mmcr0 & MMCR0_PMXE))
 		mmcr0 &= ~MMCR0_PMXE;
 
@@ -559,6 +568,91 @@ static unsigned long ebb_switch_in(bool ebb, unsigned long mmcr0)
 out:
 	return mmcr0;
 }
+
+static void pmao_restore_workaround(bool ebb)
+{
+	unsigned pmcs[6];
+
+	if (!cpu_has_feature(CPU_FTR_PMAO_BUG))
+		return;
+
+	/*
+	 * On POWER8E there is a hardware defect which affects the PMU context
+	 * switch logic, ie. power_pmu_disable/enable().
+	 *
+	 * When a counter overflows PMXE is cleared and FC/PMAO is set in MMCR0
+	 * by the hardware. Sometime later the actual PMU exception is
+	 * delivered.
+	 *
+	 * If we context switch, or simply disable/enable, the PMU prior to the
+	 * exception arriving, the exception will be lost when we clear PMAO.
+	 *
+	 * When we reenable the PMU, we will write the saved MMCR0 with PMAO
+	 * set, and this _should_ generate an exception. However because of the
+	 * defect no exception is generated when we write PMAO, and we get
+	 * stuck with no counters counting but no exception delivered.
+	 *
+	 * The workaround is to detect this case and tweak the hardware to
+	 * create another pending PMU exception.
+	 *
+	 * We do that by setting up PMC6 (cycles) for an imminent overflow and
+	 * enabling the PMU. That causes a new exception to be generated in the
+	 * chip, but we don't take it yet because we have interrupts hard
+	 * disabled. We then write back the PMU state as we want it to be seen
+	 * by the exception handler. When we reenable interrupts the exception
+	 * handler will be called and see the correct state.
+	 *
+	 * The logic is the same for EBB, except that the exception is gated by
+	 * us having interrupts hard disabled as well as the fact that we are
+	 * not in userspace. The exception is finally delivered when we return
+	 * to userspace.
+	 */
+
+	/* Only if PMAO is set and PMAO_SYNC is clear */
+	if ((current->thread.mmcr0 & (MMCR0_PMAO | MMCR0_PMAO_SYNC)) != MMCR0_PMAO)
+		return;
+
+	/* If we're doing EBB, only if BESCR[GE] is set */
+	if (ebb && !(current->thread.bescr & BESCR_GE))
+		return;
+
+	/*
+	 * We are already soft-disabled in power_pmu_enable(). We need to hard
+	 * enable to actually prevent the PMU exception from firing.
+	 */
+	hard_irq_disable();
+
+	/*
+	 * This is a bit gross, but we know we're on POWER8E and have 6 PMCs.
+	 * Using read/write_pmc() in a for loop adds 12 function calls and
+	 * almost doubles our code size.
+	 */
+	pmcs[0] = mfspr(SPRN_PMC1);
+	pmcs[1] = mfspr(SPRN_PMC2);
+	pmcs[2] = mfspr(SPRN_PMC3);
+	pmcs[3] = mfspr(SPRN_PMC4);
+	pmcs[4] = mfspr(SPRN_PMC5);
+	pmcs[5] = mfspr(SPRN_PMC6);
+
+	/* Ensure all freeze bits are unset */
+	mtspr(SPRN_MMCR2, 0);
+
+	/* Set up PMC6 to overflow in one cycle */
+	mtspr(SPRN_PMC6, 0x7FFFFFFE);
+
+	/* Enable exceptions and unfreeze PMC6 */
+	mtspr(SPRN_MMCR0, MMCR0_PMXE | MMCR0_PMCjCE | MMCR0_PMAO);
+
+	/* Now we need to refreeze and restore the PMCs */
+	mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMAO);
+
+	mtspr(SPRN_PMC1, pmcs[0]);
+	mtspr(SPRN_PMC2, pmcs[1]);
+	mtspr(SPRN_PMC3, pmcs[2]);
+	mtspr(SPRN_PMC4, pmcs[3]);
+	mtspr(SPRN_PMC5, pmcs[4]);
+	mtspr(SPRN_PMC6, pmcs[5]);
+}
 #endif /* CONFIG_PPC64 */
 
 static void perf_event_interrupt(struct pt_regs *regs);
@@ -1144,6 +1238,8 @@ static void power_pmu_enable(struct pmu *pmu)
 	cpuhw->mmcr[0] |= MMCR0_PMXE | MMCR0_FCECE;
 
  out_enable:
+	pmao_restore_workaround(ebb);
+
 	mmcr0 = ebb_switch_in(ebb, cpuhw->mmcr[0]);
 
 	mb();
diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 96cee20..64f04cf 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -10,6 +10,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#define pr_fmt(fmt)	"power8-pmu: " fmt
+
 #include <linux/kernel.h>
 #include <linux/perf_event.h>
 #include <asm/firmware.h>
@@ -774,6 +776,9 @@ static int __init init_power8_pmu(void)
 	/* Tell userspace that EBB is supported */
 	cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_EBB;
 
+	if (cpu_has_feature(CPU_FTR_PMAO_BUG))
+		pr_info("PMAO restore workaround active.\n");
+
 	return 0;
 }
 early_initcall(init_power8_pmu);
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH v3 03/11] perf: provide a common perf_event_nop_0() for use with .event_idx
From: Cody P Schafer @ 2014-03-04  7:01 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra
  Cc: Peter Zijlstra, scottwood, LKML
In-Reply-To: <20140304051936.97CBF2C020A@ozlabs.org>

On 03/03/2014 09:19 PM, Michael Ellerman wrote:
> On Thu, 2014-27-02 at 21:04:56 UTC, Cody P Schafer wrote:
>> Rather an having every pmu that needs a function that just returns 0 for
>> .event_idx define their own copy, reuse the one in kernel/events/core.c.
>>
>> Rename from perf_swevent_event_idx() because we're no longer using it
>> for just software events. Naming is based on the perf_pmu_nop_*()
>> functions.
>
> You could just use perf_pmu_nop_int() directly.

No, .event_idx needs something that takes a (struct perf_event *), 
perf_pmu_nop_int() takes a (struct pmu *).

^ permalink raw reply

* [PATCH] net/mlx4: Support shutdown() interface
From: Gavin Shan @ 2014-03-04  7:35 UTC (permalink / raw)
  To: netdev, linuxppc-dev; +Cc: weiyang, amirv, davem, Gavin Shan

In kexec scenario, we failed to load the mlx4 driver in the
second kernel because the ownership bit was hold by the first
kernel without release correctly.

The patch adds shutdown() interface so that the ownership can
be released correctly in the first kernel. It also helps avoiding
EEH error happened during boot stage of the second kernel because
of undesired traffic, which can't be handled by hardware during
that stage on Power platform.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index d711158..5a6105f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2684,6 +2684,7 @@ static struct pci_driver mlx4_driver = {
 	.name		= DRV_NAME,
 	.id_table	= mlx4_pci_table,
 	.probe		= mlx4_init_one,
+	.shutdown	= mlx4_remove_one,
 	.remove		= mlx4_remove_one,
 	.err_handler    = &mlx4_err_handler,
 };
-- 
1.7.10.4

^ permalink raw reply related

* UART Fifo mode
From: Poonam.Aggrwal @ 2014-03-04  7:52 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

Hello All

I am debugging a UART issue on a Freescale PowerPC SOC.

My observations are:

When FiFo mode is enabled FCR[FEN]=1, and the Fifo size is 16byte. I see console prints not coming properly, lot of missing characters.

But here when I change the tx_loadsz to 8, the UART works fine.

        [PORT_16550A] = {
                .name           = "16550A",
                .fifo_size      = 16,
-               .tx_loadsz      = 16,
+               .tx_loadsz      = 8,
                .fcr            = UART_FCR_ENABLE_FIFO | UART_FCR_R_TRIG_10,
                .flags          = UART_CAP_FIFO,
        },

Can anybody help to understand what is the significance of tx_loadsz. From the code (drivers/serial/tty/8250/8250.c), looks like this parameter controls the bytes we write to the UART Tx register at one  time.
So ideally this should be equal to the Fifo size.

I also see certain UARTs where tx_loadsz is not equal to fifo_size, less than fifo_size.

Please help me to understand how should tx_loadsz be determined?

Many Thanks
Poonam

[-- Attachment #2: Type: text/html, Size: 5012 bytes --]

^ permalink raw reply

* Re: [PATCH v3 02/11] perf: add PMU_FORMAT_RANGE() helper for use by sw-like pmus
From: Cody P Schafer @ 2014-03-04  8:09 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra
  Cc: Peter Zijlstra, scottwood, LKML
In-Reply-To: <20140304051936.33A712C01AB@ozlabs.org>

On 03/03/2014 09:19 PM, Michael Ellerman wrote:
> On Thu, 2014-27-02 at 21:04:55 UTC, Cody P Schafer wrote:
>> Add PMU_FORMAT_RANGE() and PMU_FORMAT_RANGE_RESERVED() (for reserved
>> areas) which generate functions to extract the relevent bits from
>> event->attr.config{,1,2} for use by sw-like pmus where the
>> 'config{,1,2}' values don't map directly to hardware registers.
>>
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   include/linux/perf_event.h | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index e56b07f..3da5081 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -871,4 +871,21 @@ _name##_show(struct device *dev,					\
>>   									\
>>   static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>>
>> +#define PMU_FORMAT_RANGE(name, attr_var, bit_start, bit_end)		\
>> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end);		\
>> +PMU_FORMAT_RANGE_RESERVED(name, attr_var, bit_start, bit_end)
>
> I really think these should have event in the name.
>
> Someone looking at the code is going to see event_get_foo() and wonder where
> that is defined. Grep won't find a definition, tags won't find a definition,
> the least you can do is have the macro name give some hint.
>

That is a good point (grep-ability). Let me think about this. There is 
also the possibility that I could adjust the event_get_*() naming to 
something else. format_get_*()? event_get_format_*()? (these names keep 
growing...)

>> +#define PMU_FORMAT_RANGE_RESERVED(name, attr_var, bit_start, bit_end)	\
>
> It doesn't generate a format attribute.

This was done with the idea that the term "format" didn't just refer to 
the attribute exposed in sysfs, it referred to "some subset of bits 
extractable from attr.config{,1,2}". Which is also the reasoning for the 
above naming.

>> +static u64 event_get_##name##_max(void)					\
>> +{									\
>> +	int bits = (bit_end) - (bit_start) + 1;				\
>> +	return ((0x1ULL << (bits - 1ULL)) - 1ULL) |			\
>> +		(0xFULL << (bits - 4ULL));				\
>
> What's wrong with:
>
> 	(0x1ULL << ((bit_end) - (bit_start) + 1)) - 1ULL;

Overflowing the << when bit_end = 63 and bit_start = 0 results in max(0, 
63) = 0.
That said, the current implementation is wrong when (bits < 4). Here's 
one that actually works (without overflowing):

         return (((1ull << (bit_end - bit_start)) - 1) << 1) + 1;

And an examination of the problematic case:

         #if 0
         typedef unsigned long long ull;
         ull a = bits - 1; /* 63 */
         ull b = 1 << a;   /* 0x8000000000000000 */
         ull c = b - 1;    /* 0x7fffffffffffffff */
         ull d = b << 1;   /* 0xfffffffffffffffe */
         ull e = d + 1;    /* 0xffffffffffffffff */
         return e;
         #endif

Small number of valid inputs, so I also tested it for all of them using

	unsigned bits = (bit_end) - (bit_start) + 1;
	return (bits < (sizeof(0ULL) * CHAR_BIT))
			? ((1ULL << bits) - 1ULL)
			: ~0ULL;

As the baseline correct one.

^ permalink raw reply

* RE: [PATCH] Corenet: Add QE platform support for Corenet
From: qiang.zhao @ 2014-03-04  9:09 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Scott Wood, linuxppc-dev@lists.ozlabs.org, Xiaobo Xie
In-Reply-To: <62FC0C4F-26FD-44B4-BC07-BDF1904AE637@kernel.crashing.org>

On Mar 3, 2014, at 11:51 PM, Kumar Gala [galak@kernel.crashing.org] wrote:



> -----Original Message-----
> From: Kumar Gala [mailto:galak@kernel.crashing.org]
> Sent: Monday, March 03, 2014 11:51 PM
> To: Zhao Qiang-B45475
> Cc: linuxppc-dev@lists.ozlabs.org; Wood Scott-B07421; Xie Xiaobo-R63061
> Subject: Re: [PATCH] Corenet: Add QE platform support for Corenet
>=20
>=20
> On Feb 28, 2014, at 2:48 AM, Zhao Qiang <B45475@freescale.com> wrote:
>=20
> > There is QE on platform T104x, add support.
> > Call funcs qe_ic_init and qe_init if CONFIG_QUICC_ENGINE is defined.
> >
> > Signed-off-by: Zhao Qiang <B45475@freescale.com>
> > ---
> > arch/powerpc/platforms/85xx/corenet_generic.c | 32
> > +++++++++++++++++++++++++++
> > 1 file changed, 32 insertions(+)
>=20
> Can you use mpc85xx_qe_init() instead?


mpc85xx_qe_init() is for old QE which is different from new QE.
New QE has no par_io, and it is not correct to init=20
par_io(par_io_init() called in mpc85xx_qe_init()) for new QE.=20

>=20
> >
> > diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c
> > b/arch/powerpc/platforms/85xx/corenet_generic.c
> > index fbd871e..f8c8e0c 100644
> > --- a/arch/powerpc/platforms/85xx/corenet_generic.c
> > +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
> >
> > /*
> > @@ -52,11 +68,24 @@ void __init corenet_gen_pic_init(void)  */ void
> > __init corenet_gen_setup_arch(void) {
> > +#ifdef CONFIG_QUICC_ENGINE
> > +	struct device_node *np;
> > +#endif
> > 	mpc85xx_smp_init();
> >
> > 	swiotlb_detect_4g();
> >
> > 	pr_info("%s board from Freescale Semiconductor\n", ppc_md.name);
> > +
> > +#ifdef CONFIG_QUICC_ENGINE
> > +	np =3D of_find_compatible_node(NULL, NULL, "fsl,qe");
> > +	if (!np) {
> > +		pr_err("%s: Could not find Quicc Engine node\n", __func__);
> > +		return;
>=20
> This doesn't seem like an reasonable error message for common corenet
> platform.  It seems reasonable to build QE support but boot on a chip w/o
> QE.
>=20
> > +	}
> > +	qe_reset();
> > +	of_node_put(np);
> > +#endif
> > }
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
>=20
>=20


Regards,
Zhao Qiang

^ permalink raw reply

* Re: [PATCH v3] powerpc/powernv Platform dump interface
From: Vasant Hegde @ 2014-03-04 12:22 UTC (permalink / raw)
  To: Stewart Smith, benh, linuxppc-dev
In-Reply-To: <1393802742-3891-1-git-send-email-stewart@linux.vnet.ibm.com>

On 03/03/2014 04:55 AM, Stewart Smith wrote:
> This enables support for userspace to fetch and initiate FSP and
> Platform dumps from the service processor (via firmware) through sysfs.
>
> Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
>
> Flow:
>    - We register for OPAL notification events.
>    - OPAL sends new dump available notification.
>    - We make information on dump available via sysfs
>    - Userspace requests dump contents
>    - We retrieve the dump via OPAL interface
>    - User copies the dump data
>    - userspace sends ack for dump
>    - We send ACK to OPAL.
>
> sysfs files:
>    - We add the /sys/firmware/opal/dump directory
>    - echoing 1 (well, anything, but in future we may support
>      different dump types) to /sys/firmware/opal/dump/initiate_dump
>      will initiate a dump.

Stewart,

s/initiate/initiate FSP/

>    - Each dump that we've been notified of gets a directory
>      in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex,
>      as this is what's used elsewhere to identify the dump).
>    - Each dump has files: id, type, dump and acknowledge
>      dump is binary and is the dump itself.
>      echoing 'ack' to acknowledge (currently any string will do) will
>      acknowledge the dump and it will soon after disappear from sysfs.
>
> OPAL APIs:
>    - opal_dump_init()
>    - opal_dump_info()

opal_dump_info2()

>    - opal_dump_read()
>    - opal_dump_ack()
>    - opal_dump_resend_notification()
>
> Currently we are only ever notified for one dump at a time (until
> the user explicitly acks the current dump, then we get a notification
> of the next dump), but this kernel code should "just work" when OPAL
> starts notifying us of all the dumps present.
>
> Changes since v2:
>   - fix bug where we would free the dump buffer after userspace read it,
>     refetching if needed. Refetching doesn't currently work, so we must
>     keep the dump around for subsequent reads.
>
> Changes since v1:
>   - Add support for getting dump type from OPAL through new OPAL call
>     (falling back to old OPAL_DUMP_INFO call if OPAL_DUMP_INFO2 isn't
>      supported)
>   - use dump type in directory name for dump
>
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
> ---
>   Documentation/ABI/stable/sysfs-firmware-opal-dump |   41 ++
>   arch/powerpc/include/asm/opal.h                   |   14 +
>   arch/powerpc/platforms/powernv/Makefile           |    2 +-
>   arch/powerpc/platforms/powernv/opal-dump.c        |  525 +++++++++++++++++++++
>   arch/powerpc/platforms/powernv/opal-wrappers.S    |    6 +
>   arch/powerpc/platforms/powernv/opal.c             |    2 +
>   6 files changed, 589 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/ABI/stable/sysfs-firmware-opal-dump
>   create mode 100644 arch/powerpc/platforms/powernv/opal-dump.c
>
> diff --git a/Documentation/ABI/stable/sysfs-firmware-opal-dump b/Documentation/ABI/stable/sysfs-firmware-opal-dump
> new file mode 100644
> index 0000000..32fe7f5
> --- /dev/null
> +++ b/Documentation/ABI/stable/sysfs-firmware-opal-dump
> @@ -0,0 +1,41 @@
> +What:		/sys/firmware/opal/dump
> +Date:		Feb 2014
> +Contact:	Stewart Smith <stewart@linux.vnet.ibm.com>
> +Description:
> +		This directory exposes interfaces for interacting with
> +		the FSP and platform dumps through OPAL firmware interface.
> +
> +		This is only for the powerpc/powernv platform.
> +
> +		initiate_dump:	When '1' is written to it,
> +				we will initiate a dump.

initiate FSP dump

> +				Read this file for supported commands.
> +
> +		0xXX-0xYYYY:	A directory for dump of type 0xXX and
> +				id 0xYYYY (in hex). The name of this
> +				directory should not be relied upon to
> +				be in this format, only that it's unique
> +				among all dumps. For determining the type
> +				and ID of the dump, use the id and type files.
> +				Do not rely on any particular size of dump
> +				type or dump id.
> +
> +		Each dump has the following files:
> +		id:		An ASCII representation of the dump ID
> +				in hex (e.g. '0x01')
> +		type:		An ASCII representation of the type of
> +				dump in the format "0x%x %s" with the ID

Better 0x%x - %s ?

> +				in hex and a description of the dump type
> +				(or 'unknown').
> +				Type '0xffffffff unknown' is used when
> +				we could not get the type from firmware.
> +				e.g. '0x02 System/Platform Dump'
> +		dump:		A binary file containing the dump.
> +				The size of the dump is the size of this file.
> +		acknowledge:	When 'ack' is written to this, we will
> +				acknowledge that we've retrieved the
> +				dump to the service processor. It will
> +				then remove it, making the dump
> +				inaccessible.
> +				Reading this file will get a list of
> +				supported actions.
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 40157e2..89c840c 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -154,9 +154,15 @@ extern int opal_enter_rtas(struct rtas_args *args,
>   #define OPAL_FLASH_VALIDATE			76
>   #define OPAL_FLASH_MANAGE			77
>   #define OPAL_FLASH_UPDATE			78
> +#define OPAL_DUMP_INIT				81
> +#define OPAL_DUMP_INFO				82
> +#define OPAL_DUMP_READ				83
> +#define OPAL_DUMP_ACK				84
>   #define OPAL_GET_MSG				85
>   #define OPAL_CHECK_ASYNC_COMPLETION		86
> +#define OPAL_DUMP_RESEND			91
>   #define OPAL_SYNC_HOST_REBOOT			87
> +#define OPAL_DUMP_INFO2				94
>
>   #ifndef __ASSEMBLY__
>
> @@ -237,6 +243,7 @@ enum OpalPendingState {
>   	OPAL_EVENT_EPOW			= 0x80,
>   	OPAL_EVENT_LED_STATUS		= 0x100,
>   	OPAL_EVENT_PCI_ERROR		= 0x200,
> +	OPAL_EVENT_DUMP_AVAIL		= 0x400,
>   	OPAL_EVENT_MSG_PENDING		= 0x800,
>   };
>
> @@ -826,6 +833,12 @@ int64_t opal_lpc_read(uint32_t chip_id, enum OpalLPCAddressType addr_type,
>   int64_t opal_validate_flash(uint64_t buffer, uint32_t *size, uint32_t *result);
>   int64_t opal_manage_flash(uint8_t op);
>   int64_t opal_update_flash(uint64_t blk_list);
> +int64_t opal_dump_init(uint8_t dump_type);
> +int64_t opal_dump_info(uint32_t *dump_id, uint32_t *dump_size);
> +int64_t opal_dump_info2(uint32_t *dump_id, uint32_t *dump_size, uint32_t *dump_type);
> +int64_t opal_dump_read(uint32_t dump_id, uint64_t buffer);
> +int64_t opal_dump_ack(uint32_t dump_id);
> +int64_t opal_dump_resend_notification(void);
>
>   int64_t opal_get_msg(uint64_t buffer, size_t size);
>   int64_t opal_check_completion(uint64_t buffer, size_t size, uint64_t token);
> @@ -861,6 +874,7 @@ extern void opal_get_rtc_time(struct rtc_time *tm);
>   extern unsigned long opal_get_boot_time(void);
>   extern void opal_nvram_init(void);
>   extern void opal_flash_init(void);
> +extern void opal_platform_dump_init(void);
>
>   extern int opal_machine_check(struct pt_regs *regs);
>
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index 8d767fd..3528c11 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -1,6 +1,6 @@
>   obj-y			+= setup.o opal-takeover.o opal-wrappers.o opal.o
>   obj-y			+= opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
> -obj-y			+= rng.o
> +obj-y			+= rng.o opal-dump.o
>
>   obj-$(CONFIG_SMP)	+= smp.o
>   obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
> diff --git a/arch/powerpc/platforms/powernv/opal-dump.c b/arch/powerpc/platforms/powernv/opal-dump.c
> new file mode 100644
> index 0000000..0c767c5
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-dump.c
> @@ -0,0 +1,525 @@
> +/*
> + * PowerNV OPAL Dump Interface
> + *
> + * Copyright 2013,2014 IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/kobject.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> +#include <linux/delay.h>
> +
> +#include <asm/opal.h>
> +
> +#define DUMP_TYPE_FSP	0x01

Better define other dump type (sysdump etc) here and use it below ?

> +
> +struct dump_obj {
> +	struct kobject  kobj;
> +	struct bin_attribute dump_attr;
> +	uint32_t	id;  /* becomes object name */
> +	uint32_t	type;
> +	uint32_t	size;
> +	char		*buffer;
> +};
> +#define to_dump_obj(x) container_of(x, struct dump_obj, kobj)
> +
> +struct dump_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct dump_obj *dump, struct dump_attribute *attr,
> +			char *buf);
> +	ssize_t (*store)(struct dump_obj *dump, struct dump_attribute *attr,
> +			 const char *buf, size_t count);
> +};
> +#define to_dump_attr(x) container_of(x, struct dump_attribute, attr)
> +
> +static ssize_t dump_id_show(struct dump_obj *dump_obj,
> +			    struct dump_attribute *attr,
> +			    char *buf)
> +{
> +	return sprintf(buf, "0x%x\n", dump_obj->id);
> +}
> +
> +static const char* dump_type_to_string(uint32_t type)
> +{
> +	switch (type) {
> +	case 0x01: return "SP Dump";
> +	case 0x02: return "System/Platform Dump";
> +	case 0x03: return "SMA Dump";
> +	default: return "unknown";
> +	}
> +}
> +
> +static ssize_t dump_type_show(struct dump_obj *dump_obj,
> +			      struct dump_attribute *attr,
> +			      char *buf)
> +{
> +	
> +	return sprintf(buf, "0x%x %s\n", dump_obj->type,

Better 0x%x - %s ?

> +		       dump_type_to_string(dump_obj->type));
> +}
> +
> +static ssize_t dump_ack_show(struct dump_obj *dump_obj,
> +			     struct dump_attribute *attr,
> +			     char *buf)
> +{
> +	return sprintf(buf, "ack - acknowledge dump\n");

Not sure but better "echo ack - acknowledge dump"?

> +}
> +
> +/*
> + * Send acknowledgement to OPAL
> + */
> +static int64_t dump_send_ack(uint32_t dump_id)
> +{
> +	int rc;
> +
> +	rc = opal_dump_ack(dump_id);
> +	if (rc)
> +		pr_warn("%s: Failed to send ack to Dump ID 0x%x (%d)\n",
> +			__func__, dump_id, rc);
> +	return rc;
> +}
> +
> +static void delay_release_kobj(void *kobj)
> +{
> +	kobject_put((struct kobject *)kobj);
> +}
> +
> +static ssize_t dump_ack_store(struct dump_obj *dump_obj,
> +			      struct dump_attribute *attr,
> +			      const char *buf,
> +			      size_t count)
> +{
> +	dump_send_ack(dump_obj->id);
> +	sysfs_schedule_callback(&dump_obj->kobj, delay_release_kobj,
> +				&dump_obj->kobj, THIS_MODULE);
> +	return count;
> +}
> +
> +/* Attributes of a dump
> + * The binary attribute of the dump itself is dynamic
> + * due to the dynamic size of the dump
> + */
> +static struct dump_attribute id_attribute =
> +	__ATTR(id, 0666, dump_id_show, NULL);
> +static struct dump_attribute type_attribute =
> +	__ATTR(type, 0666, dump_type_show, NULL);
> +static struct dump_attribute ack_attribute =
> +	__ATTR(acknowledge, 0660, dump_ack_show, dump_ack_store);
> +
> +static ssize_t init_dump_show(struct dump_obj *dump_obj,
> +			      struct dump_attribute *attr,
> +			      char *buf)
> +{
> +	return sprintf(buf, "1 - initiate dump\n");

initiate FSP dump

> +}
> +
> +static int64_t dump_fips_init(uint8_t type)
> +{
> +	int rc;
> +
> +	rc = opal_dump_init(type);
> +	if (rc)
> +		pr_warn("%s: Failed to initiate FipS dump (%d)\n",
> +			__func__, rc);
> +	return rc;
> +}
> +
> +static ssize_t init_dump_store(struct dump_obj *dump_obj,
> +			       struct dump_attribute *attr,
> +			       const char *buf,
> +			       size_t count)
> +{
> +	dump_fips_init(DUMP_TYPE_FSP);
> +	pr_info("%s: Initiated FSP dump\n", __func__);

This might mislead if OPAL fails to initiate FSP dump.. Better move this to 
dump_fips_init () ?

> +	return count;
> +}
> +
> +static struct dump_attribute initiate_attribute =
> +	__ATTR(initiate_dump, 0600, init_dump_show, init_dump_store);
> +
> +static struct attribute *initiate_attrs[] = {
> +	&initiate_attribute.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group initiate_attr_group = {
> +	.attrs = initiate_attrs,
> +};
> +
> +static struct kset *dump_kset;
> +
> +static ssize_t dump_attr_show(struct kobject *kobj,
> +			      struct attribute *attr,
> +			      char *buf)
> +{
> +	struct dump_attribute *attribute;
> +	struct dump_obj *dump;
> +
> +	attribute = to_dump_attr(attr);
> +	dump = to_dump_obj(kobj);
> +
> +	if (!attribute->show)
> +		return -EIO;
> +
> +	return attribute->show(dump, attribute, buf);
> +}
> +
> +static ssize_t dump_attr_store(struct kobject *kobj,
> +			       struct attribute *attr,
> +			       const char *buf, size_t len)
> +{
> +	struct dump_attribute *attribute;
> +	struct dump_obj *dump;
> +
> +	attribute = to_dump_attr(attr);
> +	dump = to_dump_obj(kobj);
> +
> +	if (!attribute->store)
> +		return -EIO;
> +
> +	return attribute->store(dump, attribute, buf, len);
> +}
> +
> +static const struct sysfs_ops dump_sysfs_ops = {
> +	.show = dump_attr_show,
> +	.store = dump_attr_store,
> +};
> +
> +static void dump_release(struct kobject *kobj)
> +{
> +	struct dump_obj *dump;
> +
> +	dump = to_dump_obj(kobj);
> +	vfree(dump->buffer);
> +	kfree(dump);
> +}
> +
> +static struct attribute *dump_default_attrs[] = {
> +	&id_attribute.attr,
> +	&type_attribute.attr,
> +	&ack_attribute.attr,
> +	NULL,
> +};
> +
> +static struct kobj_type dump_ktype = {
> +	.sysfs_ops = &dump_sysfs_ops,
> +	.release = &dump_release,
> +	.default_attrs = dump_default_attrs,
> +};
> +
> +static void free_dump_sg_list(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg1;
> +	while (list) {
> +		sg1 = list->next;
> +		kfree(list);
> +		list = sg1;
> +	}
> +	list = NULL;
> +}
> +
> +static struct opal_sg_list *dump_data_to_sglist(struct dump_obj *dump)
> +{
> +	struct opal_sg_list *sg1, *list = NULL;
> +	void *addr;
> +	int64_t size;
> +
> +	addr = dump->buffer;
> +	size = dump->size;
> +
> +	sg1 = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!sg1)
> +		goto nomem;
> +
> +	list = sg1;
> +	sg1->num_entries = 0;
> +	while (size > 0) {
> +		/* Translate virtual address to physical address */
> +		sg1->entry[sg1->num_entries].data =
> +			(void *)(vmalloc_to_pfn(addr) << PAGE_SHIFT);
> +
> +		if (size > PAGE_SIZE)
> +			sg1->entry[sg1->num_entries].length = PAGE_SIZE;
> +		else
> +			sg1->entry[sg1->num_entries].length = size;
> +
> +		sg1->num_entries++;
> +		if (sg1->num_entries >= SG_ENTRIES_PER_NODE) {
> +			sg1->next = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +			if (!sg1->next)
> +				goto nomem;
> +
> +			sg1 = sg1->next;
> +			sg1->num_entries = 0;
> +		}
> +		addr += PAGE_SIZE;
> +		size -= PAGE_SIZE;
> +	}
> +	return list;
> +
> +nomem:
> +	pr_err("%s : Failed to allocate memory\n", __func__);
> +	free_dump_sg_list(list);
> +	return NULL;
> +}
> +
> +static void sglist_to_phy_addr(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg, *next;
> +
> +	for (sg = list; sg; sg = next) {
> +		next = sg->next;
> +		/* Don't translate NULL pointer for last entry */
> +		if (sg->next)
> +			sg->next = (struct opal_sg_list *)__pa(sg->next);
> +		else
> +			sg->next = NULL;
> +
> +		/* Convert num_entries to length */
> +		sg->num_entries =
> +			sg->num_entries * sizeof(struct opal_sg_entry) + 16;
> +	}
> +}
> +
> +static int64_t dump_read_info(uint32_t *id, uint32_t *size, uint32_t *type)
> +{
> +	int rc;
> +	*type = 0xffffffff;
> +
> +	rc = opal_dump_info2(id, size, type);
> +
> +	if (rc == OPAL_PARAMETER)
> +		rc = opal_dump_info(id, size);
> +
> +	if (rc)
> +		pr_warn("%s: Failed to get dump info (%d)\n",
> +			__func__, rc);
> +	return rc;
> +}
> +
> +static int64_t dump_read_data(struct dump_obj *dump)
> +{
> +	struct opal_sg_list *list;
> +	uint64_t addr;
> +	int64_t rc;
> +
> +	/* Allocate memory */
> +	dump->buffer = vzalloc(PAGE_ALIGN(dump->size));
> +	if (!dump->buffer) {
> +		pr_err("%s : Failed to allocate memory\n", __func__);
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Generate SG list */
> +	list = dump_data_to_sglist(dump);
> +	if (!list) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Translate sg list addr to real address */
> +	sglist_to_phy_addr(list);
> +
> +	/* First entry address */
> +	addr = __pa(list);
> +
> +	/* Fetch data */
> +	rc = OPAL_BUSY_EVENT;
> +	while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
> +		rc = opal_dump_read(dump->id, addr);
> +		if (rc == OPAL_BUSY_EVENT) {
> +			opal_poll_events(NULL);
> +			msleep(20);
> +		}
> +	}
> +
> +	if (rc != OPAL_SUCCESS && rc != OPAL_PARTIAL)
> +		pr_warn("%s: Extract dump failed for ID 0x%x\n",
> +			__func__, dump->id);
> +
> +	/* Free SG list */
> +	free_dump_sg_list(list);
> +
> +out:
> +	return rc;
> +}
> +
> +static ssize_t dump_attr_read(struct file *filep, struct kobject *kobj,
> +			      struct bin_attribute *bin_attr,
> +			      char *buffer, loff_t pos, size_t count)
> +{
> +	ssize_t rc;
> +
> +	struct dump_obj *dump = to_dump_obj(kobj);
> +
> +	if (!dump->buffer) {
> +		rc = dump_read_data(dump);
> +
> +		if (rc != OPAL_SUCCESS && rc != OPAL_PARTIAL) {
> +			vfree(dump->buffer);
> +			dump->buffer = NULL;
> +
> +			return -EIO;
> +		}
> +		if (rc == OPAL_PARTIAL) {
> +			/* On a partial read, we just return EIO
> +			 * and rely on userspace to ask us to try
> +			 * again.
> +			 */
> +			pr_info("%s: Platform dump partially read.ID = 0x%x\n",

s/read.ID/read. ID/

Rest looks good.

-Vasant

> +				__func__, dump->id);
> +			return -EIO;
> +		}
> +	}
> +
> +	memcpy(buffer, dump->buffer + pos, count);
> +
> +	/* You may think we could free the dump buffer now and retrieve
> +	 * it again later if needed, but due to current firmware limitation,
> +	 * that's not the case. So, once read into userspace once,
> +	 * we keep the dump around until it's acknowledged by userspace.
> +	 */
> +
> +	return count;
> +}
> +
> +static struct dump_obj *create_dump_obj(uint32_t id, size_t size,
> +					uint32_t type)
> +{
> +	struct dump_obj *dump;
> +	int rc;
> +
> +	dump = kzalloc(sizeof(*dump), GFP_KERNEL);
> +	if (!dump)
> +		return NULL;
> +
> +	dump->kobj.kset = dump_kset;
> +
> +	kobject_init(&dump->kobj, &dump_ktype);
> +
> +	sysfs_bin_attr_init(&dump->dump_attr);
> +
> +	dump->dump_attr.attr.name = "dump";
> +	dump->dump_attr.attr.mode = 0400;
> +	dump->dump_attr.size = size;
> +	dump->dump_attr.read = dump_attr_read;
> +
> +	dump->id = id;
> +	dump->size = size;
> +	dump->type = type;
> +
> +	rc = kobject_add(&dump->kobj, NULL, "0x%x-0x%x", type, id);
> +	if (rc) {
> +		kobject_put(&dump->kobj);
> +		return NULL;
> +	}
> +
> +	rc = sysfs_create_bin_file(&dump->kobj, &dump->dump_attr);
> +	if (rc) {
> +		kobject_put(&dump->kobj);
> +		return NULL;
> +	}
> +
> +	pr_info("%s: New platform dump. ID = 0x%x Size %u\n",
> +		__func__, dump->id, dump->size);
> +
> +	kobject_uevent(&dump->kobj, KOBJ_ADD);
> +
> +	return dump;
> +}
> +
> +static int process_dump(void)
> +{
> +	int rc;
> +	uint32_t dump_id, dump_size, dump_type;
> +	struct dump_obj *dump;
> +	char name[22];
> +
> +	rc = dump_read_info(&dump_id, &dump_size, &dump_type);
> +	if (rc != OPAL_SUCCESS)
> +		return rc;
> +
> +	sprintf(name, "0x%x-0x%x", dump_type, dump_id);
> +
> +	/* we may get notified twice, let's handle
> +	 * that gracefully and not create two conflicting
> +	 * entries.
> +	 */
> +	if (kset_find_obj(dump_kset, name))
> +		return 0;
> +
> +	dump = create_dump_obj(dump_id, dump_size, dump_type);
> +	if (!dump)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void dump_work_fn(struct work_struct *work)
> +{
> +	process_dump();
> +}
> +
> +static DECLARE_WORK(dump_work, dump_work_fn);
> +
> +static void schedule_process_dump(void)
> +{
> +	schedule_work(&dump_work);
> +}
> +
> +/*
> + * New dump available notification
> + *
> + * Once we get notification, we add sysfs entries for it.
> + * We only fetch the dump on demand, and create sysfs asynchronously.
> + */
> +static int dump_event(struct notifier_block *nb,
> +		      unsigned long events, void *change)
> +{
> +	if (events & OPAL_EVENT_DUMP_AVAIL)
> +		schedule_process_dump();
> +
> +	return 0;
> +}
> +
> +static struct notifier_block dump_nb = {
> +	.notifier_call  = dump_event,
> +	.next           = NULL,
> +	.priority       = 0
> +};
> +
> +void __init opal_platform_dump_init(void)
> +{
> +	int rc;
> +
> +	dump_kset = kset_create_and_add("dump", NULL, opal_kobj);
> +	if (!dump_kset) {
> +		pr_warn("%s: Failed to create dump kset\n", __func__);
> +		return;
> +	}
> +
> +	rc = sysfs_create_group(&dump_kset->kobj, &initiate_attr_group);
> +	if (rc) {
> +		pr_warn("%s: Failed to create initiate dump attr group\n",
> +			__func__);
> +		kobject_put(&dump_kset->kobj);
> +		return;
> +	}
> +
> +	rc = opal_notifier_register(&dump_nb);
> +	if (rc) {
> +		pr_warn("%s: Can't register OPAL event notifier (%d)\n",
> +			__func__, rc);
> +		return;
> +	}
> +
> +	opal_dump_resend_notification();
> +}
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 3e8829c..3e02783 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -126,6 +126,12 @@ OPAL_CALL(opal_return_cpu,			OPAL_RETURN_CPU);
>   OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
>   OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
>   OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
> +OPAL_CALL(opal_dump_init,			OPAL_DUMP_INIT);
> +OPAL_CALL(opal_dump_info,			OPAL_DUMP_INFO);
> +OPAL_CALL(opal_dump_info2,			OPAL_DUMP_INFO2);
> +OPAL_CALL(opal_dump_read,			OPAL_DUMP_READ);
> +OPAL_CALL(opal_dump_ack,			OPAL_DUMP_ACK);
>   OPAL_CALL(opal_get_msg,				OPAL_GET_MSG);
>   OPAL_CALL(opal_check_completion,		OPAL_CHECK_ASYNC_COMPLETION);
> +OPAL_CALL(opal_dump_resend_notification,	OPAL_DUMP_RESEND);
>   OPAL_CALL(opal_sync_host_reboot,		OPAL_SYNC_HOST_REBOOT);
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 65499ad..262cd1a 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -474,6 +474,8 @@ static int __init opal_init(void)
>   	if (rc == 0) {
>   		/* Setup code update interface */
>   		opal_flash_init();
> +		/* Setup platform dump extract interface */
> +		opal_platform_dump_init();
>   	}
>
>   	return 0;
>

^ permalink raw reply

* Re: [PATCH] powerpc/powernv: Read OPAL error log and export it through sysfs
From: Vasant Hegde @ 2014-03-04 12:31 UTC (permalink / raw)
  To: Stewart Smith, Mahesh J Salgaonkar, benh, linuxppc-dev
In-Reply-To: <1393549112-6101-1-git-send-email-stewart@linux.vnet.ibm.com>

On 02/28/2014 06:28 AM, Stewart Smith wrote:
> Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> This patch adds support to read error logs from OPAL and export
> them to userspace through a sysfs interface.
>
> We export each log entry as a directory in /sys/firmware/opal/elog/
>
> Currently, OPAL will buffer up to 128 error log records, we don't
> need to have any knowledge of this limit on the Linux side as that
> is actually largely transparent to us.
>
> Each error log entry has the following files: id, type, acknowledge, raw.
> Currently we just export the raw binary error log in the 'raw' attribute.
> In a future patch, we may parse more of the error log to make it a bit
> easier for userspace (e.g. to be able to display a brief summary in
> petitboot without having to have a full parser).
>
> If we have >128 logs from OPAL, we'll only be notified of 128 until
> userspace starts acknowledging them. This limitation may be lifted in
> the future and with this patch, that should "just work" from the linux side.
>
> A userspace daemon should:
> - wait for error log entries using normal mechanisms (we announce creation)
> - read error log entry
> - save error log entry safely to disk
> - acknowledge the error log entry
> - rinse, repeat.
>
> On the Linux side, we read the error log when we're notified of it. This
> possibly isn't ideal as it would be better to only read them on-demand.
> However, this doesn't really work with current OPAL interface, so we
> read the error log immediately when notified at the moment.
>
> I've tested this pretty extensively and am rather confident that the
> linux side of things works rather well. There is currently an issue with
> the service processor side of things for >128 error logs though.
>
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
> ---
>   Documentation/ABI/stable/sysfs-firmware-opal-elog |   60 ++++
>   arch/powerpc/include/asm/opal.h                   |   13 +
>   arch/powerpc/platforms/powernv/Makefile           |    2 +-
>   arch/powerpc/platforms/powernv/opal-elog.c        |  312 +++++++++++++++++++++
>   arch/powerpc/platforms/powernv/opal-wrappers.S    |    5 +
>   arch/powerpc/platforms/powernv/opal.c             |    2 +
>   6 files changed, 393 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/ABI/stable/sysfs-firmware-opal-elog
>   create mode 100644 arch/powerpc/platforms/powernv/opal-elog.c
>
> diff --git a/Documentation/ABI/stable/sysfs-firmware-opal-elog b/Documentation/ABI/stable/sysfs-firmware-opal-elog
> new file mode 100644
> index 0000000..e1f3058
> --- /dev/null
> +++ b/Documentation/ABI/stable/sysfs-firmware-opal-elog
> @@ -0,0 +1,60 @@
> +What:		/sys/firmware/opal/elog
> +Date:		Feb 2014
> +Contact:	Stewart Smith <stewart@linux.vnet.ibm.com>
> +Description:
> +		This directory exposes error log entries retrieved
> +		through the OPAL firmware interface.
> +
> +		Each error log is identified by a unique ID and will
> +		exist until explicitly acknowledged to firmware.
> +
> +		Each log entry has a directory in /sys/firmware/opal/elog.
> +
> +		Log entries may be purged by the service processor
> +		before retrieved by firmware or retrieved/acknowledged by
> +		Linux if there is no room for more log entries.
> +
> +		In the event that Linux has retrieved the log entries
> +		but not explicitly acknowledged them to firmware and
> +		the service processor needs more room for log entries,
> +		the only remaining copy of a log message may be in
> +		Linux.
> +
> +		Typically, a user space daemon will monitor for new
> +		entries, read them out and acknowledge them.
> +
> +		The service processor may be able to store more log
> +		entries than firmware can, so after you acknowledge
> +		an event from Linux you may instantly get another one
> +		from the queue that was generated some time in the past.
> +
> +		The raw log format is a binary format. We currently
> +		do not parse this at all in kernel, leaving it up to
> +		user space to solve the problem. In future, we may
> +		do more parsing in kernel and add more files to make
> +		it easier for simple user space processes to extract
> +		more information.
> +
> +		For each log entry (directory), there are the following
> +		files:
> +
> +		id:		An ASCII representation of the ID of the
> +				error log, in hex - e.g. "0x01".
> +
> +		type:		An ASCII representation of the type id and
> +				description of the type of error log.
> +				Currently just "0x00 PEL" - platform error log.
> +				In the future there may be additional types.
> +
> +		raw:		A read-only binary file that can be read
> +				to get the raw log entry. These are
> +				<16kb, often just hundreds of bytes and
> +				"average" 2kb.
> +
> +		acknowledge:	Writing 'ack' to this file will acknowledge
> +				the error log to firmware (and in turn
> +				the service processor, if applicable).
> +				Shortly after acknowledging it, the log
> +				entry will be removed from sysfs.
> +				Reading this file will list the supported
> +				operations (curently just acknowledge).
> \ No newline at end of file
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 40157e2..b404545 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -151,6 +151,11 @@ extern int opal_enter_rtas(struct rtas_args *args,
>   #define OPAL_LPC_READ				67
>   #define OPAL_LPC_WRITE				68
>   #define OPAL_RETURN_CPU				69
> +#define OPAL_ELOG_READ				71
> +#define OPAL_ELOG_WRITE				72
> +#define OPAL_ELOG_ACK				73
> +#define OPAL_ELOG_RESEND			74
> +#define OPAL_ELOG_SIZE				75
>   #define OPAL_FLASH_VALIDATE			76
>   #define OPAL_FLASH_MANAGE			77
>   #define OPAL_FLASH_UPDATE			78
> @@ -823,6 +828,13 @@ int64_t opal_lpc_write(uint32_t chip_id, enum OpalLPCAddressType addr_type,
>   		       uint32_t addr, uint32_t data, uint32_t sz);
>   int64_t opal_lpc_read(uint32_t chip_id, enum OpalLPCAddressType addr_type,
>   		      uint32_t addr, __be32 *data, uint32_t sz);
> +
> +int64_t opal_read_elog(uint64_t buffer, size_t size, uint64_t log_id);
> +int64_t opal_get_elog_size(uint64_t *log_id, size_t *size, uint64_t *elog_type);
> +int64_t opal_write_elog(uint64_t buffer, uint64_t size, uint64_t offset);
> +int64_t opal_send_ack_elog(uint64_t log_id);

Stewart,

Why are you creating 64bit log ID  when actual ID is 32bit ?

Rest looks good.

-Vasant


> +void opal_resend_pending_logs(void);
> +
>   int64_t opal_validate_flash(uint64_t buffer, uint32_t *size, uint32_t *result);
>   int64_t opal_manage_flash(uint8_t op);
>   int64_t opal_update_flash(uint64_t blk_list);
> @@ -861,6 +873,7 @@ extern void opal_get_rtc_time(struct rtc_time *tm);
>   extern unsigned long opal_get_boot_time(void);
>   extern void opal_nvram_init(void);
>   extern void opal_flash_init(void);
> +extern int opal_elog_init(void);
>
>   extern int opal_machine_check(struct pt_regs *regs);
>
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index 8d767fd..189fd45 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -1,6 +1,6 @@
>   obj-y			+= setup.o opal-takeover.o opal-wrappers.o opal.o
>   obj-y			+= opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
> -obj-y			+= rng.o
> +obj-y			+= rng.o opal-elog.o
>
>   obj-$(CONFIG_SMP)	+= smp.o
>   obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
> diff --git a/arch/powerpc/platforms/powernv/opal-elog.c b/arch/powerpc/platforms/powernv/opal-elog.c
> new file mode 100644
> index 0000000..61e2ef3
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-elog.c
> @@ -0,0 +1,312 @@
> +/*
> + * Error log support on PowerNV.
> + *
> + * Copyright 2013,2014 IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/of.h>
> +#include <linux/slab.h>
> +#include <linux/sysfs.h>
> +#include <linux/fs.h>
> +#include <linux/vmalloc.h>
> +#include <linux/fcntl.h>
> +#include <asm/uaccess.h>
> +#include <asm/opal.h>
> +
> +struct elog_obj {
> +	struct kobject kobj;
> +	struct bin_attribute raw_attr;
> +	uint64_t id;
> +	uint64_t type;
> +	size_t size;
> +	char *buffer;
> +};
> +#define to_elog_obj(x) container_of(x, struct elog_obj, kobj)
> +
> +struct elog_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct elog_obj *elog, struct elog_attribute *attr,
> +			char *buf);
> +	ssize_t (*store)(struct elog_obj *elog, struct elog_attribute *attr,
> +			 const char *buf, size_t count);
> +};
> +#define to_elog_attr(x) container_of(x, struct elog_attribute, attr)
> +
> +static ssize_t elog_id_show(struct elog_obj *elog_obj,
> +			    struct elog_attribute *attr,
> +			    char *buf)
> +{
> +	return sprintf(buf, "0x%llx\n", elog_obj->id);
> +}
> +
> +static const char *elog_type_to_string(uint64_t type)
> +{
> +	switch (type) {
> +	case 0: return "PEL";
> +	default: return "unknown";
> +	}
> +}
> +
> +static ssize_t elog_type_show(struct elog_obj *elog_obj,
> +			      struct elog_attribute *attr,
> +			      char *buf)
> +{
> +	return sprintf(buf, "0x%llx %s\n",
> +		       elog_obj->type,
> +		       elog_type_to_string(elog_obj->type));
> +}
> +
> +static ssize_t elog_ack_show(struct elog_obj *elog_obj,
> +			     struct elog_attribute *attr,
> +			     char *buf)
> +{
> +	return sprintf(buf, "ack - acknowledge log message\n");
> +}
> +
> +static void delay_release_kobj(void *kobj)
> +{
> +	kobject_put((struct kobject *)kobj);
> +}
> +
> +static ssize_t elog_ack_store(struct elog_obj *elog_obj,
> +			      struct elog_attribute *attr,
> +			      const char *buf,
> +			      size_t count)
> +{
> +	opal_send_ack_elog(elog_obj->id);
> +	sysfs_schedule_callback(&elog_obj->kobj, delay_release_kobj,
> +				&elog_obj->kobj, THIS_MODULE);
> +	return count;
> +}
> +
> +static struct elog_attribute id_attribute =
> +	__ATTR(id, 0666, elog_id_show, NULL);
> +static struct elog_attribute type_attribute =
> +	__ATTR(type, 0666, elog_type_show, NULL);
> +static struct elog_attribute ack_attribute =
> +	__ATTR(acknowledge, 0660, elog_ack_show, elog_ack_store);
> +
> +static struct kset *elog_kset;
> +
> +static ssize_t elog_attr_show(struct kobject *kobj,
> +			      struct attribute *attr,
> +			      char *buf)
> +{
> +	struct elog_attribute *attribute;
> +	struct elog_obj *elog;
> +
> +	attribute = to_elog_attr(attr);
> +	elog = to_elog_obj(kobj);
> +
> +	if (!attribute->show)
> +		return -EIO;
> +
> +	return attribute->show(elog, attribute, buf);
> +}
> +
> +static ssize_t elog_attr_store(struct kobject *kobj,
> +			       struct attribute *attr,
> +			       const char *buf, size_t len)
> +{
> +	struct elog_attribute *attribute;
> +	struct elog_obj *elog;
> +
> +	attribute = to_elog_attr(attr);
> +	elog = to_elog_obj(kobj);
> +
> +	if (!attribute->store)
> +		return -EIO;
> +
> +	return attribute->store(elog, attribute, buf, len);
> +}
> +
> +static const struct sysfs_ops elog_sysfs_ops = {
> +	.show = elog_attr_show,
> +	.store = elog_attr_store,
> +};
> +
> +static void elog_release(struct kobject *kobj)
> +{
> +	struct elog_obj *elog;
> +
> +	elog = to_elog_obj(kobj);
> +	kfree(elog->buffer);
> +	kfree(elog);
> +}
> +
> +static struct attribute *elog_default_attrs[] = {
> +	&id_attribute.attr,
> +	&type_attribute.attr,
> +	&ack_attribute.attr,
> +	NULL,
> +};
> +
> +static struct kobj_type elog_ktype = {
> +	.sysfs_ops = &elog_sysfs_ops,
> +	.release = &elog_release,
> +	.default_attrs = elog_default_attrs,
> +};
> +
> +/* Maximum size of a single log on FSP is 16KB */
> +#define OPAL_MAX_ERRLOG_SIZE	16384
> +
> +static ssize_t raw_attr_read(struct file *filep, struct kobject *kobj,
> +			     struct bin_attribute *bin_attr,
> +			     char *buffer, loff_t pos, size_t count)
> +{
> +	int opal_rc;
> +
> +	struct elog_obj *elog = to_elog_obj(kobj);
> +
> +	/* We may have had an error reading before, so let's retry */
> +	if (!elog->buffer) {
> +		elog->buffer = kzalloc(elog->size, GFP_KERNEL);
> +		if (!elog->buffer)
> +			return -EIO;
> +
> +		opal_rc = opal_read_elog(__pa(elog->buffer),
> +					 elog->size, elog->id);
> +		if (opal_rc != OPAL_SUCCESS) {
> +			pr_err("ELOG: log read failed for log-id=%llx\n",
> +			       elog->id);
> +			kfree(elog->buffer);
> +			elog->buffer = NULL;
> +			return -EIO;
> +		}
> +	}
> +
> +	memcpy(buffer, elog->buffer + pos, count);
> +
> +	return count;
> +}
> +
> +static struct elog_obj *create_elog_obj(uint64_t id, size_t size, uint64_t type)
> +{
> +	struct elog_obj *elog;
> +	int rc;
> +
> +	elog = kzalloc(sizeof(*elog), GFP_KERNEL);
> +	if (!elog)
> +		return NULL;
> +
> +	elog->kobj.kset = elog_kset;
> +
> +	kobject_init(&elog->kobj, &elog_ktype);
> +
> +	sysfs_bin_attr_init(&elog->raw_attr);
> +
> +	elog->raw_attr.attr.name = "raw";
> +	elog->raw_attr.attr.mode = 0400;
> +	elog->raw_attr.size = size;
> +	elog->raw_attr.read = raw_attr_read;
> +
> +	elog->id = id;
> +	elog->size = size;
> +	elog->type = type;
> +
> +	elog->buffer = kzalloc(elog->size, GFP_KERNEL);
> +
> +	if (elog->buffer) {
> +		rc = opal_read_elog(__pa(elog->buffer),
> +					 elog->size, elog->id);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_err("ELOG: log read failed for log-id=%llx\n",
> +			       elog->id);
> +			kfree(elog->buffer);
> +			elog->buffer = NULL;
> +		}
> +	}
> +
> +	rc = kobject_add(&elog->kobj, NULL, "0x%llx", id);
> +	if (rc) {
> +		kobject_put(&elog->kobj);
> +		return NULL;
> +	}
> +
> +	rc = sysfs_create_bin_file(&elog->kobj, &elog->raw_attr);
> +	if (rc) {
> +		kobject_put(&elog->kobj);
> +		return NULL;
> +	}
> +
> +	kobject_uevent(&elog->kobj, KOBJ_ADD);
> +
> +	return elog;
> +}
> +
> +static void elog_work_fn(struct work_struct *work)
> +{
> +	size_t elog_size;
> +	uint64_t log_id;
> +	uint64_t elog_type;
> +	int rc;
> +	char name[2+16+1];
> +
> +	rc = opal_get_elog_size(&log_id, &elog_size, &elog_type);
> +	if (rc != OPAL_SUCCESS) {
> +		pr_err("ELOG: Opal log read failed\n");
> +		return;
> +	}
> +
> +	BUG_ON(elog_size > OPAL_MAX_ERRLOG_SIZE);
> +
> +	if (elog_size >= OPAL_MAX_ERRLOG_SIZE)
> +		elog_size  =  OPAL_MAX_ERRLOG_SIZE;
> +
> +	sprintf(name, "0x%llx", log_id);
> +
> +	/* we may get notified twice, let's handle
> +	 * that gracefully and not create two conflicting
> +	 * entries.
> +	 */
> +	if (kset_find_obj(elog_kset, name))
> +		return;
> +
> +	create_elog_obj(log_id, elog_size, elog_type);
> +}
> +
> +static DECLARE_WORK(elog_work, elog_work_fn);
> +
> +static int elog_event(struct notifier_block *nb,
> +				unsigned long events, void *change)
> +{
> +	/* check for error log event */
> +	if (events & OPAL_EVENT_ERROR_LOG_AVAIL)
> +		schedule_work(&elog_work);
> +	return 0;
> +}
> +
> +static struct notifier_block elog_nb = {
> +	.notifier_call  = elog_event,
> +	.next           = NULL,
> +	.priority       = 0
> +};
> +
> +int __init opal_elog_init(void)
> +{
> +	int rc = 0;
> +
> +	elog_kset = kset_create_and_add("elog", NULL, opal_kobj);
> +	if (!elog_kset) {
> +		pr_warn("%s: failed to create elog kset\n", __func__);
> +		return -1;
> +	}
> +
> +	rc = opal_notifier_register(&elog_nb);
> +	if (rc) {
> +		pr_err("%s: Can't register OPAL event notifier (%d)\n",
> +		__func__, rc);
> +		return rc;
> +	}
> +
> +	/* We are now ready to pull error logs from opal. */
> +	opal_resend_pending_logs();
> +
> +	return 0;
> +}
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 3e8829c..5fcbf25 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -123,6 +123,11 @@ OPAL_CALL(opal_xscom_write,			OPAL_XSCOM_WRITE);
>   OPAL_CALL(opal_lpc_read,			OPAL_LPC_READ);
>   OPAL_CALL(opal_lpc_write,			OPAL_LPC_WRITE);
>   OPAL_CALL(opal_return_cpu,			OPAL_RETURN_CPU);
> +OPAL_CALL(opal_read_elog,			OPAL_ELOG_READ);
> +OPAL_CALL(opal_send_ack_elog,			OPAL_ELOG_ACK);
> +OPAL_CALL(opal_get_elog_size,			OPAL_ELOG_SIZE);
> +OPAL_CALL(opal_resend_pending_logs,		OPAL_ELOG_RESEND);
> +OPAL_CALL(opal_write_elog,			OPAL_ELOG_WRITE);
>   OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
>   OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
>   OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 65499ad..fb77302 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -472,6 +472,8 @@ static int __init opal_init(void)
>   	/* Create "opal" kobject under /sys/firmware */
>   	rc = opal_sysfs_init();
>   	if (rc == 0) {
> +		/* Setup error log interface */
> +		rc = opal_elog_init();
>   		/* Setup code update interface */
>   		opal_flash_init();
>   	}
>

^ permalink raw reply

* Re: [PATCH] powerpc: Align p_dyn, p_rela and p_st symbols
From: Laurent Dufour @ 2014-03-04 15:45 UTC (permalink / raw)
  To: Anton Blanchard, benh, paulus; +Cc: linuxppc-dev
In-Reply-To: <20140304083124.0c7c29a2@kryten>

On 03/03/2014 22:31, Anton Blanchard wrote:
> 
> The 64bit relocation code places a few symbols in the text segment.
> These symbols are only 4 byte aligned where they need to be 8 byte
> aligned. Add an explicit alignment.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: stable@vger.kernel.org

Fix the issue I was seeing when booting a LE kernel in a KVM guest on my
P7 box.

Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

> ---
> 
> diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
> index 1482327..d88736f 100644
> --- a/arch/powerpc/kernel/reloc_64.S
> +++ b/arch/powerpc/kernel/reloc_64.S
> @@ -81,6 +81,7 @@ _GLOBAL(relocate)
>  
>  6:	blr
>  
> +.balign 8
>  p_dyn:	.llong	__dynamic_start - 0b
>  p_rela:	.llong	__rela_dyn_start - 0b
>  p_st:	.llong	_stext - 0b
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* [tip:irq/core] powerpc: Irq: Use generic_handle_irq
From: tip-bot for Thomas Gleixner @ 2014-03-04 16:39 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, linux-kernel, hpa, tglx, linuxppc-dev, mingo
In-Reply-To: <20140223212736.333718121@linutronix.de>

Commit-ID:  a4e04c9f219d2c00764ffa7ba45500411815879d
Gitweb:     http://git.kernel.org/tip/a4e04c9f219d2c00764ffa7ba45500411815879d
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Sun, 23 Feb 2014 21:40:08 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Mar 2014 17:37:52 +0100

powerpc: Irq: Use generic_handle_irq

No functional change

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/20140223212736.333718121@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/kernel/irq.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 1d0848b..ca1cd74 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -465,7 +465,6 @@ static inline void check_stack_overflow(void)
 
 void __do_irq(struct pt_regs *regs)
 {
-	struct irq_desc *desc;
 	unsigned int irq;
 
 	irq_enter();
@@ -487,11 +486,8 @@ void __do_irq(struct pt_regs *regs)
 	/* And finally process it */
 	if (unlikely(irq == NO_IRQ))
 		__get_cpu_var(irq_stat).spurious_irqs++;
-	else {
-		desc = irq_to_desc(irq);
-		if (likely(desc))
-			desc->handle_irq(irq, desc);
-	}
+	else
+		generic_handle_irq(irq);
 
 	trace_irq_exit(regs);
 

^ permalink raw reply related

* [tip:irq/core] powerpc:eVh_pic: Kill irq_desc abuse
From: tip-bot for Thomas Gleixner @ 2014-03-04 16:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: timur, peterz, ashish.kalra, linux-kernel, hpa, tglx,
	linuxppc-dev, mingo
In-Reply-To: <20140223212736.451970660@linutronix.de>

Commit-ID:  c866cda47f2c6c8abb929933b7794e9a92d7c924
Gitweb:     http://git.kernel.org/tip/c866cda47f2c6c8abb929933b7794e9a92d7c924
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Sun, 23 Feb 2014 21:40:08 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Mar 2014 17:37:51 +0100

powerpc:eVh_pic: Kill irq_desc abuse

I'm really grumpy about this one. The line:

#include "../../../kernel/irq/settings.h"

should have been an alarm sign for all people who added their SOB to
this trainwreck.

When I cleaned up the mess people made with interrupt descriptors a
few years ago, I warned that I'm going to hunt down new offenders and
treat them with stinking trouts. In this case I'll use frozen shark
for a better educational value.

The whole idiocy which was done there could have been avoided with two
lines of perfectly fine code. And do not complain about the lack of
correct examples in tree.

The solution is simple:

  Remove the brainfart and use the proper functions, which should
  have been used in the first place

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ashish Kalra <ashish.kalra@freescale.com>
Cc: Timur Tabi <timur@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/20140223212736.451970660@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/sysdev/ehv_pic.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/sysdev/ehv_pic.c b/arch/powerpc/sysdev/ehv_pic.c
index b74085c..2d20f10 100644
--- a/arch/powerpc/sysdev/ehv_pic.c
+++ b/arch/powerpc/sysdev/ehv_pic.c
@@ -28,8 +28,6 @@
 #include <asm/ehv_pic.h>
 #include <asm/fsl_hcalls.h>
 
-#include "../../../kernel/irq/settings.h"
-
 static struct ehv_pic *global_ehv_pic;
 static DEFINE_SPINLOCK(ehv_pic_lock);
 
@@ -113,17 +111,13 @@ static unsigned int ehv_pic_type_to_vecpri(unsigned int type)
 int ehv_pic_set_irq_type(struct irq_data *d, unsigned int flow_type)
 {
 	unsigned int src = virq_to_hw(d->irq);
-	struct irq_desc *desc = irq_to_desc(d->irq);
 	unsigned int vecpri, vold, vnew, prio, cpu_dest;
 	unsigned long flags;
 
 	if (flow_type == IRQ_TYPE_NONE)
 		flow_type = IRQ_TYPE_LEVEL_LOW;
 
-	irq_settings_clr_level(desc);
-	irq_settings_set_trigger_mask(desc, flow_type);
-	if (flow_type & (IRQ_TYPE_LEVEL_HIGH | IRQ_TYPE_LEVEL_LOW))
-		irq_settings_set_level(desc);
+	irqd_set_trigger_type(d, flow_type);
 
 	vecpri = ehv_pic_type_to_vecpri(flow_type);
 
@@ -144,7 +138,7 @@ int ehv_pic_set_irq_type(struct irq_data *d, unsigned int flow_type)
 	ev_int_set_config(src, vecpri, prio, cpu_dest);
 
 	spin_unlock_irqrestore(&ehv_pic_lock, flags);
-	return 0;
+	return IRQ_SET_MASK_OK_NOCOPY;
 }
 
 static struct irq_chip ehv_pic_irq_chip = {

^ permalink raw reply related

* [tip:irq/core] powerpc: Eeh: Kill another abuse of irq_desc
From: tip-bot for Thomas Gleixner @ 2014-03-04 16:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: shangw, peterz, linux-kernel, hpa, tglx, linuxppc-dev, mingo
In-Reply-To: <20140223212736.562906212@linutronix.de>

Commit-ID:  b8a9a11b976810ba12a43c4fe699a14892c97e52
Gitweb:     http://git.kernel.org/tip/b8a9a11b976810ba12a43c4fe699a14892c97e52
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Sun, 23 Feb 2014 21:40:09 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Mar 2014 17:37:52 +0100

powerpc: Eeh: Kill another abuse of irq_desc

commit 91150af3a (powerpc/eeh: Fix unbalanced enable for IRQ) is
another brilliant example of trainwreck engineering.

The patch "fixes" the issue of an unbalanced call to irq_enable()
which causes a prominent warning by checking the disabled state of the
interrupt line and call conditionally into the core code.

This is wrong in two aspects:

1) The warning is there to tell users, that they need to fix their
   asymetric enable/disable patterns by finding the root cause and
   solving it there.

   It's definitely not meant to work around it by conditionally
   calling into the core code depending on the random state of the irq
   line.

   Asymetric irq_disable/enable calls are a clear sign of wrong usage
   of the interfaces which have to be cured at the root and not by
   somehow hacking around it.

2) The abuse of core internal data structure instead of using the
   proper interfaces for retrieving the information for the 'hack
   around'

   irq_desc is core internal and it's clear enough stated.

Replace at least the irq_desc abuse with the proper functions and add
a big fat comment why this is absurd and completely wrong.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/20140223212736.562906212@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/kernel/eeh_driver.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index fdc679d..3e1d7de 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -143,15 +143,31 @@ static void eeh_disable_irq(struct pci_dev *dev)
 static void eeh_enable_irq(struct pci_dev *dev)
 {
 	struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
-	struct irq_desc *desc;
 
 	if ((edev->mode) & EEH_DEV_IRQ_DISABLED) {
 		edev->mode &= ~EEH_DEV_IRQ_DISABLED;
-
-		desc = irq_to_desc(dev->irq);
-		if (desc && desc->depth > 0)
+		/*
+		 * FIXME !!!!!
+		 *
+		 * This is just ass backwards. This maze has
+		 * unbalanced irq_enable/disable calls. So instead of
+		 * finding the root cause it works around the warning
+		 * in the irq_enable code by conditionally calling
+		 * into it.
+		 *
+		 * That's just wrong.The warning in the core code is
+		 * there to tell people to fix their assymetries in
+		 * their own code, not by abusing the core information
+		 * to avoid it.
+		 *
+		 * I so wish that the assymetry would be the other way
+		 * round and a few more irq_disable calls render that
+		 * shit unusable forever.
+		 *
+		 *	tglx
+		 */
+		if (irqd_irq_disabled(irq_get_irq_data(dev->irq))
 			enable_irq(dev->irq);
-	}
 }
 
 /**

^ permalink raw reply related

* Re: [PATCH] Corenet: Add QE platform support for Corenet
From: Scott Wood @ 2014-03-04 18:53 UTC (permalink / raw)
  To: Zhao Qiang-B45475; +Cc: linuxppc-dev@lists.ozlabs.org, Xie Xiaobo-R63061
In-Reply-To: <ef9ee88dd9e548f380abb7512dc1814c@BLUPR03MB341.namprd03.prod.outlook.com>

On Tue, 2014-03-04 at 03:09 -0600, Zhao Qiang-B45475 wrote:
> On Mar 3, 2014, at 11:51 PM, Kumar Gala [galak@kernel.crashing.org] wrote:
> 
> 
> 
> > -----Original Message-----
> > From: Kumar Gala [mailto:galak@kernel.crashing.org]
> > Sent: Monday, March 03, 2014 11:51 PM
> > To: Zhao Qiang-B45475
> > Cc: linuxppc-dev@lists.ozlabs.org; Wood Scott-B07421; Xie Xiaobo-R63061
> > Subject: Re: [PATCH] Corenet: Add QE platform support for Corenet
> > 
> > 
> > On Feb 28, 2014, at 2:48 AM, Zhao Qiang <B45475@freescale.com> wrote:
> > 
> > > There is QE on platform T104x, add support.
> > > Call funcs qe_ic_init and qe_init if CONFIG_QUICC_ENGINE is defined.
> > >
> > > Signed-off-by: Zhao Qiang <B45475@freescale.com>
> > > ---
> > > arch/powerpc/platforms/85xx/corenet_generic.c | 32
> > > +++++++++++++++++++++++++++
> > > 1 file changed, 32 insertions(+)
> > 
> > Can you use mpc85xx_qe_init() instead?
> 
> 
> mpc85xx_qe_init() is for old QE which is different from new QE.
> New QE has no par_io, and it is not correct to init 
> par_io(par_io_init() called in mpc85xx_qe_init()) for new QE. 

So split that function into mpc85xx_qe_init() and
mpc85xx_qe_par_io_init().

> > >
> > > diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c
> > > b/arch/powerpc/platforms/85xx/corenet_generic.c
> > > index fbd871e..f8c8e0c 100644
> > > --- a/arch/powerpc/platforms/85xx/corenet_generic.c
> > > +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
> > >
> > > /*
> > > @@ -52,11 +68,24 @@ void __init corenet_gen_pic_init(void)  */ void
> > > __init corenet_gen_setup_arch(void) {
> > > +#ifdef CONFIG_QUICC_ENGINE
> > > +	struct device_node *np;
> > > +#endif
> > > 	mpc85xx_smp_init();
> > >
> > > 	swiotlb_detect_4g();
> > >
> > > 	pr_info("%s board from Freescale Semiconductor\n", ppc_md.name);
> > > +
> > > +#ifdef CONFIG_QUICC_ENGINE
> > > +	np = of_find_compatible_node(NULL, NULL, "fsl,qe");
> > > +	if (!np) {
> > > +		pr_err("%s: Could not find Quicc Engine node\n", __func__);
> > > +		return;
> > 
> > This doesn't seem like an reasonable error message for common corenet
> > platform.  It seems reasonable to build QE support but boot on a chip w/o
> > QE.

mpc85xx_qe_init() has a similar problem regarding the error message, but
the above is worse because it does an early return from
corenet_gen_setup_arch() rather than just from mpc85xx_qe_init() -- what
if someone added non-QE things after this point?

-Scott

^ permalink raw reply

* [RFC PATCH] Remove CONFIG_DCACHE_WORD_ACCESS
From: Joe Perches @ 2014-03-04 20:23 UTC (permalink / raw)
  To: linux-arch, Benjamin Herrenschmidt, Paul Mackerras
  Cc: Russell King, Catalin Marinas, x86, Will Deacon, linux-kernel,
	Ingo Molnar, Alexander Viro, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, linuxppc-dev, linux-arm-kernel

It seems to duplicate CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
so use that instead.

This changes the !CPU_LITTLE_ENDIAN powerpc arch to use unaligned
accesses in fs/dcache.c and fs/namei.c as
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is enabled for that arch.

Remove the now unused DCACHE_WORD_ACCESS defines & uses.

Signed-off-by: Joe Perches <joe@perches.com>
---
 arch/arm/Kconfig                      | 1 -
 arch/arm/include/asm/word-at-a-time.h | 4 ++--
 arch/arm64/Kconfig                    | 1 -
 arch/x86/Kconfig                      | 1 -
 fs/Kconfig                            | 4 ----
 fs/dcache.c                           | 2 +-
 fs/namei.c                            | 2 +-
 7 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 623a272..d5a2e60 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -12,7 +12,6 @@ config ARM
 	select BUILDTIME_EXTABLE_SORT if MMU
 	select CLONE_BACKWARDS
 	select CPU_PM if (SUSPEND || CPU_IDLE)
-	select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select GENERIC_ATOMIC64 if (CPU_V7M || CPU_V6 || !CPU_32v6K || !AEABI)
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
 	select GENERIC_IDLE_POLL_SETUP
diff --git a/arch/arm/include/asm/word-at-a-time.h b/arch/arm/include/asm/word-at-a-time.h
index a6d0a29..778b2ad 100644
--- a/arch/arm/include/asm/word-at-a-time.h
+++ b/arch/arm/include/asm/word-at-a-time.h
@@ -54,7 +54,7 @@ static inline unsigned long find_zero(unsigned long mask)
 #include <asm-generic/word-at-a-time.h>
 #endif
 
-#ifdef CONFIG_DCACHE_WORD_ACCESS
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 
 /*
  * Load an unaligned word from kernel space.
@@ -94,5 +94,5 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
 	return ret;
 }
 
-#endif	/* DCACHE_WORD_ACCESS */
+#endif	/* HAVE_EFFICIENT_UNALIGNED_ACCESS */
 #endif /* __ASM_ARM_WORD_AT_A_TIME_H */
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 764d682..2d6978c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -13,7 +13,6 @@ config ARM64
 	select CLONE_BACKWARDS
 	select COMMON_CLK
 	select CPU_PM if (SUSPEND || CPU_IDLE)
-	select DCACHE_WORD_ACCESS
 	select GENERIC_CLOCKEVENTS
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
 	select GENERIC_IOMAP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index abb261e..60cfa073 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -98,7 +98,6 @@ config X86
 	select CLKEVT_I8253
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_IOMAP
-	select DCACHE_WORD_ACCESS
 	select GENERIC_SMP_IDLE_THREAD
 	select ARCH_WANT_IPC_PARSE_VERSION if X86_32
 	select HAVE_ARCH_SECCOMP_FILTER
diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..7511271 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -4,10 +4,6 @@
 
 menu "File systems"
 
-# Use unaligned word dcache accesses
-config DCACHE_WORD_ACCESS
-       bool
-
 if BLOCK
 
 source "fs/ext2/Kconfig"
diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce..4e3c195 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -163,7 +163,7 @@ int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
  * Compare 2 name strings, return 0 if they match, otherwise non-zero.
  * The strings are both count bytes long, and count is non-zero.
  */
-#ifdef CONFIG_DCACHE_WORD_ACCESS
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 
 #include <asm/word-at-a-time.h>
 /*
diff --git a/fs/namei.c b/fs/namei.c
index 385f781..1ee33ca 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1618,7 +1618,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
  *   the final mask". Again, that could be replaced with a
  *   efficient population count instruction or similar.
  */
-#ifdef CONFIG_DCACHE_WORD_ACCESS
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 
 #include <asm/word-at-a-time.h>
 

^ permalink raw reply related

* [PATCH 0/2] sched: Removed unused mc_capable() and smt_capable()
From: Bjorn Helgaas @ 2014-03-04 21:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mips, linux-ia64, x86, linux-kernel, sparclinux,
	linuxppc-dev, linux-arm-kernel

This is just cleanup of a couple unused interfaces and (for sparc64) a
supporting variable.

---

Bjorn Helgaas (2):
      sched: Remove unused mc_capable() and smt_capable()
      sparc64: Remove unused sparc64_multi_core


 arch/arm/include/asm/topology.h      |    3 ---
 arch/ia64/include/asm/topology.h     |    1 -
 arch/mips/include/asm/topology.h     |    4 ----
 arch/powerpc/include/asm/topology.h  |    1 -
 arch/sparc/include/asm/smp_64.h      |    1 -
 arch/sparc/include/asm/topology_64.h |    2 --
 arch/sparc/kernel/mdesc.c            |    4 ----
 arch/sparc/kernel/prom_64.c          |    3 ---
 arch/sparc/kernel/smp_64.c           |    2 --
 arch/x86/include/asm/topology.h      |    6 ------
 10 files changed, 27 deletions(-)

^ permalink raw reply

* [PATCH 1/2] sched: Remove unused mc_capable() and smt_capable()
From: Bjorn Helgaas @ 2014-03-04 21:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mips, linux-ia64, x86, linux-kernel, sparclinux,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <20140304210621.16893.8772.stgit@bhelgaas-glaptop.roam.corp.google.com>

Remove mc_capable() and smt_capable().  Neither is used.

Both were added by 5c45bf279d37 ("sched: mc/smt power savings sched
policy").  Uses of both were removed by 8e7fbcbc22c1 ("sched: Remove stale
power aware scheduling remnants and dysfunctional knobs").

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/arm/include/asm/topology.h      |    3 ---
 arch/ia64/include/asm/topology.h     |    1 -
 arch/mips/include/asm/topology.h     |    4 ----
 arch/powerpc/include/asm/topology.h  |    1 -
 arch/sparc/include/asm/topology_64.h |    2 --
 arch/x86/include/asm/topology.h      |    6 ------
 6 files changed, 17 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 58b8b84adcd2..2fe85fff5cca 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -20,9 +20,6 @@ extern struct cputopo_arm cpu_topology[NR_CPUS];
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_thread_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
 
-#define mc_capable()	(cpu_topology[0].socket_id != -1)
-#define smt_capable()	(cpu_topology[0].thread_id != -1)
-
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e449b75..5cb55a1e606b 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -77,7 +77,6 @@ void build_cpu_to_node_map(void);
 #define topology_core_id(cpu)			(cpu_data(cpu)->core_id)
 #define topology_core_cpumask(cpu)		(&cpu_core_map[cpu])
 #define topology_thread_cpumask(cpu)		(&per_cpu(cpu_sibling_map, cpu))
-#define smt_capable() 				(smp_num_siblings > 1)
 #endif
 
 extern void arch_fix_phys_package_id(int num, u32 slot);
diff --git a/arch/mips/include/asm/topology.h b/arch/mips/include/asm/topology.h
index 12609a17dc8b..20ea4859c822 100644
--- a/arch/mips/include/asm/topology.h
+++ b/arch/mips/include/asm/topology.h
@@ -10,8 +10,4 @@
 
 #include <topology.h>
 
-#ifdef CONFIG_SMP
-#define smt_capable()	(smp_num_siblings > 1)
-#endif
-
 #endif /* __ASM_TOPOLOGY_H */
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index d0b5fca6b077..c9202151079f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -99,7 +99,6 @@ static inline int prrn_is_enabled(void)
 
 #ifdef CONFIG_SMP
 #include <asm/cputable.h>
-#define smt_capable()		(cpu_has_feature(CPU_FTR_SMT))
 
 #ifdef CONFIG_PPC64
 #include <asm/smp.h>
diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h
index 1754390a426f..a2d10fc64faf 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -42,8 +42,6 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 #define topology_core_id(cpu)			(cpu_data(cpu).core_id)
 #define topology_core_cpumask(cpu)		(&cpu_core_map[cpu])
 #define topology_thread_cpumask(cpu)		(&per_cpu(cpu_sibling_map, cpu))
-#define mc_capable()				(sparc64_multi_core)
-#define smt_capable()				(sparc64_multi_core)
 #endif /* CONFIG_SMP */
 
 extern cpumask_t cpu_core_map[NR_CPUS];
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index d35f24e231cd..9bcc724cafdd 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -133,12 +133,6 @@ static inline void arch_fix_phys_package_id(int num, u32 slot)
 struct pci_bus;
 void x86_pci_root_bus_resources(int bus, struct list_head *resources);
 
-#ifdef CONFIG_SMP
-#define mc_capable()	((boot_cpu_data.x86_max_cores > 1) && \
-			(cpumask_weight(cpu_core_mask(0)) != nr_cpu_ids))
-#define smt_capable()			(smp_num_siblings > 1)
-#endif
-
 #ifdef CONFIG_NUMA
 extern int get_mp_bus_to_node(int busnum);
 extern void set_mp_bus_to_node(int busnum, int node);

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox