LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [V5 07/11] powerpc, perf: Change the name of HW PMU branch filter tracking variable
From: Anshuman Khandual @ 2014-02-05  8:06 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: mikey, ak, eranian, michael, acme, sukadev, mingo
In-Reply-To: <1391587582-32028-1-git-send-email-khandual@linux.vnet.ibm.com>

This patch simply changes the name of the variable from 'bhrb_filter' to
'bhrb_hw_filter' in order to add one more variable which will track SW
filters in generic powerpc book3s code which will be implemented in the
subsequent patch. This patch does not change any functionality.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/perf/core-book3s.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 8a44dc1..468f58c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -47,7 +47,7 @@ struct cpu_hw_events {
 	int n_txn_start;
 
 	/* BHRB bits */
-	u64				bhrb_filter;	/* BHRB HW branch filter */
+	u64				bhrb_hw_filter;	/* BHRB HW branch filter */
 	int				bhrb_users;
 	void				*bhrb_context;
 	struct	perf_branch_stack	bhrb_stack;
@@ -1161,7 +1161,7 @@ static void power_pmu_enable(struct pmu *pmu)
 
  out:
 	if (cpuhw->bhrb_users)
-		ppmu->config_bhrb(cpuhw->bhrb_filter);
+		ppmu->config_bhrb(cpuhw->bhrb_hw_filter);
 
 	local_irq_restore(flags);
 }
@@ -1256,7 +1256,7 @@ nocheck:
  out:
 	if (has_branch_stack(event)) {
 		power_pmu_bhrb_enable(event);
-		cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
+		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
 					event->attr.branch_sample_type);
 	}
 
@@ -1639,10 +1639,10 @@ static int power_pmu_event_init(struct perf_event *event)
 	err = power_check_constraints(cpuhw, events, cflags, n + 1);
 
 	if (has_branch_stack(event)) {
-		cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
+		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
 					event->attr.branch_sample_type);
 
-		if(cpuhw->bhrb_filter == -1)
+		if(cpuhw->bhrb_hw_filter == -1)
 			return -EOPNOTSUPP;
 	}
 
-- 
1.7.11.7

^ permalink raw reply related

* [V5 08/11] powerpc, lib: Add new branch analysis support functions
From: Anshuman Khandual @ 2014-02-05  8:06 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: mikey, ak, eranian, michael, acme, sukadev, mingo
In-Reply-To: <1391587582-32028-1-git-send-email-khandual@linux.vnet.ibm.com>

Generic powerpc branch analysis support added in the code patching
library which will help the subsequent patch on SW based filtering
of branch records in perf.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/code-patching.h | 16 +++++++
 arch/powerpc/lib/code-patching.c         | 80 ++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
index 97e02f9..39919d4 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -22,6 +22,16 @@
 #define BRANCH_SET_LINK	0x1
 #define BRANCH_ABSOLUTE	0x2
 
+#define XL_FORM_LR  0x4C000020
+#define XL_FORM_CTR 0x4C000420
+#define XL_FORM_TAR 0x4C000460
+
+#define BO_ALWAYS    0x02800000
+#define BO_CTR       0x02000000
+#define BO_CRBI_OFF  0x00800000
+#define BO_CRBI_ON   0x01800000
+#define BO_CRBI_HINT 0x00400000
+
 unsigned int create_branch(const unsigned int *addr,
 			   unsigned long target, int flags);
 unsigned int create_cond_branch(const unsigned int *addr,
@@ -56,4 +66,10 @@ static inline unsigned long ppc_function_entry(void *func)
 #endif
 }
 
+/* Perf branch filters */
+bool instr_is_return_branch(unsigned int instr);
+bool instr_is_conditional_branch(unsigned int instr);
+bool instr_is_func_call(unsigned int instr);
+bool instr_is_indirect_func_call(unsigned int instr);
+
 #endif /* _ASM_POWERPC_CODE_PATCHING_H */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index d5edbeb..a06f8b3 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -77,6 +77,7 @@ static unsigned int branch_opcode(unsigned int instr)
 	return (instr >> 26) & 0x3F;
 }
 
+/* Forms of branch instruction */
 static int instr_is_branch_iform(unsigned int instr)
 {
 	return branch_opcode(instr) == 18;
@@ -87,6 +88,85 @@ static int instr_is_branch_bform(unsigned int instr)
 	return branch_opcode(instr) == 16;
 }
 
+static int instr_is_branch_xlform(unsigned int instr)
+{
+	return branch_opcode(instr) == 19;
+}
+
+/* Classification of XL-form instruction */
+static int is_xlform_lr(unsigned int instr)
+{
+	return (instr & XL_FORM_LR) == XL_FORM_LR;
+}
+
+/* BO field analysis (B-form or XL-form) */
+static int is_bo_always(unsigned int instr)
+{
+	return (instr & BO_ALWAYS) == BO_ALWAYS;
+}
+
+/* Link bit is set */
+static int is_branch_link_set(unsigned int instr)
+{
+	return (instr & BRANCH_SET_LINK) == BRANCH_SET_LINK;
+}
+
+/* 
+ * Generic software implemented branch filters used
+ * by perf branch stack sampling when PMU does not
+ * process them for some reason.
+ */
+
+/* PERF_SAMPLE_BRANCH_ANY_RETURN */
+bool instr_is_return_branch(unsigned int instr)
+{
+	/*
+	 * Conditional and unconditional branch to LR register
+	 * without seting the link register.
+	 */
+	if (is_xlform_lr(instr) && !is_branch_link_set(instr))
+		return true;
+
+	return false;
+}
+
+/* PERF_SAMPLE_BRANCH_COND */
+bool instr_is_conditional_branch(unsigned int instr)
+{
+	/* I-form instruction - excluded */
+	if (instr_is_branch_iform(instr))
+		return false;
+
+	/* B-form or XL-form instruction */
+	if (instr_is_branch_bform(instr) || instr_is_branch_xlform(instr))  {
+
+		/* Not branch always */
+		if (!is_bo_always(instr))
+			return true;
+	}
+	return false;
+}
+
+/* PERF_SAMPLE_BRANCH_ANY_CALL */
+bool instr_is_func_call(unsigned int instr)
+{
+	/* LR should be set */
+	if (is_branch_link_set(instr))
+		return true;
+
+	return false;
+}
+
+/* PERF_SAMPLE_BRANCH_IND_CALL */
+bool instr_is_indirect_func_call(unsigned int instr)
+{
+	/* XL-form instruction with LR set */
+	if (instr_is_branch_xlform(instr) && is_branch_link_set(instr))
+		return true;
+
+	return false;
+}
+
 int instr_is_relative_branch(unsigned int instr)
 {
 	if (instr & BRANCH_ABSOLUTE)
-- 
1.7.11.7

^ permalink raw reply related

* [V5 09/11] powerpc, perf: Enable SW filtering in branch stack sampling framework
From: Anshuman Khandual @ 2014-02-05  8:06 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: mikey, ak, eranian, michael, acme, sukadev, mingo
In-Reply-To: <1391587582-32028-1-git-send-email-khandual@linux.vnet.ibm.com>

This patch enables SW based post processing of BHRB captured branches
to be able to meet more user defined branch filtration criteria in perf
branch stack sampling framework. These changes increase the number of
branch filters and their valid combinations on any powerpc64 server
platform with BHRB support. Find the summary of code changes here.

(1) struct cpu_hw_events

	Introduced two new variables track various filter values and mask

	(a) bhrb_sw_filter	Tracks SW implemented branch filter flags
	(b) bhrb_filter		Tracks both (SW and HW) branch filter flags

(2) Event creation

	Kernel will figure out supported BHRB branch filters through a PMU call
	back 'bhrb_filter_map'. This function will find out how many of the
	requested branch filters can be supported in the PMU HW. It will not
	try to invalidate any branch filter combinations. Event creation will not
	error out because of lack of HW based branch filters. Meanwhile it will
	track the overall supported branch filters in the 'bhrb_filter' variable.

	Once the PMU call back returns kernel will process the user branch filter
	request against available SW filters (bhrb_sw_filter_map) while looking
	at the 'bhrb_filter'. During this phase all the branch filters which are
	still pending from the user requested list will have to be supported in
	SW failing which the event creation will error out.

(3) SW branch filter

	During the BHRB data capture inside the PMU interrupt context, each
	of the captured 'perf_branch_entry.from' will be checked for compliance
	with applicable SW branch filters. If the entry does not conform to the
	filter requirements, it will be discarded from the final perf branch
	stack buffer.

(4) Supported SW based branch filters

	(a) PERF_SAMPLE_BRANCH_ANY_RETURN
	(b) PERF_SAMPLE_BRANCH_IND_CALL
	(c) PERF_SAMPLE_BRANCH_ANY_CALL
	(d) PERF_SAMPLE_BRANCH_COND

	Please refer the patch to understand the classification of instructions
	into these branch filter categories.

(5) Multiple branch filter semantics

	Book3 sever implementation follows the same OR semantics (as implemented in
	x86) while dealing with multiple branch filters at any point of time. SW
	branch filter analysis is carried on the data set captured in the PMU HW.
	So the resulting set of data (after applying the SW filters) will inherently
	be an AND with the HW captured set. Hence any combination of HW and SW branch
	filters will be invalid. HW based branch filters are more efficient and faster
	compared to SW implemented branch filters. So at first the PMU should decide
	whether it can support all the requested branch filters itself or not. In case
	it can support all the branch filters in an OR manner, we dont apply any SW
	branch filter on top of the HW captured set (which is the final set). This
	preserves the OR semantic of multiple branch filters as required. But in case
	where the PMU cannot support all the requested branch filters in an OR manner,
	it should not apply any it's filters and leave it upto the SW to handle them
	all. Its the PMU code's responsibility to uphold this protocol to be able to
	conform to the overall OR semantic of perf branch stack sampling framework.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/perf_event_server.h |   6 +-
 arch/powerpc/perf/core-book3s.c              | 188 ++++++++++++++++++++++++++-
 arch/powerpc/perf/power8-pmu.c               |   2 +-
 3 files changed, 187 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
index 3fd2f1b..f791bab 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -18,6 +18,10 @@
 #define MAX_EVENT_ALTERNATIVES	8
 #define MAX_LIMITED_HWCOUNTERS	2
 
+#define for_each_branch_sample_type(x) \
+        for ((x) = PERF_SAMPLE_BRANCH_USER; \
+             (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
+
 /*
  * This struct provides the constants and functions needed to
  * describe the PMU on a particular POWER-family CPU.
@@ -34,7 +38,7 @@ struct power_pmu {
 				unsigned long *valp);
 	int		(*get_alternatives)(u64 event_id, unsigned int flags,
 				u64 alt[]);
-	u64             (*bhrb_filter_map)(u64 branch_sample_type);
+	u64             (*bhrb_filter_map)(u64 branch_sample_type, u64 *bhrb_filter);
 	void            (*config_bhrb)(u64 pmu_bhrb_filter);
 	void		(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
 	int		(*limited_pmc_event)(u64 event_id);
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 468f58c..05b7546 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -46,8 +46,9 @@ struct cpu_hw_events {
 	unsigned int group_flag;
 	int n_txn_start;
 
-	/* BHRB bits */
 	u64				bhrb_hw_filter;	/* BHRB HW branch filter */
+	u64				bhrb_sw_filter;	/* BHRB SW branch filter */
+	u64				bhrb_filter;	/* Branch filter mask */
 	int				bhrb_users;
 	void				*bhrb_context;
 	struct	perf_branch_stack	bhrb_stack;
@@ -410,6 +411,152 @@ void update_branch_entry(struct cpu_hw_events *cpuhw, int u_index, u64 from, u64
 	return;
 }
 
+/*
+ * Instruction opcode analysis
+ *
+ * Analyse instruction opcodes and classify them
+ * into various branch filter options available.
+ * This follows the standard semantics of OR which
+ * means that instructions which conforms to `any`
+ * of the requested branch filters get picked up.
+ */
+static bool check_instruction(unsigned int *addr, u64 sw_filter)
+{
+	if (sw_filter & PERF_SAMPLE_BRANCH_ANY_RETURN) {
+		if (instr_is_return_branch(*addr))
+			return true;
+	}
+
+	if (sw_filter & PERF_SAMPLE_BRANCH_IND_CALL) {
+		if (instr_is_indirect_func_call(*addr))
+			return true;
+	}
+
+	if (sw_filter & PERF_SAMPLE_BRANCH_ANY_CALL) {
+		if (instr_is_func_call(*addr))
+			return true;
+	}
+
+	if (sw_filter & PERF_SAMPLE_BRANCH_COND) {
+		if (instr_is_conditional_branch(*addr))
+			return true;
+	}
+	return false;
+}
+
+/* 
+ * Access the instruction contained in the address and check
+ * whether it complies with the applicable SW branch filters.
+ */
+static bool keep_branch(u64 from, u64 sw_filter)
+{
+	unsigned int instr;
+	bool ret;
+
+	/*
+	 * The "from" branch for every branch record has to go
+	 * through this filter verification. So this quick check
+	 * here for no SW filters will improve performance.
+	 */
+	if (sw_filter == 0)
+		return true;
+
+	if (is_kernel_addr(from)) {
+		return check_instruction((unsigned int *) from, sw_filter);
+	} else {
+		/*
+		 * Userspace address needs to be
+		 * copied first before analysis.
+		 */
+		pagefault_disable();
+		ret =  __get_user_inatomic(instr, (unsigned int __user *) from);
+
+		/*
+		 * If the instruction could not be accessible
+		 * from user space, we still 'okay' the entry.
+		 */
+		if (ret) {
+			pagefault_enable();
+			return true;
+		}
+		pagefault_enable();
+		return check_instruction(&instr, sw_filter);
+	}
+}
+
+/*
+ * Validate whether all the requested branch filters
+ * are getting processed either in the PMU or in SW.
+ */
+static int all_filters_covered(u64 branch_sample_type, u64 bhrb_filter)
+{
+	u64 x;
+
+	if (bhrb_filter == PERF_SAMPLE_BRANCH_ANY)
+		return true;
+
+	for_each_branch_sample_type(x) {
+		if (!(branch_sample_type & x))
+			continue;
+		/*
+		 * Privilege filter requests have been already
+		 * taken care during the base PMU configuration.
+		 */
+		if ((x == PERF_SAMPLE_BRANCH_USER)
+			|| (x == PERF_SAMPLE_BRANCH_KERNEL)
+				|| (x == PERF_SAMPLE_BRANCH_HV))
+			continue;
+
+		/*
+		 * Requested filter not available either
+		 * in PMU or in SW.
+		 */
+		if (!(bhrb_filter & x))
+			return false;
+	}
+	return true;
+}
+
+/* SW implemented branch filters */
+static unsigned int power_sw_filter[] = { PERF_SAMPLE_BRANCH_ANY_CALL,
+					  PERF_SAMPLE_BRANCH_COND,
+					  PERF_SAMPLE_BRANCH_ANY_RETURN,
+					  PERF_SAMPLE_BRANCH_IND_CALL };
+
+/*
+ * Required SW based branch filters
+ *
+ * This is called after figuring out what all branch filters the
+ * PMU HW supports for the requested branch filter set. Here we
+ * will go through all the SW implemented branch filters one by
+ * one and pick them up if its not already supported in the PMU.
+ */
+static u64 bhrb_sw_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
+{
+	u64 branch_sw_filter = 0;
+	unsigned int i;
+
+	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) {
+		WARN_ON(*bhrb_filter != PERF_SAMPLE_BRANCH_ANY);
+		return branch_sw_filter;
+	}
+
+	/*
+	 * PMU supported branch filters must be implemented in SW
+	 * when the PMU is unable to process them for some reason.
+	 */
+	for (i = 0; i < ARRAY_SIZE(power_sw_filter); i++) {
+		if (branch_sample_type & power_sw_filter[i]) {
+			if (!(*bhrb_filter & power_sw_filter[i])) {
+				branch_sw_filter |= power_sw_filter[i];
+				*bhrb_filter |= power_sw_filter[i];
+			}
+		}
+	}
+
+	return branch_sw_filter;
+}
+
 /* Processing BHRB entries */
 void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 {
@@ -472,6 +619,11 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 			tmp = power_pmu_bhrb_to(addr);
 			update_branch_entry(cpuhw, u_index, addr, tmp, pred);
 		}
+
+		/* Apply SW branch filters and drop the entry if required */
+		if (!keep_branch(cpuhw->bhrb_entries[u_index].from,
+						cpuhw->bhrb_sw_filter))
+			u_index--;
 		u_index++;
 	}
 	cpuhw->bhrb_stack.nr = u_index;
@@ -1160,6 +1312,8 @@ static void power_pmu_enable(struct pmu *pmu)
 	}
 
  out:
+
+	/* Enable PMU based branch filters */
 	if (cpuhw->bhrb_users)
 		ppmu->config_bhrb(cpuhw->bhrb_hw_filter);
 
@@ -1256,8 +1410,12 @@ nocheck:
  out:
 	if (has_branch_stack(event)) {
 		power_pmu_bhrb_enable(event);
-		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
-					event->attr.branch_sample_type);
+		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map
+					(event->attr.branch_sample_type,
+							&cpuhw->bhrb_filter);
+		cpuhw->bhrb_sw_filter = bhrb_sw_filter_map
+					(event->attr.branch_sample_type,
+							&cpuhw->bhrb_filter);
 	}
 
 	perf_pmu_enable(event->pmu);
@@ -1638,11 +1796,27 @@ static int power_pmu_event_init(struct perf_event *event)
 	cpuhw = &get_cpu_var(cpu_hw_events);
 	err = power_check_constraints(cpuhw, events, cflags, n + 1);
 
+	/*
+	 * BHRB branch filters implemented in PMU will take
+	 * effect when we enable the event and data set
+	 * collected thereafter will be compliant with those
+	 * branch filters. Where as the SW branch filters will
+	 * be applied during the post processing of BHRB data.
+	 */
 	if (has_branch_stack(event)) {
-		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
-					event->attr.branch_sample_type);
-
-		if(cpuhw->bhrb_hw_filter == -1)
+		/* Query available PMU branch filter support */
+		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map
+				(event->attr.branch_sample_type,
+						&cpuhw->bhrb_filter);
+
+		/* Query available SW branch filter support */
+		cpuhw->bhrb_sw_filter = bhrb_sw_filter_map
+				(event->attr.branch_sample_type,
+						&cpuhw->bhrb_filter);
+
+		/* Check overall coverage of branch filter request */
+		if(!all_filters_covered(event->attr.branch_sample_type,
+						cpuhw->bhrb_filter))
 			return -EOPNOTSUPP;
 	}
 
diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 8551f85..ced0d90 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -559,7 +559,7 @@ static int power8_generic_events[] = {
 	[PERF_COUNT_HW_BRANCH_MISSES] =			PM_BR_MPRED_CMPL,
 };
 
-static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
 {
 	/* BHRB and regular PMU events share the same privilege state
 	 * filter configuration. BHRB is always recorded along with a
-- 
1.7.11.7

^ permalink raw reply related

* [V5 10/11] power8, perf: Adapt BHRB PMU configuration to work with SW filters
From: Anshuman Khandual @ 2014-02-05  8:06 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: mikey, ak, eranian, michael, acme, sukadev, mingo
In-Reply-To: <1391587582-32028-1-git-send-email-khandual@linux.vnet.ibm.com>

Powerpc kernel now supports SW based branch filters for book3s systems with some
specifc requirements while dealing with HW supported branch filters in order to
achieve overall OR semantics prevailing in perf branch stack sampling framework.
This patch adapts the BHRB branch filter configuration to meet those protocols.
POWER8 PMU can only handle one HW based branch filter request at any point of time.
For all other combinations PMU will pass it on to the SW.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/perf/power8-pmu.c | 50 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index ced0d90..e5e4529 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -561,6 +561,16 @@ static int power8_generic_events[] = {
 
 static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
 {
+	u64 x, pmu_bhrb_filter;
+	pmu_bhrb_filter = 0;
+	*bhrb_filter = 0;
+
+	/* No branch filter requested */
+	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) {
+		*bhrb_filter = PERF_SAMPLE_BRANCH_ANY;
+		return pmu_bhrb_filter;
+	}
+
 	/* BHRB and regular PMU events share the same privilege state
 	 * filter configuration. BHRB is always recorded along with a
 	 * regular PMU event. As the privilege state filter is handled
@@ -571,16 +581,42 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
 	/* Ignore user, kernel, hv bits */
 	branch_sample_type &= ~PERF_SAMPLE_BRANCH_PLM_ALL;
 
-	/* No branch filter requested */
-	if (branch_sample_type == PERF_SAMPLE_BRANCH_ANY)
-		return 0;
+	/*
+	 * P8 does not support oring of PMU HW branch filters. Hence
+	 * if multiple branch filters are requested which includes filters
+	 * supported in PMU, still go ahead and clear the PMU based HW branch
+	 * filter component as in this case all the filters will be processed
+	 * in SW.
+	 */
 
-	if (branch_sample_type == PERF_SAMPLE_BRANCH_ANY_CALL) {
-		return POWER8_MMCRA_IFM1;
+	for_each_branch_sample_type(x) {
+		/* Ignore privilege branch filters */
+		if ((x == PERF_SAMPLE_BRANCH_USER)
+			|| (x == PERF_SAMPLE_BRANCH_KERNEL)
+				|| (x == PERF_SAMPLE_BRANCH_HV))
+			continue;
+
+		if (!(branch_sample_type & x))
+			continue;
+
+		/* Supported individual PMU branch filters */
+		if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+			branch_sample_type &= ~PERF_SAMPLE_BRANCH_ANY_CALL;
+			if (branch_sample_type) {
+				/* Multiple branch filters will be processed in SW */
+				pmu_bhrb_filter = 0;
+				*bhrb_filter = 0;
+				return pmu_bhrb_filter;
+			} else {
+				/* Individual branch filter will be processed in PMU */
+				pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+				*bhrb_filter    |= PERF_SAMPLE_BRANCH_ANY_CALL;
+				return pmu_bhrb_filter;
+			}
+		}
 	}
 
-	/* Every thing else is unsupported */
-	return -1;
+	return pmu_bhrb_filter;
 }
 
 static void power8_config_bhrb(u64 pmu_bhrb_filter)
-- 
1.7.11.7

^ permalink raw reply related

* [V5 11/11] powerpc, perf: Enable privilege mode SW branch filters
From: Anshuman Khandual @ 2014-02-05  8:06 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: mikey, ak, eranian, michael, acme, sukadev, mingo
In-Reply-To: <1391587582-32028-1-git-send-email-khandual@linux.vnet.ibm.com>

This patch enables privilege mode SW branch filters. Also modifies
POWER8 PMU branch filter configuration so that the privilege mode
branch filter implemented as part of base PMU event configuration
is reflected in bhrb filter mask. As a result, the SW will skip and
not try to process the privilege mode branch filters itself.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/perf/core-book3s.c | 53 +++++++++++++++++++++++++++++++----------
 arch/powerpc/perf/power8-pmu.c  | 13 ++++++++--
 2 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 05b7546..54a9fa1 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -26,6 +26,9 @@
 #define BHRB_PREDICTION		0x0000000000000001
 #define BHRB_EA			0xFFFFFFFFFFFFFFFCUL
 
+#define POWER_ADDR_USER		0
+#define POWER_ADDR_KERNEL	1
+
 struct cpu_hw_events {
 	int n_events;
 	int n_percpu;
@@ -448,10 +451,10 @@ static bool check_instruction(unsigned int *addr, u64 sw_filter)
  * Access the instruction contained in the address and check
  * whether it complies with the applicable SW branch filters.
  */
-static bool keep_branch(u64 from, u64 sw_filter)
+static bool keep_branch(u64 from, u64 to, u64 sw_filter)
 {
 	unsigned int instr;
-	bool ret;
+	bool to_plm, ret, flag;
 
 	/*
 	 * The "from" branch for every branch record has to go
@@ -461,6 +464,37 @@ static bool keep_branch(u64 from, u64 sw_filter)
 	if (sw_filter == 0)
 		return true;
 
+	to_plm = is_kernel_addr(to) ? POWER_ADDR_KERNEL : POWER_ADDR_USER;
+
+	/*
+	 * Applying privilege mode SW branch filters first on the
+	 * 'to' address makes an AND semantic with the SW generic
+	 * branch filters (OR with each other) being applied on the
+	 * from address there after.
+	 */
+
+	/* Ignore PERF_SAMPLE_BRANCH_HV */
+	sw_filter &= ~PERF_SAMPLE_BRANCH_HV;
+
+	/* Privilege mode branch filters for "TO" address */
+	if (sw_filter & PERF_SAMPLE_BRANCH_PLM_ALL) {
+		flag = false;
+
+		if (sw_filter & PERF_SAMPLE_BRANCH_USER) {
+			if(to_plm == POWER_ADDR_USER)
+				flag = true;
+		}
+
+		if (sw_filter & PERF_SAMPLE_BRANCH_KERNEL) {
+			if(to_plm == POWER_ADDR_KERNEL)
+				flag = true;
+		}
+
+		if (!flag)
+			return false;
+	}
+
+	/* Generic branch filters for "FROM" address */
 	if (is_kernel_addr(from)) {
 		return check_instruction((unsigned int *) from, sw_filter);
 	} else {
@@ -499,15 +533,6 @@ static int all_filters_covered(u64 branch_sample_type, u64 bhrb_filter)
 		if (!(branch_sample_type & x))
 			continue;
 		/*
-		 * Privilege filter requests have been already
-		 * taken care during the base PMU configuration.
-		 */
-		if ((x == PERF_SAMPLE_BRANCH_USER)
-			|| (x == PERF_SAMPLE_BRANCH_KERNEL)
-				|| (x == PERF_SAMPLE_BRANCH_HV))
-			continue;
-
-		/*
 		 * Requested filter not available either
 		 * in PMU or in SW.
 		 */
@@ -518,7 +543,10 @@ static int all_filters_covered(u64 branch_sample_type, u64 bhrb_filter)
 }
 
 /* SW implemented branch filters */
-static unsigned int power_sw_filter[] = { PERF_SAMPLE_BRANCH_ANY_CALL,
+static unsigned int power_sw_filter[] = { PERF_SAMPLE_BRANCH_USER,
+					  PERF_SAMPLE_BRANCH_KERNEL,
+					  PERF_SAMPLE_BRANCH_HV,
+					  PERF_SAMPLE_BRANCH_ANY_CALL,
 					  PERF_SAMPLE_BRANCH_COND,
 					  PERF_SAMPLE_BRANCH_ANY_RETURN,
 					  PERF_SAMPLE_BRANCH_IND_CALL };
@@ -622,6 +650,7 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 
 		/* Apply SW branch filters and drop the entry if required */
 		if (!keep_branch(cpuhw->bhrb_entries[u_index].from,
+					cpuhw->bhrb_entries[u_index].to,
 						cpuhw->bhrb_sw_filter))
 			u_index--;
 		u_index++;
diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index e5e4529..e084937 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -575,9 +575,19 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
 	 * filter configuration. BHRB is always recorded along with a
 	 * regular PMU event. As the privilege state filter is handled
 	 * in the basic PMC configuration of the accompanying regular
-	 * PMU event, we ignore any separate BHRB specific request.
+	 * PMU event, we ignore any separate BHRB specific request. But
+	 * this needs to be communicated with the branch filter mask.
 	 */
 
+	if (branch_sample_type & PERF_SAMPLE_BRANCH_USER)
+		*bhrb_filter |= PERF_SAMPLE_BRANCH_USER;
+
+	if (branch_sample_type & PERF_SAMPLE_BRANCH_KERNEL)
+		*bhrb_filter |= PERF_SAMPLE_BRANCH_KERNEL;
+
+	if (branch_sample_type & PERF_SAMPLE_BRANCH_HV)
+		*bhrb_filter |= PERF_SAMPLE_BRANCH_HV;
+
 	/* Ignore user, kernel, hv bits */
 	branch_sample_type &= ~PERF_SAMPLE_BRANCH_PLM_ALL;
 
@@ -605,7 +615,6 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *bhrb_filter)
 			if (branch_sample_type) {
 				/* Multiple branch filters will be processed in SW */
 				pmu_bhrb_filter = 0;
-				*bhrb_filter = 0;
 				return pmu_bhrb_filter;
 			} else {
 				/* Individual branch filter will be processed in PMU */
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH V2] power7, perf: Make some new raw event codes available in sysfs
From: Anshuman Khandual @ 2014-02-05  8:51 UTC (permalink / raw)
  To: Linux PPC dev; +Cc: Michael Ellerman
In-Reply-To: <52C4F303.80007@linux.vnet.ibm.com>

On 01/02/2014 10:32 AM, Anshuman Khandual wrote:
> This patchset adds some missing event list for POWER7 PMU raw
> events which are exported through sysfs interface. Also updates
> the ABI documentation to add all the sysfs exported raw events.
> 
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  .../testing/sysfs-bus-event_source-devices-events  | 517 +++++++++++++++++++++
>  arch/powerpc/perf/power7-events-list.h             |  10 +
>  2 files changed, 527 insertions(+)

Hey Michael,

Any updates on this patch ? This does not seem to be merged yet.


Regards
Anshuman

^ permalink raw reply

* Re: [RFC PATCH 01/10] KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
From: Alexander Graf @ 2014-02-05  9:15 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Aneesh Kumar K.V, kvm-ppc, kvm-devel
In-Reply-To: <20140131221743.GA12086@iris.ozlabs.ibm.com>


On 31.01.2014, at 23:17, Paul Mackerras <paulus@samba.org> wrote:

> On Fri, Jan 31, 2014 at 11:47:44AM +0100, Alexander Graf wrote:
>>=20
>> On 31.01.2014, at 11:38, Aneesh Kumar K.V =
<aneesh.kumar@linux.vnet.ibm.com> wrote:
>>=20
>>> Alexander Graf <agraf@suse.de> writes:
>>>=20
>>>> On 01/28/2014 05:44 PM, Aneesh Kumar K.V wrote:
>>>>> We definitely don't need to emulate mtspr, because both the =
registers
>>>>> are hypervisor resource.
>>>>=20
>>>> This patch description doesn't cover what the patch actually does. =
It=20
>>>> changes the implementation from "always tell the guest it uses =
100%" to=20
>>>> "give the guest an accurate amount of cpu time spent inside guest
>>>> context".
>>>=20
>>> Will fix that
>>>=20
>>>>=20
>>>> Also, I think we either go with full hyp semantics which means we =
also=20
>>>> emulate the offset or we go with no hyp awareness in the guest at =
all=20
>>>> which means we also don't emulate SPURR which is a hyp privileged
>>>> register.
>>>=20
>>> Can you clarify this ?
>>=20
>> In the 2.06 ISA SPURR is hypervisor privileged. That changed for 2.07 =
where it became supervisor privileged. So I suppose your patch is ok. =
When reviewing those patches I only had 2.06 around because power.org =
was broken.
>=20
> It's always been supervisor privilege for reading and hypervisor
> privilege for writing, ever since it was introduced in 2.05, and that
> hasn't changed.  So I think what Aneesh is doing is correct.

This is what ISA 2.06B says:

308	SPURR	hypv		hypv		64	S
309	PURR	hypv		yes		64	S

And this is ISA 2.07:

308	SPURR	hypv		yes		64	S
309	PURR	hypv		yes		64	S

So as you can see, from 2.06 to 2.07 SPURR became supervisor readable. =
Either the spec is wrong, the respective POWER CPUs don't implement the =
spec correctly or "hypv" doesn't mean "hypv" but means "may be hypv or =
yes".

I think in the context of this patch it's perfectly reasonable to treat =
SPURR as supervisor readable.


Alex

^ permalink raw reply

* Re: [GIT PULL] tree-wide: clean up no longer required #include <linux/init.h>
From: Paul Gortmaker @ 2014-02-05 14:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, Stephen Rothwell, linux-m68k, rusty, linux-ia64, kvm,
	linux-mips, linuxppc-dev, x86, linux-alpha, netdev, gregkh,
	sparclinux, akpm, Linus Torvalds, linux-arm-kernel, linux-s390
In-Reply-To: <20140205064150.GA31568@gmail.com>

[Re: [GIT PULL] tree-wide: clean up no longer required #include <linux/init.h>] On 05/02/2014 (Wed 07:41) Ingo Molnar wrote:

> 
> * Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi Ingo,
> > 
> > On Wed, 5 Feb 2014 07:06:33 +0100 Ingo Molnar <mingo@kernel.org> wrote:
> > > 
> > > So, if you meant Linus to pull it, you probably want to cite a real 
> > > Git URI along the lines of:
> > > 
> > >    git://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git
> > 
> > Paul provided the proper git url further down in the mail along with the
> > usual pull request message (I guess he should have put that bit at the
> > top).
> 
> Yeah, indeed, and it even comes with a signed tag, which is an extra 
> nice touch:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tags/init-cleanup
> 
> (I guess the https was mentioned first to lower expectations.)

Just to clarify, the init.git was the repo of raw commits+series file
that was used for testing on linux next; now useless, except for showing
the last several weeks of history (hence the visual http link).  The
signed tag [separate repo] is the application of those commits against
the 3.14-rc1 tag, which was the end goal from the beginning.

Does history matter?  In the case of a cleanup like this, it does only
in the immediate context of this pull request; to help distinguish this
work from some short lived half baked idea that also had its testing
invalidated by arbitrarily rebasing onto the latest shiny tag.

I wouldn't have even mentioned the patch repo, except for the fact that
I know how Linus loves arbitrary rebases [and malformed pull requests]  :)

Thanks,
Paul.
--

> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply

* [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Thadeu Lima de Souza Cascardo @ 2014-02-05 18:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: paulus, shangw, Thadeu Lima de Souza Cascardo

Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.

That prevents a driver from being removed after it has gone through EEH
recovery.

This patch drops the reference if it was taken.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 7bb30dc..fdc679d 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
 	 */
 	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
 		return NULL;
+
 	driver = eeh_pcid_get(dev);
-	if (driver && driver->err_handler)
-		return NULL;
+	if (driver) {
+		eeh_pcid_put(dev);
+		if (driver->err_handler)
+			return NULL;
+	}
 
 	/* Remove it from PCI subsystem */
 	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Nishanth Aravamudan @ 2014-02-05 18:43 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo; +Cc: paulus, linuxppc-dev, shangw
In-Reply-To: <1391624445-9095-1-git-send-email-cascardo@linux.vnet.ibm.com>

On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote:
> Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
> partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
> which may grab a reference to a driver, but not release it.
> 
> That prevents a driver from being removed after it has gone through EEH
> recovery.
> 
> This patch drops the reference if it was taken.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 7bb30dc..fdc679d 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
>  	 */
>  	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
>  		return NULL;
> +

This appears to be unnecessary whitespace change?

-Nish

>  	driver = eeh_pcid_get(dev);
> -	if (driver && driver->err_handler)
> -		return NULL;
> +	if (driver) {
> +		eeh_pcid_put(dev);
> +		if (driver->err_handler)
> +			return NULL;
> +	}
> 
>  	/* Remove it from PCI subsystem */
>  	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
> -- 
> 1.7.1
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Thadeu Lima de Souza Cascardo @ 2014-02-05 19:08 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linuxppc-dev, paulus, shangw
In-Reply-To: <20140205184338.GA5433@linux.vnet.ibm.com>

On Wed, Feb 05, 2014 at 10:43:38AM -0800, Nishanth Aravamudan wrote:
> On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote:
> > Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
> > partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
> > which may grab a reference to a driver, but not release it.
> > 
> > That prevents a driver from being removed after it has gone through EEH
> > recovery.
> > 
> > This patch drops the reference if it was taken.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
> > Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
> >  1 files changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 7bb30dc..fdc679d 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
> >  	 */
> >  	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
> >  		return NULL;
> > +
> 
> This appears to be unnecessary whitespace change?
> 
> -Nish
> 

Hi, Nish.

I originally add it there for readability, giving both more evidence to
the code below, where a driver reference is get and put, and to the code
above and its respective comment. Leaving those together could give the
impression the comment also applies to the code below.

But I have no strong feelings about that.

Ben, do you want me to send a new version?

Regards.
Cascardo.

> >  	driver = eeh_pcid_get(dev);
> > -	if (driver && driver->err_handler)
> > -		return NULL;
> > +	if (driver) {
> > +		eeh_pcid_put(dev);
> > +		if (driver->err_handler)
> > +			return NULL;
> > +	}
> > 
> >  	/* Remove it from PCI subsystem */
> >  	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
> > -- 
> > 1.7.1
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Christoph Lameter @ 2014-02-05 19:28 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140205001352.GC10101@linux.vnet.ibm.com>

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.

^ permalink raw reply

* Re: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Thomas Gleixner @ 2014-02-05 20:39 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-mips, Kevin Hilman, linux, Sekhar Nori, Paul Mackerras,
	H. Peter Anvin, Yijing Wang, Daniel Walker,
	Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim, Russell King,
	Richard Weinberger, x86, Tony Lindgren, Ingo Molnar,
	linux-arm-msm, David Brown, Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel, Nicolas Ferre, Jeff Dike, Barry Song,
	linux-samsung-soc, John Stultz, user-mode-linux-user, linux-omap,
	linux-arm-kernel, davinci-linux-open-source, Michal Simek,
	Jim Cromie, microblaze-uclinux, Hanjun Guo, linux-kernel,
	Ralf Baechle, Tony Prisk, Bryan Huntsman, uclinux-dist-devel,
	linuxppc-dev
In-Reply-To: <52E0D575.5050702@linaro.org>

On Thu, 23 Jan 2014, Daniel Lezcano wrote:
> On 01/23/2014 08:12 AM, Yijing Wang wrote:
> > Currently, clocksource_register() and __clocksource_register_scale()
> > functions always return 0, it's pointless, make functions void.
> > And remove the dead code that check the clocksource_register_hz()
> > return value.
> > 
> > Signed-off-by: Yijing Wang <wangyijing@huawei.com>
> 
> Well, do we really want to change all these files to not take care of a return
> value ? What about is we have to check it again later ?
> 
> I would recommend to investigate __clocksource_register_scale and the
> underneath functions if there is not an error to be returned in the call stack
> somewhere which is ignored today.
> 
> The same applies for clocksource_register.

There is really no point in making it fail. It's so low level that
anything more than a proper printk/BUG/WARN is overkill.

Thanks,

	tglx

^ permalink raw reply

* RE: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Thomas Gleixner @ 2014-02-05 20:40 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips@linux-mips.org, x86@kernel.org, Kevin Hilman,
	linux@lists.openrisc.net, Sekhar Nori, Michal Simek,
	Paul Mackerras, Ralf Baechle, H. Peter Anvin, Yijing Wang,
	Daniel Walker, Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim,
	Russell King, Richard Weinberger, Daniel Lezcano, Tony Lindgren,
	Ingo Molnar, microblaze-uclinux@itee.uq.edu.au, David Brown,
	Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel@lists.sourceforge.net,
	linux-arm-msm@vger.kernel.org, Jeff Dike,
	davinci-linux-open-source@linux.davincidsp.com,
	linux-samsung-soc@vger.kernel.org, John Stultz,
	user-mode-linux-user@lists.sourceforge.net,
	linux-omap@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Barry Song, Jim Cromie, Hanjun Guo, linux-kernel@vger.kernel.org,
	Nicolas Ferre, 'Tony Prisk', Bryan Huntsman,
	uclinux-dist-devel@blackfin.uclinux.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D46489C@AcuExch.aculab.com>

Yijing,

On Thu, 23 Jan 2014, David Laight wrote:

> From: Linuxppc-dev Tony Prisk
> > On 23/01/14 20:12, Yijing Wang wrote:
> > > Currently, clocksource_register() and __clocksource_register_scale()
> > > functions always return 0, it's pointless, make functions void.
> > > And remove the dead code that check the clocksource_register_hz()
> > > return value.
> > ......
> > > -static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
> > > +static inline void clocksource_register_hz(struct clocksource *cs, u32 hz)
> > >   {
> > >   	return __clocksource_register_scale(cs, 1, hz);
> > >   }
> > 
> > This doesn't make sense - you are still returning a value on a function
> > declared void, and the return is now from a function that doesn't return
> > anything either ?!?!
> > Doesn't this throw a compile-time warning??
> 
> It depends on the compiler.
> Recent gcc allow it.
> I don't know if it is actually valid C though.
> 
> There is no excuse for it on lines like the above though.

Can you please resend with that fixed against 3.14-rc1 ?

Thanks,

	tglx

^ permalink raw reply

* [PATCH 13/51] powerpc, sysfs: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-02-05 22:06 UTC (permalink / raw)
  To: paulus, oleg, rusty, peterz, tglx, akpm
  Cc: ego, walken, linux, linux-kernel, Wang Dongsheng, Olof Johansson,
	Madhavan Srinivasan, Paul Mackerras, Srivatsa S. Bhat, tj,
	paulmck, linuxppc-dev, mingo
In-Reply-To: <20140205220251.19080.92336.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_maps_update_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_maps_update_done();


Fix the sysfs code in powerpc by using this latter form of callback
registration.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Wang Dongsheng <dongsheng.wang@freescale.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/sysfs.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 97e1dc9..c29ad44 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -975,7 +975,8 @@ static int __init topology_init(void)
 	int cpu;
 
 	register_nodes();
-	register_cpu_notifier(&sysfs_cpu_nb);
+
+	cpu_maps_update_begin();
 
 	for_each_possible_cpu(cpu) {
 		struct cpu *c = &per_cpu(cpu_devices, cpu);
@@ -999,6 +1000,11 @@ static int __init topology_init(void)
 		if (cpu_online(cpu))
 			register_cpu_online(cpu);
 	}
+
+	__register_cpu_notifier(&sysfs_cpu_nb);
+
+	cpu_maps_update_done();
+
 #ifdef CONFIG_PPC64
 	sysfs_create_dscr_default();
 #endif /* CONFIG_PPC64 */

^ permalink raw reply related

* Re: [PATCH 4/8] powerpc: add hv_gpci interface header
From: Cody P Schafer @ 2014-02-05 23:14 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Ingo Molnar, Paul Mackerras, Peter Zijlstra,
	Arnaldo Carvalho de Melo, LKML
In-Reply-To: <20140201055807.267F12C00BA@ozlabs.org>

>> diff --git a/arch/powerpc/include/asm/hv_gpci.h b/arch/powerpc/include/asm/hv_gpci.h
>
> Any reason this can't just live in arch/powerpc/perf ?
>

Nope, it should be able to keep the header there for now. As this 
interface allows determination of the HW topology, we may have some code 
that exposes that (in sysfs) at some point, which doesn't really belong 
in arch/powerpc/perf (though we could just put it there anyhow).

^ permalink raw reply

* [PATCH-next] powerpc: delete old PrPMC 280/2800 support
From: Paul Gortmaker @ 2014-02-06  1:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Gortmaker, Paul Mackerras, linuxppc-dev

This processor/memory module was mostly used on ATCA blades and
before that, on cPCI blades.  It wasn't really user friendly, with
custom non u-boot bootloaders (powerboot/motload) and no real way
to recover corrupted boot flash (which was a common problem).

As such, it had its day back before the big ppc --> powerpc move
to device trees, and that was largely through commercial BSPs that
started to dry up around 2007.

Systems using one were largely in a "deploy and sustain" mode,
so interest in upgrading to new kernels in the field was nil.
Also, requiring 50A, 48V power supplies and a 2'x2'x2' ATCA
chassis largely rules out any hobbyist/enthusiast interest.

The point of all this, is that we might as well delete the in
kernel files relating to this platform.  No point in continuing
to build it via walking the defconfigs or via linux-next testing.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---

[The above description is not idle speculation based on 2nd hand
info -- I personally suffered with these platforms, and when faced
with a chance to take some for free vs. letting them go to e-waste,
I happily chose the latter option.   So lets get this out early in
the 3.15 cycle so any objections have a full window to appear.  ]

 arch/powerpc/boot/Makefile                     |   5 +-
 arch/powerpc/configs/prpmc2800_defconfig       | 108 -----------------
 arch/powerpc/platforms/embedded6xx/Kconfig     |   8 --
 arch/powerpc/platforms/embedded6xx/Makefile    |   1 -
 arch/powerpc/platforms/embedded6xx/prpmc2800.c | 156 -------------------------
 5 files changed, 2 insertions(+), 276 deletions(-)
 delete mode 100644 arch/powerpc/configs/prpmc2800_defconfig
 delete mode 100644 arch/powerpc/platforms/embedded6xx/prpmc2800.c

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 90e9d9548660..a1f8c7f1ec60 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -54,7 +54,7 @@ zlib       := inffast.c inflate.c inftrees.c
 zlibheader := inffast.h inffixed.h inflate.h inftrees.h infutil.h
 zliblinuxheader := zlib.h zconf.h zutil.h
 
-$(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o main.o prpmc2800.o): \
+$(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o main.o): \
 	$(addprefix $(obj)/,$(zliblinuxheader)) $(addprefix $(obj)/,$(zlibheader))
 
 libfdt       := fdt.c fdt_ro.c fdt_wip.c fdt_sw.c fdt_rw.c fdt_strerror.c
@@ -95,7 +95,7 @@ src-plat-$(CONFIG_FSL_SOC_BOOKE) += cuboot-85xx.c cuboot-85xx-cpm2.c
 src-plat-$(CONFIG_EMBEDDED6xx) += cuboot-pq2.c cuboot-mpc7448hpc2.c \
 					cuboot-c2k.c gamecube-head.S \
 					gamecube.c wii-head.S wii.c holly.c \
-					prpmc2800.c fixed-head.S mvme5100.c
+					fixed-head.S mvme5100.c
 src-plat-$(CONFIG_AMIGAONE) += cuboot-amigaone.c
 src-plat-$(CONFIG_PPC_PS3) += ps3-head.S ps3-hvcall.S ps3.c
 src-plat-$(CONFIG_EPAPR_BOOT) += epapr.c epapr-wrapper.c
@@ -204,7 +204,6 @@ image-$(CONFIG_PPC_CHRP)		+= zImage.chrp
 image-$(CONFIG_PPC_EFIKA)		+= zImage.chrp
 image-$(CONFIG_PPC_PMAC)		+= zImage.pmac
 image-$(CONFIG_PPC_HOLLY)		+= dtbImage.holly
-image-$(CONFIG_PPC_PRPMC2800)		+= dtbImage.prpmc2800
 image-$(CONFIG_DEFAULT_UIMAGE)		+= uImage
 image-$(CONFIG_EPAPR_BOOT)		+= zImage.epapr
 
diff --git a/arch/powerpc/configs/prpmc2800_defconfig b/arch/powerpc/configs/prpmc2800_defconfig
deleted file mode 100644
index cd80fb615d34..000000000000
--- a/arch/powerpc/configs/prpmc2800_defconfig
+++ /dev/null
@@ -1,108 +0,0 @@
-CONFIG_ALTIVEC=y
-CONFIG_EXPERIMENTAL=y
-CONFIG_SYSVIPC=y
-CONFIG_POSIX_MQUEUE=y
-CONFIG_LOG_BUF_SHIFT=14
-CONFIG_BLK_DEV_INITRD=y
-# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
-# CONFIG_PPC_CHRP is not set
-# CONFIG_PPC_PMAC is not set
-CONFIG_EMBEDDED6xx=y
-CONFIG_PPC_PRPMC2800=y
-CONFIG_HIGHMEM=y
-CONFIG_NO_HZ=y
-CONFIG_HIGH_RES_TIMERS=y
-CONFIG_BINFMT_MISC=y
-CONFIG_SPARSE_IRQ=y
-# CONFIG_SECCOMP is not set
-CONFIG_NET=y
-CONFIG_PACKET=y
-CONFIG_UNIX=y
-CONFIG_XFRM_USER=y
-CONFIG_INET=y
-CONFIG_IP_MULTICAST=y
-CONFIG_IP_PNP=y
-CONFIG_IP_PNP_DHCP=y
-CONFIG_IP_PNP_BOOTP=y
-CONFIG_SYN_COOKIES=y
-# CONFIG_IPV6 is not set
-CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
-CONFIG_MTD=y
-CONFIG_MTD_CONCAT=y
-CONFIG_MTD_PARTITIONS=y
-CONFIG_MTD_CHAR=y
-CONFIG_MTD_BLOCK=y
-CONFIG_MTD_CFI=y
-CONFIG_MTD_JEDECPROBE=y
-CONFIG_MTD_CFI_INTELEXT=y
-CONFIG_MTD_PHYSMAP_OF=y
-CONFIG_PROC_DEVICETREE=y
-CONFIG_BLK_DEV_LOOP=y
-CONFIG_BLK_DEV_RAM=y
-CONFIG_BLK_DEV_RAM_SIZE=131072
-CONFIG_IDE=y
-CONFIG_BLK_DEV_GENERIC=y
-CONFIG_BLK_DEV_PDC202XX_NEW=y
-CONFIG_BLK_DEV_SD=y
-CONFIG_ATA=y
-CONFIG_SATA_MV=y
-CONFIG_MACINTOSH_DRIVERS=y
-CONFIG_NETDEVICES=y
-CONFIG_NET_ETHERNET=y
-CONFIG_NET_PCI=y
-CONFIG_E100=y
-CONFIG_8139TOO=y
-# CONFIG_8139TOO_PIO is not set
-CONFIG_E1000=y
-CONFIG_MV643XX_ETH=y
-# CONFIG_INPUT_KEYBOARD is not set
-# CONFIG_INPUT_MOUSE is not set
-# CONFIG_SERIO is not set
-CONFIG_SERIAL_MPSC=y
-CONFIG_SERIAL_MPSC_CONSOLE=y
-# CONFIG_HW_RANDOM is not set
-CONFIG_I2C=y
-CONFIG_I2C_CHARDEV=y
-CONFIG_I2C_MV64XXX=y
-CONFIG_VIDEO_OUTPUT_CONTROL=y
-CONFIG_HID_DRAGONRISE=y
-CONFIG_HID_GYRATION=y
-CONFIG_HID_TWINHAN=y
-CONFIG_HID_NTRIG=y
-CONFIG_HID_ORTEK=y
-CONFIG_HID_PANTHERLORD=y
-CONFIG_HID_PETALYNX=y
-CONFIG_HID_SAMSUNG=y
-CONFIG_HID_SONY=y
-CONFIG_HID_SUNPLUS=y
-CONFIG_HID_GREENASIA=y
-CONFIG_HID_SMARTJOYPLUS=y
-CONFIG_HID_TOPSEED=y
-CONFIG_HID_THRUSTMASTER=y
-CONFIG_THRUSTMASTER_FF=y
-CONFIG_HID_ZEROPLUS=y
-CONFIG_ZEROPLUS_FF=y
-CONFIG_USB=y
-CONFIG_USB_DEVICEFS=y
-# CONFIG_USB_DEVICE_CLASS is not set
-CONFIG_USB_MON=y
-CONFIG_USB_EHCI_HCD=y
-CONFIG_USB_OHCI_HCD=y
-CONFIG_RTC_CLASS=y
-CONFIG_RTC_DRV_MAX6900=y
-CONFIG_EXT2_FS=y
-CONFIG_EXT3_FS=y
-# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
-CONFIG_INOTIFY=y
-CONFIG_PROC_KCORE=y
-CONFIG_TMPFS=y
-CONFIG_NFS_FS=y
-CONFIG_ROOT_NFS=y
-CONFIG_PARTITION_ADVANCED=y
-CONFIG_CRC_T10DIF=y
-# CONFIG_RCU_CPU_STALL_DETECTOR is not set
-CONFIG_SYSCTL_SYSCALL_CHECK=y
-# CONFIG_CRYPTO_ANSI_CPRNG is not set
diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig b/arch/powerpc/platforms/embedded6xx/Kconfig
index 6d3c7a9fd047..3fdc8bc6258f 100644
--- a/arch/powerpc/platforms/embedded6xx/Kconfig
+++ b/arch/powerpc/platforms/embedded6xx/Kconfig
@@ -49,14 +49,6 @@ config PPC_HOLLY
 	  Select PPC_HOLLY if configuring for an IBM 750GX/CL Eval
 	  Board with TSI108/9 bridge (Hickory/Holly)
 
-config PPC_PRPMC2800
-	bool "Motorola-PrPMC2800"
-	depends on EMBEDDED6xx
-	select MV64X60
-	select NOT_COHERENT_CACHE
-	help
-	  This option enables support for the Motorola PrPMC2800 board
-
 config PPC_C2K
 	bool "SBS/GEFanuc C2K board"
 	depends on EMBEDDED6xx
diff --git a/arch/powerpc/platforms/embedded6xx/Makefile b/arch/powerpc/platforms/embedded6xx/Makefile
index cdd48d402b93..f126a2a09981 100644
--- a/arch/powerpc/platforms/embedded6xx/Makefile
+++ b/arch/powerpc/platforms/embedded6xx/Makefile
@@ -5,7 +5,6 @@ obj-$(CONFIG_MPC7448HPC2)	+= mpc7448_hpc2.o
 obj-$(CONFIG_LINKSTATION)	+= linkstation.o ls_uart.o
 obj-$(CONFIG_STORCENTER)	+= storcenter.o
 obj-$(CONFIG_PPC_HOLLY)		+= holly.o
-obj-$(CONFIG_PPC_PRPMC2800)	+= prpmc2800.o
 obj-$(CONFIG_PPC_C2K)		+= c2k.o
 obj-$(CONFIG_USBGECKO_UDBG)	+= usbgecko_udbg.o
 obj-$(CONFIG_GAMECUBE_COMMON)	+= flipper-pic.o
diff --git a/arch/powerpc/platforms/embedded6xx/prpmc2800.c b/arch/powerpc/platforms/embedded6xx/prpmc2800.c
deleted file mode 100644
index d455f08bea53..000000000000
--- a/arch/powerpc/platforms/embedded6xx/prpmc2800.c
+++ /dev/null
@@ -1,156 +0,0 @@
-/*
- * Board setup routines for the Motorola PrPMC2800
- *
- * Author: Dale Farnsworth <dale@farnsworth.org>
- *
- * 2007 (c) MontaVista, Software, Inc.  This file is licensed under
- * the terms of the GNU General Public License version 2.  This program
- * is licensed "as is" without any warranty of any kind, whether express
- * or implied.
- */
-
-#include <linux/stddef.h>
-#include <linux/kernel.h>
-#include <linux/delay.h>
-#include <linux/interrupt.h>
-#include <linux/seq_file.h>
-
-#include <asm/machdep.h>
-#include <asm/prom.h>
-#include <asm/time.h>
-
-#include <mm/mmu_decl.h>
-
-#include <sysdev/mv64x60.h>
-
-#define MV64x60_MPP_CNTL_0	0x0000
-#define MV64x60_MPP_CNTL_2	0x0008
-
-#define MV64x60_GPP_IO_CNTL	0x0000
-#define MV64x60_GPP_LEVEL_CNTL	0x0010
-#define MV64x60_GPP_VALUE_SET	0x0018
-
-#define PLATFORM_NAME_MAX	32
-
-static char prpmc2800_platform_name[PLATFORM_NAME_MAX];
-
-static void __iomem *mv64x60_mpp_reg_base;
-static void __iomem *mv64x60_gpp_reg_base;
-
-static void __init prpmc2800_setup_arch(void)
-{
-	struct device_node *np;
-	phys_addr_t paddr;
-	const unsigned int *reg;
-
-	/*
-	 * ioremap mpp and gpp registers in case they are later
-	 * needed by prpmc2800_reset_board().
-	 */
-	np = of_find_compatible_node(NULL, NULL, "marvell,mv64360-mpp");
-	reg = of_get_property(np, "reg", NULL);
-	paddr = of_translate_address(np, reg);
-	of_node_put(np);
-	mv64x60_mpp_reg_base = ioremap(paddr, reg[1]);
-
-	np = of_find_compatible_node(NULL, NULL, "marvell,mv64360-gpp");
-	reg = of_get_property(np, "reg", NULL);
-	paddr = of_translate_address(np, reg);
-	of_node_put(np);
-	mv64x60_gpp_reg_base = ioremap(paddr, reg[1]);
-
-#ifdef CONFIG_PCI
-	mv64x60_pci_init();
-#endif
-
-	printk("Motorola %s\n", prpmc2800_platform_name);
-}
-
-static void prpmc2800_reset_board(void)
-{
-	u32 temp;
-
-	local_irq_disable();
-
-	temp = in_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_0);
-	temp &= 0xFFFF0FFF;
-	out_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_0, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL);
-	temp |= 0x00000004;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL);
-	temp |= 0x00000004;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL, temp);
-
-	temp = in_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_2);
-	temp &= 0xFFFF0FFF;
-	out_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_2, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL);
-	temp |= 0x00080000;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL);
-	temp |= 0x00080000;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL, temp);
-
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_VALUE_SET, 0x00080004);
-}
-
-static void prpmc2800_restart(char *cmd)
-{
-	volatile ulong i = 10000000;
-
-	prpmc2800_reset_board();
-
-	while (i-- > 0);
-	panic("restart failed\n");
-}
-
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define PPRPM2800_COHERENCY_SETTING "off"
-#else
-#define PPRPM2800_COHERENCY_SETTING "on"
-#endif
-
-void prpmc2800_show_cpuinfo(struct seq_file *m)
-{
-	seq_printf(m, "Vendor\t\t: Motorola\n");
-	seq_printf(m, "coherency\t: %s\n", PPRPM2800_COHERENCY_SETTING);
-}
-
-/*
- * Called very early, device-tree isn't unflattened
- */
-static int __init prpmc2800_probe(void)
-{
-	unsigned long root = of_get_flat_dt_root();
-	unsigned long len = PLATFORM_NAME_MAX;
-	void *m;
-
-	if (!of_flat_dt_is_compatible(root, "motorola,PrPMC2800"))
-		return 0;
-
-	/* Update ppc_md.name with name from dt */
-	m = of_get_flat_dt_prop(root, "model", &len);
-	if (m)
-		strncpy(prpmc2800_platform_name, m,
-			min((int)len, PLATFORM_NAME_MAX - 1));
-
-	_set_L2CR(_get_L2CR() | L2CR_L2E);
-	return 1;
-}
-
-define_machine(prpmc2800){
-	.name			= prpmc2800_platform_name,
-	.probe			= prpmc2800_probe,
-	.setup_arch		= prpmc2800_setup_arch,
-	.init_early		= mv64x60_init_early,
-	.show_cpuinfo		= prpmc2800_show_cpuinfo,
-	.init_IRQ		= mv64x60_init_irq,
-	.get_irq		= mv64x60_get_irq,
-	.restart		= prpmc2800_restart,
-	.calibrate_decr		= generic_calibrate_decr,
-};
-- 
1.8.5.2

^ permalink raw reply related

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Nishanth Aravamudan @ 2014-02-06  2:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1401241618500.20466@chino.kir.corp.google.com>

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.
> 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.
> 
> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Thanks,
Nish

^ permalink raw reply

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Nishanth Aravamudan @ 2014-02-06  2:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402051312430.21661@nuc>

On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
> On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
> 
> > > If the target node allocation fails (for whatever reason) then I would
> > > recommend for simplicities sake to change the target node to
> > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > > complex solution would be to look through partial lists in increasing
> > > distance to find a partially used slab that is reasonable close to the
> > > current node. Slab has logic like that in fallback_alloc(). Slubs
> > > get_any_partial() function does something close to what you want.
> >
> > I apologize for my own ignorance, but I'm having trouble following.
> > Anton's original patch did fallback to the current cpu slab, but I'm not
> > sure any NUMA_NO_NODE change is necessary there. At the point we're
> > deactivating the slab (in the current code, in __slab_alloc()), we have
> > successfully allocated from somewhere, it's just not on the node we
> > expected to be on.
> 
> Right so if we are ignoring the node then the simplest thing to do is to
> not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

> > So perhaps you are saying to make a change lower in the code? I'm not
> > sure where it makes sense to change the target node in that case. I'd
> > appreciate any guidance you can give.
> 
> This not an easy thing to do. If the current slab is not the right node
> but would be the node from which the page allocator would be returning
> memory then the current slab can still be allocated from. If the fallback
> is to another node then the current cpu slab needs to be deactivated and
> the allocation from that node needs to proceeed. Have a look at
> fallback_alloc() in the slab allocator.
> 
> A allocation attempt from the page allocator can be restricted to a
> specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish

^ permalink raw reply

* arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Stephen N Chivers @ 2014-02-06  2:09 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Chris Proctor

I have a MPC8548e based board and an application that makes
extensive use of floating point including numerous calls to cos.
In the same program there is the use of an sqlite database.

The kernel is derived from 2.6.31 and is compiled with math emulation.

At some point after the reading of the SQLITE database, the
return result from cos goes from an in range value to an out
of range value.

This is as a result of the FP rounding mode mutating from "round to 
nearest"
to "round toward zero".

The cos in the glibc version being used is known to be sensitive to 
rounding
direction and Joseph Myers has previously fixed glibc.

The failure does not occur on a machine that has a hardware floating
point unit (a MPC7410 processor).

I have traced the mutation to the following series of instructions:

        mffs            f0
        mtfsb1          4*cr7+so
        mtfsb0          4*cr7+eq
        fadd            f13,f1,f2
        mtfsf           1, f0

The instructions are part of the stream emitted by gcc for the conversion
of a 128 bit floating point value into an integer in the sqlite database 
read.

Immediately before the execution of the mffs instruction the "rounding
mode" is "round to nearest".

On the MPC8548 board, the execution of the mtfsf instruction does not
restore the rounding mode to "round to nearest".

I believe that the mask computation in mtfsf.c is incorrect and is 
reversed.

In the latest version of the file (linux-3.14-rc1), the mask is computed 
by:

                 mask = 0;
                 if (FM & (1 << 0))
                        mask |= 0x90000000;
                 if (FM & (1 << 1))
                        mask |= 0x0f000000;
                 if (FM & (1 << 2))
                        mask |= 0x00f00000;
                 if (FM & (1 << 3))
                        mask |= 0x000f0000;
                 if (FM & (1 << 4))
                        mask |= 0x0000f000;
                 if (FM & (1 << 5))
                        mask |= 0x00000f00;
                 if (FM & (1 << 6))
                        mask |= 0x000000f0;
                 if (FM & (1 << 7))
                        mask |= 0x0000000f;

I think it should be:

                mask = 0;
                if (FM & (1 << 0))
                        mask |= 0x0000000f;
                if (FM & (1 << 1))
                        mask |= 0x000000f0;
                if (FM & (1 << 2))
                        mask |= 0x00000f00;
                if (FM & (1 << 3))
                        mask |= 0x0000f000;
                if (FM & (1 << 4))
                        mask |= 0x000f0000;
                if (FM & (1 << 5))
                        mask |= 0x00f00000;
                if (FM & (1 << 6))
                        mask |= 0x0f000000;
                if (FM & (1 << 7))
                        mask |= 0x90000000;

With the above mask computation I get consistent results for both the 
MPC8548
and MPC7410 boards.

Am I missing something subtle?

Stephen Chivers,
CSC Australia Pty. Ltd.

^ permalink raw reply

* [PATCH V3 1/3] time: Change the return type of clockevents_notify() to integer
From: Preeti U Murthy @ 2014-02-06  5:49 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

The broadcast framework can potentially be made use of by archs which do not have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 include/linux/clockchips.h   |    6 +++---
 kernel/time/clockevents.c    |    8 +++++---
 kernel/time/tick-broadcast.c |    6 ++++--
 kernel/time/tick-internal.h  |    6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..e0c5a6c 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
 	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
-	int cpu;
+	int cpu, ret = 0;
 
 	raw_spin_lock_irqsave(&clockevents_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
 	case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-		tick_broadcast_oneshot_control(reason);
+		ret = tick_broadcast_oneshot_control(reason);
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 		break;
 	}
 	raw_spin_unlock_irqrestore(&clockevents_lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 43780ab..ddf2ac2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
 	struct clock_event_device *bc, *dev;
 	struct tick_device *td;
 	unsigned long flags;
 	ktime_t now;
-	int cpu;
+	int cpu, ret = 0;
 
 	/*
 	 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 	}
 out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+	return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..f0dc03c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_broadcast_switch_to_oneshot(void) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_broadcast_oneshot_active(void) { return 0; }
@@ -87,7 +87,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {

^ permalink raw reply related

* [PATCH V3 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device
From: Preeti U Murthy @ 2014-02-06  5:49 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat

On some architectures, the local timers of CPUs stop in deep idle states.
They will need to depend on an external clock device to wake them up. However
certain implementations of archs do not have an external clock device.

This patchset provides support in the tick broadcast framework for such
architectures so as to enable the CPUs to get into deep idle.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has thus been tested on the same.

V1: https://lkml.org/lkml/2013/12/12/687.
V2: https://lkml.org/lkml/2014/1/24/28

Changes in V3:
1. Modified comments and code around programming of the broadcast hrtimer.

---

Preeti U Murthy (2):
      time: Change the return type of clockevents_notify() to integer
      time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set

Thomas Gleixner (1):
      tick/cpuidle: Initialize hrtimer mode of broadcast

 drivers/cpuidle/cpuidle.c            |   38 +++++++-----
 include/linux/clockchips.h           |   15 ++++-
 kernel/time/Makefile                 |    2 -
 kernel/time/clockevents.c            |    8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   51 ++++++++++++++++-
 kernel/time/tick-internal.h          |    6 +-
 7 files changed, 197 insertions(+), 28 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

^ permalink raw reply

* [PATCH V3 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Preeti U Murthy @ 2014-02-06  5:50 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also implicitly
indicate that the broadcast framework has not registered this CPU to be woken up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle.c |   38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8f42033 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -117,15 +117,19 @@ int cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv;
-	int next_state, entered_state;
-	bool broadcast;
+	int next_state, entered_state, ret = 0;
+	bool broadcast = false;

-	if (off || !initialized)
-		return -ENODEV;
+	if (off || !initialized) {
+		ret = -ENODEV;
+		goto out;
+	}

 	/* check if the device is ready */
-	if (!dev || !dev->enabled)
-		return -EBUSY;
+	if (!dev || !dev->enabled) {
+		ret = -EBUSY;
+		goto out;
+	}

 	drv = cpuidle_get_cpu_driver(dev);

@@ -137,15 +141,18 @@ int cpuidle_idle_call(void)
 		if (cpuidle_curr_governor->reflect)
 			cpuidle_curr_governor->reflect(dev, next_state);
 		local_irq_enable();
-		return 0;
+		goto out;
 	}

-	trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
 	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);

-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+	if (broadcast) {
+		ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+		if (ret)
+			goto out;
+	}
+
+	trace_cpu_idle_rcuidle(next_state, dev->cpu);

 	if (cpuidle_state_is_coupled(dev, drv, next_state))
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,16 +160,17 @@ int cpuidle_idle_call(void)
 	else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);

-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
-
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);

 	/* give the governor an opportunity to reflect on the outcome */
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev, entered_state);

-	return 0;
+out:	if (broadcast)
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
+
+
+	return ret;
 }

 /**

^ permalink raw reply related

* [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Preeti U Murthy @ 2014-02-06  5:50 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

From: Thomas Gleixner <tglx@linutronix.de>

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h           |    9 +++
 kernel/time/Makefile                 |    2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   45 ++++++++++++++-
 4 files changed, 159 insertions(+), 2 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
 #define CLOCK_EVT_FEAT_PERCPU		0x000040
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER		0x000080
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler:	Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:		ptr to clock event name
  * @rating:		variable to rate clock event devices
  * @irq:		IRQ number (only for non CPU local devices)
+ * @bound_on:		Bound on CPU
  * @cpumask:		cpumask to indicate for which CPUs this device works
  * @list:		list head for the management code
  * @owner:		module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
 	const char		*name;
 	int			rating;
 	int			irq;
+	int			bound_on;
 	const struct cpumask	*cpumask;
 	struct list_head	list;
 	struct module		*owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 0000000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/clockchips.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/module.h>
+
+#include "tick-internal.h"
+
+static struct hrtimer bctimer;
+
+static void bc_set_mode(enum clock_event_mode mode,
+			struct clock_event_device *bc)
+{
+	switch (mode) {
+	case CLOCK_EVT_MODE_SHUTDOWN:
+		/*
+		 * Note, we cannot cancel the timer here as we might
+		 * run into the following live lock scenario:
+		 *
+		 * cpu 0		cpu1
+		 * lock(broadcast_lock);
+		 *			hrtimer_interrupt()
+		 *			bc_handler()
+		 *			   tick_handle_oneshot_broadcast();
+		 *			    lock(broadcast_lock);
+		 * hrtimer_cancel()
+		 *  wait_for_callback()
+		 */
+		hrtimer_try_to_cancel(&bctimer);
+		break;
+	default:
+		break;
+	}
+}
+
+/*
+ * This is called from the guts of the broadcast code when the cpu
+ * which is about to enter idle has the earliest broadcast timer event.
+ */
+static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
+{
+	/*
+	 * We try to cancel the timer first. If the callback is on
+	 * flight on some other cpu then we let it handle it. If we
+	 * were able to cancel the timer nothing can rearm it as we
+	 * own broadcast_lock.
+	 *
+	 * However we can also be called from the event handler of
+	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
+	 * restart the timer since it is on flight on the same CPU. But
+	 * due to the same reason we can reset it.
+	 */
+	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
+		/* Bind the "device" to the cpu */
+		bc->bound_on = smp_processor_id();
+	} else if (bc->bound_on == smp_processor_id()) {
+		hrtimer_set_expires(&bctimer, expires);
+	}
+	return 0;
+}
+
+static struct clock_event_device ce_broadcast_hrtimer = {
+	.set_mode		= bc_set_mode,
+	.set_next_ktime		= bc_set_next,
+	.features		= CLOCK_EVT_FEAT_ONESHOT |
+				  CLOCK_EVT_FEAT_KTIME |
+				  CLOCK_EVT_FEAT_HRTIMER,
+	.rating			= 0,
+	.bound_on		= -1,
+	.min_delta_ns		= 1,
+	.max_delta_ns		= KTIME_MAX,
+	.min_delta_ticks	= 1,
+	.max_delta_ticks	= KTIME_MAX,
+	.mult			= 1,
+	.shift			= 0,
+	.cpumask		= cpu_all_mask,
+};
+
+static enum hrtimer_restart bc_handler(struct hrtimer *t)
+{
+	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
+
+	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
+		return HRTIMER_NORESTART;
+
+	return HRTIMER_RESTART;
+}
+
+void tick_setup_hrtimer_broadcast(void)
+{
+	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	bctimer.function = bc_handler;
+	clockevents_register_device(&ce_broadcast_hrtimer);
+}
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index ddf2ac2..1e46e62f 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -630,6 +630,42 @@ again:
 	raw_spin_unlock(&tick_broadcast_lock);
 }
 
+static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
+{
+	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
+		return 0;
+	if (bc->next_event.tv64 == KTIME_MAX)
+		return 0;
+	return bc->bound_on == cpu ? -EBUSY : 0;
+}
+
+static void broadcast_shutdown_local(struct clock_event_device *bc,
+				     struct clock_event_device *dev)
+{
+	/*
+	 * For hrtimer based broadcasting we cannot shutdown the cpu
+	 * local device if our own event is the first one to expire or
+	 * if we own the broadcast timer.
+	 */
+	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
+		if (broadcast_needs_cpu(bc, smp_processor_id()))
+			return;
+		if (dev->next_event.tv64 < bc->next_event.tv64)
+			return;
+	}
+	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+}
+
+static void broadcast_move_bc(int deadcpu)
+{
+	struct clock_event_device *bc = tick_broadcast_device.evtdev;
+
+	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
+		return;
+	/* This moves the broadcast assignment to this cpu */
+	clockevents_program_event(bc, bc->next_event, 1);
+}
+
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
@@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
 		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
-			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+			broadcast_shutdown_local(bc, dev);
 			/*
 			 * We only reprogram the broadcast timer if we
 			 * did not mark ourself in the force mask and
@@ -680,6 +716,11 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 			    dev->next_event.tv64 < bc->next_event.tv64)
 				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
 		}
+		/*
+		 * If the current CPU owns the hrtimer broadcast
+		 * mechanism, it cannot go deep idle.
+		 */
+		ret = broadcast_needs_cpu(bc, cpu);
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
@@ -853,6 +894,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
 	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
 	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
 
+	broadcast_move_bc(cpu);
+
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 

^ permalink raw reply related

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Joonsoo Kim @ 2014-02-06  8:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206020757.GC5433@linux.vnet.ibm.com>

On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 
> > 
> > I don't think the goal of the discussion is to reduce the amount of slab 
> > allocated, but rather get the most local slab memory possible by use of 
> > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > my patch is allocating it on the most local node; Anton's patch is 
> > allocating it on whatever happened to be the cpu slab.
> > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2278,10 +2278,14 @@ redo:
> > > > 
> > > >  	if (unlikely(!node_match(page, node))) {
> > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > -		deactivate_slab(s, page, c->freelist);
> > > > -		c->page = NULL;
> > > > -		c->freelist = NULL;
> > > > -		goto new_slab;
> > > > +		if (unlikely(!node_present_pages(node)))
> > > > +			node = numa_mem_id();
> > > > +		if (!node_match(page, node)) {
> > > > +			deactivate_slab(s, page, c->freelist);
> > > > +			c->page = NULL;
> > > > +			c->freelist = NULL;
> > > > +			goto new_slab;
> > > > +		}
> > > 
> > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > if we have a memoryless node, we expect the page's locality to be that
> > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > Just wanting to make sure I understand the intent.
> > > 
> > 
> > Yeah, the default policy should be to fallback to local memory if the node 
> > passed is memoryless.
> > 
> > > What I find odd is that there are only 2 nodes on this system, node 0
> > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > should be coming from node 1 (thus node_match() should always be true?)
> > > 
> > 
> > The nice thing about slub is its debugging ability, what is 
> > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > patches?
> 
> Ok, I finally got around to writing a script that compares the objects
> output from both kernels.
> 
> log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Joonsoo's patch.
> 
> log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Anton's patch.
> 
> slab                           objects    objects   percent
>                                log1       log2      change
> -----------------------------------------------------------
> :t-0000104                     71190      85680      20.353982 %
> UDP                            4352       3392       22.058824 %
> inode_cache                    54302      41923      22.796582 %
> fscache_cookie_jar             3276       2457       25.000000 %
> :t-0000896                     438        292        33.333333 %
> :t-0000080                     310401     195323     37.073978 %
> ext4_inode_cache               335        201        40.000000 %
> :t-0000192                     89408      128898     44.168307 %
> :t-0000184                     151300     81880      45.882353 %
> :t-0000512                     49698      73648      48.191074 %
> :at-0000192                    242867     120948     50.199904 %
> xfs_inode                      34350      15221      55.688501 %
> :t-0016384                     11005      17257      56.810541 %
> proc_inode_cache               103868     34717      66.575846 %
> tw_sock_TCP                    768        256        66.666667 %
> :t-0004096                     15240      25672      68.451444 %
> nfs_inode_cache                1008       315        68.750000 %
> :t-0001024                     14528      24720      70.154185 %
> :t-0032768                     655        1312       100.305344%
> :t-0002048                     14242      30720      115.700042%
> :t-0000640                     1020       2550       150.000000%
> :t-0008192                     10005      27905      178.910545%
> 
> FWIW, the configuration of this LPAR has slightly changed. It is now configured
> for maximally 400 CPUs, of which 200 are present. The result is that even with
> Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> script reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-512                        1182 MB    2.03%  100.00%
> kmalloc-192                        1182 MB    1.38%  100.00%
> kmalloc-16384                       966 MB   17.66%  100.00%
> kmalloc-4096                        353 MB   15.92%  100.00%
> kmalloc-8192                        259 MB   27.28%  100.00%
> kmalloc-32768                       207 MB    9.86%  100.00%
> 
> In comparison (log2 above):
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       273 MB   98.76%  100.00%
> kmalloc-8192                        225 MB   98.67%  100.00%
> pgtable-2^11                        114 MB  100.00%  100.00%
> pgtable-2^12                        109 MB  100.00%  100.00%
> kmalloc-4096                        104 MB   98.59%  100.00%
> 
> I appreciate all the help so far, if anyone has any ideas how best to
> proceed further, or what they'd like debugged more, I'm happy to get
> this fixed. We're hitting this on a couple of different systems and I'd
> like to find a good resolution to the problem.

Hello,

I have no memoryless system, so, to debug it, I need your help. :)
First, please let me know node information on your system.

I'm preparing 3 another patches which are nearly same with previous patch,
but slightly different approach. Could you test them on your system?
I will send them soon.

And I think that same problem exists if CONFIG_SLAB is enabled. Could you
confirm that?

And, could you confirm that your system's numa_mem_id() is properly set?
And, could you confirm that node_present_pages() test works properly?
And, with my patches, could you give me more information on slub stat?
For this, you need to enable CONFIG_SLUB_STATS. Then please send me all the
slub stat on /proc/sys/kernel/debug/slab.

Sorry for too many request.
If it bothers you too much, please ignore it :)

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox