LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-11  7:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140210191321.GD1558@linux.vnet.ibm.com>

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:        8264704 kB
> MemFree:         5924608 kB
> ...
> Slab:            1402496 kB
> SReclaimable:     102848 kB
> SUnreclaim:      1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       207 MB   98.60%  100.00%
> task_struct                         134 MB   97.82%  100.00%
> kmalloc-8192                        117 MB  100.00%  100.00%
> pgtable-2^12                        111 MB  100.00%  100.00%
> pgtable-2^10                        104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:        8264704 kB
> MemFree:         4150464 kB
> ...
> Slab:            1590336 kB
> SReclaimable:     208768 kB
> SUnreclaim:      1381568 kB
> 
> slabusage:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       227 MB   98.63%  100.00%
> kmalloc-8192                        130 MB  100.00%  100.00%
> task_struct                         129 MB   97.73%  100.00%
> pgtable-2^12                        112 MB  100.00%  100.00%
> pgtable-2^10                        106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:        8264640 kB
  MemFree:         7119680 kB
  Slab:             207232 kB
  SReclaimable:      32896 kB
  SUnreclaim:       174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.

^ permalink raw reply

* [PATCH v1 1/2] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan @ 2014-02-11  7:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat
In-Reply-To: <20140211065757.21159.49689.stgit@drishya>

Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/include/asm/reg.h    |    4 +
 drivers/cpufreq/Kconfig.powerpc   |    9 +
 drivers/cpufreq/Makefile          |    1 
 drivers/cpufreq/powernv-cpufreq.c |  275 +++++++++++++++++++++++++++++++++++++
 4 files changed, 289 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1	0x13B	/* Hypervisor Save/Restore 1 */
 #define SPRN_IC		0x350	/* Virtual Instruction Count */
 #define SPRN_VTB	0x351	/* Virtual Time Base */
+#define SPRN_PMICR	0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR	0x355   /* Power Management Status Reg */
+#define SPRN_PMCR	0x374	/* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG	8	/* Enable Target Address Register */
 #define FSCR_EBB_LG	7	/* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..4a91ab1 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ
 	help
 	  This adds the support for frequency switching on PA Semi
 	  PWRficient processors.
+
+config POWERNV_CPUFREQ
+       tristate "CPU frequency scaling for IBM POWERNV platform"
+       depends on PPC_POWERNV
+       select CPU_FREQ_TABLE
+       default y
+       help
+	 This adds support for CPU frequency switching on IBM POWERNV
+	 platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)		+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)		+= pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)	+= pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)		+= powernv-cpufreq.o
 
 ##################################################################################
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 0000000..ea3b630
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,275 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)	"powernv-cpufreq: " fmt
+
+#include <linux/module.h>
+#include <linux/cpufreq.h>
+#include <linux/of.h>
+#include <asm/cputhreads.h>
+
+/* FIXME: Make this per-core */
+static DEFINE_MUTEX(freq_switch_mutex);
+
+#define POWERNV_MAX_PSTATES	256
+
+static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+
+/*
+ * Initialize the freq table based on data obtained
+ * from the firmware passed via device-tree
+ */
+
+static int init_powernv_pstates(void)
+{
+	struct device_node *power_mgt;
+	int nr_pstates = 0;
+	int pstate_min, pstate_max, pstate_nominal;
+	const __be32 *pstate_ids, *pstate_freqs;
+	int i;
+	u32 len_ids, len_freqs;
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("power-mgt node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-min", &pstate_min)) {
+		pr_warn("ibm,pstate-min node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-max", &pstate_max)) {
+		pr_warn("ibm,pstate-max node not found\n");
+		return -ENODEV;
+	}
+
+	if (of_property_read_u32(power_mgt, "ibm,pstate-nominal",
+				 &pstate_nominal)) {
+		pr_warn("ibm,pstate-nominal not found\n");
+		return -ENODEV;
+	}
+	pr_info("cpufreq pstate min %d nominal %d max %d\n", pstate_min,
+		pstate_nominal, pstate_max);
+
+	pstate_ids = of_get_property(power_mgt, "ibm,pstate-ids", &len_ids);
+	if (!pstate_ids) {
+		pr_warn("ibm,pstate-ids not found\n");
+		return -ENODEV;
+	}
+
+	pstate_freqs = of_get_property(power_mgt, "ibm,pstate-frequencies-mhz",
+				      &len_freqs);
+	if (!pstate_freqs) {
+		pr_warn("ibm,pstate-frequencies-mhz not found\n");
+		return -ENODEV;
+	}
+
+	WARN_ON(len_ids != len_freqs);
+	nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
+	WARN_ON(!nr_pstates);
+
+	pr_debug("NR PStates %d\n", nr_pstates);
+	for (i = 0; i < nr_pstates; i++) {
+		u32 id = be32_to_cpu(pstate_ids[i]);
+		u32 freq = be32_to_cpu(pstate_freqs[i]);
+
+		pr_debug("PState id %d freq %d MHz\n", id, freq);
+		powernv_freqs[i].driver_data = id;
+		powernv_freqs[i].frequency = freq * 1000; /* kHz */
+	}
+	/* End of list marker entry */
+	powernv_freqs[i].driver_data = 0;
+	powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
+
+	return 0;
+}
+
+static struct freq_attr *powernv_cpu_freq_attr[] = {
+	&cpufreq_freq_attr_scaling_available_freqs,
+	NULL,
+};
+
+/* Helper routines */
+
+/* Access helpers to power mgt SPR */
+
+static inline unsigned long get_pmspr(unsigned long sprn)
+{
+	switch (sprn) {
+	case SPRN_PMCR:
+		return mfspr(SPRN_PMCR);
+
+	case SPRN_PMICR:
+		return mfspr(SPRN_PMICR);
+
+	case SPRN_PMSR:
+		return mfspr(SPRN_PMSR);
+	}
+	BUG();
+}
+
+static inline void set_pmspr(unsigned long sprn, unsigned long val)
+{
+	switch (sprn) {
+	case SPRN_PMCR:
+		mtspr(SPRN_PMCR, val);
+		return;
+
+	case SPRN_PMICR:
+		mtspr(SPRN_PMICR, val);
+		return;
+
+	case SPRN_PMSR:
+		mtspr(SPRN_PMSR, val);
+		return;
+	}
+	BUG();
+}
+
+static void set_pstate(void *pstate)
+{
+	unsigned long val;
+	unsigned long pstate_ul = *(unsigned long *) pstate;
+
+	val = get_pmspr(SPRN_PMCR);
+	val = val & 0x0000ffffffffffffULL;
+	/* Set both local and global PStates */
+	val = val | (pstate_ul << 56) | (pstate_ul << 48);
+	pr_debug("Setting cpu %d pmcr to %016lX\n", smp_processor_id(), val);
+	set_pmspr(SPRN_PMCR, val);
+}
+
+static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index)
+{
+	unsigned long val = (unsigned long)powernv_freqs[new_index].driver_data;
+
+	/*
+	 * Use smp_call_function to send IPI and execute the
+	 * mtspr on target cpu.  We could do that without IPI
+	 * if current CPU is within policy->cpus (core)
+	 */
+
+	val = val & 0xFF;
+	smp_call_function_any(cpus, set_pstate, &val, 1);
+	return 0;
+}
+
+static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+	int base, i;
+
+#ifdef CONFIG_SMP
+	base = cpu_first_thread_sibling(policy->cpu);
+
+	for (i = 0; i < threads_per_core; i++)
+		cpumask_set_cpu(base + i, policy->cpus);
+#endif
+	policy->cpuinfo.transition_latency = 25000;
+
+	/* Print frequency table */
+	for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
+		pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
+
+	policy->cur = powernv_freqs[0].frequency;
+	cpufreq_frequency_table_get_attr(powernv_freqs, policy->cpu);
+	return cpufreq_frequency_table_cpuinfo(policy, powernv_freqs);
+}
+
+static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	cpufreq_frequency_table_put_attr(policy->cpu);
+	return 0;
+}
+
+static int powernv_cpufreq_verify(struct cpufreq_policy *policy)
+{
+	return cpufreq_frequency_table_verify(policy, powernv_freqs);
+}
+
+static int powernv_cpufreq_target(struct cpufreq_policy *policy,
+			      unsigned int target_freq,
+			      unsigned int relation)
+{
+	int rc;
+	struct cpufreq_freqs freqs;
+	unsigned int new_index;
+
+	cpufreq_frequency_table_target(policy, powernv_freqs, target_freq,
+				       relation, &new_index);
+
+	freqs.old = policy->cur;
+	freqs.new = powernv_freqs[new_index].frequency;
+	freqs.cpu = policy->cpu;
+
+	mutex_lock(&freq_switch_mutex);
+	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
+
+	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
+		 policy->cpu,
+		 powernv_freqs[new_index].frequency,
+		 new_index,
+		 powernv_freqs[new_index].driver_data);
+
+	rc = powernv_set_freq(policy->cpus, new_index);
+
+	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
+	mutex_unlock(&freq_switch_mutex);
+
+	return rc;
+}
+
+static struct cpufreq_driver powernv_cpufreq_driver = {
+	.verify		= powernv_cpufreq_verify,
+	.target		= powernv_cpufreq_target,
+	.init		= powernv_cpufreq_cpu_init,
+	.exit		= powernv_cpufreq_cpu_exit,
+	.name		= "powernv-cpufreq",
+	.flags		= CPUFREQ_CONST_LOOPS,
+	.attr		= powernv_cpu_freq_attr,
+};
+
+static int __init powernv_cpufreq_init(void)
+{
+	int rc = 0;
+
+	/* Discover pstates from device tree and init */
+
+	rc = init_powernv_pstates();
+
+	if (rc) {
+		pr_info("powernv-cpufreq disabled\n");
+		return rc;
+	}
+
+	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
+	return rc;
+}
+
+static void __exit powernv_cpufreq_exit(void)
+{
+	cpufreq_unregister_driver(&powernv_cpufreq_driver);
+}
+
+module_init(powernv_cpufreq_init);
+module_exit(powernv_cpufreq_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>");

^ permalink raw reply related

* [PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions
From: Vaidyanathan Srinivasan @ 2014-02-11  7:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat
In-Reply-To: <20140211065757.21159.49689.stgit@drishya>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index ea3b630..8240e90 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include <linux/of.h>
 #include <asm/cputhreads.h>

-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)				\
+			mutex_lock(&per_cpu(freq_switch_lock,\
+				cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)				\
+			mutex_unlock(&per_cpu(freq_switch_lock,\
+				cpu_first_thread_sibling(cpu)));

 #define POWERNV_MAX_PSTATES	256

@@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
 	freqs.new = powernv_freqs[new_index].frequency;
 	freqs.cpu = policy->cpu;

-	mutex_lock(&freq_switch_mutex);
+	lock_core_freq(policy->cpu);
 	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);

 	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
@@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
 	rc = powernv_set_freq(policy->cpus, new_index);

 	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
-	mutex_unlock(&freq_switch_mutex);
+	unlock_core_freq(policy->cpu);

 	return rc;
 }
@@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {

 static int __init powernv_cpufreq_init(void)
 {
-	int rc = 0;
+	int cpu, rc = 0;

 	/* Discover pstates from device tree and init */

@@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void)
 		pr_info("powernv-cpufreq disabled\n");
 		return rc;
 	}
+	/* Init per-core mutex */
+	for_each_possible_cpu(cpu) {
+		mutex_init(&per_cpu(freq_switch_lock, cpu));
+	}

 	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
 	return rc;

^ permalink raw reply related

* [PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform
From: Vaidyanathan Srinivasan @ 2014-02-11  7:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Anton Blanchard
  Cc: Preeti U Murthy, linuxppc-dev, Srivatsa S. Bhat

Hi,

The following patch series implements the platform driver to support
dynamic CPU frequency scaling on IBM POWERNV platforms.

This patch series is based on Linux kernel 3.14-rc2 and tested on
OPAL v3 based IBM POWERNV platform and IBM POWER8 processor.

--Vaidy

---

Srivatsa S. Bhat (1):
      powernv, cpufreq: Add per-core locking to serialize frequency transitions

Vaidyanathan Srinivasan (1):
      powernv: cpufreq driver for powernv platform

 arch/powerpc/include/asm/reg.h    |    4 +
 drivers/cpufreq/Kconfig.powerpc   |    9 +
 drivers/cpufreq/Makefile          |    1 
 drivers/cpufreq/powernv-cpufreq.c |  286 +++++++++++++++++++++++++++++++++++++
 4 files changed, 300 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 

^ permalink raw reply

* [PATCH] powerpc/spufs: Remove MAX_USER_PRIO define
From: Jeremy Kerr @ 2014-02-11  6:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dongsheng Yang, linuxppc-dev, linux-kernel

Current ppc64_defconfig fails with:

 arch/powerpc/platforms/cell/spufs/sched.c:86:0: error: "MAX_USER_PRIO" redefined [-Werror]
 cc1: all warnings being treated as errors

6b6350f1 introduced a generic MAX_USER_PRIO macro to sched/prio.h, which
is causing the conflit. Use that one instead of our own.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
Ingo: 6b6350f1 is currently in tip; this fixes a build breakage for spufs

---
 arch/powerpc/platforms/cell/spufs/sched.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4931838..4a0a64f 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -83,7 +83,6 @@ static struct timer_list spuloadavg_timer;
 #define MIN_SPU_TIMESLICE	max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
 #define DEF_SPU_TIMESLICE	(100 * HZ / (1000 * SPUSCHED_TICK))
 
-#define MAX_USER_PRIO		(MAX_PRIO - MAX_RT_PRIO)
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 

^ permalink raw reply related

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  3:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140211025645.GJ18016@ZenIV.linux.org.uk>

On Tue, 2014-02-11 at 02:56 +0000, Al Viro wrote:
> > So the question is, is it reasonable to have the ref smaller than
> > 32-bit...
> 
> Every time you open a file, you bump dentry refcount.  Something like
> libc or ld.so will be opened on just about every execve(), so I'd say
> that 16 bits is far too low.  If nothing else, 32 bits might be too
> low on 64bit boxen...

So back to square 1 ... we can't implement together lockref, ticket
locks, and our lock confer mechanism within 64-bit.

I see two options at this stage. Both require a custom implementation
of lockref for powerpc, so some ifdef's such that we can replace the
generic implementation completely.

 - We can use a small ref, and when it's too big, overflow into a larger
one, falling back to the "old style" lock + ref (an overflow bit or a
compare with ffff)

 - We can have lockref "build" it's own lock out of the ticketpair and
ref, keeping the owner in a separate word. The owner doesn't strictly
need to be atomic.

Both are gross though :(

Anybody has a better idea ?

Ben.

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Al Viro @ 2014-02-11  2:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1392086660.3996.50.camel@pasglop>

On Tue, Feb 11, 2014 at 01:44:20PM +1100, Benjamin Herrenschmidt wrote:
> That leaves us with 32 bits to put the ref and the owner. The question
> is how big the ref really has to be and can we have a reasonable failure
> mode if it overflows ?
> 
> If we limit ourselves to, for example, 16-bit for the ref in lockref,
> then we can have the second 32-bit split between the owner and the ref.
> 
> If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...
> 
> So the question is, is it reasonable to have the ref smaller than
> 32-bit...

Every time you open a file, you bump dentry refcount.  Something like
libc or ld.so will be opened on just about every execve(), so I'd say
that 16 bits is far too low.  If nothing else, 32 bits might be too
low on 64bit boxen...

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  2:44 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar, Al Viro
In-Reply-To: <20140210155217.GF2107@lst.de>

(Linus, Al, a question for you down there about lockref "ref" size)

On Mon, 2014-02-10 at 16:52 +0100, Torsten Duwe wrote:

> What if I squeeze the bits a little?
> 4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
> At the cost that unlock may become an ll/sc operation again.
> I could think about a trick against that.
> But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/
> 
> Doing another round of yields for lockrefs now doesn't
> sound so bad any more.

So, the ticketpair has to be 16:16 so we can avoid the atomic on unlock

That leaves us with 32 bits to put the ref and the owner. The question
is how big the ref really has to be and can we have a reasonable failure
mode if it overflows ?

If we limit ourselves to, for example, 16-bit for the ref in lockref,
then we can have the second 32-bit split between the owner and the ref.

If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...

So the question is, is it reasonable to have the ref smaller than
32-bit...

Cheers,
Ben.

^ permalink raw reply

* [PATCH] powerpc/powernv: Add iommu DMA bypass support for IODA2
From: Benjamin Herrenschmidt @ 2014-02-11  0:38 UTC (permalink / raw)
  To: linuxppc-dev list

this patch adds the support for to create a direct iommu "bypass"
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.

Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

This is pretty much identical to the old version I posted a while ago,
except that it does have the iommu calls to enable/disable which I had
forgotten to git add before posting the previous one.

 arch/powerpc/include/asm/dma-mapping.h    |  1 +
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/kernel/dma.c                 | 10 ++--
 arch/powerpc/kernel/iommu.c               | 12 +++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 +++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci.c      | 10 ++++
 arch/powerpc/platforms/powernv/pci.h      |  6 ++-
 arch/powerpc/platforms/powernv/powernv.h  |  4 ++
 arch/powerpc/platforms/powernv/setup.c    |  9 ++++
 9 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index e27e9ad..150866b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -134,6 +134,7 @@ static inline int dma_supported(struct device *dev, u64 mask)
 }
 
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
+extern int __dma_set_mask(struct device *dev, u64 dma_mask);
 
 #define dma_alloc_coherent(d,s,h,f)	dma_alloc_attrs(d,s,h,f,NULL)
 
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f7a8036..42632c7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,6 +77,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 8032b97..ee78f6e 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -191,12 +191,10 @@ EXPORT_SYMBOL(dma_direct_ops);
 
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 
-int dma_set_mask(struct device *dev, u64 dma_mask)
+int __dma_set_mask(struct device *dev, u64 dma_mask)
 {
 	struct dma_map_ops *dma_ops = get_dma_ops(dev);
 
-	if (ppc_md.dma_set_mask)
-		return ppc_md.dma_set_mask(dev, dma_mask);
 	if ((dma_ops != NULL) && (dma_ops->set_dma_mask != NULL))
 		return dma_ops->set_dma_mask(dev, dma_mask);
 	if (!dev->dma_mask || !dma_supported(dev, dma_mask))
@@ -204,6 +202,12 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 	*dev->dma_mask = dma_mask;
 	return 0;
 }
+int dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (ppc_md.dma_set_mask)
+		return ppc_md.dma_set_mask(dev, dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
 EXPORT_SYMBOL(dma_set_mask);
 
 u64 dma_get_required_mask(struct device *dev)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..88e3ec6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,6 +1088,14 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
+	/*
+	 * Disable iommu bypass, otherwise the user can DMA to all of
+	 * our physical memory via the bypass window instead of just
+	 * the pages that has been explicitly mapped into the iommu
+	 */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, false);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1102,6 +1110,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	/* The kernel owns the device now, we can restore the iommu bypass */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7d6dcc6..3b2b4fb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -21,6 +21,7 @@
 #include <linux/irq.h>
 #include <linux/io.h>
 #include <linux/msi.h>
+#include <linux/memblock.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -460,9 +461,39 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 		return;
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
+	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
 	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
 }
 
+static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+				     struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	struct pnv_ioda_pe *pe;
+	uint64_t top;
+	bool bypass = false;
+
+	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+		return -ENODEV;;
+
+	pe = &phb->ioda.pe_array[pdn->pe_number];
+	if (pe->tce_bypass_enabled) {
+		top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		bypass = (dma_mask >= top);
+	}
+
+	if (bypass) {
+		dev_info(&pdev->dev, "Using 64-bit DMA iommu bypass\n");
+		set_dma_ops(&pdev->dev, &dma_direct_ops);
+		set_dma_offset(&pdev->dev, pe->tce_bypass_base);
+	} else {
+		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
+		set_dma_ops(&pdev->dev, &dma_iommu_ops);
+		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	}
+	return 0;
+}
+
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
 	struct pci_dev *dev;
@@ -657,6 +688,56 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+{
+	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
+					      tce32_table);
+	uint16_t window_id = (pe->pe_number << 1 ) + 1;
+	int64_t rc;
+
+	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
+	if (enable) {
+		phys_addr_t top = memblock_end_of_DRAM();
+
+		top = roundup_pow_of_two(top);
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     top);
+	} else {
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     0);
+
+		/*
+		 * We might want to reset the DMA ops of all devices on
+		 * this PE. However in theory, that shouldn't be necessary
+		 * as this is used for VFIO/KVM pass-through and the device
+		 * hasn't yet been returned to its kernel driver
+		 */
+	}
+	if (rc)
+		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
+	else
+		pe->tce_bypass_enabled = enable;
+}
+
+static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
+					  struct pnv_ioda_pe *pe)
+{
+	/* TVE #1 is selected by PCI address bit 59 */
+	pe->tce_bypass_base = 1ull << 59;
+
+	/* Install set_bypass callback for VFIO */
+	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+
+	/* Enable bypass by default */
+	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+}
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -727,6 +808,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
+	/* Also create a bypass window */
+	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
 	return;
 fail:
 	if (pe->tce32_seg >= 0)
@@ -1286,6 +1369,7 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
+	phb->dma_set_mask = pnv_pci_ioda_dma_set_mask;
 
 	/* Setup shutdown function for kexec */
 	phb->shutdown = pnv_pci_ioda_shutdown;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b555ebc..95633d7 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -634,6 +634,16 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 		pnv_pci_dma_fallback_setup(hose, pdev);
 }
 
+int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+	struct pnv_phb *phb = hose->private_data;
+
+	if (phb && phb->dma_set_mask)
+		return phb->dma_set_mask(phb, pdev, dma_mask);
+	return __dma_set_mask(&pdev->dev, dma_mask);
+}
+
 void pnv_pci_shutdown(void)
 {
 	struct pci_controller *hose;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 13f1942..cde1694 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -54,7 +54,9 @@ struct pnv_ioda_pe {
 	struct iommu_table	tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
-	/* XXX TODO: Add support for additional 64-bit iommus */
+	/* 64-bit TCE bypass region */
+	bool			tce_bypass_enabled;
+	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
@@ -113,6 +115,8 @@ struct pnv_phb {
 			 unsigned int hwirq, unsigned int virq,
 			 unsigned int is_64, struct msi_msg *msg);
 	void (*dma_dev_setup)(struct pnv_phb *phb, struct pci_dev *pdev);
+	int (*dma_set_mask)(struct pnv_phb *phb, struct pci_dev *pdev,
+			    u64 dma_mask);
 	void (*fixup_phb)(struct pci_controller *hose);
 	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
 	void (*shutdown)(struct pnv_phb *phb);
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index de6819b..213887a 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -7,12 +7,16 @@ extern void pnv_smp_init(void);
 static inline void pnv_smp_init(void) { }
 #endif
 
+struct pci_dev;
+
 #ifdef CONFIG_PCI
 extern void pnv_pci_init(void);
 extern void pnv_pci_shutdown(void);
+extern int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask);
 #else
 static inline void pnv_pci_init(void) { }
 static inline void pnv_pci_shutdown(void) { }
+static inline int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask) { }
 #endif
 
 extern void pnv_lpc_init(void);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 21166f6..110f4fb 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -27,6 +27,7 @@
 #include <linux/interrupt.h>
 #include <linux/bug.h>
 #include <linux/cpuidle.h>
+#include <linux/pci.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -141,6 +142,13 @@ static void pnv_progress(char *s, unsigned short hex)
 {
 }
 
+static int pnv_dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (dev_is_pci(dev))
+		return pnv_pci_dma_set_mask(to_pci_dev(dev), dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
+
 static void pnv_shutdown(void)
 {
 	/* Let the PCI code clear up IODA tables */
@@ -238,6 +246,7 @@ define_machine(powernv) {
 	.machine_shutdown	= pnv_shutdown,
 	.power_save             = powernv_idle,
 	.calibrate_decr		= generic_calibrate_decr,
+	.dma_set_mask		= pnv_dma_set_mask,
 #ifdef CONFIG_KEXEC
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif

^ permalink raw reply related

* Re: [PATCH] Handle vmalloc addresses
From: Benjamin Herrenschmidt @ 2014-02-11  0:19 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: Rong Song Shen, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <52E92DB0.1050902@linux.vnet.ibm.com>

Hi Nathan !

Please do a better submission :-)

Your subject is to be honest, crap. Something like

[PATCH] crypto/nx/nx-842: Fix handling of vmalloc addresses

Would have been much more informative.

Additionally, this is a patch for drivers/crypto, and while that driver
is powerpc-specific meaning I *could* take that patch, it should at
least be CCed to the crypto list/maintainer since that would
be the normal path for such a patch to be applied.

I'm taking it this time around but I know you can do better !

Cheers,
Ben.

On Wed, 2014-01-29 at 10:34 -0600, Nathan Fontenot wrote:
> The nx-842 compression driver does not currently handle getting
> a physical address for vmalloc addresses. The current driver
> uses __pa() for all addresses which does not properly handle
> vmalloc addresses and thus causes a failure since we do not pass
> a proper physical address to phyp.
> 
> This patch adds a routine to convert an address to a physical
> address by checking for vmalloc addresses and handling them properly.
> 
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>  ---
>  drivers/crypto/nx/nx-842.c |   29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> Index: linux/drivers/crypto/nx/nx-842.c
> ===================================================================
> --- linux.orig/drivers/crypto/nx/nx-842.c	2014-01-22 08:52:55.000000000 -0600
> +++ linux/drivers/crypto/nx/nx-842.c	2014-01-29 08:25:33.000000000 -0600
> @@ -158,6 +158,15 @@
>  	return sl->entry_nr * sizeof(struct nx842_slentry);
>  }
>  
> +static inline unsigned long nx842_get_pa(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_to_phys(vmalloc_to_page(addr))
> +		       + offset_in_page(addr);
> +	else
> +		return __pa(addr);
> +}
> +
>  static int nx842_build_scatterlist(unsigned long buf, int len,
>  			struct nx842_scatterlist *sl)
>  {
> @@ -168,7 +177,7 @@
>  
>  	entry = sl->entries;
>  	while (len) {
> -		entry->ptr = __pa(buf);
> +		entry->ptr = nx842_get_pa((void *)buf);
>  		nextpage = ALIGN(buf + 1, NX842_HW_PAGE_SIZE);
>  		if (nextpage < buf + len) {
>  			/* we aren't at the end yet */
> @@ -370,8 +379,8 @@
>  	op.flags = NX842_OP_COMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> -	op.out = __pa(slout.entries);
> +	op.csbcpb = nx842_get_pa(csbcpb);
> +	op.out = nx842_get_pa(slout.entries);
>  
>  	for (i = 0; i < hdr->blocks_nr; i++) {
>  		/*
> @@ -401,13 +410,13 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = max_sync_size;
>  
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, max_sync_size, &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -565,7 +574,7 @@
>  	op.flags = NX842_OP_DECOMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> +	op.csbcpb = nx842_get_pa(csbcpb);
>  
>  	/*
>  	 * max_sync_size may have changed since compression,
> @@ -597,12 +606,12 @@
>  		if (likely((inbuf & NX842_HW_PAGE_MASK) ==
>  			((inbuf + hdr->sizes[i] - 1) & NX842_HW_PAGE_MASK))) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = hdr->sizes[i];
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, hdr->sizes[i] , &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -613,12 +622,12 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.out = __pa(outbuf);
> +			op.out = nx842_get_pa((void *)outbuf);
>  			op.outlen = max_sync_size;
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(outbuf, max_sync_size, &slout);
> -			op.out = __pa(slout.entries);
> +			op.out = nx842_get_pa(slout.entries);
>  			op.outlen = -nx842_get_scatterlist_size(&slout);
>  		}
>  
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 19:39 UTC (permalink / raw)
  To: Hou Zhiqiang; +Cc: scottwood, linuxppc-dev, mingkai.hu, linux-mtd, linux-spi
In-Reply-To: <1390717003-28699-1-git-send-email-b48286@freescale.com>

On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote:
> To give spi flash layout using "mtdparts=..." in cmdline, we must
> give mtd_info a fixed name,because the cmdlinepart's parser will
> match the name given in cmdline with the mtd_info.
> 
> Now, if use OF node, mtd_info's name will be spi->dev->name. It
> consists of spi_master->bus_num, and the spi_master->bus_num maybe
> dynamically fetched.
> So, give the mtd_info a new fiexd name "name.cs", "name" is name of
> spi_device_id and "cs" is chip-select in spi_dev.
> 
> Signed-off-by: Hou Zhiqiang <b48286@freescale.com>
> ---
> v2:
>  - add check for return value of function kasprintf.
>  - whether the spi_master->bus_num is dynamical is determined by spi
>    controller driver, and it can't be check in this driver. So, we can
>    not initial the mtd_info's name by distinguishing the spi_master
>    bus_num dynamically-allocated or not.

How about spi->master->bus_num < 0 ?

>  drivers/mtd/devices/m25p80.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c
> index eb558e8..1f494d2 100644
> --- a/drivers/mtd/devices/m25p80.c
> +++ b/drivers/mtd/devices/m25p80.c
> @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi)
>  
>  	if (data && data->name)
>  		flash->mtd.name = data->name;
> -	else
> -		flash->mtd.name = dev_name(&spi->dev);
> +	else {
> +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> +				id->name, spi->chip_select);

I don't think this name is specific enough. What if there are more than
one SPI controller? Then there could be one chip with the same
chip-select. You probably still need to incorporate the SPI master
somehow, even if it's not by using the bus number directly (because it's
dynamic).

> +		if (!flash->mtd.name)
> +			return -ENOMEM;
> +	}
>  
>  	flash->mtd.type = MTD_NORFLASH;
>  	flash->mtd.writesize = 1;

Brian

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-10 19:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply

* Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 18:47 UTC (permalink / raw)
  To: B48286@freescale.com
  Cc: Scott Wood, linuxppc-dev@ozlabs.org, Mingkai.Hu@freescale.com,
	linux-mtd@lists.infradead.org, Ezequiel Garcia
In-Reply-To: <66941b677b7e4b8d9ce4cc44304b0955@DM2PR03MB301.namprd03.prod.outlook.com>

Hi Hou,

On Thu, Jan 23, 2014 at 03:29:41AM +0000, B48286@freescale.com wrote:
> > -----Original Message-----
> > From: Brian Norris [mailto:computersforpeace@gmail.com]
> > Sent: Thursday, January 23, 2014 10:12 AM
> > To: Hou Zhiqiang-B48286
> > Cc: linux-mtd@lists.infradead.org; linuxppc-dev@ozlabs.org; Wood Scott-
> > B07421; Hu Mingkai-B21284; Ezequiel Garcia
> > Subject: Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
> > 
> > On Mon, Jan 06, 2014 at 02:34:29PM +0800, Hou Zhiqiang wrote:
> > > --- a/drivers/mtd/devices/m25p80.c
> > > +++ b/drivers/mtd/devices/m25p80.c
> > > @@ -1012,7 +1012,8 @@ static int m25p_probe(struct spi_device *spi)
> > >  	if (data && data->name)
> > >  		flash->mtd.name = data->name;
> > >  	else
> > > -		flash->mtd.name = dev_name(&spi->dev);
> > > +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> > > +				id->name, spi->chip_select);
> > 
> > Changing the mtd.name may have far-reaching consequences for users who
> > already have mtdparts= command lines. But your concern is probably valid
> > for dynamically-determined bus numbers. Perhaps you can edit this patch
> > to only change the name when the busnum is dynamically-allocated?
> >
> It's a good idea, but in the case of mtd_info's name dynamically-allocated
> using "mtdparts=..." in command lines is illegal obviously.

I agree that users should never have relied on the dynamically-allocated
name. But changing the name for non-dynamic schemes (e.g., where the SPI
busnum is a fixed value) is not pleasant for users.

> Would you tell
> me what side-effect will be brought by the change of mtd_info's name.

I can only think of the mtdparts= command line. Otherwise, I don't think
the name is very important.

Brian

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Peter Zijlstra @ 2014-02-10 17:53 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140210155217.GF2107@lst.de>

On Mon, Feb 10, 2014 at 04:52:17PM +0100, Torsten Duwe wrote:
> Opinions, anyone?

Since the holder thing is a performance thing, not a correctness thing;
one thing you could do is something like:

static const int OWNER_HASH_SIZE = CONFIG_NR_CPUS * 4;
static const int OWNER_HASH_BITS = ilog2(OWNER_HASH_SIZE);

u16 lock_owner_array[OWNER_HASH_SIZE] = { 0, };

void set_owner(struct arch_spinlock_t *lock, int owner)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	lock_owner_array[hash] = owner;
}

void yield_to_owner(struct arch_spinlock_t *lock)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	int owner = lock_owner_array[hash];
	yield_to_cpu(owner);
}

And call set_owner() after the ticket lock is acquired, and don't bother
clearing it again; a new acquire will overwrite, a collision we have to
live with.

It should on average get you the right yield and does away with having
to track the owner field in place.

It does however get you an extra cacheline miss on acquire :/

One could consider patching it out when you know your kernel is not
running on an overloaded partition.

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 17:03 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: James Yang, Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

On Mon, 10 Feb 2014, Gabriel Paubert wrote:

> On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > > 	Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:


> > 
> > > > > > 
> > > > > >                 mask = 0;
> > > > > >                 if (FM & (1 << 0))
> > > > > >                         mask |= 0x0000000f;
> > > > > >                 if (FM & (1 << 1))
> > > > > >                         mask |= 0x000000f0;
> > > > > >                 if (FM & (1 << 2))
> > > > > >                         mask |= 0x00000f00;
> > > > > >                 if (FM & (1 << 3))
> > > > > >                         mask |= 0x0000f000;
> > > > > >                 if (FM & (1 << 4))
> > > > > >                         mask |= 0x000f0000;
> > > > > >                 if (FM & (1 << 5))
> > > > > >                         mask |= 0x00f00000;
> > > > > >                 if (FM & (1 << 6))
> > > > > >                         mask |= 0x0f000000;
> > > > > >                 if (FM & (1 << 7))
> > > > > >                         mask |= 0x90000000;
> > > > > > 
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.

> Ok, if you have measured that method1 is faster than method2, let us go for it.
> I believe method2 would be faster if you had a large out-of-order execution
> window, because more parallelism can be extracted from it, but this is probably
> only true for high end cores, which do not need FPU emulation in the first place.

Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
compiler generated.   

 
> The other place where we can optimize is the generation of FEX. Here is 
> my current patch:
> 
> 
> diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
> index dbce92e..b57b3fa8 100644
> --- a/arch/powerpc/math-emu/mtfsf.c
> +++ b/arch/powerpc/math-emu/mtfsf.c
> @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
>  	u32 mask;
>  	u32 fpscr;
>  
> -	if (FM == 0)
> +	if (likely(FM == 0xff))
> +		mask = 0xffffffff;
> +	else if (unlikely(FM == 0))
>  		return 0;
> -
> -	if (FM == 0xff)
> -		mask = 0x9fffffff;
>  	else {
> -		mask = 0;
> -		if (FM & (1 << 0))
> -			mask |= 0x90000000;
> -		if (FM & (1 << 1))
> -			mask |= 0x0f000000;
> -		if (FM & (1 << 2))
> -			mask |= 0x00f00000;
> -		if (FM & (1 << 3))
> -			mask |= 0x000f0000;
> -		if (FM & (1 << 4))
> -			mask |= 0x0000f000;
> -		if (FM & (1 << 5))
> -			mask |= 0x00000f00;
> -		if (FM & (1 << 6))
> -			mask |= 0x000000f0;
> -		if (FM & (1 << 7))
> -			mask |= 0x0000000f;
> +		mask = (FM & 1);
> +		mask |= (FM << 3) & 0x10;
> +		mask |= (FM << 6) & 0x100;
> +		mask |= (FM << 9) & 0x1000;
> +		mask |= (FM << 12) & 0x10000;
> +		mask |= (FM << 15) & 0x100000;
> +		mask |= (FM << 18) & 0x1000000;
> +		mask |= (FM << 21) & 0x10000000;
> +		mask *= 15;


Needs to also mask out bits 1 and 2, they aren't to be set from frB.

		mask &= 0x9FFFFFFF;




>  	}
>  
> -	__FPU_FPSCR &= ~(mask);
> -	__FPU_FPSCR |= (frB[1] & mask);
> +	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
> +		~(FPSCR_VX | FPSCR_FEX);
>  
> -	__FPU_FPSCR &= ~(FPSCR_VX);
> -	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
> +	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
>  		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
>  		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
> -		__FPU_FPSCR |= FPSCR_VX;
> -
> -	fpscr = __FPU_FPSCR;
> -	fpscr &= ~(FPSCR_FEX);
> -	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
> -	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
> -	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
> -	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
> -	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
> -		fpscr |= FPSCR_FEX;
> +		fpscr |= FPSCR_VX;
> +
> +	/* The bit order of exception enables and exception status
> +	 * is the same. Simply shift and mask to check for enabled
> +	 * exceptions.
> +	 */
> +	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
>  	__FPU_FPSCR = fpscr;
>  
>  #ifdef DEBUG
>  mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
>  1 file changed, 22 insertions(+), 35 deletions(-)
> 
> 
> Notes: 
> 
> 1) I'm really unsure on whether 0xff is frequent or not. So the likely()
> statement at the beginning may be wrong. Actually, if it is not very likely,
> it might be better to remove the special casef for FM==0xff. A look at 
> GCC sources shows that it never generates a mask of 0xff. From glibc
> sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

Can't handle all cases here.  
 
> 2) it may be better to remove the check for FM==0, after all, the instruction
> efectively becomes a nop, and generating the instruction in the first place
> would be too stupid for words.

Hmm a heavy no-op.  I wonder if it is heavier than a sync.
 
> 3) if might be better to #define the magic constants (22 and 0xf8) used 
> in the last if statement.
> 
> 	Gabriel

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 16:50 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: James Yang, Chris Proctor, linuxppc-dev
In-Reply-To: <OF80E982EC.BE15A034-ONCA257C7A.006B58B5-CA257C7A.006C3EA8@csc.com>

On Sun, 9 Feb 2014, Stephen N Chivers wrote:

> James Yang <James.Yang@freescale.com> wrote on 02/08/2014 07:49:40 AM:
> 
> > From: James Yang <James.Yang@freescale.com>
> > To: Gabriel Paubert <paubert@iram.es>
> > Cc: Stephen N Chivers <schivers@csc.com.au>, Chris Proctor 
> > <cproctor@csc.com.au>, <linuxppc-dev@lists.ozlabs.org>
> > Date: 02/08/2014 07:49 AM
> > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > 
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > >    Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
> <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.
> > > > > > 
> > > > > > Am I missing something subtle?
> > > > > 
> > > > > No I think you are correct. This said, this code may probably be 
> > > > optimized 
> > > > > to eliminate a lot of the conditional branches. I think that:
> > 
> > 
> > If the compiler is enabled to generate isel instructions, it would not 
> > use a conditional branch for this code. (ignore the andi's values, 
> > this is an old compile)
> > 
> From limited research, the 440GP is a processor
> that doesn't implement the isel instruction and it does
> not implement floating point.
> 
> The kernel emulates isel and so using that instruction
> for the 440GP would have a double trap penalty.

Are you writing about something outside the scope of this thread?  OP 
was using MPC8548 not a 440GP.  The compiler should not be using or 
targeting 8548 for a 440GP so having to emulate isel shouldn't be an 
issue because there wouldn't be any.  (The assembly listing I posted 
was generated by gcc targeting 8548.)  Anyway, I had measured the 
non-isel routines to be faster and that works without illop traps.

 
> Correct me if I am wrong, the isel instruction first appears
> in PowerPC ISA v2.04 around mid 2007. 


isel appeared in 2003 in the e500 (v1) core that is in the MPC8540.  
The instruction is Power ISA 2.03 (9/2006).

^ permalink raw reply

* [RFT][PATCH 00/12] change drivers power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw
  Cc: Shuah Khan, heiko.carstens, chris, shuahkhan, dwalker,
	ingo.tuchscherer, sonic.zhang, linux-s390, linux, davidb,
	linux-pm, linux-arm-msm, linux-pcmcia, adi-buildroot-devel,
	mirq-linux, ian, manuel.lauss, linux-arm-kernel, gregkh,
	linux-mmc, linux-kernel, schwidefsky, linux390, linuxppc-dev

Change drivers to register pm ops using dev_pm_ops instead of legacy pm_ops.
.pm hooks call existing legacy suspend and resume interfaces by passing in
the right pm state. Bus drivers suspend and resume routines call .pm driver
hooks if found.

Shuah Khan (12):
  arm: change locomo platform and bus power management to use
    dev_pm_ops
  arm: change sa1111 platform and bus power management to use
    dev_pm_ops
  arm: change scoop platform power management to use dev_pm_ops
  drivers/macintosh/adb: change platform power managemnet to use
    dev_pm_ops
  mmc: change au1xmmc platform power management to use dev_pm_ops
  mmc: change bfin_sdh platform power management to use dev_pm_ops
  isa: change isa bus power managemnet to use dev_pm_ops
  mmc: change cb710-mmc platform power management to use dev_pm_ops
  mmc: change msm_sdcc platform power management to use dev_pm_ops
  mmc: change tmio_mmc platform power management to use dev_pm_ops
  drivers/pcmcia: change ds driver power management to use dev_pm_ops
  drivers/s390/crypto: change ap_bus driver power management to use
    dev_pm_ops

 arch/arm/common/locomo.c     |   93 +++++++++++++++++++++++++++++++++++-------
 arch/arm/common/sa1111.c     |   88 +++++++++++++++++++++++++++++++--------
 arch/arm/common/scoop.c      |   44 ++++++++++++++++----
 drivers/base/isa.c           |   30 ++++++++++++--
 drivers/macintosh/adb.c      |   41 ++++++++++++++++---
 drivers/mmc/host/au1xmmc.c   |   43 +++++++++++++++----
 drivers/mmc/host/bfin_sdh.c  |   40 +++++++++++++++---
 drivers/mmc/host/cb710-mmc.c |   37 +++++++++++++++--
 drivers/mmc/host/msm_sdcc.c  |   42 +++++++++++++++----
 drivers/mmc/host/tmio_mmc.c  |   42 +++++++++++++++----
 drivers/pcmcia/ds.c          |   36 +++++++++++++---
 drivers/s390/crypto/ap_bus.c |   30 ++++++++++++--
 12 files changed, 481 insertions(+), 85 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [RFT][PATCH 04/12] drivers/macintosh/adb: change platform power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw, benh; +Cc: shuahkhan, Shuah Khan, linuxppc-dev, linux-kernel, linux-pm
In-Reply-To: <cover.1392040064.git.shuah.kh@samsung.com>

Change adb platform driver to register pm ops using dev_pm_ops instead of
legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces
by passing in the right pm state.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
---
 drivers/macintosh/adb.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 04a5049..df6b201 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -263,7 +263,7 @@ adb_reset_bus(void)
 /*
  * notify clients before sleep
  */
-static int adb_suspend(struct platform_device *dev, pm_message_t state)
+static int __adb_suspend(struct platform_device *dev, pm_message_t state)
 {
 	adb_got_sleep = 1;
 	/* We need to get a lock on the probe thread */
@@ -276,10 +276,25 @@ static int adb_suspend(struct platform_device *dev, pm_message_t state)
 	return 0;
 }
 
+static int adb_suspend(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_SUSPEND);
+}
+
+static int adb_freeze(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_FREEZE);
+}
+
+static int adb_poweroff(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_HIBERNATE);
+}
+
 /*
  * reset bus after sleep
  */
-static int adb_resume(struct platform_device *dev)
+static int __adb_resume(struct platform_device *dev)
 {
 	adb_got_sleep = 0;
 	up(&adb_probe_mutex);
@@ -287,6 +302,11 @@ static int adb_resume(struct platform_device *dev)
 
 	return 0;
 }
+
+static int adb_resume(struct device *dev)
+{
+	return __adb_resume(to_platform_device(dev));
+}
 #endif /* CONFIG_PM */
 
 static int __init adb_init(void)
@@ -831,14 +851,25 @@ static const struct file_operations adb_fops = {
 	.release	= adb_release,
 };
 
+#ifdef CONFIG_PM
+static const struct dev_pm_ops adb_dev_pm_ops = {
+	.suspend = adb_suspend,
+	.resume = adb_resume,
+	/* Hibernate hooks */
+	.freeze = adb_freeze,
+	.thaw = adb_resume,
+	.poweroff = adb_poweroff,
+	.restore = adb_resume,
+};
+#endif
+
 static struct platform_driver adb_pfdrv = {
 	.driver = {
 		.name = "adb",
-	},
 #ifdef CONFIG_PM
-	.suspend = adb_suspend,
-	.resume = adb_resume,
+		.pm = &adb_dev_pm_ops,
 #endif
+	},
 };
 
 static struct platform_device adb_pfdev = {
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-10 15:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <1392001823.3996.21.camel@pasglop>

On Mon, Feb 10, 2014 at 02:10:23PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
> >  typedef struct {
> > -       volatile unsigned int slock;
> > -} arch_spinlock_t;
> > +       union {
> > +               __ticketpair_t head_tail;
> > +               struct __raw_tickets {
> > +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> > +                       __ticket_t tail, head;
> > +#else
> > +                       __ticket_t head, tail;
> > +#endif
> > +               } tickets;
> > +       };
> > +#if defined(CONFIG_PPC_SPLPAR)
> > +       u32 holder;
> > +#endif
> > +} arch_spinlock_t __aligned(4);
> 
> That's still broken with lockref (which we just merged).
> 
> We must have the arch_spinlock_t and the ref in the same 64-bit word
> otherwise it will break.

Well, as far as I can see you'll just not be able to
USE_CMPXCHG_LOCKREF -- with the appropriate performance hit --
the code just falls back into lock&ref on pSeries.
What again was the intention of directed yield in the first place...?

> We can make it work in theory since the holder doesn't have to be
> accessed atomically, but the practicals are a complete mess ...
> lockref would essentially have to re-implement the holder handling
> of the spinlocks and use lower level ticket stuff.
> 
> Unless you can find a sneaky trick ... :-(

What if I squeeze the bits a little?
4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
At the cost that unlock may become an ll/sc operation again.
I could think about a trick against that.
But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/

Doing another round of yields for lockrefs now doesn't
sound so bad any more.

Opinions, anyone?

	Torsten

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 13:00 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBDA6@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 12:32:18PM +0000, David Laight wrote:
> > I disagree, perhaps mostly because the compiler is not clever enough, but right
> > now the code for solution 1 is (actually I have rewritten the code
> > and it reads:
> > 
> > 	mask = (FM & 1)
> > 			| ((FM << 3) & 0x10)
> > 			| ((FM << 6) & 0x100)
> > 			| ((FM << 9) & 0x1000)
> > 			| ((FM << 12) & 0x10000)
> > 			| ((FM << 15) & 0x100000)
> > 			| ((FM << 18) & 0x1000000)
> > 			| ((FM << 21) & 0x10000000);
> > to avoid sequence point in case it hampers the compiler)
> > 
> > and the output is:
> > 
> >         rlwinm 10,3,3,27,27      # D.11621, FM,,
> >         rlwinm 9,3,6,23,23       # D.11621, FM,,
> >         or 9,10,9        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,0,31,31      # D.11621, FM,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,9,19,19      # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,12,15,15     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,15,11,11     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,18,7,7       # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 3,3,21,3,3        # D.11621, FM,,
> >         or 9,9,3         #, mask, D.11621, D.11621
> >         mulli 9,9,15     # mask, mask,
> > 
> > see that r9 is used 7 times as both input and output operand, plus
> > once for rlwinm. This gives a dependency length of 8 at least.
> > 
> > In the other case (I've deleted the code) the dependency length
> > was significantly shorter. In any case that one is fewer instructions,
> > which is good for occasional use.
> 
> Hmmm... I hand-counted a dependency length of 8 for the other version.
> Maybe there are some ppc instructions that reduce it.

Either I misread the generated code or I got somewhat less.

What helps for method1 is the rotate and mask instructions of PPC. Each of
left shift and mask becomes a single rlwinm. 
> 
> Stupid compiler :-)

Indeed. I've trying to coerce it into generating rlwimi instructions
(in which case the whole building of the mask reduces to 8 assembly
instructions) and failed. It seems that the compiler lacks some patterns
some patterns that would directly map to rlwimi.

> Trouble is, I bet that even if you code it as:
>  	mask1 = (FM & 1) | ((FM << 3) & 0x10);
> 	mask2 = ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
> 	mask3 = ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
> 	mask4 = ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
> 	mask1 |= mask2;
> 	mask3 |= mask4;
> 	mask = mask1 | mask3;
> the compiler will 'optimise' it to the above before code generation.

Indeed it's what it does :-(

I believe that the current suggestion is good enough.

	Gabriel

^ permalink raw reply

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 12:32 UTC (permalink / raw)
  To: 'Gabriel Paubert'
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210122138.GA30356@visitor2.iram.es>

> I disagree, perhaps mostly because the compiler is not clever enough, but=
 right
> now the code for solution 1 is (actually I have rewritten the code
> and it reads:
>=20
> 	mask =3D (FM & 1)
> 			| ((FM << 3) & 0x10)
> 			| ((FM << 6) & 0x100)
> 			| ((FM << 9) & 0x1000)
> 			| ((FM << 12) & 0x10000)
> 			| ((FM << 15) & 0x100000)
> 			| ((FM << 18) & 0x1000000)
> 			| ((FM << 21) & 0x10000000);
> to avoid sequence point in case it hampers the compiler)
>=20
> and the output is:
>=20
>         rlwinm 10,3,3,27,27      # D.11621, FM,,
>         rlwinm 9,3,6,23,23       # D.11621, FM,,
>         or 9,10,9        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,0,31,31      # D.11621, FM,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,9,19,19      # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,12,15,15     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,15,11,11     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,18,7,7       # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 3,3,21,3,3        # D.11621, FM,,
>         or 9,9,3         #, mask, D.11621, D.11621
>         mulli 9,9,15     # mask, mask,
>=20
> see that r9 is used 7 times as both input and output operand, plus
> once for rlwinm. This gives a dependency length of 8 at least.
>=20
> In the other case (I've deleted the code) the dependency length
> was significantly shorter. In any case that one is fewer instructions,
> which is good for occasional use.

Hmmm... I hand-counted a dependency length of 8 for the other version.
Maybe there are some ppc instructions that reduce it.

Stupid compiler :-)
Trouble is, I bet that even if you code it as:
 	mask1 =3D (FM & 1) | ((FM << 3) & 0x10);
	mask2 =3D ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
	mask3 =3D ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
	mask4 =3D ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
	mask1 |=3D mask2;
	mask3 |=3D mask4;
	mask =3D mask1 | mask3;
the compiler will 'optimise' it to the above before code generation.
If it doesn't adding () to pair the | might be enough.
Then a new version of the compiler will change the behaviour again.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 12:21 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBCE9@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 11:17:38AM +0000, David Laight wrote:
> > > However, your other solutions are better.
> > >
> > >
> > > > > >
> > > > > > mask = (FM & 1);
> > > > > > mask |= (FM << 3) & 0x10;
> > > > > > mask |= (FM << 6) & 0x100;
> > > > > > mask |= (FM << 9) & 0x1000;
> > > > > > mask |= (FM << 12) & 0x10000;
> > > > > > mask |= (FM << 15) & 0x100000;
> > > > > > mask |= (FM << 18) & 0x1000000;
> > > > > > mask |= (FM << 21) & 0x10000000;
> > > > > > mask *= 15;
> > > > > >
> > > > > > should do the job, in less code space and without a single branch.
> ...
> > > > > > Another way of optomizing this could be:
> > > > > >
> > > > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > > mask *= 15;
> ...
> > Ok, if you have measured that method1 is faster than method2, let us go for it.
> > I believe method2 would be faster if you had a large out-of-order execution
> > window, because more parallelism can be extracted from it, but this is probably
> > only true for high end cores, which do not need FPU emulation in the first place.
> 
> FWIW the second has a long dependency chain on 'mask', whereas the first can execute
> the shift/and in any order and then merge the results.
> So on most superscalar cpu, or one with result delays for arithmetic, the first
> is likely to be faster.

I disagree, perhaps mostly because the compiler is not clever enough, but right
now the code for solution 1 is (actually I have rewritten the code
and it reads:

	mask = (FM & 1)
			| ((FM << 3) & 0x10)
			| ((FM << 6) & 0x100)
			| ((FM << 9) & 0x1000)
			| ((FM << 12) & 0x10000)
			| ((FM << 15) & 0x100000)
			| ((FM << 18) & 0x1000000)
			| ((FM << 21) & 0x10000000);
to avoid sequence point in case it hampers the compiler)

and the output is:

        rlwinm 10,3,3,27,27      # D.11621, FM,,
        rlwinm 9,3,6,23,23       # D.11621, FM,,
        or 9,10,9        #, D.11621, D.11621, D.11621
        rlwinm 10,3,0,31,31      # D.11621, FM,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,9,19,19      # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,12,15,15     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,15,11,11     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,18,7,7       # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 3,3,21,3,3        # D.11621, FM,,
        or 9,9,3         #, mask, D.11621, D.11621
        mulli 9,9,15     # mask, mask,

see that r9 is used 7 times as both input and output operand, plus
once for rlwinm. This gives a dependency length of 8 at least.

In the other case (I've deleted the code) the dependency length
was significantly shorter. In any case that one is fewer instructions, 
which is good for occasional use. 

	Gabriel

^ permalink raw reply

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 11:17 UTC (permalink / raw)
  To: 'Gabriel Paubert', James Yang
  Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

> > However, your other solutions are better.
> >
> >
> > > > >
> > > > > mask =3D (FM & 1);
> > > > > mask |=3D (FM << 3) & 0x10;
> > > > > mask |=3D (FM << 6) & 0x100;
> > > > > mask |=3D (FM << 9) & 0x1000;
> > > > > mask |=3D (FM << 12) & 0x10000;
> > > > > mask |=3D (FM << 15) & 0x100000;
> > > > > mask |=3D (FM << 18) & 0x1000000;
> > > > > mask |=3D (FM << 21) & 0x10000000;
> > > > > mask *=3D 15;
> > > > >
> > > > > should do the job, in less code space and without a single branch=
.
...
> > > > > Another way of optomizing this could be:
> > > > >
> > > > > mask =3D (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > mask =3D (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > mask =3D (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > mask *=3D 15;
...
> Ok, if you have measured that method1 is faster than method2, let us go f=
or it.
> I believe method2 would be faster if you had a large out-of-order executi=
on
> window, because more parallelism can be extracted from it, but this is pr=
obably
> only true for high end cores, which do not need FPU emulation in the firs=
t place.

FWIW the second has a long dependency chain on 'mask', whereas the first ca=
n execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the f=
irst
is likely to be faster.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 11:03 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <alpine.LRH.2.00.1402071348380.10318@ra8135-ec1.am.freescale.net>

On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> 
> > 	Hi Stephen,
> > 
> > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > 
> > > > From: Gabriel Paubert <paubert@iram.es>
> > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > Date: 02/06/2014 07:26 PM
> > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > 
> > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> 
> > > > > 
> > > > >                 mask = 0;
> > > > >                 if (FM & (1 << 0))
> > > > >                         mask |= 0x0000000f;
> > > > >                 if (FM & (1 << 1))
> > > > >                         mask |= 0x000000f0;
> > > > >                 if (FM & (1 << 2))
> > > > >                         mask |= 0x00000f00;
> > > > >                 if (FM & (1 << 3))
> > > > >                         mask |= 0x0000f000;
> > > > >                 if (FM & (1 << 4))
> > > > >                         mask |= 0x000f0000;
> > > > >                 if (FM & (1 << 5))
> > > > >                         mask |= 0x00f00000;
> > > > >                 if (FM & (1 << 6))
> > > > >                         mask |= 0x0f000000;
> > > > >                 if (FM & (1 << 7))
> > > > >                         mask |= 0x90000000;
> > > > > 
> > > > > With the above mask computation I get consistent results for 
> > > > > both the MPC8548 and MPC7410 boards.
> > > > > 
> > > > > Am I missing something subtle?
> > > > 
> > > > No I think you are correct. This said, this code may probably be 
> > > optimized 
> > > > to eliminate a lot of the conditional branches. I think that:
> 
> 
> If the compiler is enabled to generate isel instructions, it would not 
> use a conditional branch for this code. (ignore the andi's values, 
> this is an old compile)
> 
> c0037c2c <mtfsf>:
> c0037c2c:       2c 03 00 00     cmpwi   r3,0
> c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
> c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
> c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
> c0037c3c:       70 66 00 01     andi.   r6,r3,1
> c0037c40:       3d 00 90 00     lis     r8,-28672
> c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
> c0037c48:       70 6a 00 02     andi.   r10,r3,2
> c0037c4c:       65 28 0f 00     oris    r8,r9,3840
> c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c54:       70 66 00 04     andi.   r6,r3,4
> c0037c58:       65 28 00 f0     oris    r8,r9,240
> c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c60:       70 6a 00 08     andi.   r10,r3,8
> c0037c64:       65 28 00 0f     oris    r8,r9,15
> c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c6c:       70 66 00 10     andi.   r6,r3,16
> c0037c70:       61 28 f0 00     ori     r8,r9,61440
> c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c78:       70 6a 00 20     andi.   r10,r3,32
> c0037c7c:       61 28 0f 00     ori     r8,r9,3840
> c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
> c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
> c0037c8c:       70 66 00 40     andi.   r6,r3,64
> c0037c90:       61 28 00 f0     ori     r8,r9,240
> c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
> c0037c9c:       61 29 00 0f     ori     r9,r9,15
> 	...
>  
> However, your other solutions are better.
> 
> 
> > > > 
> > > > mask = (FM & 1);
> > > > mask |= (FM << 3) & 0x10;
> > > > mask |= (FM << 6) & 0x100;
> > > > mask |= (FM << 9) & 0x1000;
> > > > mask |= (FM << 12) & 0x10000;
> > > > mask |= (FM << 15) & 0x100000;
> > > > mask |= (FM << 18) & 0x1000000;
> > > > mask |= (FM << 21) & 0x10000000;
> > > > mask *= 15;
> > > > 
> > > > should do the job, in less code space and without a single branch.
> > > > 
> > > > Each one of the "mask |=" lines should be translated into an
> > > > rlwinm instruction followed by an "or". Actually it should be possible
> > > > to transform each of these lines into a single "rlwimi" instruction
> > > > but I don't know how to coerce gcc to reach this level of optimization.
> > > > 
> > > > Another way of optomizing this could be:
> > > > 
> > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > mask *= 15;
> > > > 
> 
> 
> > Ok, I finally edited my sources and test compiled the suggestions
> > I gave. I'd say that method2 is the best overall indeed. You can
> > actually save one more instruction by setting mask to all ones in 
> > the case FM=0xff, but that's about all in this area.
> 
> 
> My measurements show method1 to be smaller and faster than method2 due 
> to the number of instructions needed to generate the constant masks in 
> method2, but it may depend upon your compiler and hardware.  Both are 
> faster than the original with isel.

Ok, if you have measured that method1 is faster than method2, let us go for it.
I believe method2 would be faster if you had a large out-of-order execution
window, because more parallelism can be extracted from it, but this is probably
only true for high end cores, which do not need FPU emulation in the first place.

The other place where we can optimize is the generation of FEX. Here is 
my current patch:


diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
index dbce92e..b57b3fa8 100644
--- a/arch/powerpc/math-emu/mtfsf.c
+++ b/arch/powerpc/math-emu/mtfsf.c
@@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
 	u32 mask;
 	u32 fpscr;
 
-	if (FM == 0)
+	if (likely(FM == 0xff))
+		mask = 0xffffffff;
+	else if (unlikely(FM == 0))
 		return 0;
-
-	if (FM == 0xff)
-		mask = 0x9fffffff;
 	else {
-		mask = 0;
-		if (FM & (1 << 0))
-			mask |= 0x90000000;
-		if (FM & (1 << 1))
-			mask |= 0x0f000000;
-		if (FM & (1 << 2))
-			mask |= 0x00f00000;
-		if (FM & (1 << 3))
-			mask |= 0x000f0000;
-		if (FM & (1 << 4))
-			mask |= 0x0000f000;
-		if (FM & (1 << 5))
-			mask |= 0x00000f00;
-		if (FM & (1 << 6))
-			mask |= 0x000000f0;
-		if (FM & (1 << 7))
-			mask |= 0x0000000f;
+		mask = (FM & 1);
+		mask |= (FM << 3) & 0x10;
+		mask |= (FM << 6) & 0x100;
+		mask |= (FM << 9) & 0x1000;
+		mask |= (FM << 12) & 0x10000;
+		mask |= (FM << 15) & 0x100000;
+		mask |= (FM << 18) & 0x1000000;
+		mask |= (FM << 21) & 0x10000000;
+		mask *= 15;
 	}
 
-	__FPU_FPSCR &= ~(mask);
-	__FPU_FPSCR |= (frB[1] & mask);
+	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
+		~(FPSCR_VX | FPSCR_FEX);
 
-	__FPU_FPSCR &= ~(FPSCR_VX);
-	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
+	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
 		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
 		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
-		__FPU_FPSCR |= FPSCR_VX;
-
-	fpscr = __FPU_FPSCR;
-	fpscr &= ~(FPSCR_FEX);
-	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
-	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
-	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
-	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
-	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
-		fpscr |= FPSCR_FEX;
+		fpscr |= FPSCR_VX;
+
+	/* The bit order of exception enables and exception status
+	 * is the same. Simply shift and mask to check for enabled
+	 * exceptions.
+	 */
+	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
 	__FPU_FPSCR = fpscr;
 
 #ifdef DEBUG
 mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
 1 file changed, 22 insertions(+), 35 deletions(-)


Notes: 

1) I'm really unsure on whether 0xff is frequent or not. So the likely()
statement at the beginning may be wrong. Actually, if it is not very likely,
it might be better to remove the special casef for FM==0xff. A look at 
GCC sources shows that it never generates a mask of 0xff. From glibc
sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

2) it may be better to remove the check for FM==0, after all, the instruction
efectively becomes a nop, and generating the instruction in the first place
would be too stupid for words.

3) if might be better to #define the magic constants (22 and 0xf8) used 
in the last if statement.

	Gabriel
> 

^ permalink raw reply related

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-10  3:45 UTC (permalink / raw)
  To: Peter Zijlstra, Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Daniel Lezcano,
	Rafael J. Wysocki, LKML, Ingo Molnar, Thomas Gleixner,
	linuxppc-dev
In-Reply-To: <20140207124140.GB9987@twins.programming.kicks-ass.net>

Hi Peter,

On 02/07/2014 06:11 PM, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
>> But observe the idle state "snooze" on powerpc. The power that this idle
>> state saves is through the lowering of the thread priority of the CPU.
>> After it lowers the thread priority, it is done. It cannot
>> "wait_for_interrupts". It will exit my_idle(). It is now upto the
>> generic idle loop to increase the thread priority if the need_resched
>> flag is set. Only an interrupt routine can increase the thread priority.
>> Else we will need to do it explicitly. And in such states which have a
>> polling nature, the cpu will not receive a reschedule IPI.
>>
>> That is why in the snooze_loop() we poll on need_resched. If it is set
>> we up the priority of the thread using HMT_MEDIUM() and then exit the
>> my_idle() loop. In case of interrupts, the priority gets automatically
>> increased.
> 
> You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
> and get the IPI if that is what you want.
> 
> Depending on how horribly unprovisioned the thread gets at the lowest
> priority, that might actually be faster than polling and raising the
> prio whenever it does get ran.

So I am assuming you mean something like the below:

my_idle()
{
   local_irq_enable();
   /* Remove the setting of the polling flag */
   HMT_low();
   return index;
}

And then exit into the generic idle loop. But the issue I see here is
that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on
testing need_resched() immediately after this returns that the
TIF_NEED_RESCHED flag is set, the thread will exit at low priority right?
 We could raise the priority of the thread in arch_cpu_idle_exit() soon
after setting the polling flag but that would mean for cases where the
TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of
the thread.

Thanks

Regards
Preeti U Murthy

> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox