* [PATCH v6 02/46] percpu_rwlock: Introduce per-CPU variables for the reader and the writer
From: Srivatsa S. Bhat @ 2013-02-18 12:38 UTC (permalink / raw)
To: tglx, peterz, tj, oleg, paulmck, rusty, mingo, akpm, namhyung
Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
srivatsa.bhat, netdev, vincent.guittot, walken, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20130218123714.26245.61816.stgit@srivatsabhat.in.ibm.com>
Per-CPU rwlocks ought to give better performance than global rwlocks.
That is where the "per-CPU" component comes in. So introduce the necessary
per-CPU variables that would be necessary at the reader and the writer sides,
and add the support for dynamically initializing per-CPU rwlocks.
These per-CPU variables will be used subsequently to implement the core
algorithm behind per-CPU rwlocks.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
include/linux/percpu-rwlock.h | 8 ++++++++
lib/percpu-rwlock.c | 12 ++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
index 0caf81f..74eaf4d 100644
--- a/include/linux/percpu-rwlock.h
+++ b/include/linux/percpu-rwlock.h
@@ -28,7 +28,13 @@
#include <linux/lockdep.h>
#include <linux/spinlock.h>
+struct rw_state {
+ unsigned long reader_refcnt;
+ bool writer_signal;
+};
+
struct percpu_rwlock {
+ struct rw_state __percpu *rw_state;
rwlock_t global_rwlock;
};
@@ -41,6 +47,8 @@ extern void percpu_write_unlock(struct percpu_rwlock *);
extern int __percpu_init_rwlock(struct percpu_rwlock *,
const char *, struct lock_class_key *);
+extern void percpu_free_rwlock(struct percpu_rwlock *);
+
#define percpu_init_rwlock(pcpu_rwlock) \
({ static struct lock_class_key rwlock_key; \
__percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \
diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
index 111a238..f938096 100644
--- a/lib/percpu-rwlock.c
+++ b/lib/percpu-rwlock.c
@@ -31,6 +31,10 @@
int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
const char *name, struct lock_class_key *rwlock_key)
{
+ pcpu_rwlock->rw_state = alloc_percpu(struct rw_state);
+ if (unlikely(!pcpu_rwlock->rw_state))
+ return -ENOMEM;
+
/* ->global_rwlock represents the whole percpu_rwlock for lockdep */
#ifdef CONFIG_DEBUG_SPINLOCK
__rwlock_init(&pcpu_rwlock->global_rwlock, name, rwlock_key);
@@ -41,6 +45,14 @@ int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
return 0;
}
+void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
+{
+ free_percpu(pcpu_rwlock->rw_state);
+
+ /* Catch use-after-free bugs */
+ pcpu_rwlock->rw_state = NULL;
+}
+
void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
{
read_lock(&pcpu_rwlock->global_rwlock);
^ permalink raw reply related
* [PATCH v6 01/46] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-02-18 12:38 UTC (permalink / raw)
To: tglx, peterz, tj, oleg, paulmck, rusty, mingo, akpm, namhyung
Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
srivatsa.bhat, netdev, vincent.guittot, walken, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20130218123714.26245.61816.stgit@srivatsabhat.in.ibm.com>
A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
locks can also lead to too many deadlock possibilities which can make it very
hard/impossible to use. This is explained in the example below, which helps
justify the need for a different algorithm to implement flexible Per-CPU
Reader-Writer locks.
We can use global rwlocks as shown below safely, without fear of deadlocks:
Readers:
CPU 0 CPU 1
------ ------
1. spin_lock(&random_lock); read_lock(&my_rwlock);
2. read_lock(&my_rwlock); spin_lock(&random_lock);
Writer:
CPU 2:
------
write_lock(&my_rwlock);
We can observe that there is no possibility of deadlocks or circular locking
dependencies here. Its perfectly safe.
Now consider a blind/straight-forward conversion of global rwlocks to per-CPU
rwlocks like this:
The reader locks its own per-CPU rwlock for read, and proceeds.
Something like: read_lock(per-cpu rwlock of this cpu);
The writer acquires all per-CPU rwlocks for write and only then proceeds.
Something like:
for_each_online_cpu(cpu)
write_lock(per-cpu rwlock of 'cpu');
Now let's say that for performance reasons, the above scenario (which was
perfectly safe when using global rwlocks) was converted to use per-CPU rwlocks.
CPU 0 CPU 1
------ ------
1. spin_lock(&random_lock); read_lock(my_rwlock of CPU 1);
2. read_lock(my_rwlock of CPU 0); spin_lock(&random_lock);
Writer:
CPU 2:
------
for_each_online_cpu(cpu)
write_lock(my_rwlock of 'cpu');
Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:
(holds Waiting for
random_lock) CPU 0 -------------> CPU 2 (holds my_rwlock of CPU 0
for write)
^ |
| |
Waiting| | Waiting
for | | for
| V
------ CPU 1 <------
(holds my_rwlock of
CPU 1 for read)
So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. One simple measure for (or characteristic of) safe percpu
rwlock should be that if a user replaces global rwlocks with per-CPU rwlocks
(for performance reasons), he shouldn't suddenly end up in numerous deadlock
possibilities which never existed before. The replacement should continue to
remain safe, and perhaps improve the performance.
Observing the robustness of global rwlocks in providing a fair amount of
deadlock safety, we implement per-CPU rwlocks as nothing but global rwlocks,
as a first step.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
include/linux/percpu-rwlock.h | 49 ++++++++++++++++++++++++++++++++
lib/Kconfig | 3 ++
lib/Makefile | 1 +
lib/percpu-rwlock.c | 63 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 116 insertions(+)
create mode 100644 include/linux/percpu-rwlock.h
create mode 100644 lib/percpu-rwlock.c
diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
new file mode 100644
index 0000000..0caf81f
--- /dev/null
+++ b/include/linux/percpu-rwlock.h
@@ -0,0 +1,49 @@
+/*
+ * Flexible Per-CPU Reader-Writer Locks
+ * (with relaxed locking rules and reduced deadlock-possibilities)
+ *
+ * Copyright (C) IBM Corporation, 2012-2013
+ * Author: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
+ *
+ * With lots of invaluable suggestions from:
+ * Oleg Nesterov <oleg@redhat.com>
+ * Tejun Heo <tj@kernel.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _LINUX_PERCPU_RWLOCK_H
+#define _LINUX_PERCPU_RWLOCK_H
+
+#include <linux/percpu.h>
+#include <linux/lockdep.h>
+#include <linux/spinlock.h>
+
+struct percpu_rwlock {
+ rwlock_t global_rwlock;
+};
+
+extern void percpu_read_lock(struct percpu_rwlock *);
+extern void percpu_read_unlock(struct percpu_rwlock *);
+
+extern void percpu_write_lock(struct percpu_rwlock *);
+extern void percpu_write_unlock(struct percpu_rwlock *);
+
+extern int __percpu_init_rwlock(struct percpu_rwlock *,
+ const char *, struct lock_class_key *);
+
+#define percpu_init_rwlock(pcpu_rwlock) \
+({ static struct lock_class_key rwlock_key; \
+ __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \
+})
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 75cdb77..32fb0b9 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -45,6 +45,9 @@ config STMP_DEVICE
config PERCPU_RWSEM
boolean
+config PERCPU_RWLOCK
+ boolean
+
config CRC_CCITT
tristate "CRC-CCITT functions"
help
diff --git a/lib/Makefile b/lib/Makefile
index 02ed6c0..1854b5e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+lib-$(CONFIG_PERCPU_RWLOCK) += percpu-rwlock.o
CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
new file mode 100644
index 0000000..111a238
--- /dev/null
+++ b/lib/percpu-rwlock.c
@@ -0,0 +1,63 @@
+/*
+ * Flexible Per-CPU Reader-Writer Locks
+ * (with relaxed locking rules and reduced deadlock-possibilities)
+ *
+ * Copyright (C) IBM Corporation, 2012-2013
+ * Author: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
+ *
+ * With lots of invaluable suggestions from:
+ * Oleg Nesterov <oleg@redhat.com>
+ * Tejun Heo <tj@kernel.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/lockdep.h>
+#include <linux/percpu-rwlock.h>
+#include <linux/errno.h>
+
+
+int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
+ const char *name, struct lock_class_key *rwlock_key)
+{
+ /* ->global_rwlock represents the whole percpu_rwlock for lockdep */
+#ifdef CONFIG_DEBUG_SPINLOCK
+ __rwlock_init(&pcpu_rwlock->global_rwlock, name, rwlock_key);
+#else
+ pcpu_rwlock->global_rwlock =
+ __RW_LOCK_UNLOCKED(&pcpu_rwlock->global_rwlock);
+#endif
+ return 0;
+}
+
+void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
+{
+ read_lock(&pcpu_rwlock->global_rwlock);
+}
+
+void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
+{
+ read_unlock(&pcpu_rwlock->global_rwlock);
+}
+
+void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
+{
+ write_lock(&pcpu_rwlock->global_rwlock);
+}
+
+void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
+{
+ write_unlock(&pcpu_rwlock->global_rwlock);
+}
+
^ permalink raw reply related
* [PATCH v6 00/46] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2013-02-18 12:38 UTC (permalink / raw)
To: tglx, peterz, tj, oleg, paulmck, rusty, mingo, akpm, namhyung
Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
srivatsa.bhat, netdev, vincent.guittot, walken, linuxppc-dev,
linux-arm-kernel
Hi,
This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
offline path and provides an alternative (set of APIs) to preempt_disable() to
prevent CPUs from going offline, which can be invoked from atomic context.
The motivation behind the removal of stop_machine() is to avoid its ill-effects
and thus improve the design of CPU hotplug. (More description regarding this
is available in the patches).
All the users of preempt_disable()/local_irq_disable() who used to use it to
prevent CPU offline, have been converted to the new primitives introduced in the
patchset. Also, the CPU_DYING notifiers have been audited to check whether
they can cope up with the removal of stop_machine() or whether they need to
use new locks for synchronization (all CPU_DYING notifiers looked OK, without
the need for any new locks).
Applies on current mainline (v3.8-rc7+).
This patchset is available in the following git branch:
git://github.com/srivatsabhat/linux.git stop-machine-free-cpu-hotplug-v6
Overview of the patches:
-----------------------
Patches 1 to 7 introduce a generic, flexible Per-CPU Reader-Writer Locking
scheme.
Patch 8 uses this synchronization mechanism to build the
get/put_online_cpus_atomic() APIs which can be used from atomic context, to
prevent CPUs from going offline.
Patch 9 is a cleanup; it converts preprocessor macros to static inline
functions.
Patches 10 to 43 convert various call-sites to use the new APIs.
Patch 44 is the one which actually removes stop_machine() from the CPU
offline path.
Patch 45 decouples stop_machine() and CPU hotplug from Kconfig.
Patch 46 updates the documentation to reflect the new APIs.
Changes in v6:
--------------
* Fixed issues related to memory barriers, as pointed out by Paul and Oleg.
* Fixed the locking issue related to clockevents_lock, which was being
triggered when cpu idle was enabled.
* Some code restructuring to improve readability and to enhance some fastpath
optimizations.
* Randconfig build-fixes, reported by Fengguang Wu.
Changes in v5:
--------------
Exposed a new generic locking scheme: Flexible Per-CPU Reader-Writer locks,
based on the synchronization schemes already discussed in the previous
versions, and used it in CPU hotplug, to implement the new APIs.
Audited the CPU_DYING notifiers in the kernel source tree and replaced
usages of preempt_disable() with the new get/put_online_cpus_atomic() APIs
where necessary.
Changes in v4:
--------------
The synchronization scheme has been simplified quite a bit, which makes it
look a lot less complex than before. Some highlights:
* Implicit ACKs:
The earlier design required the readers to explicitly ACK the writer's
signal. The new design uses implicit ACKs instead. The reader switching
over to rwlock implicitly tells the writer to stop waiting for that reader.
* No atomic operations:
Since we got rid of explicit ACKs, we no longer have the need for a reader
and a writer to update the same counter. So we can get rid of atomic ops
too.
Changes in v3:
--------------
* Dropped the _light() and _full() variants of the APIs. Provided a single
interface: get/put_online_cpus_atomic().
* Completely redesigned the synchronization mechanism again, to make it
fast and scalable at the reader-side in the fast-path (when no hotplug
writers are active). This new scheme also ensures that there is no
possibility of deadlocks due to circular locking dependency.
In summary, this provides the scalability and speed of per-cpu rwlocks
(without actually using them), while avoiding the downside (deadlock
possibilities) which is inherent in any per-cpu locking scheme that is
meant to compete with preempt_disable()/enable() in terms of flexibility.
The problem with using per-cpu locking to replace preempt_disable()/enable
was explained here:
https://lkml.org/lkml/2012/12/6/290
Basically we use per-cpu counters (for scalability) when no writers are
active, and then switch to global rwlocks (for lock-safety) when a writer
becomes active. It is a slightly complex scheme, but it is based on
standard principles of distributed algorithms.
Changes in v2:
-------------
* Completely redesigned the synchronization scheme to avoid using any extra
cpumasks.
* Provided APIs for 2 types of atomic hotplug readers: "light" (for
light-weight) and "full". We wish to have more "light" readers than
the "full" ones, to avoid indirectly inducing the "stop_machine effect"
without even actually using stop_machine().
And the patches show that it _is_ generally true: 5 patches deal with
"light" readers, whereas only 1 patch deals with a "full" reader.
Also, the "light" readers happen to be in very hot paths. So it makes a
lot of sense to have such a distinction and a corresponding light-weight
API.
Links to previous versions:
v5: http://lwn.net/Articles/533553/
v4: https://lkml.org/lkml/2012/12/11/209
v3: https://lkml.org/lkml/2012/12/7/287
v2: https://lkml.org/lkml/2012/12/5/322
v1: https://lkml.org/lkml/2012/12/4/88
--
Paul E. McKenney (1):
cpu: No more __stop_machine() in _cpu_down()
Srivatsa S. Bhat (45):
percpu_rwlock: Introduce the global reader-writer lock backend
percpu_rwlock: Introduce per-CPU variables for the reader and the writer
percpu_rwlock: Provide a way to define and init percpu-rwlocks at compile time
percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
percpu_rwlock: Make percpu-rwlocks IRQ-safe, optimally
percpu_rwlock: Rearrange the read-lock code to fastpath nested percpu readers
percpu_rwlock: Allow writers to be readers, and add lockdep annotations
CPU hotplug: Provide APIs to prevent CPU offline from atomic context
CPU hotplug: Convert preprocessor macros to static inline functions
smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
sched/timer: Use get/put_online_cpus_atomic() to prevent CPU offline
sched/migration: Use raw_spin_lock/unlock since interrupts are already disabled
sched/rt: Use get/put_online_cpus_atomic() to prevent CPU offline
tick: Use get/put_online_cpus_atomic() to prevent CPU offline
time/clocksource: Use get/put_online_cpus_atomic() to prevent CPU offline
clockevents: Use get/put_online_cpus_atomic() in clockevents_notify()
softirq: Use get/put_online_cpus_atomic() to prevent CPU offline
irq: Use get/put_online_cpus_atomic() to prevent CPU offline
net: Use get/put_online_cpus_atomic() to prevent CPU offline
block: Use get/put_online_cpus_atomic() to prevent CPU offline
crypto: pcrypt - Protect access to cpu_online_mask with get/put_online_cpus()
infiniband: ehca: Use get/put_online_cpus_atomic() to prevent CPU offline
[SCSI] fcoe: Use get/put_online_cpus_atomic() to prevent CPU offline
staging: octeon: Use get/put_online_cpus_atomic() to prevent CPU offline
x86: Use get/put_online_cpus_atomic() to prevent CPU offline
perf/x86: Use get/put_online_cpus_atomic() to prevent CPU offline
KVM: Use get/put_online_cpus_atomic() to prevent CPU offline from atomic context
kvm/vmx: Use get/put_online_cpus_atomic() to prevent CPU offline
x86/xen: Use get/put_online_cpus_atomic() to prevent CPU offline
alpha/smp: Use get/put_online_cpus_atomic() to prevent CPU offline
blackfin/smp: Use get/put_online_cpus_atomic() to prevent CPU offline
cris/smp: Use get/put_online_cpus_atomic() to prevent CPU offline
hexagon/smp: Use get/put_online_cpus_atomic() to prevent CPU offline
ia64: Use get/put_online_cpus_atomic() to prevent CPU offline
m32r: Use get/put_online_cpus_atomic() to prevent CPU offline
MIPS: Use get/put_online_cpus_atomic() to prevent CPU offline
mn10300: Use get/put_online_cpus_atomic() to prevent CPU offline
parisc: Use get/put_online_cpus_atomic() to prevent CPU offline
powerpc: Use get/put_online_cpus_atomic() to prevent CPU offline
sh: Use get/put_online_cpus_atomic() to prevent CPU offline
sparc: Use get/put_online_cpus_atomic() to prevent CPU offline
tile: Use get/put_online_cpus_atomic() to prevent CPU offline
CPU hotplug, stop_machine: Decouple CPU hotplug from stop_machine() in Kconfig
Documentation/cpu-hotplug: Remove references to stop_machine()
Documentation/cpu-hotplug.txt | 17 +-
arch/alpha/kernel/smp.c | 19 +-
arch/arm/Kconfig | 1
arch/blackfin/Kconfig | 1
arch/blackfin/mach-common/smp.c | 6 -
arch/cris/arch-v32/kernel/smp.c | 8 +
arch/hexagon/kernel/smp.c | 5
arch/ia64/Kconfig | 1
arch/ia64/kernel/irq_ia64.c | 13 +
arch/ia64/kernel/perfmon.c | 6 +
arch/ia64/kernel/smp.c | 23 ++
arch/ia64/mm/tlb.c | 6 -
arch/m32r/kernel/smp.c | 12 +
arch/mips/Kconfig | 1
arch/mips/kernel/cevt-smtc.c | 8 +
arch/mips/kernel/smp.c | 16 +-
arch/mips/kernel/smtc.c | 3
arch/mips/mm/c-octeon.c | 4
arch/mn10300/Kconfig | 1
arch/mn10300/kernel/smp.c | 2
arch/mn10300/mm/cache-smp.c | 5
arch/mn10300/mm/tlb-smp.c | 15 +
arch/parisc/Kconfig | 1
arch/parisc/kernel/smp.c | 4
arch/powerpc/Kconfig | 1
arch/powerpc/mm/mmu_context_nohash.c | 2
arch/s390/Kconfig | 1
arch/sh/Kconfig | 1
arch/sh/kernel/smp.c | 12 +
arch/sparc/Kconfig | 1
arch/sparc/kernel/leon_smp.c | 2
arch/sparc/kernel/smp_64.c | 9 -
arch/sparc/kernel/sun4d_smp.c | 2
arch/sparc/kernel/sun4m_smp.c | 3
arch/tile/kernel/smp.c | 4
arch/x86/Kconfig | 1
arch/x86/include/asm/ipi.h | 5
arch/x86/kernel/apic/apic_flat_64.c | 10 +
arch/x86/kernel/apic/apic_numachip.c | 5
arch/x86/kernel/apic/es7000_32.c | 5
arch/x86/kernel/apic/io_apic.c | 7 -
arch/x86/kernel/apic/ipi.c | 10 +
arch/x86/kernel/apic/x2apic_cluster.c | 4
arch/x86/kernel/apic/x2apic_uv_x.c | 4
arch/x86/kernel/cpu/mcheck/therm_throt.c | 4
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 5
arch/x86/kvm/vmx.c | 8 +
arch/x86/mm/tlb.c | 14 +
arch/x86/xen/mmu.c | 11 +
arch/x86/xen/smp.c | 9 +
block/blk-softirq.c | 4
crypto/pcrypt.c | 4
drivers/infiniband/hw/ehca/ehca_irq.c | 8 +
drivers/scsi/fcoe/fcoe.c | 7 +
drivers/staging/octeon/ethernet-rx.c | 3
include/linux/cpu.h | 8 +
include/linux/percpu-rwlock.h | 74 +++++++
include/linux/stop_machine.h | 2
init/Kconfig | 2
kernel/cpu.c | 59 +++++-
kernel/irq/manage.c | 7 +
kernel/sched/core.c | 36 +++-
kernel/sched/fair.c | 5
kernel/sched/rt.c | 3
kernel/smp.c | 65 ++++--
kernel/softirq.c | 3
kernel/time/clockevents.c | 3
kernel/time/clocksource.c | 5
kernel/time/tick-broadcast.c | 2
kernel/timer.c | 2
lib/Kconfig | 3
lib/Makefile | 1
lib/percpu-rwlock.c | 256 +++++++++++++++++++++++++
net/core/dev.c | 9 +
virt/kvm/kvm_main.c | 10 +
75 files changed, 776 insertions(+), 123 deletions(-)
create mode 100644 include/linux/percpu-rwlock.h
create mode 100644 lib/percpu-rwlock.c
Regards,
Srivatsa S. Bhat
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH] i2c: Remove unneeded xxx_set_drvdata(..., NULL) calls
From: Marek Vasut @ 2013-02-18 12:17 UTC (permalink / raw)
To: Doug Anderson
Cc: Wolfram Sang, Tony Lindgren, Linus Walleij, Thierry Reding,
Sekhar Nori, linux-i2c, Guan Xuetao, Kevin Hilman, Sonic Zhang,
linux-arm-kernel, Deepak Sikri, Havard Skinnemoen, Pawel Moll,
Stephen Warren, Sascha Hauer, Uwe Kleine-König, Rob Herring,
uclinux-dist-devel, Jean Delvare, Lars-Peter Clausen,
Ben Dooks (embedded platforms), Barry Song, linux-omap,
Mika Westerberg, Oskar Schirmer, Fabio Estevam,
davinci-linux-open-source, Shawn Guo, Jim Cromie,
Greg Kroah-Hartman, Tomoya MORINAGA, linux-kernel, Kyungmin Park,
Viresh Kumar, Karol Lewandowski, Jiri Kosina, STEricsson,
Joe Perches, Andrew Morton, Alessandro Rubini, linuxppc-dev,
Alexander Stein
In-Reply-To: <1360970315-32116-1-git-send-email-dianders@chromium.org>
Dear Doug Anderson,
> There is simply no reason to be manually setting the private driver
> data to NULL in the remove/fail to probe cases. This is just extra
> cruft code that can be removed.
>
> A few notes:
> * Nothing relies on drvdata being set to NULL.
> * The __device_release_driver() function eventually calls
> dev_set_drvdata(dev, NULL) anyway, so there's no need to do it
> twice.
> * I verified that there were no cases where xxx_get_drvdata() was
> being called in these drivers and checking for / relying on the NULL
> return value.
>
> This could be cleaned up kernel-wide but for now just take the baby
> step and remove from the i2c subsystem.
>
> Reported-by: Wolfram Sang <wsa@the-dreams.de>
> Reported-by: Stephen Warren <swarren@wwwdotorg.org>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
For
> drivers/i2c/busses/i2c-mxs.c | 2 --
[...]
> diff --git a/drivers/i2c/busses/i2c-mxs.c b/drivers/i2c/busses/i2c-mxs.c
> index 22d8ad3..120f246 100644
> --- a/drivers/i2c/busses/i2c-mxs.c
> +++ b/drivers/i2c/busses/i2c-mxs.c
> @@ -697,8 +697,6 @@ static int mxs_i2c_remove(struct platform_device *pdev)
>
> writel(MXS_I2C_CTRL0_SFTRST, i2c->regs + MXS_I2C_CTRL0_SET);
>
> - platform_set_drvdata(pdev, NULL);
> -
> return 0;
> }
[...]
Add my:
Reviewed-by: Marek Vasut <marex@denx.de>
^ permalink raw reply
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2013-02-18 10:57 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
linux-arm-kernel, netdev, oleg, Vincent Guittot, sbw, tj, akpm,
linuxppc-dev
In-Reply-To: <alpine.LFD.2.02.1302181151040.22263@ionos>
On 02/18/2013 04:24 PM, Thomas Gleixner wrote:
> On Mon, 18 Feb 2013, Srivatsa S. Bhat wrote:
>> Lockup observed while running this patchset, with CPU_IDLE and INTEL_IDLE turned
>> on in the .config:
>>
>> smpboot: CPU 1 is now offline
>> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 11
>> Pid: 0, comm: swapper/11 Not tainted 3.8.0-rc7+stpmch13-1 #8
>> Call Trace:
>> [<ffffffff812aba1e>] do_raw_spin_lock+0x7e/0x150
>> [<ffffffff815a64c1>] _raw_spin_lock_irqsave+0x61/0x70
>> [<ffffffff810c0758>] ? clockevents_notify+0x28/0x150
>> [<ffffffff815a6d37>] ? _raw_spin_unlock_irqrestore+0x77/0x80
>> [<ffffffff810c0758>] clockevents_notify+0x28/0x150
>> [<ffffffff8130459f>] intel_idle+0xaf/0xe0
>> [<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
>> [<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
>> [<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
>> [<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
>> [<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
>> [<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
>> [<ffffffff810203d5>] cpu_idle+0xe5/0x140
>> [<ffffffff8159c603>] start_secondary+0xdd/0xdf
>
>> BUG: spinlock lockup suspected on CPU#2, migration/2/19
>> lock: clockevents_lock+0x0/0x40, .magic: dead4ead, .owner: swapper/8/0, .owner_cpu: 8
>
> Unfortunately there is no back trace for cpu8.
Yes :-(
I had run this several times hoping to get a backtrace on the lock-holder,
expecting trigger_all_cpu_backtrace() to get it right at least once. But I
hadn't succeeded even once.
> That's probably caused
> by the watchdog -> panic setting.
>
Oh, ok..
> So we have no idea why cpu2 and 11 get stuck on the clockevents_lock
> and without that information it's impossible to decode.
>
But thankfully, the issue seems to have been resolved by the diff I posted
in my previous mail, along with the fixes related to memory barriers.
Regards,
Srivatsa S. Bhat
^ permalink raw reply
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Vincent Guittot @ 2013-02-18 10:58 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512207A6.4000402@linux.vnet.ibm.com>
On 18 February 2013 11:51, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 02/18/2013 04:04 PM, Srivatsa S. Bhat wrote:
>> On 02/18/2013 03:54 PM, Vincent Guittot wrote:
>>> On 15 February 2013 20:40, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>> Hi Vincent,
>>>>
>>>> On 02/15/2013 06:58 PM, Vincent Guittot wrote:
>>>>> Hi Srivatsa,
>>>>>
>>>>> I have run some tests with you branch (thanks Paul for the git tree)
>>>>> and you will find results below.
>>>>>
>>>>
>>>> Thank you very much for testing this patchset!
>>>>
>>>>> The tests condition are:
>>>>> - 5 CPUs system in 2 clusters
>>>>> - The test plugs/unplugs CPU2 and it increases the system load each 20
>>>>> plug/unplug sequence with either more cyclictests threads
>>>>> - The test is done with all CPUs online and with only CPU0 and CPU2
>>>>>
>>>>> The main conclusion is that there is no differences with and without
>>>>> your patches with my stress tests. I'm not sure that it was the
>>>>> expected results but the cpu_down is already quite low : 4-5ms in
>>>>> average
>>>>>
>>>>
>>>> Atleast my patchset doesn't perform _worse_ than mainline, with respect
>>>> to cpu_down duration :-)
>>>
>>> yes exactly and it has pass more than 400 consecutive plug/unplug on
>>> an ARM platform
>>>
>>
>> Great! However, did you turn on CPU_IDLE during your tests?
>>
>> In my tests, I had turned off cpu idle in the .config, like I had mentioned in
>> the cover letter. I'm struggling to get it working with CPU_IDLE/INTEL_IDLE
>> turned on, because it gets into a lockup almost immediately. It appears that
>> the lock-holder of clockevents_lock never releases it, for some reason..
>> See below for the full log. Lockdep has not been useful in debugging this,
>> unfortunately :-(
>>
>
> Ah, nevermind, the following diff fixes it :-) I had applied this fix on v5
> and tested but it still had races where I used to hit the lockups. Now after
> I fixed all the memory barrier issues that Paul and Oleg pointed out in v5,
> I applied this fix again and tested it just now - it works beautifully! :-)
My tests have been done without cpuidle because i have some issues
with function tracer and cpuidle
But the cpu hotplug and cpuidle work well when I run the tests without
enabling the function tracer
Vincent
>
> I'll include this fix and post a v6 soon.
>
> Regards,
> Srivatsa S. Bhat
>
> --------------------------------------------------------------------------->
>
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 30b6de0..ca340fd 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -17,6 +17,7 @@
> #include <linux/module.h>
> #include <linux/notifier.h>
> #include <linux/smp.h>
> +#include <linux/cpu.h>
>
> #include "tick-internal.h"
>
> @@ -431,6 +432,7 @@ void clockevents_notify(unsigned long reason, void *arg)
> unsigned long flags;
> int cpu;
>
> + get_online_cpus_atomic();
> raw_spin_lock_irqsave(&clockevents_lock, flags);
> clockevents_do_notify(reason, arg);
>
> @@ -459,6 +461,7 @@ void clockevents_notify(unsigned long reason, void *arg)
> break;
> }
> raw_spin_unlock_irqrestore(&clockevents_lock, flags);
> + put_online_cpus_atomic();
> }
> EXPORT_SYMBOL_GPL(clockevents_notify);
> #endif
>
^ permalink raw reply
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Thomas Gleixner @ 2013-02-18 10:54 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
linux-arm-kernel, netdev, oleg, Vincent Guittot, sbw, tj, akpm,
linuxppc-dev
In-Reply-To: <512203B3.7090002@linux.vnet.ibm.com>
On Mon, 18 Feb 2013, Srivatsa S. Bhat wrote:
> Lockup observed while running this patchset, with CPU_IDLE and INTEL_IDLE turned
> on in the .config:
>
> smpboot: CPU 1 is now offline
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 11
> Pid: 0, comm: swapper/11 Not tainted 3.8.0-rc7+stpmch13-1 #8
> Call Trace:
> [<ffffffff812aba1e>] do_raw_spin_lock+0x7e/0x150
> [<ffffffff815a64c1>] _raw_spin_lock_irqsave+0x61/0x70
> [<ffffffff810c0758>] ? clockevents_notify+0x28/0x150
> [<ffffffff815a6d37>] ? _raw_spin_unlock_irqrestore+0x77/0x80
> [<ffffffff810c0758>] clockevents_notify+0x28/0x150
> [<ffffffff8130459f>] intel_idle+0xaf/0xe0
> [<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
> [<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
> [<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
> [<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
> [<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
> [<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
> [<ffffffff810203d5>] cpu_idle+0xe5/0x140
> [<ffffffff8159c603>] start_secondary+0xdd/0xdf
> BUG: spinlock lockup suspected on CPU#2, migration/2/19
> lock: clockevents_lock+0x0/0x40, .magic: dead4ead, .owner: swapper/8/0, .owner_cpu: 8
Unfortunately there is no back trace for cpu8. That's probably caused
by the watchdog -> panic setting.
So we have no idea why cpu2 and 11 get stuck on the clockevents_lock
and without that information it's impossible to decode.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2013-02-18 10:51 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512203B3.7090002@linux.vnet.ibm.com>
On 02/18/2013 04:04 PM, Srivatsa S. Bhat wrote:
> On 02/18/2013 03:54 PM, Vincent Guittot wrote:
>> On 15 February 2013 20:40, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> On 02/15/2013 06:58 PM, Vincent Guittot wrote:
>>>> Hi Srivatsa,
>>>>
>>>> I have run some tests with you branch (thanks Paul for the git tree)
>>>> and you will find results below.
>>>>
>>>
>>> Thank you very much for testing this patchset!
>>>
>>>> The tests condition are:
>>>> - 5 CPUs system in 2 clusters
>>>> - The test plugs/unplugs CPU2 and it increases the system load each 20
>>>> plug/unplug sequence with either more cyclictests threads
>>>> - The test is done with all CPUs online and with only CPU0 and CPU2
>>>>
>>>> The main conclusion is that there is no differences with and without
>>>> your patches with my stress tests. I'm not sure that it was the
>>>> expected results but the cpu_down is already quite low : 4-5ms in
>>>> average
>>>>
>>>
>>> Atleast my patchset doesn't perform _worse_ than mainline, with respect
>>> to cpu_down duration :-)
>>
>> yes exactly and it has pass more than 400 consecutive plug/unplug on
>> an ARM platform
>>
>
> Great! However, did you turn on CPU_IDLE during your tests?
>
> In my tests, I had turned off cpu idle in the .config, like I had mentioned in
> the cover letter. I'm struggling to get it working with CPU_IDLE/INTEL_IDLE
> turned on, because it gets into a lockup almost immediately. It appears that
> the lock-holder of clockevents_lock never releases it, for some reason..
> See below for the full log. Lockdep has not been useful in debugging this,
> unfortunately :-(
>
Ah, nevermind, the following diff fixes it :-) I had applied this fix on v5
and tested but it still had races where I used to hit the lockups. Now after
I fixed all the memory barrier issues that Paul and Oleg pointed out in v5,
I applied this fix again and tested it just now - it works beautifully! :-)
I'll include this fix and post a v6 soon.
Regards,
Srivatsa S. Bhat
--------------------------------------------------------------------------->
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 30b6de0..ca340fd 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/cpu.h>
#include "tick-internal.h"
@@ -431,6 +432,7 @@ void clockevents_notify(unsigned long reason, void *arg)
unsigned long flags;
int cpu;
+ get_online_cpus_atomic();
raw_spin_lock_irqsave(&clockevents_lock, flags);
clockevents_do_notify(reason, arg);
@@ -459,6 +461,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(&clockevents_lock, flags);
+ put_online_cpus_atomic();
}
EXPORT_SYMBOL_GPL(clockevents_notify);
#endif
^ permalink raw reply related
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2013-02-18 10:34 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <CAKfTPtD=2jn1AVjuhVP1ot7_8x9-W4==hNGZET5N9tRE7gtyMw@mail.gmail.com>
On 02/18/2013 03:54 PM, Vincent Guittot wrote:
> On 15 February 2013 20:40, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 02/15/2013 06:58 PM, Vincent Guittot wrote:
>>> Hi Srivatsa,
>>>
>>> I have run some tests with you branch (thanks Paul for the git tree)
>>> and you will find results below.
>>>
>>
>> Thank you very much for testing this patchset!
>>
>>> The tests condition are:
>>> - 5 CPUs system in 2 clusters
>>> - The test plugs/unplugs CPU2 and it increases the system load each 20
>>> plug/unplug sequence with either more cyclictests threads
>>> - The test is done with all CPUs online and with only CPU0 and CPU2
>>>
>>> The main conclusion is that there is no differences with and without
>>> your patches with my stress tests. I'm not sure that it was the
>>> expected results but the cpu_down is already quite low : 4-5ms in
>>> average
>>>
>>
>> Atleast my patchset doesn't perform _worse_ than mainline, with respect
>> to cpu_down duration :-)
>
> yes exactly and it has pass more than 400 consecutive plug/unplug on
> an ARM platform
>
Great! However, did you turn on CPU_IDLE during your tests?
In my tests, I had turned off cpu idle in the .config, like I had mentioned in
the cover letter. I'm struggling to get it working with CPU_IDLE/INTEL_IDLE
turned on, because it gets into a lockup almost immediately. It appears that
the lock-holder of clockevents_lock never releases it, for some reason..
See below for the full log. Lockdep has not been useful in debugging this,
unfortunately :-(
>>
>> So, here is the analysis:
>> Stop-machine() doesn't really slow down CPU-down operation, if the rest
>> of the CPUs are mostly running in userspace all the time. Because, the
>> CPUs running userspace workloads cooperate very eagerly with the stop-machine
>> dance - they receive the resched IPI, and allow the per-cpu cpu-stopper
>> thread to monopolize the CPU, almost immediately.
>>
>> The scenario where stop-machine() takes longer to take effect is when
>> most of the online CPUs are running in kernelspace, because, then the
>> probability that they call preempt_disable() frequently (and hence inhibit
>> stop-machine) is higher. That's why, in my tests, I ran genload from LTP
>> which generated a lot of system-time (system-time in 'top' indicates activity
>> in kernelspace). Hence my patchset showed significant improvement over
>> mainline in my tests.
>>
>
> ok, I hadn't noticed this important point for the test
>
>> However, your test is very useful too, if we measure a different parameter:
>> the latency impact on the workloads running on the system (cyclic test).
>> One other important aim of this patchset is to make hotplug as less intrusive
>> as possible, for other workloads running on the system. So if you measure
>> the cyclictest numbers, I would expect my patchset to show better numbers
>> than mainline, when you do cpu-hotplug in parallel (same test that you did).
>> Mainline would run stop-machine and hence interrupt the cyclic test tasks
>> too often. My patchset wouldn't do that, and hence cyclic test should
>> ideally show better numbers.
>
> In fact, I haven't looked at the results as i was more interested by
> the load that was generated
>
>>
>> I'd really appreciate if you could try that out and let me know how it
>> goes.. :-) Thank you very much!
>
> ok, I'm going to try to run a test series
>
Great! Thank you :-)
Regards,
Srivatsa S. Bhat
--------------------------------------------------------------------------------
Lockup observed while running this patchset, with CPU_IDLE and INTEL_IDLE turned
on in the .config:
smpboot: CPU 1 is now offline
Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 11
Pid: 0, comm: swapper/11 Not tainted 3.8.0-rc7+stpmch13-1 #8
Call Trace:
<NMI> [<ffffffff815a319e>] panic+0xc9/0x1ee
[<ffffffff810fdd41>] watchdog_overflow_callback+0xb1/0xc0
[<ffffffff8113ab5c>] __perf_event_overflow+0x9c/0x330
[<ffffffff81028a88>] ? x86_perf_event_set_period+0xd8/0x160
[<ffffffff8113b514>] perf_event_overflow+0x14/0x20
[<ffffffff8102ee54>] intel_pmu_handle_irq+0x1c4/0x360
[<ffffffff815a8ef1>] perf_event_nmi_handler+0x21/0x30
[<ffffffff815a8366>] nmi_handle+0xb6/0x200
[<ffffffff815a82b0>] ? oops_begin+0xd0/0xd0
[<ffffffff815a85c8>] default_do_nmi+0x68/0x220
[<ffffffff815a8840>] do_nmi+0xc0/0x110
[<ffffffff815a7911>] end_repeat_nmi+0x1e/0x2e
[<ffffffff812a3f98>] ? delay_tsc+0x38/0xb0
[<ffffffff812a3f98>] ? delay_tsc+0x38/0xb0
[<ffffffff812a3f98>] ? delay_tsc+0x38/0xb0
<<EOE>> [<ffffffff812a3f1f>] __delay+0xf/0x20
[<ffffffff812aba1e>] do_raw_spin_lock+0x7e/0x150
[<ffffffff815a64c1>] _raw_spin_lock_irqsave+0x61/0x70
[<ffffffff810c0758>] ? clockevents_notify+0x28/0x150
[<ffffffff815a6d37>] ? _raw_spin_unlock_irqrestore+0x77/0x80
[<ffffffff810c0758>] clockevents_notify+0x28/0x150
[<ffffffff8130459f>] intel_idle+0xaf/0xe0
[<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
[<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
[<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
[<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
[<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
[<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
[<ffffffff810203d5>] cpu_idle+0xe5/0x140
[<ffffffff8159c603>] start_secondary+0xdd/0xdf
BUG: spinlock lockup suspected on CPU#2, migration/2/19
lock: clockevents_lock+0x0/0x40, .magic: dead4ead, .owner: swapper/8/0, .owner_cpu: 8
Pid: 19, comm: migration/2 Not tainted 3.8.0-rc7+stpmch13-1 #8
Call Trace:
[<ffffffff812ab878>] spin_dump+0x78/0xc0
[<ffffffff812abac6>] do_raw_spin_lock+0x126/0x150
[<ffffffff815a64c1>] _raw_spin_lock_irqsave+0x61/0x70
[<ffffffff810c0758>] ? clockevents_notify+0x28/0x150
[<ffffffff810c0758>] clockevents_notify+0x28/0x150
[<ffffffff8159e08f>] hrtimer_cpu_notify+0xe3/0x107
[<ffffffff815ab5ec>] notifier_call_chain+0x5c/0x120
[<ffffffff8108d7de>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff8105e540>] __cpu_notify+0x20/0x40
[<ffffffff81592003>] take_cpu_down+0x53/0x80
[<ffffffff810ed3ba>] cpu_stopper_thread+0xfa/0x1e0
[<ffffffff81591fb0>] ? enable_nonboot_cpus+0xf0/0xf0
[<ffffffff815a5039>] ? __schedule+0x469/0x890
[<ffffffff810ed2c0>] ? res_counter_init+0x60/0x60
[<ffffffff810ed2c0>] ? res_counter_init+0x60/0x60
[<ffffffff8108576e>] kthread+0xee/0x100
[<ffffffff81085680>] ? __init_kthread_worker+0x70/0x70
[<ffffffff815b092c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085680>] ? __init_kthread_worker+0x70/0x70
sending NMI to all CPUs:
NMI backtrace for cpu 3
CPU 3
Pid: 0, comm: swapper/3 Not tainted 3.8.0-rc7+stpmch13-1 #8 IBM IBM System x -[7870C4Q]-/68Y8033
RIP: 0010:[<ffffffff81304589>] [<ffffffff81304589>] intel_idle+0x99/0xe0
RSP: 0018:ffff8808db57bdd8 EFLAGS: 00000046
RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8808db57bfd8 RDI: ffffffff815a6d37
RBP: ffff8808db57be08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000020 R14: ffffffff81a9c000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8808ffcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffbbb902f8 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/3 (pid: 0, threadinfo ffff8808db57a000, task ffff8808db578380)
Stack:
0000000000000000 0000000300000001 ffff8808db57be18 ffff8808ffcda070
000000bd1ad2c5de ffffffff81472ee0 ffff8808db57be18 ffffffff81472ef9
ffff8808db57be78 ffffffff814734c1 ffff8808ffcc0000 0000000100000046
Call Trace:
[<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
[<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
[<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
[<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
[<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
[<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
[<ffffffff810203d5>] cpu_idle+0xe5/0x140
[<ffffffff8159c603>] start_secondary+0xdd/0xdf
Code: ff 48 8d 86 38 e0 ff ff 83 e2 08 75 1e 31 d2 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <85> 1d 69 7d 79 00 75 0e 48 8d 75 dc bf 05 00 00 00 e8 91 c1 db
NMI backtrace for cpu 5
CPU 5
Pid: 0, comm: swapper/5 Not tainted 3.8.0-rc7+stpmch13-1 #8 IBM IBM System x -[7870C4Q]-/68Y8033
RIP: 0010:[<ffffffff81304589>] [<ffffffff81304589>] intel_idle+0x99/0xe0
RSP: 0018:ffff8808db583dd8 EFLAGS: 00000046
RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8808db583fd8 RDI: ffffffff815a6d37
RBP: ffff8808db583e08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000020 R14: ffffffff81a9c000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff88117fc40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffff600400 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/5 (pid: 0, threadinfo ffff8808db582000, task ffff8808db580400)
Stack:
0000000000000000 0000000500000001 ffff8808db583e18 ffff88117fc5a070
000000bd1af68ce5 ffffffff81472ee0 ffff8808db583e18 ffffffff81472ef9
ffff8808db583e78 ffffffff814734c1 ffff88117fc40000 0000000100000046
Call Trace:
[<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
[<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
[<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
[<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
[<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
[<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
[<ffffffff810203d5>] cpu_idle+0xe5/0x140
[<ffffffff8159c603>] start_secondary+0xdd/0xdf
Code: ff 48 8d 86 38 e0 ff ff 83 e2 08 75 1e 31 d2 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <85> 1d 69 7d 79 00 75 0e 48 8d 75 dc bf 05 00 00 00 e8 91 c1 db
NMI backtrace for cpu 6
CPU 6
Pid: 0, comm: swapper/6 Not tainted 3.8.0-rc7+stpmch13-1 #8 IBM IBM System x -[7870C4Q]-/68Y8033
RIP: 0010:[<ffffffff81304589>] [<ffffffff81304589>] intel_idle+0x99/0xe0
RSP: 0018:ffff8808db589dd8 EFLAGS: 00000046
RAX: 0000000000000010 RBX: 0000000000000004 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8808db589fd8 RDI: ffffffff815a6d37
RBP: ffff8808db589e08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
R13: 0000000000000010 R14: ffffffff81a9c000 R15: 0000000000000002
FS: 0000000000000000(0000) GS:ffff88117fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fde168df000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/6 (pid: 0, threadinfo ffff8808db588000, task ffff8808db586440)
Stack:
0000000000000000 0000000600000001 ffff8808db589e18 ffff88117fc9a070
000000bd1af4a55f ffffffff81472ee0 ffff8808db589e18 ffffffff81472ef9
ffff8808db589e78 ffffffff814734c1 ffff88117fc80000 0000000b00000046
Call Trace:
[<ffffffff81472ee0>] ? disable_cpuidle+0x20/0x20
[<ffffffff81472ef9>] cpuidle_enter+0x19/0x20
[<ffffffff814734c1>] cpuidle_wrap_enter+0x41/0xa0
[<ffffffff81473530>] cpuidle_enter_tk+0x10/0x20
[<ffffffff81472f17>] cpuidle_enter_state+0x17/0x50
[<ffffffff81473899>] cpuidle_idle_call+0xd9/0x290
[<ffffffff810203d5>] cpu_idle+0xe5/0x140
[<ffffffff8159c603>] start_secondary+0xdd/0xdf
Code: ff 48 8d 86 38 e0 ff ff 83 e2 08 75 1e 31 d2 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <85> 1d 69 7d 79 00 75 0e 48 8d 75 dc bf 05 00 00 00 e8 91 c1 db
3h
^ permalink raw reply
* Re: [PATCH] i2c: Remove unneeded xxx_set_drvdata(..., NULL) calls
From: Mika Westerberg @ 2013-02-18 10:35 UTC (permalink / raw)
To: Doug Anderson
Cc: Wolfram Sang, Tony Lindgren, Linus Walleij, Thierry Reding,
Sekhar Nori, linux-i2c, Guan Xuetao, Kevin Hilman, Sonic Zhang,
linux-arm-kernel, Deepak Sikri, Havard Skinnemoen, Marek Vasut,
Pawel Moll, Stephen Warren, Sascha Hauer, Uwe Kleine-König,
Rob Herring, uclinux-dist-devel, Jean Delvare, Lars-Peter Clausen,
Ben Dooks (embedded platforms), Barry Song, linux-omap,
Oskar Schirmer, Fabio Estevam, davinci-linux-open-source,
Shawn Guo, Jim Cromie, Greg Kroah-Hartman, Tomoya MORINAGA,
linux-kernel, Kyungmin Park, Viresh Kumar, Karol Lewandowski,
Jiri Kosina, STEricsson, Joe Perches, Andrew Morton,
Alessandro Rubini, linuxppc-dev, Alexander Stein
In-Reply-To: <1360970315-32116-1-git-send-email-dianders@chromium.org>
On Fri, Feb 15, 2013 at 03:18:35PM -0800, Doug Anderson wrote:
> There is simply no reason to be manually setting the private driver
> data to NULL in the remove/fail to probe cases. This is just extra
> cruft code that can be removed.
>
> A few notes:
> * Nothing relies on drvdata being set to NULL.
> * The __device_release_driver() function eventually calls
> dev_set_drvdata(dev, NULL) anyway, so there's no need to do it
> twice.
> * I verified that there were no cases where xxx_get_drvdata() was
> being called in these drivers and checking for / relying on the NULL
> return value.
>
> This could be cleaned up kernel-wide but for now just take the baby
> step and remove from the i2c subsystem.
>
> Reported-by: Wolfram Sang <wsa@the-dreams.de>
> Reported-by: Stephen Warren <swarren@wwwdotorg.org>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
> drivers/i2c/busses/i2c-au1550.c | 1 -
> drivers/i2c/busses/i2c-bfin-twi.c | 2 --
> drivers/i2c/busses/i2c-cpm.c | 2 --
> drivers/i2c/busses/i2c-davinci.c | 2 --
> drivers/i2c/busses/i2c-designware-pcidrv.c | 2 --
> drivers/i2c/busses/i2c-designware-platdrv.c | 2 --
For i2c-designware-pcidrv.c and i2c-designware-platdrv.c:
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
^ permalink raw reply
* [RFC PATCH 16/17] powerpc: get_user_pages_fast changes
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
handle large pages for get_user_pages_fast. Also take care of large page splitting.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/mm/gup.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 74 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d7efdbf..4b9c27e 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -55,6 +55,64 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
return 1;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int gup_huge_pmd(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, int write,
+ struct page **pages, int *nr)
+{
+ int refs;
+ pmd_t pmd;
+ unsigned long mask;
+ struct page *head, *page, *tail;
+
+ pmd = *pmdp;
+ mask = PMD_HUGE_PRESENT | PMD_HUGE_USER;
+ if (write)
+ mask |= PMD_HUGE_RW;
+
+ if ((pmd_val(pmd) & mask) != mask)
+ return 0;
+
+ /* FIXME!! large pages are never "special" */
+ VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
+
+ refs = 0;
+ head = pmd_page(pmd);
+ page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ tail = page;
+ do {
+ VM_BUG_ON(compound_head(page) != head);
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ refs++;
+ } while (addr += PAGE_SIZE, addr != end);
+
+ if (!page_cache_add_speculative(head, refs)) {
+ *nr -= refs;
+ return 0;
+ }
+
+ if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+ *nr -= refs;
+ while (refs--)
+ put_page(head);
+ return 0;
+ }
+ /*
+ * Any tail page need their mapcount reference taken before we
+ * return.
+ */
+ while (refs--) {
+ if (PageTail(tail))
+ get_huge_page_tail(tail);
+ tail++;
+ }
+
+ return 1;
+}
+#endif
+
static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
@@ -66,9 +124,23 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ /*
+ * The pmd_trans_splitting() check below explains why
+ * pmdp_splitting_flush has to flush the tlb, to stop
+ * this gup-fast code from running while we set the
+ * splitting bit in the pmd. Returning zero will take
+ * the slow path that will call wait_split_huge_page()
+ * if the pmd is still in splitting state. gup-fast
+ * can't because it has irq disabled and
+ * wait_split_huge_page() would never return as the
+ * tlb flush IPI wouldn't run.
+ */
+ if (pmd_none(pmd) || pmd_trans_splitting(pmd))
return 0;
- if (is_hugepd(pmdp)) {
+ if (unlikely(pmd_large(pmd))) {
+ if (!gup_huge_pmd(pmdp, addr, next, write, pages, nr))
+ return 0;
+ } else if (is_hugepd(pmdp)) {
if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
addr, next, write, pages, nr))
return 0;
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 03/17] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
We now have PTE page consuming only 2K of the 64K page.This is in order to
facilitate transparent huge page support, which works much better if our PMDs
cover 16MB instead of 256MB.
Inorder to reduce the wastage, we now have multiple PTE page fragment
from the same PTE page.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/mmu-book3e.h | 4 +
arch/powerpc/include/asm/mmu-hash64.h | 4 +
arch/powerpc/include/asm/page.h | 4 +
arch/powerpc/include/asm/pgalloc-32.h | 45 ++++++++
arch/powerpc/include/asm/pgalloc-64.h | 143 ++++++++++++++++++++-----
arch/powerpc/include/asm/pgalloc.h | 46 +-------
arch/powerpc/kernel/setup_64.c | 4 +-
arch/powerpc/mm/mmu_context_hash64.c | 12 +++
arch/powerpc/mm/pgtable_64.c | 189 +++++++++++++++++++++++++++++++++
9 files changed, 377 insertions(+), 74 deletions(-)
diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 99d43e0..6bd293d 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -231,6 +231,10 @@ typedef struct {
u64 high_slices_psize; /* 4 bits per slice for now */
u16 user_psize; /* page size index */
#endif
+#ifdef CONFIG_PPC_64K_PAGES
+ /* for 2K page table support */
+ struct list_head pgtable_list;
+#endif
} mm_context_t;
/* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 35bb51e..c3b3518 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -498,6 +498,10 @@ typedef struct {
unsigned long acop; /* mask of enabled coprocessor types */
unsigned int cop_pid; /* pid value used with coprocessors */
#endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+ /* for 2K page table support */
+ struct list_head pgtable_list;
+#endif
} mm_context_t;
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index f072e97..38e7ff6 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -378,7 +378,11 @@ void arch_free_page(struct page *page, int order);
struct vm_area_struct;
+#ifdef CONFIG_PPC_64K_PAGES
+typedef pte_t *pgtable_t;
+#else
typedef struct page *pgtable_t;
+#endif
#include <asm-generic/memory_model.h>
#endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgalloc-32.h b/arch/powerpc/include/asm/pgalloc-32.h
index 580cf73..27b2386 100644
--- a/arch/powerpc/include/asm/pgalloc-32.h
+++ b/arch/powerpc/include/asm/pgalloc-32.h
@@ -37,6 +37,17 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+ free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+ pgtable_page_dtor(ptepage);
+ __free_page(ptepage);
+}
+
static inline void pgtable_free(void *table, unsigned index_size)
{
BUG_ON(index_size); /* 32-bit doesn't use this */
@@ -45,4 +56,38 @@ static inline void pgtable_free(void *table, unsigned index_size)
#define check_pgt_cache() do { } while (0)
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ unsigned long pgf = (unsigned long)table;
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ pgf |= shift;
+ tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+ void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+ unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+ pgtable_free(table, shift);
+}
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ pgtable_free(table, shift);
+}
+#endif
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+ unsigned long address)
+{
+ struct page *page = page_address(table);
+
+ tlb_flush_pgtable(tlb, address);
+ pgtable_page_dtor(page);
+ pgtable_free_tlb(tlb, page, 0);
+}
#endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 292725c..f6875a5 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -72,9 +72,91 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
#define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, (unsigned long)(pte))
#define pmd_pgtable(pmd) pmd_page(pmd)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
+ unsigned long address)
+{
+ return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+}
-#else /* CONFIG_PPC_64K_PAGES */
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
+ unsigned long address)
+{
+ pte_t *pte;
+ struct page *page;
+ pte = pte_alloc_one_kernel(mm, address);
+ if (!pte)
+ return NULL;
+ page = virt_to_page(pte);
+ pgtable_page_ctor(page);
+ return page;
+}
+
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+ free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+ pgtable_page_dtor(ptepage);
+ __free_page(ptepage);
+}
+
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ unsigned long pgf = (unsigned long)table;
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ pgf |= shift;
+ tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+ void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+ unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+ if (!shift)
+ free_page((unsigned long)table);
+ else {
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ kmem_cache_free(PGT_CACHE(shift), table);
+ }
+}
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ pgtable_free(table, shift);
+}
+#endif
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+ unsigned long address)
+{
+ struct page *page = page_address(table);
+
+ tlb_flush_pgtable(tlb, address);
+ pgtable_page_dtor(page);
+ pgtable_free_tlb(tlb, page, 0);
+}
+
+#else /* if CONFIG_PPC_64K_PAGES */
+
+extern unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
+extern void page_table_free(struct mm_struct *, unsigned long *);
+#ifdef CONFIG_SMP
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
+extern void __tlb_remove_table(void *_table);
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ pgtable_free(table, shift);
+}
+#endif
#define pud_populate(mm, pud, pmd) pud_set(pud, (unsigned long)pmd)
static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
@@ -83,51 +165,56 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
pmd_set(pmd, (unsigned long)pte);
}
-#define pmd_populate(mm, pmd, pte_page) \
- pmd_populate_kernel(mm, pmd, page_address(pte_page))
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
-#endif /* CONFIG_PPC_64K_PAGES */
-
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
+ pgtable_t pte_page)
{
- return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
- GFP_KERNEL|__GFP_REPEAT);
+ pmd_set(pmd, (unsigned long)pte_page);
}
-static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+static inline pgtable_t pmd_pgtable(pmd_t pmd)
{
- kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
+ return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
}
static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
unsigned long address)
{
- return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+ return (pte_t *)page_table_alloc(mm, address);
}
static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
unsigned long address)
{
- struct page *page;
- pte_t *pte;
+ return (pgtable_t)page_table_alloc(mm, address);
+}
- pte = pte_alloc_one_kernel(mm, address);
- if (!pte)
- return NULL;
- page = virt_to_page(pte);
- pgtable_page_ctor(page);
- return page;
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+ page_table_free(mm, (unsigned long *)pte);
}
-static inline void pgtable_free(void *table, unsigned index_size)
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
{
- if (!index_size)
- free_page((unsigned long)table);
- else {
- BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
- kmem_cache_free(PGT_CACHE(index_size), table);
- }
+ page_table_free(mm, (unsigned long *)ptepage);
+}
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+ unsigned long address)
+{
+ tlb_flush_pgtable(tlb, address);
+ pgtable_free_tlb(tlb, table, 0);
+}
+#endif /* CONFIG_PPC_64K_PAGES */
+
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
+ GFP_KERNEL|__GFP_REPEAT);
+}
+
+static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+{
+ kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
}
#define __pmd_free_tlb(tlb, pmd, addr) \
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..e9a9f60 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -3,6 +3,7 @@
#ifdef __KERNEL__
#include <linux/mm.h>
+#include <asm-generic/tlb.h>
#ifdef CONFIG_PPC_BOOK3E
extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address);
@@ -13,56 +14,11 @@ static inline void tlb_flush_pgtable(struct mmu_gather *tlb,
}
#endif /* !CONFIG_PPC_BOOK3E */
-static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
-{
- free_page((unsigned long)pte);
-}
-
-static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
-{
- pgtable_page_dtor(ptepage);
- __free_page(ptepage);
-}
-
#ifdef CONFIG_PPC64
#include <asm/pgalloc-64.h>
#else
#include <asm/pgalloc-32.h>
#endif
-#ifdef CONFIG_SMP
-struct mmu_gather;
-extern void tlb_remove_table(struct mmu_gather *, void *);
-
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
-{
- unsigned long pgf = (unsigned long)table;
- BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
- pgf |= shift;
- tlb_remove_table(tlb, (void *)pgf);
-}
-
-static inline void __tlb_remove_table(void *_table)
-{
- void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
- unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
-
- pgtable_free(table, shift);
-}
-#else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
- pgtable_free(table, shift);
-}
-#endif /* !CONFIG_SMP */
-
-static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
- unsigned long address)
-{
- tlb_flush_pgtable(tlb, address);
- pgtable_page_dtor(ptepage);
- pgtable_free_tlb(tlb, page_address(ptepage), 0);
-}
-
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_PGALLOC_H */
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6da881b..4e2db82 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -575,7 +575,9 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = klimit;
-
+#ifdef CONFIG_PPC_64K_PAGES
+ INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif
irqstack_early_init();
exc_lvl_early_init();
emergency_stack_init();
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 59cd773..83f2222 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -86,6 +86,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
spin_lock_init(mm->context.cop_lockp);
#endif /* CONFIG_PPC_ICSWX */
+ INIT_LIST_HEAD(&mm->context.pgtable_list);
+
return 0;
}
@@ -99,11 +101,21 @@ EXPORT_SYMBOL_GPL(__destroy_context);
void destroy_context(struct mm_struct *mm)
{
+ struct page *page;
+ struct list_head *item, *tmp;
+
#ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
mm->context.cop_lockp = NULL;
#endif /* CONFIG_PPC_ICSWX */
+ list_for_each_safe(item, tmp, &mm->context.pgtable_list) {
+ page = list_entry(item, struct page, lru);
+ list_del(&page->lru);
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
__destroy_context(mm->context.id);
subpage_prot_free(mm);
mm->context.id = MMU_NO_CONTEXT;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index e212a27..ec80314 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -69,6 +69,7 @@
unsigned long ioremap_bot = IOREMAP_BASE;
#ifdef CONFIG_PPC_MMU_NOHASH
+/* FIXME!! */
static void *early_alloc_pgtable(unsigned long size)
{
void *pt;
@@ -337,3 +338,191 @@ EXPORT_SYMBOL(__ioremap_at);
EXPORT_SYMBOL(iounmap);
EXPORT_SYMBOL(__iounmap);
EXPORT_SYMBOL(__iounmap_at);
+
+#ifdef CONFIG_PPC_64K_PAGES
+/*
+ * we support 15 fragments per PTE page. This is limited by how many
+ * bits we can pack in page->_mapcount. We use the first half for
+ * tracking the usage for rcu page table free.
+ */
+#define FRAG_MASK_BITS 15
+#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
+/*
+ * We use a 2K PTE page fragment and another 2K for storing
+ * real_pte_t hash index
+ */
+#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
+
+static inline unsigned int atomic_xor_bits(atomic_t *v, unsigned int bits)
+{
+ unsigned int old, new;
+
+ do {
+ old = atomic_read(v);
+ new = old ^ bits;
+ } while (atomic_cmpxchg(v, old, new) != old);
+ return new;
+}
+
+unsigned long *page_table_alloc(struct mm_struct *mm, unsigned long vmaddr)
+{
+ struct page *page;
+ unsigned int mask, bit;
+ unsigned long *table;
+
+ /* Allocate fragments of a 4K page as 1K/2K page table */
+ spin_lock(&mm->page_table_lock);
+ mask = FRAG_MASK;
+ if (!list_empty(&mm->context.pgtable_list)) {
+ page = list_first_entry(&mm->context.pgtable_list,
+ struct page, lru);
+ table = (unsigned long *) page_address(page);
+ mask = atomic_read(&page->_mapcount);
+ /*
+ * Update with the higher order mask bits accumulated,
+ * added as a part of rcu free.
+ */
+ mask = mask | (mask >> FRAG_MASK_BITS);
+ }
+ if ((mask & FRAG_MASK) == FRAG_MASK) {
+ spin_unlock(&mm->page_table_lock);
+ page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+ if (!page)
+ return NULL;
+ pgtable_page_ctor(page);
+ atomic_set(&page->_mapcount, 1);
+ table = (unsigned long *) page_address(page);
+ spin_lock(&mm->page_table_lock);
+ INIT_LIST_HEAD(&page->lru);
+ list_add(&page->lru, &mm->context.pgtable_list);
+ } else {
+ /* The second half is used for real_pte_t hindex */
+ for (bit = 1; mask & bit; bit <<= 1)
+ table = (unsigned long *)((char *)table + PTE_FRAG_SIZE);
+
+ mask = atomic_xor_bits(&page->_mapcount, bit);
+ /*
+ * We have taken up all the space, remove this from
+ * the list, we will add it back when we have a free slot
+ */
+ if ((mask & FRAG_MASK) == FRAG_MASK)
+ list_del_init(&page->lru);
+ }
+ spin_unlock(&mm->page_table_lock);
+ /*
+ * zero out the newly allocated area, this make sure we don't
+ * see the old left over pte values
+ */
+ memset(table, 0, PTE_FRAG_SIZE);
+ return table;
+}
+
+void page_table_free(struct mm_struct *mm, unsigned long *table)
+{
+ struct page *page;
+ unsigned int bit, mask;
+
+ /* Free 2K page table fragment of a 64K page */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ spin_lock(&mm->page_table_lock);
+ mask = atomic_xor_bits(&page->_mapcount, bit);
+ if (mask == 0)
+ list_del(&page->lru);
+ else if (mask & FRAG_MASK) {
+ /*
+ * Add the page table page to pgtable_list so that
+ * the free fragment can be used by the next alloc
+ */
+ list_del_init(&page->lru);
+ list_add(&page->lru, &mm->context.pgtable_list);
+ }
+ spin_unlock(&mm->page_table_lock);
+ if (mask == 0) {
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
+}
+
+#ifdef CONFIG_SMP
+static void __page_table_free_rcu(void *table)
+{
+ unsigned int bit;
+ struct page *page;
+ /*
+ * this is a PTE page free 2K page table
+ * fragment of a 64K page.
+ */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ bit <<= FRAG_MASK_BITS;
+ /*
+ * clear the higher half and if nobody used the page in
+ * between, even lower half would be zero.
+ */
+ if (atomic_xor_bits(&page->_mapcount, bit) == 0) {
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
+}
+
+static void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table)
+{
+ struct page *page;
+ struct mm_struct *mm;
+ unsigned int bit, mask;
+
+ mm = tlb->mm;
+ /* Free 2K page table fragment of a 64K page */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ spin_lock(&mm->page_table_lock);
+ /*
+ * stash the actual mask in higher half, and clear the lower half
+ * and selectively, add remove from pgtable list
+ */
+ mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS));
+ if (!(mask & FRAG_MASK))
+ list_del(&page->lru);
+ else {
+ /*
+ * Add the page table page to pgtable_list so that
+ * the free fragment can be used by the next alloc
+ */
+ list_del_init(&page->lru);
+ list_add_tail(&page->lru, &mm->context.pgtable_list);
+ }
+ spin_unlock(&mm->page_table_lock);
+ tlb_remove_table(tlb, table);
+}
+
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+ unsigned long pgf = (unsigned long)table;
+
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ pgf |= shift;
+ if (shift == 0)
+ /* PTE page needs special handling */
+ page_table_free_rcu(tlb, table);
+ else
+ tlb_remove_table(tlb, (void *)pgf);
+}
+
+void __tlb_remove_table(void *_table)
+{
+ void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+ unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+ if (!shift)
+ /* PTE page needs special handling */
+ __page_table_free_rcu(table);
+ else {
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ kmem_cache_free(PGT_CACHE(shift), table);
+ }
+}
+#endif
+#endif /* CONFIG_PPC_64K_PAGES */
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 11/17] powerpc: Print page size info during boot
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This gives hint about different base and actual page size combination
supported by the platform.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/mm/hash_utils_64.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index df48ba5..a06b55a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -314,7 +314,7 @@ static int __init htab_dt_scan_page_sizes(unsigned long node,
prop = (u32 *)of_get_flat_dt_prop(node,
"ibm,segment-page-sizes", &size);
if (prop != NULL) {
- DBG("Page sizes from device-tree:\n");
+ pr_info("Page sizes from device-tree:\n");
size /= 4;
cur_cpu_spec->mmu_features &= ~(MMU_FTR_16M_PAGE);
while(size > 0) {
@@ -364,10 +364,10 @@ static int __init htab_dt_scan_page_sizes(unsigned long node,
continue;
def->penc[idx] = penc;
- DBG(" %d: shift=%02x, sllp=%04lx, "
- "avpnm=%08lx, tlbiel=%d, penc=%d\n",
- idx, shift, def->sllp, def->avpnm,
- def->tlbiel, def->penc[idx]);
+ pr_info("base_shift=%d: shift=%d, sllp=0x%04lx,"
+ " avpnm=0x%08lx, tlbiel=%d, penc=%d\n",
+ base_shift, shift, def->sllp,
+ def->avpnm, def->tlbiel, def->penc[idx]);
}
}
return 1;
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 00/17] THP support for PPC64
From: Aneesh Kumar K.V @ 2013-02-18 10:27 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev
Hi,
This is an early RFC version adding transparent huge page support for PPC64.
I am sharing the changes, so that we can have early review on the approach
taken. The TODOs include
*) Compile issues with different config option
*) HugeTLBfs is disabled now (mostly compile issues)
*) PPC32 and other sub architecture details need to be worked out.
*) 4K page size details need to be worked out
*) Closer review of PMD* flags.
Some numbers:
The latency measurements code from Anton found at
http://ozlabs.org/~anton/junkcode/latency2001.c
THP disabled 64K page size
------------------------
[root@llmp24l02 ~]# ./latency2001 8G
8589934592 731.73 cycles 205.77 ns
[root@llmp24l02 ~]# ./latency2001 8G
8589934592 743.39 cycles 209.05 ns
[root@llmp24l02 ~]#
THP disabled large page via hugetlbfs
-------------------------------------
[root@llmp24l02 ~]# ./latency2001 -l 8G
8589934592 416.09 cycles 117.01 ns
[root@llmp24l02 ~]# ./latency2001 -l 8G
8589934592 415.74 cycles 116.91 ns
THP enabled 64K page size.
----------------
[root@llmp24l02 ~]# ./latency2001 8G
8589934592 405.07 cycles 113.91 ns
[root@llmp24l02 ~]# ./latency2001 8G
8589934592 411.82 cycles 115.81 ns
[root@llmp24l02 ~]#
We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.
I haven't really measured the collapse alloc impact.
Another test that does 50000000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.
Thanks,
-aneesh
^ permalink raw reply
* [RFC PATCH 01/17] powerpc: Don't hard code the size of pte page
From: Aneesh Kumar K.V @ 2013-02-18 10:27 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
USE PTRS_PER_PTE to indicate the size of pte page.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/pgtable.h | 6 ++++++
arch/powerpc/mm/hash_low_64.S | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index a9cbd3b..fc57855 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -17,6 +17,12 @@ struct mm_struct;
# include <asm/pgtable-ppc32.h>
#endif
+/*
+ * hidx is in the second half of the page table. We use the
+ * 8 bytes per each pte entry.
+ */
+#define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
+
#ifndef __ASSEMBLY__
#include <asm/tlbflush.h>
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 7443481..abdd5e2 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -490,7 +490,7 @@ END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, CPU_FTR_NOEXECUTE)
beq htab_inval_old_hpte
ld r6,STK_PARAM(R6)(r1)
- ori r26,r6,0x8000 /* Load the hidx mask */
+ ori r26,r6,PTE_PAGE_HIDX_OFFSET /* Load the hidx mask. */
ld r26,0(r26)
addi r5,r25,36 /* Check actual HPTE_SUB bit, this */
rldcr. r0,r31,r5,0 /* must match pgtable.h definition */
@@ -607,7 +607,7 @@ htab_pte_insert_ok:
sld r4,r4,r5
andc r26,r26,r4
or r26,r26,r3
- ori r5,r6,0x8000
+ ori r5,r6,PTE_PAGE_HIDX_OFFSET
std r26,0(r5)
lwsync
std r30,0(r6)
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 15/17] powerpc: hypervisor require few WIMG bit set
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Without this insert will return H_PARAMETER error. Also use
the signed variant when printing error.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/mm/largepage-hash64.c | 2 ++
arch/powerpc/platforms/pseries/lpar.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/largepage-hash64.c b/arch/powerpc/mm/largepage-hash64.c
index 2a5fc39..20a626e 100644
--- a/arch/powerpc/mm/largepage-hash64.c
+++ b/arch/powerpc/mm/largepage-hash64.c
@@ -123,6 +123,8 @@ repeat:
/* Add in WIMG bits. FIXME!! enabled by default */
rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
_PAGE_COHERENT | _PAGE_GUARDED));
+#else
+ rflags |= _PAGE_COHERENT;
#endif
/* Insert into the hash table, primary slot */
slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index ca9c2bb..3daced3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -155,7 +155,7 @@ static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
*/
if (unlikely(lpar_rc != H_SUCCESS)) {
if (!(vflags & HPTE_V_BOLTED))
- pr_devel(" lpar err %lu\n", lpar_rc);
+ pr_devel(" lpar err %ld\n", lpar_rc);
return -2;
}
if (!(vflags & HPTE_V_BOLTED))
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 17/17] powerpc: Save DAR and DSISR in pt_regs on MCE
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
We were not saving DAR and DSISR on MCE. Save then and also print the values
along with exception details in xmon.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/kernel/exceptions-64s.S | 9 +++++++++
arch/powerpc/xmon/xmon.c | 2 +-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 0e9c48c..d02e730 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -640,9 +640,18 @@ slb_miss_user_pseries:
.align 7
.globl machine_check_common
machine_check_common:
+
+ mfspr r10,SPRN_DAR
+ std r10,PACA_EXGEN+EX_DAR(r13)
+ mfspr r10,SPRN_DSISR
+ stw r10,PACA_EXGEN+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
FINISH_NAP
DISABLE_INTS
+ ld r3,PACA_EXGEN+EX_DAR(r13)
+ lwz r4,PACA_EXGEN+EX_DSISR(r13)
+ std r3,_DAR(r1)
+ std r4,_DSISR(r1)
bl .save_nvgprs
addi r3,r1,STACK_FRAME_OVERHEAD
bl .machine_check_exception
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 1f8d2f1..a72e490 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1423,7 +1423,7 @@ static void excprint(struct pt_regs *fp)
printf(" sp: %lx\n", fp->gpr[1]);
printf(" msr: %lx\n", fp->msr);
- if (trap == 0x300 || trap == 0x380 || trap == 0x600) {
+ if (trap == 0x300 || trap == 0x380 || trap == 0x600 || trap == 0x200) {
printf(" dar: %lx\n", fp->dar);
if (trap != 0x380)
printf(" dsisr: %lx\n", fp->dsisr);
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 05/17] powerpc: Add size argument to pgtable_cache_add
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
We will use this later with THP changes. With THP we want to create PMD with
twice the size. The second half will be used to depoist pgtable, which will
carry the hpte hash index value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 7 ++++++-
arch/powerpc/mm/init_64.c | 16 ++++++++--------
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..658ba7c 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -338,8 +338,13 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry)
#define pgoff_to_pte(off) ((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
#define PTE_FILE_MAX_BITS (BITS_PER_LONG - PTE_RPN_SHIFT)
-void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
+extern void __pgtable_cache_add(unsigned index, unsigned long table_size,
+ void (*ctor)(void *));
void pgtable_cache_init(void);
+static inline void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+ return __pgtable_cache_add(shift, sizeof(void *) << shift, ctor);
+}
/*
* find_linux_pte returns the address of a linux pte for a given
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..b378438 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -100,10 +100,10 @@ struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
* everything else. Caches created by this function are used for all
* the higher level pagetables, and for hugepage pagetables.
*/
-void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+void __pgtable_cache_add(unsigned int index, unsigned long table_size,
+ void (*ctor)(void *))
{
char *name;
- unsigned long table_size = sizeof(void *) << shift;
unsigned long align = table_size;
/* When batching pgtable pointers for RCU freeing, we store
@@ -111,7 +111,7 @@ void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
* big enough to fit it.
*
* Likewise, hugeapge pagetable pointers contain a (different)
- * shift value in the low bits. All tables must be aligned so
+ * huge page size in the low bits. All tables must be aligned so
* as to leave enough 0 bits in the address to contain it. */
unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
HUGEPD_SHIFT_MASK + 1);
@@ -121,17 +121,17 @@ void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
* moment, gcc doesn't seem to recognize is_power_of_2 as a
* constant expression, so so much for that. */
BUG_ON(!is_power_of_2(minalign));
- BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+ BUG_ON((index < 1) || (index > MAX_PGTABLE_INDEX_SIZE));
- if (PGT_CACHE(shift))
+ if (PGT_CACHE(index))
return; /* Already have a cache of this size */
align = max_t(unsigned long, align, minalign);
- name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+ name = kasprintf(GFP_KERNEL, "pgtable-2^%d", index);
new = kmem_cache_create(name, table_size, align, 0, ctor);
- PGT_CACHE(shift) = new;
+ PGT_CACHE(index) = new;
- pr_debug("Allocated pgtable cache for order %d\n", shift);
+ pr_debug("Allocated pgtable cache for order %d\n", index);
}
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 09/17] powerpc/mm: Use encode avpn where we need only avpn values
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/mmu-hash64.h | 8 ++++----
arch/powerpc/mm/hash_native_64.c | 10 +++++-----
arch/powerpc/platforms/pseries/lpar.c | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 6ec65b6..aeeee5e 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -237,14 +237,14 @@ static inline unsigned long hpte_encode_avpn(unsigned long vpn, int psize,
/*
* This function sets the AVPN and L fields of the HPTE appropriately
- * for the page size
+ * using the base page size and actual page size.
*/
-static inline unsigned long hpte_encode_v(unsigned long vpn,
- int psize, int ssize)
+static inline unsigned long hpte_encode_v(unsigned long vpn, int psize,
+ int apsize, int ssize)
{
unsigned long v;
v = hpte_encode_avpn(vpn, psize, ssize);
- if (psize != MMU_PAGE_4K)
+ if (apsize != MMU_PAGE_4K)
v |= HPTE_V_LARGE;
return v;
}
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index da46cd3..4cf361f 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -216,7 +216,7 @@ static long native_hpte_insert(unsigned long hpte_group, unsigned long vpn,
if (i == HPTES_PER_GROUP)
return -1;
- hpte_v = hpte_encode_v(vpn, psize, ssize) | vflags | HPTE_V_VALID;
+ hpte_v = hpte_encode_v(vpn, psize, apsize, ssize) | vflags | HPTE_V_VALID;
hpte_r = hpte_encode_r(pa, psize, apsize) | rflags;
if (!(vflags & HPTE_V_BOLTED)) {
@@ -327,7 +327,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
int ret = 0;
int actual_psize;
- want_v = hpte_encode_v(vpn, psize, ssize);
+ want_v = hpte_encode_avpn(vpn, psize, ssize);
DBG_LOW(" update(vpn=%016lx, avpnv=%016lx, group=%lx, newpp=%lx)",
vpn, want_v & HPTE_V_AVPN, slot, newpp);
@@ -364,7 +364,7 @@ static long native_hpte_find(unsigned long vpn, int psize, int ssize)
unsigned long want_v, hpte_v;
hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
- want_v = hpte_encode_v(vpn, psize, ssize);
+ want_v = hpte_encode_avpn(vpn, psize, ssize);
/* Bolted mappings are only ever in the primary group */
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
@@ -427,7 +427,7 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
DBG_LOW(" invalidate(vpn=%016lx, hash: %lx)\n", vpn, slot);
- want_v = hpte_encode_v(vpn, psize, ssize);
+ want_v = hpte_encode_avpn(vpn, psize, ssize);
native_lock_hpte(hptep);
hpte_v = hptep->v;
@@ -599,7 +599,7 @@ static void native_flush_hash_range(unsigned long number, int local)
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
slot += hidx & _PTEIDX_GROUP_IX;
hptep = htab_address + slot;
- want_v = hpte_encode_v(vpn, psize, ssize);
+ want_v = hpte_encode_avpn(vpn, psize, ssize);
native_lock_hpte(hptep);
hpte_v = hptep->v;
if (!HPTE_V_COMPARE(hpte_v, want_v) ||
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 9f99847..ca9c2bb 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -121,7 +121,7 @@ static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
"pa=%016lx, rflags=%lx, vflags=%lx, psize=%d)\n",
hpte_group, vpn, pa, rflags, vflags, psize);
- hpte_v = hpte_encode_v(vpn, psize, ssize) | vflags | HPTE_V_VALID;
+ hpte_v = hpte_encode_v(vpn, psize, apsize, ssize) | vflags | HPTE_V_VALID;
hpte_r = hpte_encode_r(pa, psize, apsize) | rflags;
if (!(vflags & HPTE_V_BOLTED))
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 13/17] powerpc/THP: Add code to handle HPTE faults for large pages
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
We now have pmd entries covering to 16MB range. To implement THP on powerpc,
we double the size of PMD. The second half is used to deposit the pgtable (PTE page).
We also use the depoisted PTE page for tracking the HPTE information. The information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
With 16MB huge page and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/mmu-hash64.h | 5 +
arch/powerpc/include/asm/pgtable-ppc64.h | 33 ++----
arch/powerpc/kernel/io-workarounds.c | 2 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 5 +-
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/hash_utils_64.c | 12 ++-
arch/powerpc/mm/hugetlbpage.c | 19 +++-
arch/powerpc/mm/largepage-hash64.c | 170 ++++++++++++++++++++++++++++++
arch/powerpc/mm/pgtable.c | 34 ++++++
arch/powerpc/mm/tlb_hash64.c | 2 +-
arch/powerpc/perf/callchain.c | 2 +-
arch/powerpc/platforms/pseries/eeh.c | 2 +-
13 files changed, 248 insertions(+), 41 deletions(-)
create mode 100644 arch/powerpc/mm/largepage-hash64.c
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index aeeee5e..f1024c8 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -319,6 +319,11 @@ extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
pte_t *ptep, unsigned long trap, int local, int ssize,
unsigned int shift, unsigned int mmu_psize);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __hash_page_thp(unsigned long ea, unsigned long access,
+ unsigned long vsid, pmd_t *pmdp, unsigned long trap,
+ int local, int ssize, unsigned int psize);
+#endif
extern void hash_failure_debug(unsigned long ea, unsigned long access,
unsigned long vsid, unsigned long trap,
int ssize, int psize, int lpsize,
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0da8840..d9579a5 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -350,39 +350,18 @@ static inline void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
return __pgtable_cache_add(shift, sizeof(void *) << shift, ctor);
}
-/*
- * find_linux_pte returns the address of a linux pte for a given
- * effective address and directory. If not found, it returns zero.
- */
-static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
-{
- pgd_t *pg;
- pud_t *pu;
- pmd_t *pm;
- pte_t *pt = NULL;
-
- pg = pgdir + pgd_index(ea);
- if (!pgd_none(*pg)) {
- pu = pud_offset(pg, ea);
- if (!pud_none(*pu)) {
- pm = pmd_offset(pu, ea);
- if (pmd_present(*pm))
- pt = pte_offset_kernel(pm, ea);
- }
- }
- return pt;
-}
-
-#ifdef CONFIG_HUGETLB_PAGE
+#if defined(CONFIG_HUGETLB_PAGE)
pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
- unsigned *shift);
+ unsigned *shift, unsigned int *thp);
#else
+pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, unsigned int *thp);
static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
- unsigned *shift)
+ unsigned *shift,
+ unsigned int *thp)
{
if (shift)
*shift = 0;
- return find_linux_pte(pgdir, ea);
+ return find_linux_pte(pgdir, ea, thp);
}
#endif /* !CONFIG_HUGETLB_PAGE */
diff --git a/arch/powerpc/kernel/io-workarounds.c b/arch/powerpc/kernel/io-workarounds.c
index 50e90b7..a37c5d2 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -70,7 +70,7 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
return NULL;
- ptep = find_linux_pte(init_mm.pgd, vaddr);
+ ptep = find_linux_pte(init_mm.pgd, vaddr, NULL);
if (ptep == NULL)
paddr = 0;
else
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8cc18ab..4f2a7dc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -683,7 +683,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
*/
rcu_read_lock_sched();
ptep = find_linux_pte_or_hugepte(current->mm->pgd,
- hva, NULL);
+ hva, NULL, NULL);
if (ptep && pte_present(*ptep)) {
pte = kvmppc_read_update_linux_pte(ptep, 1);
if (pte_write(pte))
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..5a9b7f6 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -27,7 +27,7 @@ static void *real_vmalloc_addr(void *x)
unsigned long addr = (unsigned long) x;
pte_t *p;
- p = find_linux_pte(swapper_pg_dir, addr);
+ p = find_linux_pte(swapper_pg_dir, addr, NULL);
if (!p || !pte_present(*p))
return NULL;
/* assume we don't have huge pages in vmalloc space... */
@@ -145,6 +145,7 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
unlock_rmap(rmap);
}
+/* FIXME!! check */
static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
int writing, unsigned long *pte_sizep)
{
@@ -152,7 +153,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
unsigned long ps = *pte_sizep;
unsigned int shift;
- ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+ ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift, NULL);
if (!ptep)
return __pte(0);
if (shift)
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3787b61..6b09f9d 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -33,6 +33,7 @@ obj-y += hugetlbpage.o
obj-$(CONFIG_PPC_STD_MMU_64) += hugetlbpage-hash64.o
obj-$(CONFIG_PPC_BOOK3E_MMU) += hugetlbpage-book3e.o
endif
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += largepage-hash64.o
obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
obj-$(CONFIG_HIGHMEM) += highmem.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index a06b55a..3a1752f 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -939,7 +939,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
unsigned long vsid;
struct mm_struct *mm;
pte_t *ptep;
- unsigned hugeshift;
+ unsigned hugeshift, thp;
const struct cpumask *tmp;
int rc, user_region = 0, local = 0;
int psize, ssize;
@@ -1005,7 +1005,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
#endif /* CONFIG_PPC_64K_PAGES */
/* Get PTE and page size from page tables */
- ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
+ ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift, &thp);
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
return 1;
@@ -1028,6 +1028,12 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
ssize, hugeshift, psize);
#endif /* CONFIG_HUGETLB_PAGE */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (thp)
+ return __hash_page_thp(ea, access, vsid, (pmd_t *)ptep,
+ trap, local, ssize, psize);
+#endif
+
#ifndef CONFIG_PPC_64K_PAGES
DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
#else
@@ -1133,7 +1139,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
pgdir = mm->pgd;
if (pgdir == NULL)
return;
- ptep = find_linux_pte(pgdir, ea);
+ ptep = find_linux_pte(pgdir, ea, NULL);
if (!ptep)
return;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1a6de0a..bce7a9f 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -67,7 +67,8 @@ static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
#define hugepd_none(hpd) ((hpd).pd == 0)
-pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+ unsigned *shift, unsigned int *thp)
{
pgd_t *pg;
pud_t *pu;
@@ -77,6 +78,8 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
if (shift)
*shift = 0;
+ if (thp)
+ *thp = 0;
pg = pgdir + pgd_index(ea);
if (is_hugepd(pg)) {
@@ -91,12 +94,20 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
pm = pmd_offset(pu, ea);
if (is_hugepd(pm))
hpdp = (hugepd_t *)pm;
- else if (!pmd_none(*pm)) {
+ else if (pmd_large(*pm)) {
+ /* THP page */
+ if (thp)
+ *thp = 1;
+ /*
+ * This should be ok, except for few flags
+ * most of the pte, large page pmd bits map
+ */
+ return (pte_t *)pm;
+ } else if (!pmd_none(*pm)) {
return pte_offset_kernel(pm, ea);
}
}
}
-
if (!hpdp)
return NULL;
@@ -614,7 +625,7 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
unsigned shift;
unsigned long mask;
- ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
+ ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift, NULL);
/* Verify it is a huge page else bail. */
if (!ptep || !shift)
diff --git a/arch/powerpc/mm/largepage-hash64.c b/arch/powerpc/mm/largepage-hash64.c
new file mode 100644
index 0000000..2a5fc39
--- /dev/null
+++ b/arch/powerpc/mm/largepage-hash64.c
@@ -0,0 +1,170 @@
+/*
+ * PPC64 THP Support for hash based MMUs
+ */
+
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/cacheflush.h>
+#include <asm/machdep.h>
+#include <asm/udbg.h>
+
+/*
+ * A linux huge page PMD was changed and the corresponding hash table entry
+ * neesd to be flushed. FIXME!! there is no batching support yet.
+ *
+ * The linux huge page PMD now include the pmd entries followed by the address
+ * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
+ * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
+ * With 16MB huge page and 64K HPTE we need 256 entries and with 4K HPTE we need
+ * 4096 entries. Both will fit in a 4K pgtable_t.
+ */
+int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ pmd_t *pmdp, unsigned long trap, int local, int ssize,
+ unsigned int psize)
+{
+ unsigned int index, valid;
+ unsigned char *hpte_slot_array;
+ unsigned long rflags, pa, hidx;
+ unsigned long old_pmd, new_pmd;
+ int ret, lpsize = MMU_PAGE_16M;
+ unsigned long vpn, hash, shift, slot;
+
+ /*
+ * atomically mark the linux large page PMD busy and dirty
+ */
+ do {
+ old_pmd = pmd_val(*pmdp);
+ /* If PMD busy, retry the access */
+ if (unlikely(old_pmd & PMD_HUGE_BUSY))
+ return 0;
+ /* If PMD permissions don't match, take page fault */
+ if (unlikely(access & ~old_pmd))
+ return 1;
+ /*
+ * Try to lock the PTE, add ACCESSED and DIRTY if it was
+ * a write access
+ */
+ new_pmd = old_pmd | PMD_HUGE_BUSY | PMD_HUGE_ACCESSED;
+ if (access & _PAGE_RW)
+ new_pmd |= PMD_HUGE_DIRTY;
+ } while (old_pmd != __cmpxchg_u64((unsigned long *)pmdp,
+ old_pmd, new_pmd));
+ /*
+ * derive the rflags. Default enable read (0x2)
+ */
+ rflags = 0x2 | (!(new_pmd & PMD_HUGE_RW));
+ /* PMD_HUGE_EXEC -> HW_NO_EXEC since it's inverted */
+ rflags |= ((new_pmd & PMD_HUGE_EXEC) ? 0 : HPTE_R_N);
+
+#if 0 /* FIXME!! */
+ if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
+
+ /*
+ * No CPU has hugepages but lacks no execute, so we
+ * don't need to worry about that case
+ */
+ rflags = hash_page_do_lazy_icache(rflags, __pte(old_pte), trap);
+ }
+#endif
+ /*
+ * Find the slot index details for this ea, using base page size.
+ */
+ shift = mmu_psize_defs[psize].shift;
+ index = (ea & (HUGE_PAGE_SIZE - 1)) >> shift;
+ BUG_ON(index > 4096);
+
+ vpn = hpt_vpn(ea, vsid, ssize);
+ hash = hpt_hash(vpn, shift, ssize);
+ /*
+ * The hpte hindex are stored in the pgtable whose address is in the
+ * second half of the PMD
+ */
+ hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
+
+ valid = hpte_slot_array[index] & 0x1;
+ if (unlikely(valid)) {
+ /* update the hpte bits */
+ hidx = hpte_slot_array[index] >> 1;
+ if (hidx & _PTEIDX_SECONDARY)
+ hash = ~hash;
+ slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+ slot += hidx & _PTEIDX_GROUP_IX;
+
+ ret = ppc_md.hpte_updatepp(slot, rflags, vpn,
+ psize, ssize, local);
+ /*
+ * We failed to update, try to insert a new entry.
+ */
+ if (ret == -1) {
+ /*
+ * large pte is marked busy, so we can be sure
+ * nobody is looking at hpte_slot_array. hence we can
+ * safely update this here.
+ */
+ hpte_slot_array[index] = 0;
+ valid = 0;
+ }
+ }
+
+ if (likely(!valid)) {
+ unsigned long hpte_group;
+
+ /* insert new entry */
+ pa = pmd_pfn(__pmd(old_pmd)) << PAGE_SHIFT;
+repeat:
+ hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+
+ /* clear the busy bits and set the hash pte bits */
+ new_pmd = (new_pmd & ~PMD_HUGE_HPTEFLAGS) | PMD_HUGE_HASHPTE;
+
+#if 0
+ /* Add in WIMG bits. FIXME!! enabled by default */
+ rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+ _PAGE_COHERENT | _PAGE_GUARDED));
+#endif
+ /* Insert into the hash table, primary slot */
+ slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
+ psize, lpsize, ssize);
+ /*
+ * Primary is full, try the secondary
+ */
+ if (unlikely(slot == -1)) {
+ hpte_group = ((~hash & htab_hash_mask) *
+ HPTES_PER_GROUP) & ~0x7UL;
+ slot = ppc_md.hpte_insert(hpte_group, vpn, pa,
+ rflags, HPTE_V_SECONDARY,
+ psize, lpsize, ssize);
+ if (slot == -1) {
+ if (mftb() & 0x1)
+ hpte_group = ((hash & htab_hash_mask) *
+ HPTES_PER_GROUP) & ~0x7UL;
+
+ ppc_md.hpte_remove(hpte_group);
+ goto repeat;
+ }
+ }
+ /*
+ * Hypervisor failure. Restore old pmd and return -1
+ * similar to __hash_page_*
+ */
+ if (unlikely(slot == -2)) {
+ *pmdp = __pmd(old_pmd);
+ hash_failure_debug(ea, access, vsid, trap, ssize,
+ psize, lpsize, old_pmd);
+ return -1;
+ }
+ /*
+ * large pte is marked busy, so we can be sure
+ * nobody is looking at hpte_slot_array. hence we can
+ * safely update this here.
+ */
+ hpte_slot_array[index] = slot << 1 | 0x1;
+ }
+ /*
+ * No need to use ldarx/stdcx here
+ */
+ *pmdp = __pmd(new_pmd & ~PMD_HUGE_BUSY);
+ return 0;
+}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e173b5e..841271f 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -580,3 +580,37 @@ struct page *pmd_page(pmd_t pmd)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
+ * find_linux_pte returns the address of a linux pte for a given
+ * effective address and directory. If not found, it returns zero.
+ */
+pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea, unsigned int *thp)
+{
+ pgd_t *pg;
+ pud_t *pu;
+ pmd_t *pm;
+ pte_t *pt = NULL;
+
+ if (thp)
+ *thp = 0;
+ pg = pgdir + pgd_index(ea);
+ if (!pgd_none(*pg)) {
+ pu = pud_offset(pg, ea);
+ if (!pud_none(*pu)) {
+ pm = pmd_offset(pu, ea);
+ if (pmd_large(*pm)) {
+ /* THP page */
+ if (thp)
+ *thp = 1;
+ /*
+ * This should be ok, except for few flags
+ * most of the pte, large page pmd bits map
+ */
+ return (pte_t *)pm;
+ } else if (pmd_present(*pm))
+ pt = pte_offset_kernel(pm, ea);
+ }
+ }
+ return pt;
+}
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index 023ec8a..9a951d5 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -206,7 +206,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
- pte_t *ptep = find_linux_pte(mm->pgd, start);
+ pte_t *ptep = find_linux_pte(mm->pgd, start, NULL);
unsigned long pte;
if (ptep == NULL)
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 74d1e78..578cac7 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -125,7 +125,7 @@ static int read_user_stack_slow(void __user *ptr, void *ret, int nb)
if (!pgdir)
return -EFAULT;
- ptep = find_linux_pte_or_hugepte(pgdir, addr, &shift);
+ ptep = find_linux_pte_or_hugepte(pgdir, addr, &shift, NULL);
if (!shift)
shift = PAGE_SHIFT;
diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
index 9a04322..d6f8f0e 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -261,7 +261,7 @@ static inline unsigned long eeh_token_to_phys(unsigned long token)
pte_t *ptep;
unsigned long pa;
- ptep = find_linux_pte(init_mm.pgd, token);
+ ptep = find_linux_pte(init_mm.pgd, token, NULL);
if (!ptep)
return token;
pa = pte_pfn(*ptep) << PAGE_SHIFT;
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 06/17] powerpc/mm: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
We look at both the segment base page size and actual page size and store
the pte-lp-encodings in an array per base page size.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/machdep.h | 3 +-
arch/powerpc/include/asm/mmu-hash64.h | 12 ++--
arch/powerpc/mm/hash_low_64.S | 18 ++++--
arch/powerpc/mm/hash_native_64.c | 105 ++++++++++++++++++++++++---------
arch/powerpc/mm/hash_utils_64.c | 103 +++++++++++++++++++-------------
arch/powerpc/platforms/pseries/lpar.c | 4 +-
6 files changed, 163 insertions(+), 82 deletions(-)
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 19d9d96..6cee6e0 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -50,7 +50,8 @@ struct machdep_calls {
unsigned long prpn,
unsigned long rflags,
unsigned long vflags,
- int psize, int ssize);
+ int psize, int apsize,
+ int ssize);
long (*hpte_remove)(unsigned long hpte_group);
void (*hpte_removebolted)(unsigned long ea,
int psize, int ssize);
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index c3b3518..6290e26 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -154,7 +154,7 @@ extern unsigned long htab_hash_mask;
struct mmu_psize_def
{
unsigned int shift; /* number of bits */
- unsigned int penc; /* HPTE encoding */
+ unsigned int penc[MMU_PAGE_COUNT]; /* HPTE encoding */
unsigned int tlbiel; /* tlbiel supported for that page size */
unsigned long avpnm; /* bits to mask out in AVPN in the HPTE */
unsigned long sllp; /* SLB L||LP (exact mask to use in slbmte) */
@@ -254,16 +254,18 @@ static inline unsigned long hpte_encode_v(unsigned long vpn,
* for the page size. We assume the pa is already "clean" that is properly
* aligned for the requested page size
*/
-static inline unsigned long hpte_encode_r(unsigned long pa, int psize)
+static inline unsigned long hpte_encode_r(unsigned long pa, int base_psize,
+ int actual_psize)
{
unsigned long r;
/* A 4K page needs no special encoding */
- if (psize == MMU_PAGE_4K)
+ if (actual_psize == MMU_PAGE_4K)
return pa & HPTE_R_RPN;
else {
- unsigned int penc = mmu_psize_defs[psize].penc;
- unsigned int shift = mmu_psize_defs[psize].shift;
+ unsigned int penc = mmu_psize_defs[base_psize].penc[actual_psize];
+ unsigned int shift = mmu_psize_defs[actual_psize].shift;
+ /* FIXME!! replace 12 by LP_SHIFT ? */
return (pa & ~((1ul << shift) - 1)) | (penc << 12);
}
return r;
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index abdd5e2..0e980ac 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -196,7 +196,8 @@ htab_insert_pte:
mr r4,r29 /* Retrieve vpn */
li r7,0 /* !bolted, !secondary */
li r8,MMU_PAGE_4K /* page size */
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_4K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(htab_call_hpte_insert1)
bl . /* Patched by htab_finish_init() */
cmpdi 0,r3,0
@@ -219,7 +220,8 @@ _GLOBAL(htab_call_hpte_insert1)
mr r4,r29 /* Retrieve vpn */
li r7,HPTE_V_SECONDARY /* !bolted, secondary */
li r8,MMU_PAGE_4K /* page size */
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_4K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(htab_call_hpte_insert2)
bl . /* Patched by htab_finish_init() */
cmpdi 0,r3,0
@@ -515,7 +517,8 @@ htab_special_pfn:
mr r4,r29 /* Retrieve vpn */
li r7,0 /* !bolted, !secondary */
li r8,MMU_PAGE_4K /* page size */
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_4K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(htab_call_hpte_insert1)
bl . /* patched by htab_finish_init() */
cmpdi 0,r3,0
@@ -542,7 +545,8 @@ _GLOBAL(htab_call_hpte_insert1)
mr r4,r29 /* Retrieve vpn */
li r7,HPTE_V_SECONDARY /* !bolted, secondary */
li r8,MMU_PAGE_4K /* page size */
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_4K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(htab_call_hpte_insert2)
bl . /* patched by htab_finish_init() */
cmpdi 0,r3,0
@@ -840,7 +844,8 @@ ht64_insert_pte:
mr r4,r29 /* Retrieve vpn */
li r7,0 /* !bolted, !secondary */
li r8,MMU_PAGE_64K
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_64K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(ht64_call_hpte_insert1)
bl . /* patched by htab_finish_init() */
cmpdi 0,r3,0
@@ -863,7 +868,8 @@ _GLOBAL(ht64_call_hpte_insert1)
mr r4,r29 /* Retrieve vpn */
li r7,HPTE_V_SECONDARY /* !bolted, secondary */
li r8,MMU_PAGE_64K
- ld r9,STK_PARAM(R9)(r1) /* segment size */
+ li r9,MMU_PAGE_64K /* actual page size */
+ ld r10,STK_PARAM(R9)(r1) /* segment size */
_GLOBAL(ht64_call_hpte_insert2)
bl . /* patched by htab_finish_init() */
cmpdi 0,r3,0
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index ffc1e00..16ba033 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -36,10 +36,14 @@
#endif
#define HPTE_LOCK_BIT 3
+#define LP_SHIFT 12
+#define LP_BITS 8
+#define LP_MASK(i) ((0xFF >> (i)) << LP_SHIFT)
+
DEFINE_RAW_SPINLOCK(native_tlbie_lock);
-static inline void __tlbie(unsigned long vpn, int psize, int ssize)
+static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
{
unsigned long va;
unsigned int penc;
@@ -68,7 +72,7 @@ static inline void __tlbie(unsigned long vpn, int psize, int ssize)
break;
default:
/* We need 14 to 14 + i bits of va */
- penc = mmu_psize_defs[psize].penc;
+ penc = mmu_psize_defs[psize].penc[apsize];
va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
va |= penc << 12;
va |= ssize << 8;
@@ -80,7 +84,7 @@ static inline void __tlbie(unsigned long vpn, int psize, int ssize)
}
}
-static inline void __tlbiel(unsigned long vpn, int psize, int ssize)
+static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize)
{
unsigned long va;
unsigned int penc;
@@ -102,7 +106,7 @@ static inline void __tlbiel(unsigned long vpn, int psize, int ssize)
break;
default:
/* We need 14 to 14 + i bits of va */
- penc = mmu_psize_defs[psize].penc;
+ penc = mmu_psize_defs[psize].penc[apsize];
va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
va |= penc << 12;
va |= ssize << 8;
@@ -114,7 +118,8 @@ static inline void __tlbiel(unsigned long vpn, int psize, int ssize)
}
-static inline void tlbie(unsigned long vpn, int psize, int ssize, int local)
+static inline void tlbie(unsigned long vpn, int psize, int apsize,
+ int ssize, int local)
{
unsigned int use_local = local && mmu_has_feature(MMU_FTR_TLBIEL);
int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
@@ -125,10 +130,10 @@ static inline void tlbie(unsigned long vpn, int psize, int ssize, int local)
raw_spin_lock(&native_tlbie_lock);
asm volatile("ptesync": : :"memory");
if (use_local) {
- __tlbiel(vpn, psize, ssize);
+ __tlbiel(vpn, psize, apsize, ssize);
asm volatile("ptesync": : :"memory");
} else {
- __tlbie(vpn, psize, ssize);
+ __tlbie(vpn, psize, apsize, ssize);
asm volatile("eieio; tlbsync; ptesync": : :"memory");
}
if (lock_tlbie && !use_local)
@@ -156,7 +161,7 @@ static inline void native_unlock_hpte(struct hash_pte *hptep)
static long native_hpte_insert(unsigned long hpte_group, unsigned long vpn,
unsigned long pa, unsigned long rflags,
- unsigned long vflags, int psize, int ssize)
+ unsigned long vflags, int psize, int apsize, int ssize)
{
struct hash_pte *hptep = htab_address + hpte_group;
unsigned long hpte_v, hpte_r;
@@ -184,7 +189,7 @@ static long native_hpte_insert(unsigned long hpte_group, unsigned long vpn,
return -1;
hpte_v = hpte_encode_v(vpn, psize, ssize) | vflags | HPTE_V_VALID;
- hpte_r = hpte_encode_r(pa, psize) | rflags;
+ hpte_r = hpte_encode_r(pa, psize, apsize) | rflags;
if (!(vflags & HPTE_V_BOLTED)) {
DBG_LOW(" i=%x hpte_v=%016lx, hpte_r=%016lx\n",
@@ -244,6 +249,47 @@ static long native_hpte_remove(unsigned long hpte_group)
return i;
}
+static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
+{
+ unsigned int mask;
+ int i, penc, shift;
+ unsigned int lp = (hptep->r >> LP_SHIFT) & LP_BITS;
+
+#if 0
+ /*
+ * FIXME!! hpte_decode have more tricks. why not
+ * How do we find how many bits need to be used for r and z ?
+ */
+ for (i = 0; i < LP_BITS; i++) {
+ if ((hptep->r & LP_MASK(i+1)) == LP_MASK(i+1))
+ break;
+ }
+ penc = LP_MASK(i+1) >> LP_SHIFT;
+ for (i = 0; i < MMU_PAGE_COUNT; i++) {
+ if (penc == mmu_psize_defs[psize].penc[i])
+ return i;
+ }
+ return -1;
+#else
+ penc = 0;
+ /* is this better ? */
+ for (i = 0; i < MMU_PAGE_COUNT; i++) {
+ /* valid entries have a shift value */
+ if (!mmu_psize_defs[i].shift)
+ continue;
+
+ /* encoding bits per actual page size */
+ shift = mmu_psize_defs[i].shift - 11;
+ if (shift > 9)
+ shift = 9;
+ mask = (1 << shift) - 1;
+ if ((lp & mask) == mmu_psize_defs[psize].penc[i])
+ return i;
+ }
+ return -1;
+#endif
+}
+
static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
unsigned long vpn, int psize, int ssize,
int local)
@@ -251,6 +297,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
struct hash_pte *hptep = htab_address + slot;
unsigned long hpte_v, want_v;
int ret = 0;
+ int actual_psize;
want_v = hpte_encode_v(vpn, psize, ssize);
@@ -260,6 +307,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
native_lock_hpte(hptep);
hpte_v = hptep->v;
+ actual_psize = hpte_actual_psize(hptep, psize);
/* Even if we miss, we need to invalidate the TLB */
if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID)) {
@@ -274,7 +322,7 @@ static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
native_unlock_hpte(hptep);
/* Ensure it is out of the tlb too. */
- tlbie(vpn, psize, ssize, local);
+ tlbie(vpn, psize, actual_psize, ssize, local);
return ret;
}
@@ -315,6 +363,7 @@ static long native_hpte_find(unsigned long vpn, int psize, int ssize)
static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
int psize, int ssize)
{
+ int actual_psize;
unsigned long vpn;
unsigned long vsid;
long slot;
@@ -327,13 +376,14 @@ static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
if (slot == -1)
panic("could not find page to bolt\n");
hptep = htab_address + slot;
+ actual_psize = hpte_actual_psize(hptep, psize);
/* Update the HPTE */
hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
(newpp & (HPTE_R_PP | HPTE_R_N));
/* Ensure it is out of the tlb too. */
- tlbie(vpn, psize, ssize, 0);
+ tlbie(vpn, psize, actual_psize, ssize, 0);
}
static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
@@ -343,6 +393,7 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
unsigned long hpte_v;
unsigned long want_v;
unsigned long flags;
+ int actual_psize;
local_irq_save(flags);
@@ -352,6 +403,7 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
native_lock_hpte(hptep);
hpte_v = hptep->v;
+ actual_psize = hpte_actual_psize(hptep, psize);
/* Even if we miss, we need to invalidate the TLB */
if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
native_unlock_hpte(hptep);
@@ -360,23 +412,19 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
hptep->v = 0;
/* Invalidate the TLB */
- tlbie(vpn, psize, ssize, local);
+ tlbie(vpn, psize, actual_psize, ssize, local);
local_irq_restore(flags);
}
-#define LP_SHIFT 12
-#define LP_BITS 8
-#define LP_MASK(i) ((0xFF >> (i)) << LP_SHIFT)
-
static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
- int *psize, int *ssize, unsigned long *vpn)
+ int *psize, int *apsize, int *ssize, unsigned long *vpn)
{
unsigned long avpn, pteg, vpi;
unsigned long hpte_r = hpte->r;
unsigned long hpte_v = hpte->v;
unsigned long vsid, seg_off;
- int i, size, shift, penc;
+ int i, size, a_size = MMU_PAGE_4K, shift, penc;
if (!(hpte_v & HPTE_V_LARGE))
size = MMU_PAGE_4K;
@@ -395,12 +443,13 @@ static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
/* valid entries have a shift value */
if (!mmu_psize_defs[size].shift)
continue;
-
- if (penc == mmu_psize_defs[size].penc)
- break;
+ for (a_size = 0; a_size < MMU_PAGE_COUNT; a_size++)
+ if (penc == mmu_psize_defs[size].penc[a_size])
+ goto out;
}
}
+out:
/* This works for all page sizes, and for 256M and 1T segments */
*ssize = hpte_v >> HPTE_V_SSIZE_SHIFT;
shift = mmu_psize_defs[size].shift;
@@ -433,7 +482,8 @@ static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
default:
*vpn = size = 0;
}
- *psize = size;
+ *psize = size;
+ *apsize = a_size;
}
/*
@@ -451,7 +501,7 @@ static void native_hpte_clear(void)
struct hash_pte *hptep = htab_address;
unsigned long hpte_v;
unsigned long pteg_count;
- int psize, ssize;
+ int psize, apsize, ssize;
pteg_count = htab_hash_mask + 1;
@@ -477,9 +527,9 @@ static void native_hpte_clear(void)
* already hold the native_tlbie_lock.
*/
if (hpte_v & HPTE_V_VALID) {
- hpte_decode(hptep, slot, &psize, &ssize, &vpn);
+ hpte_decode(hptep, slot, &psize, &apsize, &ssize, &vpn);
hptep->v = 0;
- __tlbie(vpn, psize, ssize);
+ __tlbie(vpn, psize, apsize, ssize);
}
}
@@ -491,6 +541,7 @@ static void native_hpte_clear(void)
/*
* Batched hash table flush, we batch the tlbie's to avoid taking/releasing
* the lock all the time
+ * FIXME!! large page support needed ?
*/
static void native_flush_hash_range(unsigned long number, int local)
{
@@ -540,7 +591,7 @@ static void native_flush_hash_range(unsigned long number, int local)
pte_iterate_hashed_subpages(pte, psize,
vpn, index, shift) {
- __tlbiel(vpn, psize, ssize);
+ __tlbiel(vpn, psize, psize, ssize);
} pte_iterate_hashed_end();
}
asm volatile("ptesync":::"memory");
@@ -557,7 +608,7 @@ static void native_flush_hash_range(unsigned long number, int local)
pte_iterate_hashed_subpages(pte, psize,
vpn, index, shift) {
- __tlbie(vpn, psize, ssize);
+ __tlbie(vpn, psize, psize, ssize);
} pte_iterate_hashed_end();
}
asm volatile("eieio; tlbsync; ptesync":::"memory");
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index bfeab83..48edb46 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] = {
[MMU_PAGE_4K] = {
.shift = 12,
.sllp = 0,
- .penc = 0,
+ .penc[MMU_PAGE_4K] = 0,
.avpnm = 0,
.tlbiel = 0,
},
@@ -139,14 +139,14 @@ static struct mmu_psize_def mmu_psize_defaults_gp[] = {
[MMU_PAGE_4K] = {
.shift = 12,
.sllp = 0,
- .penc = 0,
+ .penc[MMU_PAGE_4K] = 0,
.avpnm = 0,
.tlbiel = 1,
},
[MMU_PAGE_16M] = {
.shift = 24,
.sllp = SLB_VSID_L,
- .penc = 0,
+ .penc[MMU_PAGE_16M] = 0,
.avpnm = 0x1UL,
.tlbiel = 0,
},
@@ -208,7 +208,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
BUG_ON(!ppc_md.hpte_insert);
ret = ppc_md.hpte_insert(hpteg, vpn, paddr, tprot,
- HPTE_V_BOLTED, psize, ssize);
+ HPTE_V_BOLTED, psize, psize, ssize);
if (ret < 0)
break;
@@ -275,6 +275,30 @@ static void __init htab_init_seg_sizes(void)
of_scan_flat_dt(htab_dt_scan_seg_sizes, NULL);
}
+static int __init get_idx_from_shift(unsigned int shift)
+{
+ int idx = -1;
+
+ switch (shift) {
+ case 0xc:
+ idx = MMU_PAGE_4K;
+ break;
+ case 0x10:
+ idx = MMU_PAGE_64K;
+ break;
+ case 0x14:
+ idx = MMU_PAGE_1M;
+ break;
+ case 0x18:
+ idx = MMU_PAGE_16M;
+ break;
+ case 0x22:
+ idx = MMU_PAGE_16G;
+ break;
+ }
+ return idx;
+}
+
static int __init htab_dt_scan_page_sizes(unsigned long node,
const char *uname, int depth,
void *data)
@@ -294,60 +318,57 @@ static int __init htab_dt_scan_page_sizes(unsigned long node,
size /= 4;
cur_cpu_spec->mmu_features &= ~(MMU_FTR_16M_PAGE);
while(size > 0) {
- unsigned int shift = prop[0];
+ unsigned int base_shift = prop[0];
unsigned int slbenc = prop[1];
unsigned int lpnum = prop[2];
- unsigned int lpenc = 0;
struct mmu_psize_def *def;
- int idx = -1;
+ int idx, base_idx;
size -= 3; prop += 3;
- while(size > 0 && lpnum) {
- if (prop[0] == shift)
- lpenc = prop[1];
+ base_idx = get_idx_from_shift(base_shift);
+ if (base_idx < 0) {
+ /*
+ * skip the pte encoding also
+ */
prop += 2; size -= 2;
- lpnum--;
+ continue;
}
- switch(shift) {
- case 0xc:
- idx = MMU_PAGE_4K;
- break;
- case 0x10:
- idx = MMU_PAGE_64K;
- break;
- case 0x14:
- idx = MMU_PAGE_1M;
- break;
- case 0x18:
- idx = MMU_PAGE_16M;
+ def = &mmu_psize_defs[base_idx];
+ if (base_idx == MMU_PAGE_16M)
cur_cpu_spec->mmu_features |= MMU_FTR_16M_PAGE;
- break;
- case 0x22:
- idx = MMU_PAGE_16G;
- break;
- }
- if (idx < 0)
- continue;
- def = &mmu_psize_defs[idx];
- def->shift = shift;
- if (shift <= 23)
+
+ def->shift = base_shift;
+ if (base_shift <= 23)
def->avpnm = 0;
else
- def->avpnm = (1 << (shift - 23)) - 1;
+ def->avpnm = (1 << (base_shift - 23)) - 1;
def->sllp = slbenc;
- def->penc = lpenc;
- /* We don't know for sure what's up with tlbiel, so
+ /*
+ * We don't know for sure what's up with tlbiel, so
* for now we only set it for 4K and 64K pages
*/
- if (idx == MMU_PAGE_4K || idx == MMU_PAGE_64K)
+ if (base_idx == MMU_PAGE_4K || base_idx == MMU_PAGE_64K)
def->tlbiel = 1;
else
def->tlbiel = 0;
- DBG(" %d: shift=%02x, sllp=%04lx, avpnm=%08lx, "
- "tlbiel=%d, penc=%d\n",
- idx, shift, def->sllp, def->avpnm, def->tlbiel,
- def->penc);
+ while (size > 0 && lpnum) {
+ unsigned int shift = prop[0];
+ unsigned int penc = prop[1];
+
+ prop += 2; size -= 2;
+ lpnum--;
+
+ idx = get_idx_from_shift(shift);
+ if (idx < 0)
+ continue;
+
+ def->penc[idx] = penc;
+ DBG(" %d: shift=%02x, sllp=%04lx, "
+ "avpnm=%08lx, tlbiel=%d, penc=%d\n",
+ idx, shift, def->sllp, def->avpnm,
+ def->tlbiel, def->penc[idx]);
+ }
}
return 1;
}
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 0da39fe..9f99847 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -109,7 +109,7 @@ void vpa_init(int cpu)
static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
unsigned long vpn, unsigned long pa,
unsigned long rflags, unsigned long vflags,
- int psize, int ssize)
+ int psize, int apsize, int ssize)
{
unsigned long lpar_rc;
unsigned long flags;
@@ -122,7 +122,7 @@ static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
hpte_group, vpn, pa, rflags, vflags, psize);
hpte_v = hpte_encode_v(vpn, psize, ssize) | vflags | HPTE_V_VALID;
- hpte_r = hpte_encode_r(pa, psize) | rflags;
+ hpte_r = hpte_encode_r(pa, psize, apsize) | rflags;
if (!(vflags & HPTE_V_BOLTED))
pr_devel(" hpte_v=%016lx, hpte_r=%016lx\n", hpte_v, hpte_r);
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 02/17] arch/powerpc: Reduce the PTE_INDEX_SIZE
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This make one PMD cover 16MB range. That helps in easier implementation of THP
on power. THP core code make use of one pmd entry to track the huge page and
the range mapped by a single pmd entry should be equal to the huge page size
supported by the hardware.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/pgtable-ppc64-64k.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index be4e287..3c529b4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -4,10 +4,10 @@
#include <asm-generic/pgtable-nopud.h>
-#define PTE_INDEX_SIZE 12
+#define PTE_INDEX_SIZE 8
#define PMD_INDEX_SIZE 12
#define PUD_INDEX_SIZE 0
-#define PGD_INDEX_SIZE 6
+#define PGD_INDEX_SIZE 10
#ifndef __ASSEMBLY__
#define PTE_TABLE_SIZE (sizeof(real_pte_t) << PTE_INDEX_SIZE)
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 07/17] powerpc: Update tlbie/tlbiel as per ISA doc
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This make sure we handle Multiple page size segment correctly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/mm/hash_native_64.c | 52 +++++++++++++++++++++++++++++---------
1 file changed, 40 insertions(+), 12 deletions(-)
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 16ba033..da46cd3 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -43,7 +43,7 @@
DEFINE_RAW_SPINLOCK(native_tlbie_lock);
-static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
+static inline void __tlbie(unsigned long vpn, int bpsize, int apsize, int ssize)
{
unsigned long va;
unsigned int penc;
@@ -63,19 +63,33 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
*/
va &= ~(0xffffULL << 48);
- switch (psize) {
+ switch (bpsize) {
case MMU_PAGE_4K:
+ /* clear out bits after (52) [0....52.....63] */
+ va &= ~((1ul << (64 - 52)) - 1);
va |= ssize << 8;
+ va |= mmu_psize_defs[apsize].sllp << 6;
asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
: : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
: "memory");
break;
default:
/* We need 14 to 14 + i bits of va */
- penc = mmu_psize_defs[psize].penc[apsize];
- va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
+ penc = mmu_psize_defs[bpsize].penc[apsize];
+ /* clear out bits after (44) [0....44.....63] */
+ va &= ~((1ul << (64 - 44)) - 1);
va |= penc << 12;
va |= ssize << 8;
+ /* Add AVAL part */
+ if (bpsize != apsize) {
+ /*
+ * MPSS, 64K base page size and 16MB parge page size
+ * We don't need all the bits, but this seems to work.
+ * vpn cover upto 65 bits of va. (0...65) and we need
+ * 56..62 bits of va.
+ */
+ va |= ((vpn >> 2) & 0xfe);
+ }
va |= 1; /* L */
asm volatile(ASM_FTR_IFCLR("tlbie %0,1", PPC_TLBIE(%1,%0), %2)
: : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
@@ -84,7 +98,7 @@ static inline void __tlbie(unsigned long vpn, int psize, int apsize, int ssize)
}
}
-static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize)
+static inline void __tlbiel(unsigned long vpn, int bpsize, int apsize, int ssize)
{
unsigned long va;
unsigned int penc;
@@ -98,18 +112,32 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize)
*/
va &= ~(0xffffULL << 48);
- switch (psize) {
+ switch (bpsize) {
case MMU_PAGE_4K:
+ /* clear out bits after(52) [0....52.....63] */
+ va &= ~((1ul << (64 - 52)) - 1);
va |= ssize << 8;
+ va |= mmu_psize_defs[apsize].sllp << 6;
asm volatile(".long 0x7c000224 | (%0 << 11) | (0 << 21)"
: : "r"(va) : "memory");
break;
default:
/* We need 14 to 14 + i bits of va */
- penc = mmu_psize_defs[psize].penc[apsize];
- va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
+ penc = mmu_psize_defs[bpsize].penc[apsize];
+ /* clear out bits after(44) [0....44.....63] */
+ va &= ~((1ul << (64 - 44)) - 1);
va |= penc << 12;
va |= ssize << 8;
+ /* Add AVAL part */
+ if (bpsize != apsize) {
+ /*
+ * MPSS, 64K base page size and 16MB parge page size
+ * We don't need all the bits, but this seems to work.
+ * vpn cover upto 65 bits of va. (0...65) and we need
+ * 56..62 bits of va.
+ */
+ va |= ((vpn >> 2) & 0xfe);
+ }
va |= 1; /* L */
asm volatile(".long 0x7c000224 | (%0 << 11) | (1 << 21)"
: : "r"(va) : "memory");
@@ -118,22 +146,22 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize)
}
-static inline void tlbie(unsigned long vpn, int psize, int apsize,
+static inline void tlbie(unsigned long vpn, int bpsize, int apsize,
int ssize, int local)
{
unsigned int use_local = local && mmu_has_feature(MMU_FTR_TLBIEL);
int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
if (use_local)
- use_local = mmu_psize_defs[psize].tlbiel;
+ use_local = mmu_psize_defs[bpsize].tlbiel;
if (lock_tlbie && !use_local)
raw_spin_lock(&native_tlbie_lock);
asm volatile("ptesync": : :"memory");
if (use_local) {
- __tlbiel(vpn, psize, apsize, ssize);
+ __tlbiel(vpn, bpsize, apsize, ssize);
asm volatile("ptesync": : :"memory");
} else {
- __tlbie(vpn, psize, apsize, ssize);
+ __tlbie(vpn, bpsize, apsize, ssize);
asm volatile("eieio; tlbsync; ptesync": : :"memory");
}
if (lock_tlbie && !use_local)
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 04/17] mm/THP: Add pmd args to pgtable deposit and withdraw APIs
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/s390/include/asm/pgtable.h | 5 +++--
arch/s390/mm/pgtable.c | 5 +++--
arch/sparc/include/asm/pgtable_64.h | 5 +++--
arch/sparc/mm/tlb.c | 5 +++--
include/asm-generic/pgtable.h | 5 +++--
mm/huge_memory.c | 12 ++++++------
mm/pgtable-generic.c | 5 +++--
7 files changed, 24 insertions(+), 18 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index c1d7930..d57436c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1232,10 +1232,11 @@ static inline void __pmd_idte(unsigned long address, pmd_t *pmdp)
#define SEGMENT_RW __pgprot(_HPAGE_TYPE_RW)
#define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable);
#define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
static inline int pmd_trans_splitting(pmd_t pmd)
{
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ae44d2a..9ab3224 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -920,7 +920,8 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
}
}
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable)
{
struct list_head *lh = (struct list_head *) pgtable;
@@ -934,7 +935,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
mm->pmd_huge_pte = pgtable;
}
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
struct list_head *lh;
pgtable_t pgtable;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 7870be0..4fa7133 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -855,10 +855,11 @@ extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);
#define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable);
#define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif
/* Encode and de-code a swap entry */
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 3e8fec3..79922f4 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -150,7 +150,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
}
}
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable)
{
struct list_head *lh = (struct list_head *) pgtable;
@@ -164,7 +165,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
mm->pmd_huge_pte = pgtable;
}
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
struct list_head *lh;
pgtable_t pgtable;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 5cf680a..6f87e9e 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -163,11 +163,12 @@ extern void pmdp_splitting_flush(struct vm_area_struct *vma,
#endif
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable);
#endif
#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif
#ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6001ee6..5beb2e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -739,7 +739,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
page_add_new_anon_rmap(page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
- pgtable_trans_huge_deposit(mm, pgtable);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm->nr_ptes++;
spin_unlock(&mm->page_table_lock);
@@ -926,7 +926,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
- pgtable_trans_huge_deposit(dst_mm, pgtable);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
dst_mm->nr_ptes++;
ret = 0;
@@ -1091,10 +1091,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
goto out_free_pages;
VM_BUG_ON(!PageHead(page));
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmdp_clear_flush(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
- pgtable = pgtable_trans_huge_withdraw(mm);
pmd_populate(mm, &_pmd, pgtable);
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1373,7 +1373,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
- pgtable = pgtable_trans_huge_withdraw(tlb->mm);
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
if (is_huge_zero_pmd(orig_pmd)) {
@@ -1705,7 +1705,7 @@ static int __split_huge_page_map(struct page *page,
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
haddr = address;
@@ -2393,7 +2393,7 @@ static void collapse_huge_page(struct mm_struct *mm,
page_add_new_anon_rmap(new_page, vma, address);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
- pgtable_trans_huge_deposit(mm, pgtable);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
*hpage = NULL;
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c8323f..e1a6e4f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -124,7 +124,8 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+ pgtable_t pgtable)
{
assert_spin_locked(&mm->page_table_lock);
@@ -141,7 +142,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* no "address" argument so destroys page coloring of some arch */
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
pgtable_t pgtable;
--
1.7.10
^ permalink raw reply related
* [RFC PATCH 14/17] powerpc: support for zerout withdraw.
From: Aneesh Kumar K.V @ 2013-02-18 10:28 UTC (permalink / raw)
To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1361183295-6958-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Need changes to other archs. This need to be fixed further
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/pgtable.h | 3 ++-
arch/powerpc/mm/pgtable.c | 11 ++++++++---
mm/huge_memory.c | 18 ++++++++++++------
3 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 4e49c34..3dfbec9 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -262,7 +262,8 @@ extern void pmdp_splitting_flush(struct vm_area_struct *vma,
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
#define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
+ pmd_t *pmdp, int tozero);
#define __HAVE_ARCH_PMDP_INVALIDATE
extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 841271f..fa5e108 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -355,7 +355,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
/* FIXME!! May be all this should be in pgtable_64.c ? */
#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp, int tozero)
{
pgtable_t pgtable;
unsigned long *pgtable_slot;
@@ -368,8 +368,13 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
* Make sure we are invalidating all the entries. So that
* we fault and create new entries later
*/
- /* zero out the table before returning */
- memset(pgtable, 0, PTE_FRAG_SIZE);
+ /* FIXME!! this is not correct. zero out the table before returning
+ * because we are using this for other things.
+ * zap_huge_pmd
+ */
+ if (tozero)
+ /* Not needed, because we depoist a zeroed table ? */
+ memset(pgtable, 0, PTE_FRAG_SIZE);
return pgtable;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5beb2e2..3777a5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -781,7 +781,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
- pgtable_trans_huge_deposit(mm, pgtable);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
mm->nr_ptes++;
return true;
}
@@ -996,7 +996,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
pmdp_clear_flush(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
- pgtable = pgtable_trans_huge_withdraw(mm);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd, 1);
pmd_populate(mm, &_pmd, pgtable);
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1091,7 +1091,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
goto out_free_pages;
VM_BUG_ON(!PageHead(page));
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd, 1);
pmdp_clear_flush(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
@@ -1373,7 +1373,13 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
- pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ /*
+ * Withdraw the pgtable without zero out, because
+ * the following pmd_get_and_clear will look at
+ * pgtable contents, in case of some architectures
+ * like ppc64
+ */
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd, 0);
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
if (is_huge_zero_pmd(orig_pmd)) {
@@ -1705,7 +1711,7 @@ static int __split_huge_page_map(struct page *page,
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd, 1);
pmd_populate(mm, &_pmd, pgtable);
haddr = address;
@@ -2699,7 +2705,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pmdp_clear_flush(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
- pgtable = pgtable_trans_huge_withdraw(mm);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd, 1);
pmd_populate(mm, &_pmd, pgtable);
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
--
1.7.10
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox