LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 42/52] octeon, watchdog: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:40 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-watchdog, linux-pm,
	linux-kernel, linuxppc-dev, Wim Van Sebroeck, Srivatsa S. Bhat,
	tj, paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the watchdog code in octeon by using this latter form of callback
registration.

Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-watchdog@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 drivers/watchdog/octeon-wdt-main.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/watchdog/octeon-wdt-main.c b/drivers/watchdog/octeon-wdt-main.c
index 4612088..4baf2d7 100644
--- a/drivers/watchdog/octeon-wdt-main.c
+++ b/drivers/watchdog/octeon-wdt-main.c
@@ -708,10 +708,13 @@ static int __init octeon_wdt_init(void)
 
 	cpumask_clear(&irq_enabled_cpus);
 
+	cpu_notifier_register_begin();
 	for_each_online_cpu(cpu)
 		octeon_wdt_setup_interrupt(cpu);
 
-	register_hotcpu_notifier(&octeon_wdt_cpu_notifier);
+	__register_hotcpu_notifier(&octeon_wdt_cpu_notifier);
+	cpu_notifier_register_done();
+
 out:
 	return ret;
 }
@@ -725,7 +728,8 @@ static void __exit octeon_wdt_cleanup(void)
 
 	misc_deregister(&octeon_wdt_miscdev);
 
-	unregister_hotcpu_notifier(&octeon_wdt_cpu_notifier);
+	cpu_notifier_register_begin();
+	__unregister_hotcpu_notifier(&octeon_wdt_cpu_notifier);
 
 	for_each_online_cpu(cpu) {
 		int core = cpu2core(cpu);
@@ -734,6 +738,9 @@ static void __exit octeon_wdt_cleanup(void)
 		/* Free the interrupt handler */
 		free_irq(OCTEON_IRQ_WDOG0 + core, octeon_wdt_poke_irq);
 	}
+
+	cpu_notifier_register_done();
+
 	/*
 	 * Disable the boot-bus memory, the code it points to is soon
 	 * to go missing.

^ permalink raw reply related

* [PATCH v3 44/52] hwmon, coretemp: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:41 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, linux-kernel,
	Fenghua Yu, lm-sensors, linuxppc-dev, Guenter Roeck,
	Srivatsa S. Bhat, tj, paulmck, Ingo Molnar, Jean Delvare
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the hwmon coretemp code by using this latter form of callback
registration.

Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: lm-sensors@lm-sensors.org
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 drivers/hwmon/coretemp.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index bbb0b0d..746a6ad 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -849,20 +849,20 @@ static int __init coretemp_init(void)
 	if (err)
 		goto exit;
 
-	get_online_cpus();
+	cpu_notifier_register_begin();
 	for_each_online_cpu(i)
 		get_core_online(i);
 
 #ifndef CONFIG_HOTPLUG_CPU
 	if (list_empty(&pdev_list)) {
-		put_online_cpus();
+		cpu_notifier_register_done();
 		err = -ENODEV;
 		goto exit_driver_unreg;
 	}
 #endif
 
-	register_hotcpu_notifier(&coretemp_cpu_notifier);
-	put_online_cpus();
+	__register_hotcpu_notifier(&coretemp_cpu_notifier);
+	cpu_notifier_register_done();
 	return 0;
 
 #ifndef CONFIG_HOTPLUG_CPU
@@ -877,8 +877,8 @@ static void __exit coretemp_exit(void)
 {
 	struct pdev_entry *p, *n;
 
-	get_online_cpus();
-	unregister_hotcpu_notifier(&coretemp_cpu_notifier);
+	cpu_notifier_register_begin();
+	__unregister_hotcpu_notifier(&coretemp_cpu_notifier);
 	mutex_lock(&pdev_list_mutex);
 	list_for_each_entry_safe(p, n, &pdev_list, list) {
 		platform_device_unregister(p->pdev);
@@ -886,7 +886,7 @@ static void __exit coretemp_exit(void)
 		kfree(p);
 	}
 	mutex_unlock(&pdev_list_mutex);
-	put_online_cpus();
+	cpu_notifier_register_done();
 	platform_driver_unregister(&coretemp_driver);
 }
 

^ permalink raw reply related

* [PATCH v3 45/52] hwmon, via-cputemp: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:41 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, linux-kernel,
	lm-sensors, linuxppc-dev, Guenter Roeck, Srivatsa S. Bhat, tj,
	paulmck, Ingo Molnar, Jean Delvare
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the hwmon via-cputemp code by using this latter form of callback
registration.

Cc: Jean Delvare <jdelvare@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: lm-sensors@lm-sensors.org
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 drivers/hwmon/via-cputemp.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/hwmon/via-cputemp.c b/drivers/hwmon/via-cputemp.c
index 38944e9..8df43c5 100644
--- a/drivers/hwmon/via-cputemp.c
+++ b/drivers/hwmon/via-cputemp.c
@@ -319,7 +319,7 @@ static int __init via_cputemp_init(void)
 	if (err)
 		goto exit;
 
-	get_online_cpus();
+	cpu_notifier_register_begin();
 	for_each_online_cpu(i) {
 		struct cpuinfo_x86 *c = &cpu_data(i);
 
@@ -339,14 +339,14 @@ static int __init via_cputemp_init(void)
 
 #ifndef CONFIG_HOTPLUG_CPU
 	if (list_empty(&pdev_list)) {
-		put_online_cpus();
+		cpu_notifier_register_done();
 		err = -ENODEV;
 		goto exit_driver_unreg;
 	}
 #endif
 
-	register_hotcpu_notifier(&via_cputemp_cpu_notifier);
-	put_online_cpus();
+	__register_hotcpu_notifier(&via_cputemp_cpu_notifier);
+	cpu_notifier_register_done();
 	return 0;
 
 #ifndef CONFIG_HOTPLUG_CPU
@@ -361,8 +361,8 @@ static void __exit via_cputemp_exit(void)
 {
 	struct pdev_entry *p, *n;
 
-	get_online_cpus();
-	unregister_hotcpu_notifier(&via_cputemp_cpu_notifier);
+	cpu_notifier_register_begin();
+	__unregister_hotcpu_notifier(&via_cputemp_cpu_notifier);
 	mutex_lock(&pdev_list_mutex);
 	list_for_each_entry_safe(p, n, &pdev_list, list) {
 		platform_device_unregister(p->pdev);
@@ -370,7 +370,7 @@ static void __exit via_cputemp_exit(void)
 		kfree(p);
 	}
 	mutex_unlock(&pdev_list_mutex);
-	put_online_cpus();
+	cpu_notifier_register_done();
 	platform_driver_unregister(&via_cputemp_driver);
 }
 

^ permalink raw reply related

* [PATCH v3 47/52] trace, ring-buffer: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:41 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, Frederic Weisbecker,
	linux-kernel, Steven Rostedt, linuxppc-dev, Srivatsa S. Bhat, tj,
	paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the tracing ring-buffer code by using this latter form of callback
registration.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/trace/ring_buffer.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fc4da2d..c634868 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1301,7 +1301,7 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 	 * In that off case, we need to allocate for all possible cpus.
 	 */
 #ifdef CONFIG_HOTPLUG_CPU
-	get_online_cpus();
+	cpu_notifier_register_begin();
 	cpumask_copy(buffer->cpumask, cpu_online_mask);
 #else
 	cpumask_copy(buffer->cpumask, cpu_possible_mask);
@@ -1324,10 +1324,10 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 #ifdef CONFIG_HOTPLUG_CPU
 	buffer->cpu_notify.notifier_call = rb_cpu_notify;
 	buffer->cpu_notify.priority = 0;
-	register_cpu_notifier(&buffer->cpu_notify);
+	__register_cpu_notifier(&buffer->cpu_notify);
+	cpu_notifier_register_done();
 #endif
 
-	put_online_cpus();
 	mutex_init(&buffer->mutex);
 
 	return buffer;
@@ -1341,7 +1341,9 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 
  fail_free_cpumask:
 	free_cpumask_var(buffer->cpumask);
-	put_online_cpus();
+#ifdef CONFIG_HOTPLUG_CPU
+	cpu_notifier_register_done();
+#endif
 
  fail_free_buffer:
 	kfree(buffer);
@@ -1358,16 +1360,17 @@ ring_buffer_free(struct ring_buffer *buffer)
 {
 	int cpu;
 
-	get_online_cpus();
-
 #ifdef CONFIG_HOTPLUG_CPU
-	unregister_cpu_notifier(&buffer->cpu_notify);
+	cpu_notifier_register_begin();
+	__unregister_cpu_notifier(&buffer->cpu_notify);
 #endif
 
 	for_each_buffer_cpu(buffer, cpu)
 		rb_free_cpu_buffer(buffer->buffers[cpu]);
 
-	put_online_cpus();
+#ifdef CONFIG_HOTPLUG_CPU
+	cpu_notifier_register_done();
+#endif
 
 	kfree(buffer->buffers);
 	free_cpumask_var(buffer->cpumask);

^ permalink raw reply related

* [PATCH v3 48/52] profile: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:42 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, Mauro Carvalho Chehab, linux-pm,
	linux-kernel, linuxppc-dev, Al Viro, Srivatsa S. Bhat, tj,
	paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the profile code by using this latter form of callback registration.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/profile.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/profile.c b/kernel/profile.c
index 6631e1e..4c0cb95 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -591,18 +591,28 @@ out_cleanup:
 int __ref create_proc_profile(void) /* false positive from hotcpu_notifier */
 {
 	struct proc_dir_entry *entry;
+	int err = 0;
 
 	if (!prof_on)
 		return 0;
-	if (create_hash_tables())
-		return -ENOMEM;
+
+	cpu_notifier_register_begin();
+
+	if (create_hash_tables()) {
+		err = -ENOMEM;
+		goto out;
+	}
+
 	entry = proc_create("profile", S_IWUSR | S_IRUGO,
 			    NULL, &proc_profile_operations);
 	if (!entry)
-		return 0;
+		goto out;
 	proc_set_size(entry, (1 + prof_len) * sizeof(atomic_t));
-	hotcpu_notifier(profile_cpu_callback, 0);
-	return 0;
+	__hotcpu_notifier(profile_cpu_callback, 0);
+
+out:
+	cpu_notifier_register_done();
+	return err;
 }
 module_init(create_proc_profile);
 #endif /* CONFIG_PROC_FS */

^ permalink raw reply related

* [PATCH v3 46/52] xen, balloon: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:41 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, Konrad Rzeszutek Wilk,
	linux-kernel, linuxppc-dev, David Vrabel, Srivatsa S. Bhat, tj,
	xen-devel, paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

The xen balloon driver doesn't take get/put_online_cpus() around this code,
but that is also buggy, since it can miss CPU hotplug events in between the
initialization and callback registration:

	for_each_online_cpu(cpu)
		init_cpu(cpu);
		   ^
		   |  Race window; Can miss CPU hotplug events here.
		   v
	register_cpu_notifier(&foobar_cpu_notifier);

Interestingly, the balloon code in xen can simply be reorganized as shown
below, to have a race-free method to register hotplug callbacks, without even
taking get/put_online_cpus(). This is because the initialization performed for
already online CPUs is exactly the same as that performed for CPUs that come
online later. Moreover, the code has checks in place to avoid double
initialization.

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the balloon code in xen this way to fix the issues with CPU
hotplug callback registration.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: xen-devel@lists.xenproject.org
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 drivers/xen/balloon.c |   36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 37d06ea..dd79549 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -592,19 +592,29 @@ static void __init balloon_add_region(unsigned long start_pfn,
 	}
 }
 
+static int alloc_balloon_scratch_page(int cpu)
+{
+	if (per_cpu(balloon_scratch_page, cpu) != NULL)
+		return 0;
+
+	per_cpu(balloon_scratch_page, cpu) = alloc_page(GFP_KERNEL);
+	if (per_cpu(balloon_scratch_page, cpu) == NULL) {
+		pr_warn("Failed to allocate balloon_scratch_page for cpu %d\n", cpu);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+
 static int balloon_cpu_notify(struct notifier_block *self,
 				    unsigned long action, void *hcpu)
 {
 	int cpu = (long)hcpu;
 	switch (action) {
 	case CPU_UP_PREPARE:
-		if (per_cpu(balloon_scratch_page, cpu) != NULL)
-			break;
-		per_cpu(balloon_scratch_page, cpu) = alloc_page(GFP_KERNEL);
-		if (per_cpu(balloon_scratch_page, cpu) == NULL) {
-			pr_warn("Failed to allocate balloon_scratch_page for cpu %d\n", cpu);
+		if (alloc_balloon_scratch_page(cpu))
 			return NOTIFY_BAD;
-		}
 		break;
 	default:
 		break;
@@ -624,15 +634,17 @@ static int __init balloon_init(void)
 		return -ENODEV;
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-		for_each_online_cpu(cpu)
-		{
-			per_cpu(balloon_scratch_page, cpu) = alloc_page(GFP_KERNEL);
-			if (per_cpu(balloon_scratch_page, cpu) == NULL) {
-				pr_warn("Failed to allocate balloon_scratch_page for cpu %d\n", cpu);
+		register_cpu_notifier(&balloon_cpu_notifier);
+
+		get_online_cpus();
+		for_each_online_cpu(cpu) {
+			if (alloc_balloon_scratch_page(cpu)) {
+				put_online_cpus();
+				unregister_cpu_notifier(&balloon_cpu_notifier);
 				return -ENOMEM;
 			}
 		}
-		register_cpu_notifier(&balloon_cpu_notifier);
+		put_online_cpus();
 	}
 
 	pr_info("Initialising balloon driver\n");

^ permalink raw reply related

* [PATCH v3 49/52] mm, vmstat: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:42 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, Christoph Lameter, walken, linux, linux-pm,
	Rik van Riel, linux-kernel, linux-mm, linuxppc-dev, Dave Hansen,
	Srivatsa S. Bhat, Johannes Weiner, tj, Toshi Kani, Andrew Morton,
	paulmck, Cody P Schafer, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the vmstat code in the MM subsystem by using this latter form of callback
registration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index def5dd2..58c6f3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1292,14 +1292,14 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_SMP
 	int cpu;
 
-	register_cpu_notifier(&vmstat_notifier);
+	cpu_notifier_register_begin();
+	__register_cpu_notifier(&vmstat_notifier);
 
-	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		start_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
 	}
-	put_online_cpus();
+	cpu_notifier_register_done();
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);

^ permalink raw reply related

* [PATCH v3 50/52] mm, zswap: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:42 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, linux-kernel, linux-mm,
	linuxppc-dev, Srivatsa S. Bhat, tj, paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the zswap code by using this latter form of callback registration.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/zswap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index e55bab9..d7337fb 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -387,18 +387,18 @@ static int zswap_cpu_init(void)
 {
 	unsigned long cpu;
 
-	get_online_cpus();
+	cpu_notifier_register_begin();
 	for_each_online_cpu(cpu)
 		if (__zswap_cpu_notifier(CPU_UP_PREPARE, cpu) != NOTIFY_OK)
 			goto cleanup;
-	register_cpu_notifier(&zswap_cpu_notifier_block);
-	put_online_cpus();
+	__register_cpu_notifier(&zswap_cpu_notifier_block);
+	cpu_notifier_register_done();
 	return 0;
 
 cleanup:
 	for_each_online_cpu(cpu)
 		__zswap_cpu_notifier(CPU_UP_CANCELED, cpu);
-	put_online_cpus();
+	cpu_notifier_register_done();
 	return -ENOMEM;
 }
 

^ permalink raw reply related

* [PATCH v3 51/52] net/core/flow.c: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:42 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, Li RongQing, netdev,
	linux-kernel, Chris Metcalf, David S. Miller, linuxppc-dev,
	Srivatsa S. Bhat, tj, Sasha Levin, Andrew Morton, paulmck,
	Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the code in net/core/flow.c by using this latter form of callback
registration.

Cc: Li RongQing <roy.qing.li@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: netdev@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 net/core/flow.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index dfa602c..9a2151f 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -456,6 +456,8 @@ static int __init flow_cache_init(struct flow_cache *fc)
 	if (!fc->percpu)
 		return -ENOMEM;
 
+	cpu_notifier_register_begin();
+
 	for_each_online_cpu(i) {
 		if (flow_cache_cpu_prepare(fc, i))
 			goto err;
@@ -463,7 +465,9 @@ static int __init flow_cache_init(struct flow_cache *fc)
 	fc->hotcpu_notifier = (struct notifier_block){
 		.notifier_call = flow_cache_cpu,
 	};
-	register_hotcpu_notifier(&fc->hotcpu_notifier);
+	__register_hotcpu_notifier(&fc->hotcpu_notifier);
+
+	cpu_notifier_register_done();
 
 	setup_timer(&fc->rnd_timer, flow_cache_new_hashrnd,
 		    (unsigned long) fc);
@@ -479,6 +483,8 @@ err:
 		fcp->hash_table = NULL;
 	}
 
+	cpu_notifier_register_done();
+
 	free_percpu(fc->percpu);
 	fc->percpu = NULL;
 

^ permalink raw reply related

* [PATCH v3 52/52] net/iucv/iucv.c: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-10 20:42 UTC (permalink / raw)
  To: paulus, oleg, mingo, rjw, rusty, peterz, tglx, akpm
  Cc: linux-arch, ego, walken, linux, linux-pm, linux-s390,
	linux-kernel, Ursula Braun, linuxppc-dev, David S. Miller,
	Srivatsa S. Bhat, netdev, tj, paulmck, Ingo Molnar
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();


Fix the code in net/iucv/iucv.c by using this latter form of callback
registration. Also, provide helper functions to perform the common memory
allocations and frees, to condense repetitive code.

Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 net/iucv/iucv.c |  121 ++++++++++++++++++++++++++-----------------------------
 1 file changed, 57 insertions(+), 64 deletions(-)

diff --git a/net/iucv/iucv.c b/net/iucv/iucv.c
index cd5b8ec..79a0ce9 100644
--- a/net/iucv/iucv.c
+++ b/net/iucv/iucv.c
@@ -621,6 +621,42 @@ static void iucv_disable(void)
 	put_online_cpus();
 }
 
+static void free_iucv_data(int cpu)
+{
+	kfree(iucv_param_irq[cpu]);
+	iucv_param_irq[cpu] = NULL;
+	kfree(iucv_param[cpu]);
+	iucv_param[cpu] = NULL;
+	kfree(iucv_irq_data[cpu]);
+	iucv_irq_data[cpu] = NULL;
+}
+
+static int alloc_iucv_data(int cpu)
+{
+	/* Note: GFP_DMA used to get memory below 2G */
+	iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
+			     GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
+	if (!iucv_irq_data[cpu])
+		goto out_free;
+
+	/* Allocate parameter blocks. */
+	iucv_param[cpu] = kmalloc_node(sizeof(union iucv_param),
+			  GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
+	if (!iucv_param[cpu])
+		goto out_free;
+
+	iucv_param_irq[cpu] = kmalloc_node(sizeof(union iucv_param),
+			  GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
+	if (!iucv_param_irq[cpu])
+		goto out_free;
+
+	return 0;
+
+out_free:
+	free_iucv_data(cpu);
+	return -ENOMEM;
+}
+
 static int iucv_cpu_notify(struct notifier_block *self,
 				     unsigned long action, void *hcpu)
 {
@@ -630,38 +666,14 @@ static int iucv_cpu_notify(struct notifier_block *self,
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
-					GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_irq_data[cpu])
-			return notifier_from_errno(-ENOMEM);
-
-		iucv_param[cpu] = kmalloc_node(sizeof(union iucv_param),
-				     GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_param[cpu]) {
-			kfree(iucv_irq_data[cpu]);
-			iucv_irq_data[cpu] = NULL;
+		if (alloc_iucv_data(cpu))
 			return notifier_from_errno(-ENOMEM);
-		}
-		iucv_param_irq[cpu] = kmalloc_node(sizeof(union iucv_param),
-					GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_param_irq[cpu]) {
-			kfree(iucv_param[cpu]);
-			iucv_param[cpu] = NULL;
-			kfree(iucv_irq_data[cpu]);
-			iucv_irq_data[cpu] = NULL;
-			return notifier_from_errno(-ENOMEM);
-		}
 		break;
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		kfree(iucv_param_irq[cpu]);
-		iucv_param_irq[cpu] = NULL;
-		kfree(iucv_param[cpu]);
-		iucv_param[cpu] = NULL;
-		kfree(iucv_irq_data[cpu]);
-		iucv_irq_data[cpu] = NULL;
+		free_iucv_data(cpu);
 		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
@@ -2025,33 +2037,20 @@ static int __init iucv_init(void)
 		goto out_int;
 	}
 
-	for_each_online_cpu(cpu) {
-		/* Note: GFP_DMA used to get memory below 2G */
-		iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
-				     GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_irq_data[cpu]) {
-			rc = -ENOMEM;
-			goto out_free;
-		}
+	cpu_notifier_register_begin();
 
-		/* Allocate parameter blocks. */
-		iucv_param[cpu] = kmalloc_node(sizeof(union iucv_param),
-				  GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_param[cpu]) {
-			rc = -ENOMEM;
-			goto out_free;
-		}
-		iucv_param_irq[cpu] = kmalloc_node(sizeof(union iucv_param),
-				  GFP_KERNEL|GFP_DMA, cpu_to_node(cpu));
-		if (!iucv_param_irq[cpu]) {
+	for_each_online_cpu(cpu) {
+		if (alloc_iucv_data(cpu)) {
 			rc = -ENOMEM;
 			goto out_free;
 		}
-
 	}
-	rc = register_hotcpu_notifier(&iucv_cpu_notifier);
+	rc = __register_hotcpu_notifier(&iucv_cpu_notifier);
 	if (rc)
 		goto out_free;
+
+	cpu_notifier_register_done();
+
 	rc = register_reboot_notifier(&iucv_reboot_notifier);
 	if (rc)
 		goto out_cpu;
@@ -2069,16 +2068,14 @@ static int __init iucv_init(void)
 out_reboot:
 	unregister_reboot_notifier(&iucv_reboot_notifier);
 out_cpu:
-	unregister_hotcpu_notifier(&iucv_cpu_notifier);
+	cpu_notifier_register_begin();
+	__unregister_hotcpu_notifier(&iucv_cpu_notifier);
 out_free:
-	for_each_possible_cpu(cpu) {
-		kfree(iucv_param_irq[cpu]);
-		iucv_param_irq[cpu] = NULL;
-		kfree(iucv_param[cpu]);
-		iucv_param[cpu] = NULL;
-		kfree(iucv_irq_data[cpu]);
-		iucv_irq_data[cpu] = NULL;
-	}
+	for_each_possible_cpu(cpu)
+		free_iucv_data(cpu);
+
+	cpu_notifier_register_done();
+
 	root_device_unregister(iucv_root);
 out_int:
 	unregister_external_interrupt(0x4000, iucv_external_interrupt);
@@ -2105,15 +2102,11 @@ static void __exit iucv_exit(void)
 		kfree(p);
 	spin_unlock_irq(&iucv_queue_lock);
 	unregister_reboot_notifier(&iucv_reboot_notifier);
-	unregister_hotcpu_notifier(&iucv_cpu_notifier);
-	for_each_possible_cpu(cpu) {
-		kfree(iucv_param_irq[cpu]);
-		iucv_param_irq[cpu] = NULL;
-		kfree(iucv_param[cpu]);
-		iucv_param[cpu] = NULL;
-		kfree(iucv_irq_data[cpu]);
-		iucv_irq_data[cpu] = NULL;
-	}
+	cpu_notifier_register_begin();
+	__unregister_hotcpu_notifier(&iucv_cpu_notifier);
+	for_each_possible_cpu(cpu)
+		free_iucv_data(cpu);
+	cpu_notifier_register_done();
 	root_device_unregister(iucv_root);
 	bus_unregister(&iucv_bus);
 	unregister_external_interrupt(0x4000, iucv_external_interrupt);

^ permalink raw reply related

* Re: [PATCH v3 00/52] CPU hotplug: Fix issues with callback registration
From: Rafael J. Wysocki @ 2014-03-11  0:31 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-arch, ego, walken, linux, akpm, linux-pm, peterz, rusty,
	oleg, linux-kernel, linuxppc-dev, paulus, tj, tglx, paulmck,
	mingo
In-Reply-To: <20140310203312.10746.310.stgit@srivatsabhat.in.ibm.com>

On Tuesday, March 11, 2014 02:03:52 AM Srivatsa S. Bhat wrote:
> Hi,

Hi,

> Many subsystems and drivers have the need to register CPU hotplug callbacks
> from their init routines and also perform initialization for the CPUs that are
> already online. But unfortunately there is no race-free way to achieve this
> today.
> 
> For example, consider this piece of code:
> 
> 	get_online_cpus();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	register_cpu_notifier(&foobar_cpu_notifier);
> 
> 	put_online_cpus();
> 
> This is not safe because there is a possibility of an ABBA deadlock involving
> the cpu_add_remove_lock and the cpu_hotplug.lock.
> 
>           CPU 0                                         CPU 1
>           -----                                         -----
> 
>    Acquire cpu_hotplug.lock
>    [via get_online_cpus()]
> 
>                                               CPU online/offline operation
>                                               takes cpu_add_remove_lock
>                                               [via cpu_maps_update_begin()]
> 
>    Try to acquire
>    cpu_add_remove_lock
>    [via register_cpu_notifier()]
> 
>                                               CPU online/offline operation
>                                               tries to acquire cpu_hotplug.lock
>                                               [via cpu_hotplug_begin()]
> 
>                             *** DEADLOCK! ***
> 
> 
> Other combinations of callback registration also don't work correctly.
> Examples:
> 
> 	register_cpu_notifier(&foobar_cpu_notifier);
> 
> 	get_online_cpus();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	put_online_cpus();
> 
> This can lead to double initialization if a hotplug operation occurs after
> registering the notifier and before invoking get_online_cpus().
> 
> On the other hand, the following piece of code can miss hotplug events
> altogether:
> 
> 	get_online_cpus();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	put_online_cpus();
>               ^
>               |   Race window; Can miss hotplug events here
>               v
> 	register_cpu_notifier(&foobar_cpu_notifier);
> 
> 
> To solve these issues and provide a race-free method to register CPU hotplug
> callbacks, this patchset introduces new variants of the callback registration
> APIs that don't hold the cpu_add_remove_lock, and exports the
> cpu_add_remove_lock via 2 new APIs cpu_notifier_register_begin/done() for use
> by various subsystems. With this in place, the following code snippet will
> register a hotplug callback as well as initialize already online CPUs without
> any race conditions.
> 
> 	cpu_notifier_register_begin();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	/* This doesn't take the cpu_add_remove_lock */
> 	__register_cpu_notifier(&foobar_cpu_notifier);
> 
> 	cpu_notifier_register_done();
> 
> 
> In this patchset, patch 1 adds lockdep annotations to catch the above mentioned
> deadlock scenario. Patch 2 introduces the new APIs and infrastructure necessary
> for race-free callback registration. The remaining patches perform tree-wide
> conversions (to use this model).
> 
> This patchset has been hosted in the below git tree. It applies cleanly on
> v3.14-rc6.
> 
> git://github.com/srivatsabhat/linux.git cpuhp-registration-fixes-v3

So to me, this is a useful patchset and it addresses a real problem.  It has
gone through several rounds of review and recently I haven't really seen any
objections against is from anyone.  On the contrary, I've seen ACKs from a
number of people whom I trust.

Thus, if nobody objects, I'm going to take this series into the PM tree for 3.15.

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* [PATCH 2/5 V2] IBM Akebono: Add support for a new PHY interface to the IBM emac driver
From: Alistair Popple @ 2014-03-11  0:44 UTC (permalink / raw)
  To: alistair, netdev, davem; +Cc: devicetree, linuxppc-dev, linux-kernel
In-Reply-To: <20140307.154142.488351276799532264.davem@davemloft.net>

The IBM PPC476GTR SoC that is used on the Akebono board uses a
different ethernet PHY interface that has wake on lan (WOL) support
with the IBM emac. This patch adds support to the IBM emac driver for
this new PHY interface.

At this stage the wake on lan functionality has not been implemented.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

This version just fixes the coding style as suggested by David M. 

 .../devicetree/bindings/powerpc/4xx/emac.txt       |   9 +
 drivers/net/ethernet/ibm/emac/Kconfig              |   4 +
 drivers/net/ethernet/ibm/emac/Makefile             |   1 +
 drivers/net/ethernet/ibm/emac/core.c               |  50 ++++-
 drivers/net/ethernet/ibm/emac/core.h               |  12 +
 drivers/net/ethernet/ibm/emac/rgmii_wol.c          | 244 +++++++++++++++++++++
 drivers/net/ethernet/ibm/emac/rgmii_wol.h          |  62 ++++++
 7 files changed, 376 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/ibm/emac/rgmii_wol.c
 create mode 100644 drivers/net/ethernet/ibm/emac/rgmii_wol.h

diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
index 712baf6..0c20529 100644
--- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
+++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
@@ -61,6 +61,8 @@
 			  Fox Axon: present, whatever value is appropriate for each
 			  EMAC, that is the content of the current (bogus) "phy-port"
 			  property.
+    - rgmii-wol-device  : 1 cell, required iff connected to a RGMII in the WKUP
+                          power domain. phandle of the RGMII-WOL device node.
 
     Optional properties:
     - phy-address       : 1 cell, optional, MDIO address of the PHY. If absent,
@@ -146,3 +148,10 @@
 			   available.
 			   For Axon: 0x0000012a
 
+      iv) RGMII-WOL node
+
+    Required properties:
+    - compatible         : compatible list, containing 2 entries, first is
+			   "ibm,rgmii-wol-CHIP" where CHIP is the host ASIC (like
+			   EMAC) and the second is "ibm,rgmii-wol".
+    - reg                : <registers mapping>
diff --git a/drivers/net/ethernet/ibm/emac/Kconfig b/drivers/net/ethernet/ibm/emac/Kconfig
index 3f44a30..56ea346 100644
--- a/drivers/net/ethernet/ibm/emac/Kconfig
+++ b/drivers/net/ethernet/ibm/emac/Kconfig
@@ -55,6 +55,10 @@ config IBM_EMAC_RGMII
 	bool
 	default n
 
+config IBM_EMAC_RGMII_WOL
+	bool "IBM EMAC RGMII wake-on-LAN support" if COMPILE_TEST
+	default n
+
 config IBM_EMAC_TAH
 	bool
 	default n
diff --git a/drivers/net/ethernet/ibm/emac/Makefile b/drivers/net/ethernet/ibm/emac/Makefile
index eba2183..8843803 100644
--- a/drivers/net/ethernet/ibm/emac/Makefile
+++ b/drivers/net/ethernet/ibm/emac/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_IBM_EMAC) += ibm_emac.o
 ibm_emac-y := mal.o core.o phy.o
 ibm_emac-$(CONFIG_IBM_EMAC_ZMII) += zmii.o
 ibm_emac-$(CONFIG_IBM_EMAC_RGMII) += rgmii.o
+ibm_emac-$(CONFIG_IBM_EMAC_RGMII_WOL) += rgmii_wol.o
 ibm_emac-$(CONFIG_IBM_EMAC_TAH) += tah.o
 ibm_emac-$(CONFIG_IBM_EMAC_DEBUG) += debug.o
diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index ae342fd..ff58474 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -632,6 +632,8 @@ static int emac_configure(struct emac_instance *dev)
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_set_speed(dev->rgmii_dev, dev->rgmii_port,
 				dev->phy.speed);
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_set_speed(dev->rgmii_wol_dev, dev->phy.speed);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
 		zmii_set_speed(dev->zmii_dev, dev->zmii_port, dev->phy.speed);
 
@@ -799,6 +801,8 @@ static int __emac_mdio_read(struct emac_instance *dev, u8 id, u8 reg)
 		zmii_get_mdio(dev->zmii_dev, dev->zmii_port);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_get_mdio(dev->rgmii_dev, dev->rgmii_port);
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_get_mdio(dev->rgmii_wol_dev);
 
 	/* Wait for management interface to become idle */
 	n = 20;
@@ -846,6 +850,8 @@ static int __emac_mdio_read(struct emac_instance *dev, u8 id, u8 reg)
 	DBG2(dev, "mdio_read -> %04x" NL, r);
 	err = 0;
  bail:
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_put_mdio(dev->rgmii_wol_dev);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_put_mdio(dev->rgmii_dev, dev->rgmii_port);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
@@ -871,6 +877,8 @@ static void __emac_mdio_write(struct emac_instance *dev, u8 id, u8 reg,
 		zmii_get_mdio(dev->zmii_dev, dev->zmii_port);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_get_mdio(dev->rgmii_dev, dev->rgmii_port);
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_get_mdio(dev->rgmii_wol_dev);
 
 	/* Wait for management interface to be idle */
 	n = 20;
@@ -909,6 +917,8 @@ static void __emac_mdio_write(struct emac_instance *dev, u8 id, u8 reg,
 	}
 	err = 0;
  bail:
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_put_mdio(dev->rgmii_wol_dev);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_put_mdio(dev->rgmii_dev, dev->rgmii_port);
 	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
@@ -2277,10 +2287,11 @@ struct emac_depentry {
 #define	EMAC_DEP_MAL_IDX	0
 #define	EMAC_DEP_ZMII_IDX	1
 #define	EMAC_DEP_RGMII_IDX	2
-#define	EMAC_DEP_TAH_IDX	3
-#define	EMAC_DEP_MDIO_IDX	4
-#define	EMAC_DEP_PREV_IDX	5
-#define	EMAC_DEP_COUNT		6
+#define EMAC_DEP_RGMII_WOL_IDX  3
+#define	EMAC_DEP_TAH_IDX	4
+#define	EMAC_DEP_MDIO_IDX	5
+#define	EMAC_DEP_PREV_IDX	6
+#define	EMAC_DEP_COUNT		7
 
 static int emac_check_deps(struct emac_instance *dev,
 			   struct emac_depentry *deps)
@@ -2358,6 +2369,7 @@ static int emac_wait_deps(struct emac_instance *dev)
 	deps[EMAC_DEP_MAL_IDX].phandle = dev->mal_ph;
 	deps[EMAC_DEP_ZMII_IDX].phandle = dev->zmii_ph;
 	deps[EMAC_DEP_RGMII_IDX].phandle = dev->rgmii_ph;
+	deps[EMAC_DEP_RGMII_WOL_IDX].phandle = dev->rgmii_wol_ph;
 	if (dev->tah_ph)
 		deps[EMAC_DEP_TAH_IDX].phandle = dev->tah_ph;
 	if (dev->mdio_ph)
@@ -2380,6 +2392,7 @@ static int emac_wait_deps(struct emac_instance *dev)
 		dev->mal_dev = deps[EMAC_DEP_MAL_IDX].ofdev;
 		dev->zmii_dev = deps[EMAC_DEP_ZMII_IDX].ofdev;
 		dev->rgmii_dev = deps[EMAC_DEP_RGMII_IDX].ofdev;
+		dev->rgmii_wol_dev = deps[EMAC_DEP_RGMII_WOL_IDX].ofdev;
 		dev->tah_dev = deps[EMAC_DEP_TAH_IDX].ofdev;
 		dev->mdio_dev = deps[EMAC_DEP_MDIO_IDX].ofdev;
 	}
@@ -2585,6 +2598,8 @@ static int emac_init_config(struct emac_instance *dev)
 		dev->rgmii_ph = 0;
 	if (emac_read_uint_prop(np, "rgmii-channel", &dev->rgmii_port, 0))
 		dev->rgmii_port = 0xffffffff;
+	if (emac_read_uint_prop(np, "rgmii-wol-device", &dev->rgmii_wol_ph, 0))
+		dev->rgmii_wol_ph = 0;
 	if (emac_read_uint_prop(np, "fifo-entry-size", &dev->fifo_entry_size, 0))
 		dev->fifo_entry_size = 16;
 	if (emac_read_uint_prop(np, "mal-burst-size", &dev->mal_burst_size, 0))
@@ -2671,6 +2686,16 @@ static int emac_init_config(struct emac_instance *dev)
 #endif
 	}
 
+	if (dev->rgmii_wol_ph != 0) {
+#ifdef CONFIG_IBM_EMAC_RGMII_WOL
+		dev->features |= EMAC_FTR_HAS_RGMII_WOL;
+#else
+		printk(KERN_ERR "%s: RGMII WOL support not enabled !\n",
+		       np->full_name);
+		return -ENXIO;
+#endif
+	}
+
 	/* Read MAC-address */
 	p = of_get_property(np, "local-mac-address", NULL);
 	if (p == NULL) {
@@ -2844,10 +2869,15 @@ static int emac_probe(struct platform_device *ofdev)
 	    (err = rgmii_attach(dev->rgmii_dev, dev->rgmii_port, dev->phy_mode)) != 0)
 		goto err_detach_zmii;
 
+	/* Attach to RGMII_WOL, if needed */
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL) &&
+	    (err = rgmii_wol_attach(dev->rgmii_wol_dev, dev->phy_mode)) != 0)
+		goto err_detach_rgmii;
+
 	/* Attach to TAH, if needed */
 	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH) &&
 	    (err = tah_attach(dev->tah_dev, dev->tah_port)) != 0)
-		goto err_detach_rgmii;
+		goto err_detach_rgmii_wol;
 
 	/* Set some link defaults before we can find out real parameters */
 	dev->phy.speed = SPEED_100;
@@ -2920,6 +2950,9 @@ static int emac_probe(struct platform_device *ofdev)
  err_detach_tah:
 	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
 		tah_detach(dev->tah_dev, dev->tah_port);
+ err_detach_rgmii_wol:
+	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII_WOL))
+		rgmii_wol_detach(dev->rgmii_wol_dev);
  err_detach_rgmii:
 	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
 		rgmii_detach(dev->rgmii_dev, dev->rgmii_port);
@@ -3081,12 +3114,17 @@ static int __init emac_init(void)
 	rc = tah_init();
 	if (rc)
 		goto err_rgmii;
-	rc = platform_driver_register(&emac_driver);
+	rc = rgmii_wol_init();
 	if (rc)
 		goto err_tah;
+	rc = platform_driver_register(&emac_driver);
+	if (rc)
+		goto err_rgmii_wol;
 
 	return 0;
 
+ err_rgmii_wol:
+	rgmii_wol_exit();
  err_tah:
 	tah_exit();
  err_rgmii:
diff --git a/drivers/net/ethernet/ibm/emac/core.h b/drivers/net/ethernet/ibm/emac/core.h
index 67f342a..7e1a70d 100644
--- a/drivers/net/ethernet/ibm/emac/core.h
+++ b/drivers/net/ethernet/ibm/emac/core.h
@@ -42,6 +42,7 @@
 #include "phy.h"
 #include "zmii.h"
 #include "rgmii.h"
+#include "rgmii_wol.h"
 #include "mal.h"
 #include "tah.h"
 #include "debug.h"
@@ -209,6 +210,10 @@ struct emac_instance {
 	u32				rgmii_port;
 	struct platform_device		*rgmii_dev;
 
+	/* RGMII WOL infos if any */
+	u32				rgmii_wol_ph;
+	struct platform_device		*rgmii_wol_dev;
+
 	/* TAH infos if any */
 	u32				tah_ph;
 	u32				tah_port;
@@ -332,6 +337,10 @@ struct emac_instance {
  * APM821xx does not support Half Duplex mode
  */
 #define EMAC_FTR_APM821XX_NO_HALF_DUPLEX	0x00001000
+/*
+ * Set if we have a RGMII with wake on LAN.
+ */
+#define EMAC_FTR_HAS_RGMII_WOL		0x00020000
 
 /* Right now, we don't quite handle the always/possible masks on the
  * most optimal way as we don't have a way to say something like
@@ -355,6 +364,9 @@ enum {
 #ifdef CONFIG_IBM_EMAC_RGMII
 	    EMAC_FTR_HAS_RGMII	|
 #endif
+#ifdef CONFIG_IBM_EMAC_RGMII_WOL
+	    EMAC_FTR_HAS_RGMII_WOL	|
+#endif
 #ifdef CONFIG_IBM_EMAC_NO_FLOW_CTRL
 	    EMAC_FTR_NO_FLOW_CONTROL_40x |
 #endif
diff --git a/drivers/net/ethernet/ibm/emac/rgmii_wol.c b/drivers/net/ethernet/ibm/emac/rgmii_wol.c
new file mode 100644
index 0000000..69785d7
--- /dev/null
+++ b/drivers/net/ethernet/ibm/emac/rgmii_wol.c
@@ -0,0 +1,244 @@
+/* drivers/net/ethernet/ibm/emac/rgmii_wol.c
+ *
+ * Driver for PowerPC 4xx on-chip ethernet controller, RGMII bridge with
+ * wake on LAN support.
+ *
+ * Copyright 2013 Alistair Popple, IBM Corp.
+ *                <alistair@popple.id.au>
+ *
+ * Based on rgmii.h:
+ * Copyright 2007 Benjamin Herrenschmidt, IBM Corp.
+ *                <benh@kernel.crashing.org>
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/ethtool.h>
+#include <linux/io.h>
+#include <linux/of_address.h>
+
+#include "emac.h"
+#include "debug.h"
+
+/* RGMII_WOL_REG */
+
+#define WKUP_ETH_RGSPD      0xC0000000
+#define WKUP_ETH_FCSEN      0x20000000
+#define WKUP_ETH_CRSEN      0x02000000
+#define WKUP_ETH_COLEN      0x01000000
+#define WKUP_ETH_TX_OE      0x00040000
+#define WKUP_ETH_RX_IE      0x00020000
+#define WKUP_ETH_RGMIIEN    0x00010000
+
+#define WKUP_ETH_RGSPD_10   0x00000000
+#define WKUP_ETH_RGSPD_100  0x40000000
+#define WKUP_ETH_RGSPD_1000 0x80000000
+
+/* RGMII bridge supports only GMII/TBI and RGMII/RTBI PHYs */
+static inline int rgmii_valid_mode(int phy_mode)
+{
+	return  phy_mode == PHY_MODE_GMII ||
+		phy_mode == PHY_MODE_MII ||
+		phy_mode == PHY_MODE_RGMII ||
+		phy_mode == PHY_MODE_TBI ||
+		phy_mode == PHY_MODE_RTBI;
+}
+
+int rgmii_wol_attach(struct platform_device *ofdev, int mode)
+{
+	struct rgmii_wol_instance *dev = platform_get_drvdata(ofdev);
+
+	dev_dbg(&ofdev->dev, "attach\n");
+
+	/* Check if we need to attach to a RGMII */
+	if (!rgmii_valid_mode(mode)) {
+		dev_err(&ofdev->dev, "unsupported settings !\n");
+		return -ENODEV;
+	}
+
+	mutex_lock(&dev->lock);
+
+	/* Enable this input */
+	out_be32(dev->reg, (in_be32(dev->reg) | WKUP_ETH_RGMIIEN |
+				    WKUP_ETH_TX_OE | WKUP_ETH_RX_IE));
+
+	++dev->users;
+
+	mutex_unlock(&dev->lock);
+
+	return 0;
+}
+
+void rgmii_wol_set_speed(struct platform_device *ofdev, int speed)
+{
+	struct rgmii_wol_instance *dev = platform_get_drvdata(ofdev);
+	u32 reg;
+
+	mutex_lock(&dev->lock);
+
+	reg = in_be32(dev->reg) & ~WKUP_ETH_RGSPD;
+
+	dev_dbg(&ofdev->dev, "speed(%d)\n", speed);
+
+	switch (speed) {
+	case SPEED_1000:
+		reg |= WKUP_ETH_RGSPD_1000;
+		break;
+	case SPEED_100:
+		reg |= WKUP_ETH_RGSPD_100;
+		break;
+	case SPEED_10:
+		reg |= WKUP_ETH_RGSPD_10;
+		break;
+	default:
+		dev_err(&ofdev->dev, "invalid speed set!\n");
+	}
+
+	out_be32(dev->reg, reg);
+
+	mutex_unlock(&dev->lock);
+}
+
+void rgmii_wol_get_mdio(struct platform_device *ofdev)
+{
+	/* MDIO is always enabled when RGMII_WOL is enabled, so we
+	 * don't have to do anything here.
+	 */
+	dev_dbg(&ofdev->dev, "get_mdio\n");
+}
+
+void rgmii_wol_put_mdio(struct platform_device *ofdev)
+{
+	dev_dbg(&ofdev->dev, "put_mdio\n");
+}
+
+void rgmii_wol_detach(struct platform_device *ofdev)
+{
+	struct rgmii_wol_instance *dev = platform_get_drvdata(ofdev);
+
+	BUG_ON(!dev || dev->users == 0);
+
+	mutex_lock(&dev->lock);
+
+	dev_dbg(&ofdev->dev, "detach\n");
+
+	/* Disable this input */
+	out_be32(dev->reg, 0);
+
+	--dev->users;
+
+	mutex_unlock(&dev->lock);
+}
+
+int rgmii_wol_get_regs_len(struct platform_device *ofdev)
+{
+	return sizeof(struct emac_ethtool_regs_subhdr) +
+		sizeof(u32);
+}
+
+void *rgmii_wol_dump_regs(struct platform_device *ofdev, void *buf)
+{
+	struct rgmii_wol_instance *dev = platform_get_drvdata(ofdev);
+	struct emac_ethtool_regs_subhdr *hdr = buf;
+	u32 *regs = (u32 *)(hdr + 1);
+
+	hdr->version = 0;
+	hdr->index = 0; /* for now, are there chips with more than one
+			 * rgmii ? if yes, then we'll add a cell_index
+			 * like we do for emac
+			 */
+	memcpy_fromio(regs, dev->reg, sizeof(u32));
+	return regs + 1;
+}
+
+
+static int rgmii_wol_probe(struct platform_device *ofdev)
+{
+	struct device_node *np = ofdev->dev.of_node;
+	struct rgmii_wol_instance *dev;
+	int rc;
+
+	rc = -ENOMEM;
+	dev = kzalloc(sizeof(struct rgmii_wol_instance), GFP_KERNEL);
+	if (dev == NULL)
+		goto err_gone;
+
+	mutex_init(&dev->lock);
+
+	dev->reg = of_iomap(np, 0);
+	if (!dev->reg) {
+		dev_err(&ofdev->dev, "Can't map registers\n");
+		rc = -ENXIO;
+		goto err_free;
+	}
+
+	/* Check for RGMII flags */
+	if (of_property_read_bool(np, "has-mdio"))
+		dev->flags |= EMAC_RGMII_FLAG_HAS_MDIO;
+
+	dev_dbg(&ofdev->dev, "Boot REG = 0x%08x\n", in_be32(dev->reg));
+
+	/* Disable all inputs by default */
+	out_be32(dev->reg, 0);
+
+	dev_info(&ofdev->dev,
+	       "RGMII %s initialized with%s MDIO support\n",
+	       ofdev->dev.of_node->full_name,
+	       (dev->flags & EMAC_RGMII_FLAG_HAS_MDIO) ? "" : "out");
+
+	wmb();
+	platform_set_drvdata(ofdev, dev);
+
+	return 0;
+
+ err_free:
+	kfree(dev);
+ err_gone:
+	return rc;
+}
+
+static int rgmii_wol_remove(struct platform_device *ofdev)
+{
+	struct rgmii_wol_instance *dev = platform_get_drvdata(ofdev);
+
+	WARN_ON(dev->users != 0);
+
+	iounmap(dev->reg);
+	kfree(dev);
+
+	return 0;
+}
+
+static struct of_device_id rgmii_wol_match[] = {
+	{
+		.compatible	= "ibm,rgmii-wol",
+	},
+	{
+		.type		= "emac-rgmii-wol",
+	},
+	{},
+};
+
+static struct platform_driver rgmii_wol_driver = {
+	.driver = {
+		.name = "emac-rgmii-wol",
+		.owner = THIS_MODULE,
+		.of_match_table = rgmii_wol_match,
+	},
+	.probe = rgmii_wol_probe,
+	.remove = rgmii_wol_remove,
+};
+
+int __init rgmii_wol_init(void)
+{
+	return platform_driver_register(&rgmii_wol_driver);
+}
+
+void rgmii_wol_exit(void)
+{
+	platform_driver_unregister(&rgmii_wol_driver);
+}
diff --git a/drivers/net/ethernet/ibm/emac/rgmii_wol.h b/drivers/net/ethernet/ibm/emac/rgmii_wol.h
new file mode 100644
index 0000000..9f0b589
--- /dev/null
+++ b/drivers/net/ethernet/ibm/emac/rgmii_wol.h
@@ -0,0 +1,62 @@
+/* drivers/net/ethernet/ibm/emac/rgmii_wol.h
+ *
+ * Driver for PowerPC 4xx on-chip ethernet controller, RGMII bridge with
+ * wake on LAN support.
+ *
+ * Copyright 2013 Alistair Popple, IBM Corp.
+ *                <alistair@popple.id.au>
+ *
+ * Based on rgmii.h:
+ * Copyright 2007 Benjamin Herrenschmidt, IBM Corp.
+ *                <benh@kernel.crashing.org>
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+#ifndef __IBM_NEWEMAC_RGMII_WOL_H
+#define __IBM_NEWEMAC_RGMII_WOL_H
+
+/* RGMII device */
+struct rgmii_wol_instance {
+	u32 __iomem			*reg;
+
+	/* RGMII bridge flags */
+	int				flags;
+#define EMAC_RGMII_FLAG_HAS_MDIO	0x00000001
+
+	/* Only one EMAC whacks us at a time */
+	struct mutex			lock;
+
+	/* number of EMACs using this RGMII bridge */
+	int				users;
+};
+
+#ifdef CONFIG_IBM_EMAC_RGMII_WOL
+
+extern int rgmii_wol_init(void);
+extern void rgmii_wol_exit(void);
+extern int rgmii_wol_attach(struct platform_device *ofdev, int mode);
+extern void rgmii_wol_detach(struct platform_device *ofdev);
+extern void rgmii_wol_get_mdio(struct platform_device *ofdev);
+extern void rgmii_wol_put_mdio(struct platform_device *ofdev);
+extern void rgmii_wol_set_speed(struct platform_device *ofdev, int speed);
+extern int rgmii_wol_get_regs_len(struct platform_device *ofdev);
+extern void *rgmii_wol_dump_regs(struct platform_device *ofdev, void *buf);
+
+#else
+
+# define rgmii_wol_init()		0
+# define rgmii_wol_exit()		do { } while (0)
+# define rgmii_wol_attach(x, y)		(-ENXIO)
+# define rgmii_wol_detach(x)		do { } while (0)
+# define rgmii_wol_get_mdio(o)		do { } while (0)
+# define rgmii_wol_put_mdio(o)		do { } while (0)
+# define rgmii_wol_set_speed(x, y)	do { } while (0)
+# define rgmii_wol_get_regs_len(x)	0
+# define rgmii_wol_dump_regs(x, buf)	(buf)
+#endif				/* !CONFIG_IBM_EMAC_RGMII_WOL */
+
+#endif /* __IBM_NEWEMAC_RGMII_WOL_H */
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH] powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
From: Mahesh J Salgaonkar @ 2014-03-11  5:26 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt; +Cc: Paul Mackerras

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

While checking powersaving mode in machine check handler at 0x200, we
clobber CFAR register. Fix it by saving and restoring it during beq/bgt.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/exception-64s.h |    8 ++++++++
 arch/powerpc/kernel/exceptions-64s.S     |    5 +++++
 2 files changed, 13 insertions(+)

diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 6683061..aeaa56c 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -147,6 +147,14 @@ BEGIN_FTR_SECTION_NESTED(943)						\
 END_FTR_SECTION_NESTED(ftr,ftr,943)
 
 /*
+ * Set an SPR from a register if the CPU has the given feature
+ */
+#define OPT_SET_SPR(ra, spr, ftr)					\
+BEGIN_FTR_SECTION_NESTED(943)						\
+	mtspr	spr,ra;							\
+END_FTR_SECTION_NESTED(ftr,ftr,943)
+
+/*
  * Save a register to the PACA if the CPU has the given feature
  */
 #define OPT_SAVE_REG_TO_PACA(offset, ra, ftr)				\
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 38d5073..4c34c3c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -164,13 +164,18 @@ BEGIN_FTR_SECTION
 	 */
 	mfspr	r13,SPRN_SRR1
 	rlwinm.	r13,r13,47-31,30,31
+	OPT_GET_SPR(r13, SPRN_CFAR, CPU_FTR_CFAR)
 	beq	9f
 
+	mfspr	r13,SPRN_SRR1
+	rlwinm.	r13,r13,47-31,30,31
 	/* waking up from powersave (nap) state */
 	cmpwi	cr1,r13,2
 	/* Total loss of HV state is fatal. let's just stay stuck here */
+	OPT_GET_SPR(r13, SPRN_CFAR, CPU_FTR_CFAR)
 	bgt	cr1,.
 9:
+	OPT_SET_SPR(r13, SPRN_CFAR, CPU_FTR_CFAR)
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
 	EXCEPTION_PROLOG_0(PACA_EXMC)

^ permalink raw reply related

* [PATCHv6 0/6] Getting rid of get_unused_fd() / enable close-on-exec
From: Yann Droneaud @ 2014-03-11 10:23 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Al Viro, linux-ia64, Jeremy Kerr,
	Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, cbe-oss-dev, linux-fsdevel, Eric Paris,
	Andrew Morton
  Cc: Yann Droneaud, Ulrich Drepper, linux-kernel

Hi,

Please find the sixth revision of my patchset to remove get_unused_fd()
macro in order to help subsystems to use get_unused_fd_flags() (or
anon_inode_getfd()) with flags either provided by the userspace or
set to O_CLOEXEC by default where appropriate.

Without get_unused_fd() macro, more subsystems are likely to use
get_unused_fd_flags() (or anon_inode_getfd()) and be taught to
provide an API that let userspace choose the opening flags
of the file descriptor.

Not allowing userspace to provide the "open" flags or not using O_CLOEXEC
by default should be considered bad practice from security point of view:
in most case O_CLOEXEC must be used to not leak file descriptor across
exec().

Not allowing userspace to atomically set close-on-exec flag and not using
O_CLOEXEC should be avoided to protect multi-threaded program
from race condition when it tried to set close-on-exec flag using
fcntl(fd, F_SETFD, FD_CLOEXEC) after opening the file descriptor.

Example:

    int fd;
    int ret;

    fd = open(filename, O_RDONLY);
    if (fd < 0) {
        perror("open");
        return -1;
    }

    /*
     * window opened for another thread to call fork(),
     * then the new process can call exec() at any time
     * and the file descriptor would be inherited
     */

    ret = fcntl(fd, F_SETFD, FD_CLOEXEC)
    if (ret < 0) {
        perror("fnctl()");
        close(fd);
        return -1;
    }

vs.:

    int fd;
    fd = open(filaneme, O_RDONLY | O_CLOEXEC);
    if (fd < 0) {
        perror("open");
        return -1;
    }

Using O_CLOEXEC by default when flags are not (eg. cannot be) provided
by userspace is the safest bet as it allows userspace to choose, if, when
and where the file descriptor is going to be inherited across exec():
userspace is free to call fcntl() to remove FD_CLOEXEC flag in the child
process that will call exec().

Unfortunately, O_CLOEXEC cannot be made the default for most existing system
calls as it's not the default behavior for POSIX / Unix. Reader interested
in this issue could have a look at "Ghosts of Unix past, part 2: Conflated
designs" [1] article by Neil Brown.

FAQ:

- Why do one want close-on-exec ?

Setting close-on-exec flag on file descriptor ensure it won't be inherited
silently by child, child of child, etc. when executing another program.

If the file descriptor is not closed, some kernel resources can be locked
until the last process with the opened file descriptor exit.

If the file descriptor is not closed, this can lead to a security issue,
eg. making resources available to a less privileged program allowing
information leak and/or deny of service.

- Why do one need atomic close-on-exec ?

Even if it's possible to set close-on-exec flag through call to fcntl()
as shown previously, it introduces a race condition in multi-threaded
process, where a thread call fork() / exec() while another thread
is between call to open() and fcntl().

Additionally, using close-on-exec free the programmer from tracking manually
which file descriptor is to close in a child before calling exec():
in a program using multiple third-party libraries, it's difficult to know
which file descriptor must be closed.
AFAIK, while there's a atexit(), pthread_atfork(), there's no atexec()
userspace function in libc to allow libraries to register a handler in
order to close its opened file descriptor before exec().

- Why default to close-on-exec ?

Some kernel interfaces don't allow userspace to pass a O_CLOEXEC-like
flag when creating a new file descriptor.

In such cases, if possible (see below), O_CLOEXEC must be made
the default so that userspace doesn't have to call fcntl()
which, as demonstrated previously, is open to race condition in
multi-threaded program.

- How to choose between flag 0 or O_CLOEXEC in call to
  get_unused_fd_flags() (or anon_inode_getfd()) ?

Short: Will it break existing application ? Will it break kernel ABI ?

       If answer is no, use O_CLOEXEC.
       If answer is yes, use 0.

Long: If userspace expect to retrieve a file descriptor with plain
      old Unix(tm) semantics, O_CLOEXEC must not be the default, as it
      could break some applications expecting that the file descriptor
      will be inherited across exec().

      But for some subsystems, such as InfiniBand, KVM, VFIO, it makes
      no sense to have file descriptors inherited across exec() since
      those are tied to resources that will vanish when a another program
      will replace the current one by mean of exec(), so it's safe to use
      O_CLOEXEC in such cases.

      For others, like XFS, the file descriptor is retrieved by one
      program and will be used by a different program, executed as a
      child. In this case, setting O_CLOEXEC would break existing
      application which do not expect to have to call fcntl(fd,
      F_SETFD, 0) to make it available across exec().

      If file descriptor created by a subsystem is not tied to the
      current process resources, it's likely legal to use it in a
      different process context, thus O_CLOEXEC must not be the default.

      If one, as a subsystem maintainer, cannot tell for sure that
      no userspace program ever rely current behavior, eg. file descriptor
      being inherited across exec(), then the default flag *must*
      be kept 0 to not break application.

- This subsystem cannot be turned to use O_CLOEXEC by default:

If O_CLOEXEC cannot be made the default, it would be interesting
to think to extend the API to have a (set of) function(s) taking
a flag parameter so that userspace can atomically request close-on-exec
if it need it (and it should need it).

- Background:

One might want to read "Secure File Descriptor Handling" [2] by
Ulrich Drepper who is responsible of adding O_CLOEXEC flag on open(),
and flags alike on other syscalls.

One might also want to read PEP-446 "Make newly created file descriptors
non-inheritable" [3] by Victor Stinner since it has lot more background
information on file descriptor leaking.

One also like to read "Excuse me son, but your code is leaking !!!" [4]
by Dan Walsh for advice.

[1] http://lwn.net/Articles/412131/
[2] http://udrepper.livejournal.com/20407.html
[3] http://www.python.org/dev/peps/pep-0446/
[4] http://danwalsh.livejournal.com/53603.html

- Statistics:

In linux-next tag 20140307, they're currently:

- 32 calls to fd_install()
       with one call part of anon_inode_getfd()
- 25 calls to get_unused_fd_flags()
       with one call part of anon_inode_getfd()
       with another part of get_unused_fd() macro
- 11 calls to anon_inode_getfd()
-  9 calls to anon_inode_getfile()
       with one call part of anon_inode_getfd()
-  6 calls to get_unused_fd()

Changes from patchset v5 [PATCHSETv5]

- perf: introduce a flag to enable close-on-exec in perf_event_open()
  DROPPED: applied upstream, commit a21b0b354d4a.

Changes from patchset v4 [PATCHSETv4]:

- rewrote cover letter following discussion with perf maintainer.
  Thanks to Peter Zijlstra.

- modified a bit some commit messages.

- events: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: replaced by following patch.

- perf: introduce a flag to enable close-on-exec in perf_event_open()
  NEW: instead of hard coding the flags to 0, this patch
       allows userspace to specify close-on-exec flag.

- fanotify: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: replaced by following patch.

- fanotify: enable close-on-exec on events' fd when requested in
    fanotify_init()
  NEW: instead of hard coding the flags to 0, this patch
       enable close-on-exec if userspace request it.

Changes from patchset v3 [PATCHSETv3]:

- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
  DROPPED: applied upstream, commit a646fbf0fd11.

Changes from patchset v2 [PATCHSETv2]:

- android/sw_sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
  DROPPED: applied upstream, commit 45acea57335e.

- android/sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
  DROPPED: applied upstream, commit 9c6cd3b39048.

- vfio: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: applied upstream, commit a5d550703d2c.
  Additionally subsystem maintainer applied another patch on top
  to set the flags to O_CLOEXEC, commit 5d042fbdbb2d.

- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
  NEW: propose to use O_CLOEXEC as default flag.

Changes from patchset v1 [PATCHSETv1]:

- explicitly added subsystem maintainers as mail recepients.

- infiniband: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: subsystem maintainer applied another patch
           using get_unused_fd_flags(O_CLOEXEC) as suggested,
	   commit da183c7af844.

- android/sw_sync: use get_unused_fd_flags(0) instead of get_unused_fd()
  MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested.

- android/sync: use get_unused_fd_flags(0) instead of get_unused_fd()
  MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested.

- xfs: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: applied asis by subsystem maintainer, commit 862a62937e76.

- sctp: use get_unused_fd_flags(0) instead of get_unused_fd()
  DROPPED: applied asis by subsystem maintainer, commit 8a59bd3e9b29.

Links:

[PATCHSETv5]
  http://lkml.kernel.org/r/cover.1388952061.git.ydroneaud@opteya.com

[PATCHSETv4]
  http://lkml.kernel.org/r/cover.1383121137.git.ydroneaud@opteya.com

[PATCHSETv3]
  http://lkml.kernel.org/r/cover.1378460926.git.ydroneaud@opteya.com

[PATCHSETv2]
  http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com

[PATCHSETv1]
  http://lkml.kernel.org/r/cover.1372777600.git.ydroneaud@opteya.com

Yann Droneaud (6):
  ia64: use get_unused_fd_flags(0) instead of get_unused_fd()
  ppc/cell: use get_unused_fd_flags(0) instead of get_unused_fd()
  binfmt_misc: use get_unused_fd_flags(0) instead of get_unused_fd()
  file: use get_unused_fd_flags(0) instead of get_unused_fd()
  fanotify: enable close-on-exec on events' fd when requested in
    fanotify_init()
  file: remove macro get_unused_fd()

 arch/ia64/kernel/perfmon.c                | 2 +-
 arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
 fs/binfmt_misc.c                          | 2 +-
 fs/file.c                                 | 2 +-
 fs/notify/fanotify/fanotify_user.c        | 2 +-
 include/linux/file.h                      | 1 -
 6 files changed, 6 insertions(+), 7 deletions(-)

-- 
1.8.5.3

^ permalink raw reply

* [PATCHv6 2/6] ppc/cell: use get_unused_fd_flags(0) instead of get_unused_fd()
From: Yann Droneaud @ 2014-03-11 10:23 UTC (permalink / raw)
  To: Jeremy Kerr, Arnd Bergmann, Benjamin Herrenschmidt,
	Paul Mackerras
  Cc: Yann Droneaud, cbe-oss-dev, linux-kernel, Al Viro, Andrew Morton,
	linuxppc-dev
In-Reply-To: <cover.1394532336.git.ydroneaud@opteya.com>

Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.

Link: http://lkml.kernel.org/r/cover.1394532336.git.ydroneaud@opteya.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
---
 arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 87ba7cf99cd7..51effcec30d8 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -301,7 +301,7 @@ static int spufs_context_open(struct path *path)
 	int ret;
 	struct file *filp;

-	ret = get_unused_fd();
+	ret = get_unused_fd_flags(0);
 	if (ret < 0)
 		return ret;

@@ -518,7 +518,7 @@ static int spufs_gang_open(struct path *path)
 	int ret;
 	struct file *filp;

-	ret = get_unused_fd();
+	ret = get_unused_fd_flags(0);
 	if (ret < 0)
 		return ret;

-- 
1.8.5.3

^ permalink raw reply related

* Re: [PATCH 3/7] DMA: Freescale: add fsl_dma_free_descriptor() to reduce code duplication
From: Vinod Koul @ 2014-03-11 11:06 UTC (permalink / raw)
  To: hongbo.zhang
  Cc: linux-kernel, scottwood, dmaengine, dan.j.williams, linuxppc-dev
In-Reply-To: <1389851246-8564-4-git-send-email-hongbo.zhang@freescale.com>

On Thu, Jan 16, 2014 at 01:47:22PM +0800, hongbo.zhang@freescale.com wrote:
> From: Hongbo Zhang <hongbo.zhang@freescale.com>
> 
> There are several places where descriptors are freed using identical code.
> This patch puts this code into a function to reduce code duplication.
> 
> Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
> Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
> ---
>  drivers/dma/fsldma.c |   38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
> index 95236e6..ad73538 100644
> --- a/drivers/dma/fsldma.c
> +++ b/drivers/dma/fsldma.c
> @@ -418,6 +418,21 @@ static dma_cookie_t fsl_dma_tx_submit(struct dma_async_tx_descriptor *tx)
>  }
>  
>  /**
> + * fsl_dma_free_descriptor - Free descriptor from channel's DMA pool.
> + * @chan : Freescale DMA channel
> + * @desc: descriptor to be freed
> + */
> +static void fsl_dma_free_descriptor(struct fsldma_chan *chan,
> +		struct fsl_desc_sw *desc)
> +{
> +	list_del(&desc->node);
> +#ifdef FSL_DMA_LD_DEBUG
> +	chan_dbg(chan, "LD %p free\n", desc);
> +#endif
why not wrap the define stuff in the defination of chan_dbg rather than its
usage :(

-- 
~Vinod

> +	dma_pool_free(chan->desc_pool, desc, desc->async_tx.phys);
> +}
> +
> +/**
>   * fsl_dma_alloc_descriptor - Allocate descriptor from channel's DMA pool.
>   * @chan : Freescale DMA channel
>   *
> @@ -491,13 +506,8 @@ static void fsldma_free_desc_list(struct fsldma_chan *chan,
>  {
>  	struct fsl_desc_sw *desc, *_desc;
>  
> -	list_for_each_entry_safe(desc, _desc, list, node) {
> -		list_del(&desc->node);
> -#ifdef FSL_DMA_LD_DEBUG
> -		chan_dbg(chan, "LD %p free\n", desc);
> -#endif
> -		dma_pool_free(chan->desc_pool, desc, desc->async_tx.phys);
> -	}
> +	list_for_each_entry_safe(desc, _desc, list, node)
> +		fsl_dma_free_descriptor(chan, desc);
>  }
>  
>  static void fsldma_free_desc_list_reverse(struct fsldma_chan *chan,
> @@ -505,13 +515,8 @@ static void fsldma_free_desc_list_reverse(struct fsldma_chan *chan,
>  {
>  	struct fsl_desc_sw *desc, *_desc;
>  
> -	list_for_each_entry_safe_reverse(desc, _desc, list, node) {
> -		list_del(&desc->node);
> -#ifdef FSL_DMA_LD_DEBUG
> -		chan_dbg(chan, "LD %p free\n", desc);
> -#endif
> -		dma_pool_free(chan->desc_pool, desc, desc->async_tx.phys);
> -	}
> +	list_for_each_entry_safe_reverse(desc, _desc, list, node)
> +		fsl_dma_free_descriptor(chan, desc);
>  }
>  
>  /**
> @@ -827,10 +832,7 @@ static void fsldma_cleanup_descriptor(struct fsldma_chan *chan,
>  	dma_run_dependencies(txd);
>  
>  	dma_descriptor_unmap(txd);
> -#ifdef FSL_DMA_LD_DEBUG
> -	chan_dbg(chan, "LD %p free\n", desc);
> -#endif
> -	dma_pool_free(chan->desc_pool, desc, txd->phys);
> +	fsl_dma_free_descriptor(chan, desc);
>  }
>  
>  /**
> -- 
> 1.7.9.5
> 
> 
> 

-- 

^ permalink raw reply

* [RFC PATCH 1/2] powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo
From: Gautham R. Shenoy @ 2014-03-11 11:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gautham R. Shenoy
In-Reply-To: <1394537479-17231-1-git-send-email-ego@linux.vnet.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.

However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms. We need a better way
to query and report the correct value of the processor clock in
/proc/cpuinfo.

The patch achieves this by creating a machdep_call named
get_proc_freq() which is expected to returns the frequency in Hz. The
code in show_cpuinfo() can invoke this method to display the correct
clock rate on platforms that have implemented this method. On the
other powerpc platforms it can use the value cached in ppc_proc_freq.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h |  2 ++
 arch/powerpc/kernel/setup-common.c | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index ad3025d..eb1c6d4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -113,6 +113,8 @@ struct machdep_calls {
 	/* Optional, may be NULL. */
 	void		(*show_cpuinfo)(struct seq_file *m);
 	void		(*show_percpuinfo)(struct seq_file *m, int i);
+	/* Returns the current operating frequency of "cpu" in Hz */
+	unsigned long  	(*get_proc_freq)(unsigned int cpu);

 	void		(*init_IRQ)(void);

diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index bc76cc6..bdd3045 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -209,6 +209,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 {
 	unsigned long cpu_id = (unsigned long)v - 1;
 	unsigned int pvr;
+	unsigned long proc_freq;
 	unsigned short maj;
 	unsigned short min;

@@ -260,12 +261,19 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 #endif /* CONFIG_TAU */

 	/*
-	 * Assume here that all clock rates are the same in a
-	 * smp system.  -- Cort
+	 * Platforms that have variable clock rates, should implement
+	 * the method ppc_md.get_proc_freq() that reports the clock
+	 * rate of a given cpu. The rest can use ppc_proc_freq to
+	 * report the clock rate that is same across all cpus.
 	 */
-	if (ppc_proc_freq)
+	if (ppc_md.get_proc_freq)
+		proc_freq = ppc_md.get_proc_freq(cpu_id);
+	else
+		proc_freq = ppc_proc_freq;
+
+	if (proc_freq)
 		seq_printf(m, "clock\t\t: %lu.%06luMHz\n",
-			   ppc_proc_freq / 1000000, ppc_proc_freq % 1000000);
+			   proc_freq / 1000000, proc_freq % 1000000);

 	if (ppc_md.show_percpuinfo != NULL)
 		ppc_md.show_percpuinfo(m, cpu_id);
-- 
1.8.3.1

^ permalink raw reply related

* [RFC PATCH 0/2] powernv: Show the correct clock value in /proc/cpuinfo
From: Gautham R. Shenoy @ 2014-03-11 11:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Hi,

Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.

However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms.

This patch-series fixes this problem by having /proc/cpuinfo report
the value returned by cpufreq_quick_get(cpu) whenever the cpufreq
backend driver is available and fallback to the old way of reporting
the clock rate in its absence.

These patches depend on the patches to enable dynamic cpufrequency
scaling on PowerNV that can be found here:
http://linuxppc.10917.n7.nabble.com/PATCH-v2-0-6-powernv-cpufreq-Dynamic-cpu-frequency-scaling-td80641.html

Gautham R. Shenoy (2):
  powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo
  powerpc: powernv: Implement ppc_md.get_proc_freq()

 arch/powerpc/include/asm/machdep.h     |  2 ++
 arch/powerpc/kernel/setup-common.c     | 16 ++++++++++++----
 arch/powerpc/platforms/powernv/setup.c | 21 +++++++++++++++++++++
 3 files changed, 35 insertions(+), 4 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [RFC PATCH 2/2] powerpc: powernv: Implement ppc_md.get_proc_freq()
From: Gautham R. Shenoy @ 2014-03-11 11:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gautham R. Shenoy
In-Reply-To: <1394537479-17231-1-git-send-email-ego@linux.vnet.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Implement a method named pnv_get_proc_freq(unsigned int cpu) which
returns the current clock rate on the 'cpu' in Hz to be reported in
/proc/cpuinfo. This method uses the value reported by cpufreq when
such a value is sane. Otherwise it falls back to old way of reporting
the clockrate, i.e. ppc_proc_freq.

Set the ppc_md.get_proc_freq() hook to pnv_get_proc_freq() on the
PowerNV platform.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/setup.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 110f4fb..423e36d 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -28,6 +28,7 @@
 #include <linux/bug.h>
 #include <linux/cpuidle.h>
 #include <linux/pci.h>
+#include <linux/cpufreq.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -235,6 +236,25 @@ void powernv_idle(void)
 	}
 }
 
+/*
+ * Returns the cpu frequency for 'cpu' in Hz. This is used by
+ * /proc/cpuinfo
+ */
+unsigned long pnv_get_proc_freq(unsigned int cpu)
+{
+	unsigned long ret_freq;
+
+	ret_freq = cpufreq_quick_get(cpu) * 1000ul;
+
+	/*
+	 * If the backend cpufreq driver does not exist,
+         * then fallback to old way of reporting the clockrate.
+	 */
+	if (!ret_freq)
+		ret_freq = ppc_proc_freq;
+	return ret_freq;
+}
+
 define_machine(powernv) {
 	.name			= "PowerNV",
 	.probe			= pnv_probe,
@@ -242,6 +262,7 @@ define_machine(powernv) {
 	.setup_arch		= pnv_setup_arch,
 	.init_IRQ		= pnv_init_IRQ,
 	.show_cpuinfo		= pnv_show_cpuinfo,
+	.get_proc_freq          = pnv_get_proc_freq,
 	.progress		= pnv_progress,
 	.machine_shutdown	= pnv_shutdown,
 	.power_save             = powernv_idle,
-- 
1.8.3.1

^ permalink raw reply related

* Re: how do I increase default kernel stack size for my MPC8548 board?
From: perth1415 @ 2014-03-11 12:35 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <r2u389deec71004291743ra5fe0d49yaf317b5c5aaea6b7@mail.gmail.com>

Hi,

I have the exact same query. We have a board with MPC8378, which is a 32-bit
processor running 2.6-25 kernel. I increased THREAD_SHIFT to 14. Now the
board hangs just after the "Uncompressing Kernel Image ... OK" message from
U-boot.
Is there any other change required for 16k stack apart from THREAD_SHIFT?

Thanks,
Partha

--
View this message in context: http://linuxppc.10917.n7.nabble.com/how-do-I-increase-default-kernel-stack-size-for-my-MPC8548-board-tp52842p80711.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: how do I increase default kernel stack size for my MPC8548 board?
From: perth1415 @ 2014-03-11 14:19 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1394541351657-80711.post@n7.nabble.com>

Looks like I found the issue :-) The stack size for init task needs to be
updated too. Right now it's by default set to 8K for PPC32.

In arch/powerpc/kernel/vmlinux.lds.S -
===
    /* The initial task and kernel stack */
#ifdef CONFIG_PPC32
    . = ALIGN(8192);
#else
    . = ALIGN(16384);
#endif
===

Problem is, how do we link this to THREAD_SHIFT?

Thanks,
Partha

--
View this message in context: http://linuxppc.10917.n7.nabble.com/how-do-I-increase-default-kernel-stack-size-for-my-MPC8548-board-tp52842p80712.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: Build regressions/improvements in v3.14-rc6
From: Geert Uytterhoeven @ 2014-03-11 17:43 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: the arch/x86 maintainers, linuxppc-dev@lists.ozlabs.org,
	Linux-sh list
In-Reply-To: <1394528182-32013-1-git-send-email-geert@linux-m68k.org>

On Tue, Mar 11, 2014 at 9:56 AM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> JFYI, when comparing v3.14-rc6[1]  to v3.14-rc5[3], the summaries are:
>   - build errors: +10/-4

Only fuzz from the randconfig police:

  + /scratch/kisskb/src/drivers/gpu/drm/drm_gem_cma_helper.c: error:
implicit declaration of function 'dma_alloc_writecombine'
[-Werror=implicit-function-declaration]:  => 91:2
  + /scratch/kisskb/src/drivers/gpu/drm/drm_gem_cma_helper.c: error:
implicit declaration of function 'dma_free_writecombine'
[-Werror=implicit-function-declaration]:  => 176:3

sh-randconfig

  + /scratch/kisskb/src/kernel/bounds.c: error: -mcall-aixdesc must be
big endian:  => 1:0
  + /scratch/kisskb/src/scripts/mod/devicetable-offsets.c: error:
-mcall-aixdesc must be big endian:  => 1:0
  + /scratch/kisskb/src/scripts/mod/empty.c: error: -mcall-aixdesc
must be big endian:  => 1:0
  + <stdin>: error: -mcall-aixdesc must be big endian:  => 1:0

powerpc-randconfig

  + error: No rule to make target /etc/sound/pndsperm.bin:  => N/A
  + error: No rule to make target /etc/sound/pndspini.bin:  => N/A
  + error: No rule to make target /etc/sound/trxpro.hex:  => N/A

i386-randconfig

  + error: initramfs.c: undefined reference to `__stack_chk_guard':
=> .init.text+0x1aef)

x86_64-randconfig

> [1] http://kisskb.ellerman.id.au/kisskb/head/7258/ (all 119 configs)
> [3] http://kisskb.ellerman.id.au/kisskb/head/7236/ (all 119 configs)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Node 0 not necessary for powerpc?
From: Nishanth Aravamudan @ 2014-03-11 19:56 UTC (permalink / raw)
  To: linux-mm; +Cc: cl, linuxppc-dev, anton, rientjes

I have a P7 system that has no node0, but a node0 shows up in numactl
--hardware, which has no cpus and no memory (and no PCI devices):

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 7935 MB
node 2 free: 7716 MB
node 3 cpus:
node 3 size: 8395 MB
node 3 free: 8015 MB
node distances:
node   0   1   2   3 
  0:  10  20  10  20 
  1:  20  10  20  20 
  2:  10  20  10  20 
  3:  20  20  20  10 

This is because we statically initialize N_ONLINE to be [0] in
mm/page_alloc.c:

        [N_ONLINE] = { { [0] = 1UL } },

I'm not sure what the architectural requirements are here, but at least
on this test system, removing this initialization, it boots fine and is
running. I've not yet tried stress tests, but it's survived the
beginnings of kernbench so far.

numactl --hardware
available: 3 nodes (1-3)
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 7935 MB
node 2 free: 7479 MB
node 3 cpus:
node 3 size: 8396 MB
node 3 free: 8375 MB
node distances:
node   1   2   3 
  1:  10  20  20 
  2:  20  10  20 
  3:  20  20  10

Perhaps we could put in a ARCH_DOES_NOT_NEED_NODE0 and only define it on
powerpc for now, conditionalizing the above initialization on that?

Thanks,
Nish

^ permalink raw reply

* Re: [RFC PATCH] Remove CONFIG_DCACHE_WORD_ACCESS
From: Benjamin Herrenschmidt @ 2014-03-11 20:37 UTC (permalink / raw)
  To: Joe Perches
  Cc: linux-arch, Anton Blanchard, Russell King, Catalin Marinas, x86,
	Will Deacon, linux-kernel, Ingo Molnar, Paul Mackerras,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Thomas Gleixner,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <1393964591.20435.58.camel@joe-AO722>

On Tue, 2014-03-04 at 12:23 -0800, Joe Perches wrote:
> It seems to duplicate CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> so use that instead.
> 
> This changes the !CPU_LITTLE_ENDIAN powerpc arch to use unaligned
> accesses in fs/dcache.c and fs/namei.c as
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is enabled for that arch.
> 
> Remove the now unused DCACHE_WORD_ACCESS defines & uses.

Interesting.. we have word-at-a-time but we never enabled
DCACHE_WORD_ACCESS, I wonder why that is. In fact, we should
probably do it for LE as well for P8 if we can make a P8
only config option...

Anton, what do you reckon here ?

Cheers,
Ben.

> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  arch/arm/Kconfig                      | 1 -
>  arch/arm/include/asm/word-at-a-time.h | 4 ++--
>  arch/arm64/Kconfig                    | 1 -
>  arch/x86/Kconfig                      | 1 -
>  fs/Kconfig                            | 4 ----
>  fs/dcache.c                           | 2 +-
>  fs/namei.c                            | 2 +-
>  7 files changed, 4 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 623a272..d5a2e60 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -12,7 +12,6 @@ config ARM
>  	select BUILDTIME_EXTABLE_SORT if MMU
>  	select CLONE_BACKWARDS
>  	select CPU_PM if (SUSPEND || CPU_IDLE)
> -	select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS
>  	select GENERIC_ATOMIC64 if (CPU_V7M || CPU_V6 || !CPU_32v6K || !AEABI)
>  	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
>  	select GENERIC_IDLE_POLL_SETUP
> diff --git a/arch/arm/include/asm/word-at-a-time.h b/arch/arm/include/asm/word-at-a-time.h
> index a6d0a29..778b2ad 100644
> --- a/arch/arm/include/asm/word-at-a-time.h
> +++ b/arch/arm/include/asm/word-at-a-time.h
> @@ -54,7 +54,7 @@ static inline unsigned long find_zero(unsigned long mask)
>  #include <asm-generic/word-at-a-time.h>
>  #endif
>  
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>  
>  /*
>   * Load an unaligned word from kernel space.
> @@ -94,5 +94,5 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
>  	return ret;
>  }
>  
> -#endif	/* DCACHE_WORD_ACCESS */
> +#endif	/* HAVE_EFFICIENT_UNALIGNED_ACCESS */
>  #endif /* __ASM_ARM_WORD_AT_A_TIME_H */
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 764d682..2d6978c 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -13,7 +13,6 @@ config ARM64
>  	select CLONE_BACKWARDS
>  	select COMMON_CLK
>  	select CPU_PM if (SUSPEND || CPU_IDLE)
> -	select DCACHE_WORD_ACCESS
>  	select GENERIC_CLOCKEVENTS
>  	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
>  	select GENERIC_IOMAP
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index abb261e..60cfa073 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -98,7 +98,6 @@ config X86
>  	select CLKEVT_I8253
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	select GENERIC_IOMAP
> -	select DCACHE_WORD_ACCESS
>  	select GENERIC_SMP_IDLE_THREAD
>  	select ARCH_WANT_IPC_PARSE_VERSION if X86_32
>  	select HAVE_ARCH_SECCOMP_FILTER
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 312393f..7511271 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -4,10 +4,6 @@
>  
>  menu "File systems"
>  
> -# Use unaligned word dcache accesses
> -config DCACHE_WORD_ACCESS
> -       bool
> -
>  if BLOCK
>  
>  source "fs/ext2/Kconfig"
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 265e0ce..4e3c195 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -163,7 +163,7 @@ int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
>   * Compare 2 name strings, return 0 if they match, otherwise non-zero.
>   * The strings are both count bytes long, and count is non-zero.
>   */
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>  
>  #include <asm/word-at-a-time.h>
>  /*
> diff --git a/fs/namei.c b/fs/namei.c
> index 385f781..1ee33ca 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1618,7 +1618,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
>   *   the final mask". Again, that could be replaced with a
>   *   efficient population count instruction or similar.
>   */
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>  
>  #include <asm/word-at-a-time.h>
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Bug in reclaim logic with exhausted nodes?
From: Nishanth Aravamudan @ 2014-03-11 21:06 UTC (permalink / raw)
  To: linux-mm; +Cc: cl, rientjes, linuxppc-dev, anton, mgorman

We have seen the following situation on a test system:

2-node system, each node has 32GB of memory.

2 gigantic (16GB) pages reserved at boot-time, both of which are
allocated from node 1.

SLUB notices this:

[    0.000000] SLUB: Unable to allocate memory from node 1
[    0.000000] SLUB: Allocating a useless per node structure in order to
be able to continue

After boot, user then did:

echo 24 > /proc/sys/vm/nr_hugepages

And tasks are stuck:

[<c0000000010980b8>] kexec_stack+0xb8/0x8000
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f9334b0>] 0xc00000004f9334b0
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f91f440>] 0xc00000004f91f440
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
[<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
[<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

kswapd1 is also pegged at this point at 100% cpu.

If we go in and manually:

echo 24 >
/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages

rather than relying on the interleaving allocator from the sysctl, the
allocation succeeds (and the echo returns immediately).

I think we are hitting the following:

mm/hugetlb.c::alloc_fresh_huge_page_node():

        page = alloc_pages_exact_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));

include/linux/gfp.h:

#define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

and mm/page_alloc.c::__alloc_pages_slowpath():

        /*
         * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
         * __GFP_NOWARN set) should not cause reclaim since the subsystem
         * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
         * using a larger set of nodes after it has established that the
         * allowed per node queues are empty and that nodes are
         * over allocated.
         */
        if (IS_ENABLED(CONFIG_NUMA) &&
                        (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
                goto nopage;

so we *do* reclaim in this callpath. Under my reading, since node1 is
exhausted, no matter how much work kswapd1 does, it will never reclaim
memory from node1 to satisfy a 16M page allocation request (or any
other, for that matter).

I see the following possible changes/fixes, but am unsure if
a) my analysis is right
b) which is best.

1) Since we did notice early in boot that (in this case) node 1 was
exhausted, perhaps we should mark it as such there somehow, and if a
__GFP_THISNODE allocation request comes through on such a node, we
immediately fallthrough to nopage?

2) There is the following check
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
                return 1;

I wonder if we should add a check to also be sure that the pages we are
reclaiming, if __GFP_THISNODE is set, are from the right node?

       if (gfp_mask & __GFP_THISNODE && the progress we have made is on
       		the node requested?)

3) did_some_progress could be updated to track where the progress is
occuring, and if we are in __GFP_THISNODE allocation request and we
didn't make any progress on the correct node, we fail the allocation?

I think this situation could be reproduced (and am working on it) by
exhausting a NUMA node with 16M hugepages and then using the generic
RR allocator to ask for more. Other node exhaustion cases probably
exist, but since we can't swap the hugepages, it seems like the most
straightforward way to try and reproduce it.

Any thoughts on this? Am I way off base?

Thanks,
Nish

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox