public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] lockup detector fixes
@ 2010-09-01  3:00 Don Zickus
  2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Don Zickus @ 2010-09-01  3:00 UTC (permalink / raw)
  To: mingo; +Cc: peterz, gorcunov, fweisbec, linux-kernel, Don Zickus

A bunch of fixes for the lockup detector.

Tested on intel/amd on top of ingo's tip branch.

Akinobu Mita (2):
  lockup_detector: convert cpu notifier to return encapsulate errno
    value
  lockup_detector: remove unnecessary panic_notifier

Don Zickus (1):
  [lockup detector] sync touch_*_watchdog back to old semantics

 kernel/watchdog.c |   53 +++++++++++++++++++++++------------------------------
 1 files changed, 23 insertions(+), 30 deletions(-)

-- 
1.7.2.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  3:00 [PATCH 0/3] lockup detector fixes Don Zickus
@ 2010-09-01  3:00 ` Don Zickus
  2010-09-01  5:30   ` Ingo Molnar
  2010-09-01  9:19   ` [tip:perf/urgent] lockup_detector: Sync " tip-bot for Don Zickus
  2010-09-01  3:00 ` [PATCH 2/3] lockup_detector: convert cpu notifier to return encapsulate errno value Don Zickus
  2010-09-01  3:00 ` [PATCH 3/3] lockup_detector: remove unnecessary panic_notifier Don Zickus
  2 siblings, 2 replies; 15+ messages in thread
From: Don Zickus @ 2010-09-01  3:00 UTC (permalink / raw)
  To: mingo; +Cc: peterz, gorcunov, fweisbec, linux-kernel, Don Zickus

During my rewrite, the semantics of touch_nmi_watchdog and
touch_softlockup_watchdog changed enough to break some drivers
(mostly over preemptable regions).

This change brings those touch_*_watchdog functions back in line
to how they used to work.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/watchdog.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0d53c8e..7f9c3c5 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -122,7 +122,7 @@ static void __touch_watchdog(void)
 
 void touch_softlockup_watchdog(void)
 {
-	__get_cpu_var(watchdog_touch_ts) = 0;
+	__raw_get_cpu_var(watchdog_touch_ts) = 0;
 }
 EXPORT_SYMBOL(touch_softlockup_watchdog);
 
@@ -142,7 +142,14 @@ void touch_all_softlockup_watchdogs(void)
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 void touch_nmi_watchdog(void)
 {
-	__get_cpu_var(watchdog_nmi_touch) = true;
+	if (watchdog_enabled) {
+		unsigned cpu;
+
+		for_each_present_cpu(cpu) {
+			if (per_cpu(watchdog_nmi_touch, cpu) != true)
+				per_cpu(watchdog_nmi_touch, cpu) = true;
+		}
+	}
 	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
@@ -433,6 +440,9 @@ static int watchdog_enable(int cpu)
 		wake_up_process(p);
 	}
 
+	/* if any cpu succeeds, watchdog is considered enabled for the system */
+	watchdog_enabled = 1;
+
 	return 0;
 }
 
@@ -455,9 +465,6 @@ static void watchdog_disable(int cpu)
 		per_cpu(softlockup_watchdog, cpu) = NULL;
 		kthread_stop(p);
 	}
-
-	/* if any cpu succeeds, watchdog is considered enabled for the system */
-	watchdog_enabled = 1;
 }
 
 static void watchdog_enable_all_cpus(void)
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] lockup_detector: convert cpu notifier to return encapsulate errno value
  2010-09-01  3:00 [PATCH 0/3] lockup detector fixes Don Zickus
  2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
@ 2010-09-01  3:00 ` Don Zickus
  2010-09-01  6:24   ` [tip:perf/core] lockup_detector: Convert " tip-bot for Akinobu Mita
  2010-09-01  3:00 ` [PATCH 3/3] lockup_detector: remove unnecessary panic_notifier Don Zickus
  2 siblings, 1 reply; 15+ messages in thread
From: Don Zickus @ 2010-09-01  3:00 UTC (permalink / raw)
  To: mingo; +Cc: peterz, gorcunov, fweisbec, linux-kernel, Akinobu Mita,
	Don Zickus

From: Akinobu Mita <akinobu.mita@gmail.com>

By the commit e6bde73b07edeb703d4c89c1daabc09c303de11f
("cpu-hotplug: return better errno on cpu hotplug failure"),
the cpu notifier can return encapsulate errno value.

This converts the cpu notifier to return encapsulate errno value
for lockup_detector.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/watchdog.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9c3c5..db70647 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -378,7 +378,7 @@ static int watchdog_nmi_enable(int cpu)
 	}
 
 	printk(KERN_ERR "NMI watchdog failed to create perf event on cpu%i: %p\n", cpu, event);
-	return -1;
+	return PTR_ERR(event);
 
 	/* success path */
 out_save:
@@ -422,17 +422,19 @@ static int watchdog_prepare_cpu(int cpu)
 static int watchdog_enable(int cpu)
 {
 	struct task_struct *p = per_cpu(softlockup_watchdog, cpu);
+	int err;
 
 	/* enable the perf event */
-	if (watchdog_nmi_enable(cpu) != 0)
-		return -1;
+	err = watchdog_nmi_enable(cpu);
+	if (err)
+		return err;
 
 	/* create the watchdog thread */
 	if (!p) {
 		p = kthread_create(watchdog, (void *)(unsigned long)cpu, "watchdog/%d", cpu);
 		if (IS_ERR(p)) {
 			printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
-			return -1;
+			return PTR_ERR(p);
 		}
 		kthread_bind(p, cpu);
 		per_cpu(watchdog_touch_ts, cpu) = 0;
@@ -526,17 +528,16 @@ static int __cpuinit
 cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
 	int hotcpu = (unsigned long)hcpu;
+	int err = 0;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (watchdog_prepare_cpu(hotcpu))
-			return NOTIFY_BAD;
+		err = watchdog_prepare_cpu(hotcpu);
 		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
-		if (watchdog_enable(hotcpu))
-			return NOTIFY_BAD;
+		err = watchdog_enable(hotcpu);
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_UP_CANCELED:
@@ -549,7 +550,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 #endif /* CONFIG_HOTPLUG_CPU */
 	}
-	return NOTIFY_OK;
+	return notifier_from_errno(err);
 }
 
 static struct notifier_block __cpuinitdata cpu_nfb = {
@@ -565,7 +566,7 @@ static int __init spawn_watchdog_task(void)
 		return 0;
 
 	err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
-	WARN_ON(err == NOTIFY_BAD);
+	WARN_ON(notifier_to_errno(err));
 
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 	register_cpu_notifier(&cpu_nfb);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] lockup_detector: remove unnecessary panic_notifier
  2010-09-01  3:00 [PATCH 0/3] lockup detector fixes Don Zickus
  2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
  2010-09-01  3:00 ` [PATCH 2/3] lockup_detector: convert cpu notifier to return encapsulate errno value Don Zickus
@ 2010-09-01  3:00 ` Don Zickus
  2010-09-01  6:25   ` [tip:perf/core] lockup_detector: Remove unused panic_notifier tip-bot for Akinobu Mita
  2 siblings, 1 reply; 15+ messages in thread
From: Don Zickus @ 2010-09-01  3:00 UTC (permalink / raw)
  To: mingo; +Cc: peterz, gorcunov, fweisbec, linux-kernel, Akinobu Mita,
	Don Zickus

From: Akinobu Mita <akinobu.mita@gmail.com>

The panic notifer in lockup_detector just set did_panic to 1.
But did_panic is not used anywhere so we can just remove it.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/watchdog.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index db70647..fa71aeb 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -43,7 +43,6 @@ static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
 #endif
 
-static int __read_mostly did_panic;
 static int __initdata no_watchdog;
 
 
@@ -187,18 +186,6 @@ static int is_softlockup(unsigned long touch_ts)
 	return 0;
 }
 
-static int
-watchdog_panic(struct notifier_block *this, unsigned long event, void *ptr)
-{
-	did_panic = 1;
-
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block panic_block = {
-	.notifier_call = watchdog_panic,
-};
-
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 static struct perf_event_attr wd_hw_attr = {
 	.type		= PERF_TYPE_HARDWARE,
@@ -571,8 +558,6 @@ static int __init spawn_watchdog_task(void)
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 	register_cpu_notifier(&cpu_nfb);
 
-	atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
-
 	return 0;
 }
 early_initcall(spawn_watchdog_task);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
@ 2010-09-01  5:30   ` Ingo Molnar
  2010-09-01  6:00     ` Cyrill Gorcunov
  2010-09-01  9:19   ` [tip:perf/urgent] lockup_detector: Sync " tip-bot for Don Zickus
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2010-09-01  5:30 UTC (permalink / raw)
  To: Don Zickus; +Cc: peterz, gorcunov, fweisbec, linux-kernel


* Don Zickus <dzickus@redhat.com> wrote:

>  void touch_nmi_watchdog(void)
>  {
> -	__get_cpu_var(watchdog_nmi_touch) = true;
> +	if (watchdog_enabled) {
> +		unsigned cpu;
> +
> +		for_each_present_cpu(cpu) {
> +			if (per_cpu(watchdog_nmi_touch, cpu) != true)
> +				per_cpu(watchdog_nmi_touch, cpu) = true;
> +		}

Hm, this is going to be a scalability nightmare with lots of CPUs. Not 
only do we have a nr_cpus loop, but we touch per-cpu areas of _other_ 
CPUs - a big scalability nono.

Why do we need to do this? We never needed to touch other CPU's NMI 
lockup accounting data areas - why has this changed? The changelog does 
not explain this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  5:30   ` Ingo Molnar
@ 2010-09-01  6:00     ` Cyrill Gorcunov
  2010-09-01  7:01       ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2010-09-01  6:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Don Zickus, peterz, fweisbec, linux-kernel

On 9/1/10, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Don Zickus <dzickus@redhat.com> wrote:
>
>>  void touch_nmi_watchdog(void)
>>  {
>> -	__get_cpu_var(watchdog_nmi_touch) = true;
>> +	if (watchdog_enabled) {
>> +		unsigned cpu;
>> +
>> +		for_each_present_cpu(cpu) {
>> +			if (per_cpu(watchdog_nmi_touch, cpu) != true)
>> +				per_cpu(watchdog_nmi_touch, cpu) = true;
>> +		}
>
> Hm, this is going to be a scalability nightmare with lots of CPUs. Not
> only do we have a nr_cpus loop, but we touch per-cpu areas of _other_
> CPUs - a big scalability nono.
>
> Why do we need to do this? We never needed to touch other CPU's NMI
> lockup accounting data areas - why has this changed? The changelog does
> not explain this.
>
> Thanks,
>
> 	Ingo
>
I believe this came from old nmi watchdog code where it might be
useful when nmi watchdog activated via io-apic. I'm trying to figure
out if we really need it still.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip:perf/core] lockup_detector: Convert cpu notifier to return encapsulate errno value
  2010-09-01  3:00 ` [PATCH 2/3] lockup_detector: convert cpu notifier to return encapsulate errno value Don Zickus
@ 2010-09-01  6:24   ` tip-bot for Akinobu Mita
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Akinobu Mita @ 2010-09-01  6:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, akinobu.mita, tglx, mingo, dzickus

Commit-ID:  eac243355a99d6b9d41bbeba4fc83e7f735485f9
Gitweb:     http://git.kernel.org/tip/eac243355a99d6b9d41bbeba4fc83e7f735485f9
Author:     Akinobu Mita <akinobu.mita@gmail.com>
AuthorDate: Tue, 31 Aug 2010 23:00:08 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Sep 2010 07:33:34 +0200

lockup_detector: Convert cpu notifier to return encapsulate errno value

By the commit e6bde73b07edeb703d4c89c1daabc09c303de11f
("cpu-hotplug: return better errno on cpu hotplug failure"),
the cpu notifier can return encapsulate errno value, resulting
in more meaningful error codes for CPU hotplug failures.

This converts the cpu notifier to return encapsulate errno value
for the lockup_detector as well.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: fweisbec@gmail.com
LKML-Reference: <1283310009-22168-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/watchdog.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0d53c8e..501cb6e 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -371,7 +371,7 @@ static int watchdog_nmi_enable(int cpu)
 	}
 
 	printk(KERN_ERR "NMI watchdog failed to create perf event on cpu%i: %p\n", cpu, event);
-	return -1;
+	return PTR_ERR(event);
 
 	/* success path */
 out_save:
@@ -415,17 +415,19 @@ static int watchdog_prepare_cpu(int cpu)
 static int watchdog_enable(int cpu)
 {
 	struct task_struct *p = per_cpu(softlockup_watchdog, cpu);
+	int err;
 
 	/* enable the perf event */
-	if (watchdog_nmi_enable(cpu) != 0)
-		return -1;
+	err = watchdog_nmi_enable(cpu);
+	if (err)
+		return err;
 
 	/* create the watchdog thread */
 	if (!p) {
 		p = kthread_create(watchdog, (void *)(unsigned long)cpu, "watchdog/%d", cpu);
 		if (IS_ERR(p)) {
 			printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
-			return -1;
+			return PTR_ERR(p);
 		}
 		kthread_bind(p, cpu);
 		per_cpu(watchdog_touch_ts, cpu) = 0;
@@ -519,17 +521,16 @@ static int __cpuinit
 cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
 	int hotcpu = (unsigned long)hcpu;
+	int err = 0;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (watchdog_prepare_cpu(hotcpu))
-			return NOTIFY_BAD;
+		err = watchdog_prepare_cpu(hotcpu);
 		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
-		if (watchdog_enable(hotcpu))
-			return NOTIFY_BAD;
+		err = watchdog_enable(hotcpu);
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_UP_CANCELED:
@@ -542,7 +543,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 #endif /* CONFIG_HOTPLUG_CPU */
 	}
-	return NOTIFY_OK;
+	return notifier_from_errno(err);
 }
 
 static struct notifier_block __cpuinitdata cpu_nfb = {
@@ -558,7 +559,7 @@ static int __init spawn_watchdog_task(void)
 		return 0;
 
 	err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
-	WARN_ON(err == NOTIFY_BAD);
+	WARN_ON(notifier_to_errno(err));
 
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 	register_cpu_notifier(&cpu_nfb);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip:perf/core] lockup_detector: Remove unused panic_notifier
  2010-09-01  3:00 ` [PATCH 3/3] lockup_detector: remove unnecessary panic_notifier Don Zickus
@ 2010-09-01  6:25   ` tip-bot for Akinobu Mita
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Akinobu Mita @ 2010-09-01  6:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, akinobu.mita, tglx, mingo, dzickus

Commit-ID:  14416c35b6b9975c9593d7ecc8382d1ecaa0b598
Gitweb:     http://git.kernel.org/tip/14416c35b6b9975c9593d7ecc8382d1ecaa0b598
Author:     Akinobu Mita <akinobu.mita@gmail.com>
AuthorDate: Tue, 31 Aug 2010 23:00:09 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Sep 2010 07:33:34 +0200

lockup_detector: Remove unused panic_notifier

The panic notifer in lockup_detector just set did_panic to 1.
But did_panic is not used anywhere so we can just remove it.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: fweisbec@gmail.com
LKML-Reference: <1283310009-22168-4-git-send-email-dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/watchdog.c |   15 ---------------
 1 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 501cb6e..5b1ee4f4 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -43,7 +43,6 @@ static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
 #endif
 
-static int __read_mostly did_panic;
 static int __initdata no_watchdog;
 
 
@@ -180,18 +179,6 @@ static int is_softlockup(unsigned long touch_ts)
 	return 0;
 }
 
-static int
-watchdog_panic(struct notifier_block *this, unsigned long event, void *ptr)
-{
-	did_panic = 1;
-
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block panic_block = {
-	.notifier_call = watchdog_panic,
-};
-
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 static struct perf_event_attr wd_hw_attr = {
 	.type		= PERF_TYPE_HARDWARE,
@@ -564,8 +551,6 @@ static int __init spawn_watchdog_task(void)
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 	register_cpu_notifier(&cpu_nfb);
 
-	atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
-
 	return 0;
 }
 early_initcall(spawn_watchdog_task);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  6:00     ` Cyrill Gorcunov
@ 2010-09-01  7:01       ` Cyrill Gorcunov
  2010-09-01  7:20         ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2010-09-01  7:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Don Zickus, peterz, fweisbec, linux-kernel

On 9/1/10, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On 9/1/10, Ingo Molnar <mingo@elte.hu> wrote:
>>
>> * Don Zickus <dzickus@redhat.com> wrote:
>>
>>>  void touch_nmi_watchdog(void)
>>>  {
>>> -	__get_cpu_var(watchdog_nmi_touch) = true;
>>> +	if (watchdog_enabled) {
>>> +		unsigned cpu;
>>> +
>>> +		for_each_present_cpu(cpu) {
>>> +			if (per_cpu(watchdog_nmi_touch, cpu) != true)
>>> +				per_cpu(watchdog_nmi_touch, cpu) = true;
>>> +		}
>>
>> Hm, this is going to be a scalability nightmare with lots of CPUs. Not
>> only do we have a nr_cpus loop, but we touch per-cpu areas of _other_
>> CPUs - a big scalability nono.
>>
>> Why do we need to do this? We never needed to touch other CPU's NMI
>> lockup accounting data areas - why has this changed? The changelog does
>> not explain this.
>>
>> Thanks,
>>
>> 	Ingo
>>
> I believe this came from old nmi watchdog code where it might be
> useful when nmi watchdog activated via io-apic. I'm trying to figure
> out if we really need it still.
>
Well, we can't drop it or make per-cpu specific, for example we need
it in case of panic with watchdog enabled and panic timeout set, or
boot delay set and etc. Seems same applies to printk_delay. Hmm...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  7:01       ` Cyrill Gorcunov
@ 2010-09-01  7:20         ` Ingo Molnar
  2010-09-01  7:42           ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2010-09-01  7:20 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Don Zickus, peterz, fweisbec, linux-kernel


* Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On 9/1/10, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> > On 9/1/10, Ingo Molnar <mingo@elte.hu> wrote:
> >>
> >> * Don Zickus <dzickus@redhat.com> wrote:
> >>
> >>>  void touch_nmi_watchdog(void)
> >>>  {
> >>> -	__get_cpu_var(watchdog_nmi_touch) = true;
> >>> +	if (watchdog_enabled) {
> >>> +		unsigned cpu;
> >>> +
> >>> +		for_each_present_cpu(cpu) {
> >>> +			if (per_cpu(watchdog_nmi_touch, cpu) != true)
> >>> +				per_cpu(watchdog_nmi_touch, cpu) = true;
> >>> +		}
> >>
> >> Hm, this is going to be a scalability nightmare with lots of CPUs. Not
> >> only do we have a nr_cpus loop, but we touch per-cpu areas of _other_
> >> CPUs - a big scalability nono.
> >>
> >> Why do we need to do this? We never needed to touch other CPU's NMI
> >> lockup accounting data areas - why has this changed? The changelog does
> >> not explain this.
> >>
> >> Thanks,
> >>
> >> 	Ingo
> >>
> > I believe this came from old nmi watchdog code where it might be
> > useful when nmi watchdog activated via io-apic. I'm trying to figure
> > out if we really need it still.
>
> Well, we can't drop it or make per-cpu specific, for example we need 
> it in case of panic with watchdog enabled and panic timeout set, or 
> boot delay set and etc. Seems same applies to printk_delay. Hmm...

Ok - can you cite the old watchdog code, did it really do a nr_cpus 
loop?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01  7:20         ` Ingo Molnar
@ 2010-09-01  7:42           ` Cyrill Gorcunov
  0 siblings, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2010-09-01  7:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Don Zickus, peterz@infradead.org, fweisbec@gmail.com,
	linux-kernel@vger.kernel.org

On Wednesday, September 1, 2010, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
>> On 9/1/10, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> > On 9/1/10, Ingo Molnar <mingo@elte.hu> wrote:
>> >>
>> >> * Don Zickus <dzickus@redhat.com> wrote:
>> >>
>> >>>  void touch_nmi_watchdog(void)
>> >>>  {
>> >>> - __get_cpu_var(watchdog_nmi_touch) = true;
>> >>> + if (watchdog_enabled) {
>> >>> +         unsigned cpu;
>> >>> +
>> >>> +         for_each_present_cpu(cpu) {
>> >>> +                 if (per_cpu(watchdog_nmi_touch, cpu) != true)
>> >>> +                         per_cpu(watchdog_nmi_touch, cpu) = true;
>> >>> +         }
>> >>
>> >> Hm, this is going to be a scalability nightmare with lots of CPUs. Not
>> >> only do we have a nr_cpus loop, but we touch per-cpu areas of _other_
>> >> CPUs - a big scalability nono.
>> >>
>> >> Why do we need to do this? We never needed to touch other CPU's NMI
>> >> lockup accounting data areas - why has this changed? The changelog does
>> >> not explain this.
>> >>
>> >> Thanks,
>> >>
>> >>    Ingo
>> >>
>> > I believe this came from old nmi watchdog code where it might be
>> > useful when nmi watchdog activated via io-apic. I'm trying to figure
>> > out if we really need it still.
>>
>> Well, we can't drop it or make per-cpu specific, for example we need
>> it in case of panic with watchdog enabled and panic timeout set, or
>> boot delay set and etc. Seems same applies to printk_delay. Hmm...
>
> Ok - can you cite the old watchdog code, did it really do a nr_cpus
> loop?
>
> Thanks,
>
>         Ingo
>

Yes, previous touch_nmi_watchdog really did a loop as
for_each_present_cpu and touching per-cpu variable

(bad format)

void touch_nmi_watchdog(void)
{
if (nmi_watchdog_active()) {
  unsigned cpu;
  for_each_present_cpu(cpu) {
    if (per_cpu(nmi_touch,cpu) != 1)
       per_cpu(nmi_touch, cpu) = 1;
    }
  }
touch_softlockup_watchdog();
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip:perf/urgent] lockup_detector: Sync touch_*_watchdog back to old semantics
  2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
  2010-09-01  5:30   ` Ingo Molnar
@ 2010-09-01  9:19   ` tip-bot for Don Zickus
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Don Zickus @ 2010-09-01  9:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, gorcunov, tglx, mingo, dzickus

Commit-ID:  68d3f1d810500e8b975bdf0b20dd83d060076b4b
Gitweb:     http://git.kernel.org/tip/68d3f1d810500e8b975bdf0b20dd83d060076b4b
Author:     Don Zickus <dzickus@redhat.com>
AuthorDate: Tue, 31 Aug 2010 23:00:07 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Sep 2010 10:02:28 +0200

lockup_detector: Sync touch_*_watchdog back to old semantics

During my rewrite, the semantics of touch_nmi_watchdog and
touch_softlockup_watchdog changed enough to break some drivers
(mostly over preemptable regions).

These are cases where long delays on one CPU (due to
print_delay for example) can cause long delays on other
CPUs - so we must 'touch' the nmi_watchdog flag of those
other CPUs as well.

This change brings those touch_*_watchdog() functions back in line
with to how they used to work.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
LKML-Reference: <1283310009-22168-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/watchdog.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0d53c8e..7f9c3c5 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -122,7 +122,7 @@ static void __touch_watchdog(void)
 
 void touch_softlockup_watchdog(void)
 {
-	__get_cpu_var(watchdog_touch_ts) = 0;
+	__raw_get_cpu_var(watchdog_touch_ts) = 0;
 }
 EXPORT_SYMBOL(touch_softlockup_watchdog);
 
@@ -142,7 +142,14 @@ void touch_all_softlockup_watchdogs(void)
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 void touch_nmi_watchdog(void)
 {
-	__get_cpu_var(watchdog_nmi_touch) = true;
+	if (watchdog_enabled) {
+		unsigned cpu;
+
+		for_each_present_cpu(cpu) {
+			if (per_cpu(watchdog_nmi_touch, cpu) != true)
+				per_cpu(watchdog_nmi_touch, cpu) = true;
+		}
+	}
 	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
@@ -433,6 +440,9 @@ static int watchdog_enable(int cpu)
 		wake_up_process(p);
 	}
 
+	/* if any cpu succeeds, watchdog is considered enabled for the system */
+	watchdog_enabled = 1;
+
 	return 0;
 }
 
@@ -455,9 +465,6 @@ static void watchdog_disable(int cpu)
 		per_cpu(softlockup_watchdog, cpu) = NULL;
 		kthread_stop(p);
 	}
-
-	/* if any cpu succeeds, watchdog is considered enabled for the system */
-	watchdog_enabled = 1;
 }
 
 static void watchdog_enable_all_cpus(void)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
@ 2010-09-01 15:51 Don Zickus
  2010-09-01 16:36 ` Cyrill Gorcunov
  2010-09-01 17:15 ` Ingo Molnar
  0 siblings, 2 replies; 15+ messages in thread
From: Don Zickus @ 2010-09-01 15:51 UTC (permalink / raw)
  To: Ingo Molnar, Cyrill Gorcunov; +Cc: peterz, fweisbec, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 1951 bytes --]

Top posting because droid won't let me bottom post

This patch was the result of a regression with acpi and preempt.  Akpm asked that I not change the semantics of the old touch_nmi_watchdog.  So I tried to revert to the old behaviour.

Sorry for not properly explaining that.

Cheers,
Don

Ingo Molnar <mingo@elte.hu> wrote:

>
>* Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
>> On 9/1/10, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> > On 9/1/10, Ingo Molnar <mingo@elte.hu> wrote:
>> >>
>> >> * Don Zickus <dzickus@redhat.com> wrote:
>> >>
>> >>>  void touch_nmi_watchdog(void)
>> >>>  {
>> >>> -	__get_cpu_var(watchdog_nmi_touch) = true;
>> >>> +	if (watchdog_enabled) {
>> >>> +		unsigned cpu;
>> >>> +
>> >>> +		for_each_present_cpu(cpu) {
>> >>> +			if (per_cpu(watchdog_nmi_touch, cpu) != true)
>> >>> +				per_cpu(watchdog_nmi_touch, cpu) = true;
>> >>> +		}
>> >>
>> >> Hm, this is going to be a scalability nightmare with lots of CPUs. Not
>> >> only do we have a nr_cpus loop, but we touch per-cpu areas of _other_
>> >> CPUs - a big scalability nono.
>> >>
>> >> Why do we need to do this? We never needed to touch other CPU's NMI
>> >> lockup accounting data areas - why has this changed? The changelog does
>> >> not explain this.
>> >>
>> >> Thanks,
>> >>
>> >> 	Ingo
>> >>
>> > I believe this came from old nmi watchdog code where it might be
>> > useful when nmi watchdog activated via io-apic. I'm trying to figure
>> > out if we really need it still.
>>
>> Well, we can't drop it or make per-cpu specific, for example we need 
>> it in case of panic with watchdog enabled and panic timeout set, or 
>> boot delay set and etc. Seems same applies to printk_delay. Hmm...
>
>Ok - can you cite the old watchdog code, did it really do a nr_cpus 
>loop?
>
>Thanks,
>
>	Ingo
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01 15:51 [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
@ 2010-09-01 16:36 ` Cyrill Gorcunov
  2010-09-01 17:15 ` Ingo Molnar
  1 sibling, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2010-09-01 16:36 UTC (permalink / raw)
  To: Don Zickus; +Cc: Ingo Molnar, peterz, fweisbec, linux-kernel

On Wed, Sep 01, 2010 at 11:51:12AM -0400, Don Zickus wrote:
> Top posting because droid won't let me bottom post
> 
> This patch was the result of a regression with acpi and preempt.
> Akpm asked that I not change the semantics of the old touch_nmi_watchdog.
> So I tried to revert to the old behaviour.
> 
> Sorry for not properly explaining that.
> 

yup, stareing at old behaviour I think there is a place where we could
get rid of traversing all cpus: native_cpu_up() -- I don't get the reason why
watchdog counter should be reset on every other cpu as well. Perhaps
I miss something. On the other hands I think changing behaviour of
touch_nmi_watchdog just for one entry might not be worth thing to do :)

> Cheers,
> Don
> 
> Ingo Molnar <mingo@elte.hu> wrote:
> 
...
> >Ok - can you cite the old watchdog code, did it really do a nr_cpus 
> >loop?
> >
> >Thanks,
> >
> >	Ingo

	-- Cyrill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics
  2010-09-01 15:51 [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
  2010-09-01 16:36 ` Cyrill Gorcunov
@ 2010-09-01 17:15 ` Ingo Molnar
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2010-09-01 17:15 UTC (permalink / raw)
  To: Don Zickus; +Cc: Cyrill Gorcunov, peterz, fweisbec, linux-kernel


* Don Zickus <dzickus@redhat.com> wrote:

> Top posting because droid won't let me bottom post
> 
> This patch was the result of a regression with acpi and preempt.  Akpm 
> asked that I not change the semantics of the old touch_nmi_watchdog.  
> So I tried to revert to the old behaviour.

that's ok - i extended the changelog a bit before applying the patch.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-09-01 17:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-01  3:00 [PATCH 0/3] lockup detector fixes Don Zickus
2010-09-01  3:00 ` [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
2010-09-01  5:30   ` Ingo Molnar
2010-09-01  6:00     ` Cyrill Gorcunov
2010-09-01  7:01       ` Cyrill Gorcunov
2010-09-01  7:20         ` Ingo Molnar
2010-09-01  7:42           ` Cyrill Gorcunov
2010-09-01  9:19   ` [tip:perf/urgent] lockup_detector: Sync " tip-bot for Don Zickus
2010-09-01  3:00 ` [PATCH 2/3] lockup_detector: convert cpu notifier to return encapsulate errno value Don Zickus
2010-09-01  6:24   ` [tip:perf/core] lockup_detector: Convert " tip-bot for Akinobu Mita
2010-09-01  3:00 ` [PATCH 3/3] lockup_detector: remove unnecessary panic_notifier Don Zickus
2010-09-01  6:25   ` [tip:perf/core] lockup_detector: Remove unused panic_notifier tip-bot for Akinobu Mita
  -- strict thread matches above, loose matches on Subject: below --
2010-09-01 15:51 [PATCH 1/3] [lockup detector] sync touch_*_watchdog back to old semantics Don Zickus
2010-09-01 16:36 ` Cyrill Gorcunov
2010-09-01 17:15 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox