linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] watchdog: remove unnecessary head files
@ 2014-08-04  7:36 chai wen
  2014-08-04  7:36 ` [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu chai wen
  2014-08-04 14:26 ` [PATCH 1/2] watchdog: remove unnecessary head files Don Zickus
  0 siblings, 2 replies; 8+ messages in thread
From: chai wen @ 2014-08-04  7:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: dzickus, mingo, chai wen

Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
---
 kernel/watchdog.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index c3319bd..4c2e11c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -15,11 +15,6 @@
 #include <linux/cpu.h>
 #include <linux/nmi.h>
 #include <linux/init.h>
-#include <linux/delay.h>
-#include <linux/freezer.h>
-#include <linux/kthread.h>
-#include <linux/lockdep.h>
-#include <linux/notifier.h>
 #include <linux/module.h>
 #include <linux/sysctl.h>
 #include <linux/smpboot.h>
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-04  7:36 [PATCH 1/2] watchdog: remove unnecessary head files chai wen
@ 2014-08-04  7:36 ` chai wen
  2014-08-04 14:31   ` Don Zickus
  2014-08-04 14:26 ` [PATCH 1/2] watchdog: remove unnecessary head files Don Zickus
  1 sibling, 1 reply; 8+ messages in thread
From: chai wen @ 2014-08-04  7:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: dzickus, mingo, chai wen


For now, soft lockup detector warns once for each case of process softlockup.
But the thread 'watchdog/n' may can not always get cpu at the time slot between
the task switch of two processes hogging that cpu.
This case is a false negative of "warn only once for a process", as there may be
a different process that is going to hog the cpu. Is is better for detector to
be aware of it. 

Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
---
 kernel/watchdog.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 4c2e11c..908050c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
+static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
@@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	 */
 	duration = is_softlockup(touch_ts);
 	if (unlikely(duration)) {
+		pid_t pid = task_pid_nr(current);
+
 		/*
 		 * If a virtual machine is stopped by the host it can look to
 		 * the watchdog like a soft lockup, check to see if the host
@@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 			return HRTIMER_RESTART;
 
 		/* only warn once */
-		if (__this_cpu_read(soft_watchdog_warn) == true)
+		if (__this_cpu_read(soft_watchdog_warn) == true) {
+			/*
+			 * soft lockup detector should be aware of that there
+			 * may be a task-swicth of two different processes
+			 * hogging the cpu continously
+			 */
+			if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {
+				__this_cpu_write(soft_watchdog_warn, false);
+				__touch_watchdog();
+			}
 			return HRTIMER_RESTART;
+		}
 
 		if (softlockup_all_cpu_backtrace) {
 			/* Prevent multiple soft-lockup reports if one cpu is already
@@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 
 		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
-			current->comm, task_pid_nr(current));
+			current->comm, pid);
+		__this_cpu_write(softlockup_warn_pid_saved, pid);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] watchdog: remove unnecessary head files
  2014-08-04  7:36 [PATCH 1/2] watchdog: remove unnecessary head files chai wen
  2014-08-04  7:36 ` [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu chai wen
@ 2014-08-04 14:26 ` Don Zickus
  1 sibling, 0 replies; 8+ messages in thread
From: Don Zickus @ 2014-08-04 14:26 UTC (permalink / raw)
  To: chai wen; +Cc: linux-kernel, mingo

On Mon, Aug 04, 2014 at 03:36:18PM +0800, chai wen wrote:
> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>

Seems reasonable.

Acked-by: Don Zickus <dzickus@redhat.com>

> ---
>  kernel/watchdog.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index c3319bd..4c2e11c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -15,11 +15,6 @@
>  #include <linux/cpu.h>
>  #include <linux/nmi.h>
>  #include <linux/init.h>
> -#include <linux/delay.h>
> -#include <linux/freezer.h>
> -#include <linux/kthread.h>
> -#include <linux/lockdep.h>
> -#include <linux/notifier.h>
>  #include <linux/module.h>
>  #include <linux/sysctl.h>
>  #include <linux/smpboot.h>
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-04  7:36 ` [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu chai wen
@ 2014-08-04 14:31   ` Don Zickus
  2014-08-05  2:47     ` Chai Wen
  0 siblings, 1 reply; 8+ messages in thread
From: Don Zickus @ 2014-08-04 14:31 UTC (permalink / raw)
  To: chai wen; +Cc: linux-kernel, mingo

On Mon, Aug 04, 2014 at 03:36:19PM +0800, chai wen wrote:
> 
> For now, soft lockup detector warns once for each case of process softlockup.
> But the thread 'watchdog/n' may can not always get cpu at the time slot between
> the task switch of two processes hogging that cpu.
> This case is a false negative of "warn only once for a process", as there may be
> a different process that is going to hog the cpu. Is is better for detector to
> be aware of it. 

I am not sure I fully understand the problem resolved.

>From the changelog I understood that two processes bouncing back and forth
could hog the cpu and could create a 'false negative' (a situation not
reported but should).

But looking at the patch below I was a little confused on the
__touch_watchdog addition.  See below:

> 
> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
> ---
>  kernel/watchdog.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 4c2e11c..908050c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  	 */
>  	duration = is_softlockup(touch_ts);
>  	if (unlikely(duration)) {
> +		pid_t pid = task_pid_nr(current);
> +
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
>  		 * the watchdog like a soft lockup, check to see if the host
> @@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  			return HRTIMER_RESTART;
>  
>  		/* only warn once */
> -		if (__this_cpu_read(soft_watchdog_warn) == true)
> +		if (__this_cpu_read(soft_watchdog_warn) == true) {
> +			/*
> +			 * soft lockup detector should be aware of that there
> +			 * may be a task-swicth of two different processes
> +			 * hogging the cpu continously
> +			 */
> +			if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {
> +				__this_cpu_write(soft_watchdog_warn, false);
> +				__touch_watchdog();
> +			}

The above piece is what I am trying to understand.  Are you saying that
when two different processes are hogging the cpu, undo the
soft_watchdog_warn and allow the second pid to be reported?

Just trying to understand the problem and see if this is the right
approach (because 3 or more processes could cause the same problem???).

Cheers,
Don

>  			return HRTIMER_RESTART;
> +		}
>  
>  		if (softlockup_all_cpu_backtrace) {
>  			/* Prevent multiple soft-lockup reports if one cpu is already
> @@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  
>  		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>  			smp_processor_id(), duration,
> -			current->comm, task_pid_nr(current));
> +			current->comm, pid);
> +		__this_cpu_write(softlockup_warn_pid_saved, pid);
>  		print_modules();
>  		print_irqtrace_events(current);
>  		if (regs)
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-04 14:31   ` Don Zickus
@ 2014-08-05  2:47     ` Chai Wen
  2014-08-05 15:20       ` Don Zickus
  0 siblings, 1 reply; 8+ messages in thread
From: Chai Wen @ 2014-08-05  2:47 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, mingo

On 08/04/2014 10:31 PM, Don Zickus wrote:

> On Mon, Aug 04, 2014 at 03:36:19PM +0800, chai wen wrote:
>>
>> For now, soft lockup detector warns once for each case of process softlockup.
>> But the thread 'watchdog/n' may can not always get cpu at the time slot between
>> the task switch of two processes hogging that cpu.
>> This case is a false negative of "warn only once for a process", as there may be
>> a different process that is going to hog the cpu. Is is better for detector to
>> be aware of it. 
> 
> I am not sure I fully understand the problem resolved.
> 
>>From the changelog I understood that two processes bouncing back and forth
> could hog the cpu and could create a 'false negative' (a situation not
> reported but should).


Hi Don

Thanks for your comment.
Perhaps 'task-switch' is wrong and is some misleading here, sorry for that.

Here I mean the very case that between a termination of an old cpu hogging
process and a starting getting cpu of a new process hogging cpu.
The case that two or more processes bouncing back and forth and the thread 'watchdog/n'
getting no chance to run is rather difficult to be supposed. And I think this situation
does not exist.

When I am reading the code of warning once about a case, I think maybe it is
not so reliable by judging a "soft_watchdog_warn". And I tried a simple test to see
if it could handle the cased I mentioned above. Please see the comment and detail of
the test below.

> 
> But looking at the patch below I was a little confused on the
> __touch_watchdog addition.  See below:
> 
>>
>> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
>> ---
>>  kernel/watchdog.c |   18 ++++++++++++++++--
>>  1 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 4c2e11c..908050c 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
>>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
>>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
>>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
>> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
>>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
>>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
>> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>  	 */
>>  	duration = is_softlockup(touch_ts);
>>  	if (unlikely(duration)) {
>> +		pid_t pid = task_pid_nr(current);
>> +
>>  		/*
>>  		 * If a virtual machine is stopped by the host it can look to
>>  		 * the watchdog like a soft lockup, check to see if the host
>> @@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>  			return HRTIMER_RESTART;
>>  
>>  		/* only warn once */
>> -		if (__this_cpu_read(soft_watchdog_warn) == true)
>> +		if (__this_cpu_read(soft_watchdog_warn) == true) {
>> +			/*
>> +			 * soft lockup detector should be aware of that there
>> +			 * may be a task-swicth of two different processes
>> +			 * hogging the cpu continously
>> +			 */
>> +			if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {
>> +				__this_cpu_write(soft_watchdog_warn, false);
>> +				__touch_watchdog();
>> +			}
> 
> The above piece is what I am trying to understand.  Are you saying that
> when two different processes are hogging the cpu, undo the
> soft_watchdog_warn and allow the second pid to be reported?
> 


Yes, Indeed.

> Just trying to understand the problem and see if this is the right
> approach (because 3 or more processes could cause the same problem???).
> 


Only 2 processes is involved in this case as mentioned above, and it is a case about
a termination of an old process and a starting of a new process.

Here is my test about the case:

	stuck.c:
	#include <stdlib.h>
	#include <stdio.h>

	int main(int argc, char **argv)
	{
    		while(1);
    		exit(0);
	}

	# gcc -o stuck stuck.c
	# ./stuck &
	[1] 30309
	# ./stuck &
	[2] 30310
	# taskset -pc 3 30309
	pid 30309's current affinity list: 0-3
	pid 30309's new affinity list: 3
	# taskset -pc 3 30310
	pid 30310's current affinity list: 0-3
	pid 30310's new affinity list: 3

	Then change the schedule policy of 30309 and 30310 to be SCHED_FIFO with the MAX_RT_PRIO-1 priority.
	As the firstly changed to SCHED_FIFO process hogging cpu#3, and is reported after about ~20s.
	After it is killed or terminated, the process 30310 is going out and keeping hogging the cpu continuously
	But this process can not be always reported by the detector in this test.
	If removing the 'warn once' checking, pid change and rather big lockup duration could be found.

thanks
chai wen

> Cheers,
> Don
> 
>>  			return HRTIMER_RESTART;
>> +		}
>>  
>>  		if (softlockup_all_cpu_backtrace) {
>>  			/* Prevent multiple soft-lockup reports if one cpu is already
>> @@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>  
>>  		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>>  			smp_processor_id(), duration,
>> -			current->comm, task_pid_nr(current));
>> +			current->comm, pid);
>> +		__this_cpu_write(softlockup_warn_pid_saved, pid);
>>  		print_modules();
>>  		print_irqtrace_events(current);
>>  		if (regs)
>> -- 
>> 1.7.1
>>
> .
> 



-- 
Regards

Chai Wen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-05  2:47     ` Chai Wen
@ 2014-08-05 15:20       ` Don Zickus
  2014-08-06  2:23         ` Chai Wen
  0 siblings, 1 reply; 8+ messages in thread
From: Don Zickus @ 2014-08-05 15:20 UTC (permalink / raw)
  To: Chai Wen; +Cc: linux-kernel, mingo

On Tue, Aug 05, 2014 at 10:47:57AM +0800, Chai Wen wrote:
> On 08/04/2014 10:31 PM, Don Zickus wrote:
> 
> > On Mon, Aug 04, 2014 at 03:36:19PM +0800, chai wen wrote:
> >>
> >> For now, soft lockup detector warns once for each case of process softlockup.
> >> But the thread 'watchdog/n' may can not always get cpu at the time slot between
> >> the task switch of two processes hogging that cpu.
> >> This case is a false negative of "warn only once for a process", as there may be
> >> a different process that is going to hog the cpu. Is is better for detector to
> >> be aware of it. 
> > 
> > I am not sure I fully understand the problem resolved.
> > 
> >>From the changelog I understood that two processes bouncing back and forth
> > could hog the cpu and could create a 'false negative' (a situation not
> > reported but should).
> 
> 
> Hi Don
> 
> Thanks for your comment.
> Perhaps 'task-switch' is wrong and is some misleading here, sorry for that.
> 
> Here I mean the very case that between a termination of an old cpu hogging
> process and a starting getting cpu of a new process hogging cpu.
> The case that two or more processes bouncing back and forth and the thread 'watchdog/n'
> getting no chance to run is rather difficult to be supposed. And I think this situation
> does not exist.
> 
> When I am reading the code of warning once about a case, I think maybe it is
> not so reliable by judging a "soft_watchdog_warn". And I tried a simple test to see
> if it could handle the cased I mentioned above. Please see the comment and detail of
> the test below.

Thank you for your test case.  I understand the problem now.  If you have
multiple processes hogging the cpu and you kill the one reported by
the softlockup warning, you will never know about the other processes
because the soft_watchdog_warn variable never gets a chance to reset.

I am ok with your patch then.

Do you mind if I modify the changelog a little bit to maybe help explain
the problem better?  I am thinking of something like below:

"
For now, soft lockup detector warns once for each case of process softlockup.
But the thread 'watchdog/n' may not always get the cpu at the time slot between
the task switch of two processes hogging that cpu to reset
soft_watchdog_warn.

An example would be two processes hogging the cpu.  Process A causes the
softlockup warning and is killed manually by a user.  Process B
immediately becomes the new process hogging the cpu preventing the
softlockup code from resetting the soft_watchdog_warn variable.

This case is a false negative of "warn only once for a process", as there may be
a different process that is going to hog the cpu.  Resolve this by
saving/checking the pid of the hogging process and use that to reset
soft_watchdog_warn too.
"

> 
> > 
> > But looking at the patch below I was a little confused on the
> > __touch_watchdog addition.  See below:
> > 
> >>
> >> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
> >> ---
> >>  kernel/watchdog.c |   18 ++++++++++++++++--
> >>  1 files changed, 16 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> index 4c2e11c..908050c 100644
> >> --- a/kernel/watchdog.c
> >> +++ b/kernel/watchdog.c
> >> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
> >>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
> >>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> >>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> >> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
> >>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> >>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
> >>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> >> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>  	 */
> >>  	duration = is_softlockup(touch_ts);
> >>  	if (unlikely(duration)) {
> >> +		pid_t pid = task_pid_nr(current);
> >> +
> >>  		/*
> >>  		 * If a virtual machine is stopped by the host it can look to
> >>  		 * the watchdog like a soft lockup, check to see if the host
> >> @@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>  			return HRTIMER_RESTART;
> >>  
> >>  		/* only warn once */
> >> -		if (__this_cpu_read(soft_watchdog_warn) == true)
> >> +		if (__this_cpu_read(soft_watchdog_warn) == true) {
> >> +			/*
> >> +			 * soft lockup detector should be aware of that there
> >> +			 * may be a task-swicth of two different processes
> >> +			 * hogging the cpu continously
> >> +			 */

Can I modify the comment above to something like:


+			/*
+			 * Handle the case where multiple processes are
+			 * causing softlockups but the duration is small
+			 * enough, the softlockup detector can not reset
+			 * itself in time.  Use pids to detect this.
+			 */


Cheers,
Don

> > 
> > The above piece is what I am trying to understand.  Are you saying that
> > when two different processes are hogging the cpu, undo the
> > soft_watchdog_warn and allow the second pid to be reported?
> > 
> 
> 
> Yes, Indeed.
> 
> > Just trying to understand the problem and see if this is the right
> > approach (because 3 or more processes could cause the same problem???).
> > 
> 
> 
> Only 2 processes is involved in this case as mentioned above, and it is a case about
> a termination of an old process and a starting of a new process.
> 
> Here is my test about the case:
> 
> 	stuck.c:
> 	#include <stdlib.h>
> 	#include <stdio.h>
> 
> 	int main(int argc, char **argv)
> 	{
>     		while(1);
>     		exit(0);
> 	}
> 
> 	# gcc -o stuck stuck.c
> 	# ./stuck &
> 	[1] 30309
> 	# ./stuck &
> 	[2] 30310
> 	# taskset -pc 3 30309
> 	pid 30309's current affinity list: 0-3
> 	pid 30309's new affinity list: 3
> 	# taskset -pc 3 30310
> 	pid 30310's current affinity list: 0-3
> 	pid 30310's new affinity list: 3
> 
> 	Then change the schedule policy of 30309 and 30310 to be SCHED_FIFO with the MAX_RT_PRIO-1 priority.
> 	As the firstly changed to SCHED_FIFO process hogging cpu#3, and is reported after about ~20s.
> 	After it is killed or terminated, the process 30310 is going out and keeping hogging the cpu continuously
> 	But this process can not be always reported by the detector in this test.
> 	If removing the 'warn once' checking, pid change and rather big lockup duration could be found.
> 
> thanks
> chai wen
> 
> > Cheers,
> > Don
> > 
> >>  			return HRTIMER_RESTART;
> >> +		}
> >>  
> >>  		if (softlockup_all_cpu_backtrace) {
> >>  			/* Prevent multiple soft-lockup reports if one cpu is already
> >> @@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>  
> >>  		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> >>  			smp_processor_id(), duration,
> >> -			current->comm, task_pid_nr(current));
> >> +			current->comm, pid);
> >> +		__this_cpu_write(softlockup_warn_pid_saved, pid);
> >>  		print_modules();
> >>  		print_irqtrace_events(current);
> >>  		if (regs)
> >> -- 
> >> 1.7.1
> >>
> > .
> > 
> 
> 
> 
> -- 
> Regards
> 
> Chai Wen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-05 15:20       ` Don Zickus
@ 2014-08-06  2:23         ` Chai Wen
  2014-08-06 13:46           ` Don Zickus
  0 siblings, 1 reply; 8+ messages in thread
From: Chai Wen @ 2014-08-06  2:23 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, mingo

On 08/05/2014 11:20 PM, Don Zickus wrote:

> On Tue, Aug 05, 2014 at 10:47:57AM +0800, Chai Wen wrote:
>> On 08/04/2014 10:31 PM, Don Zickus wrote:
>>
>>> On Mon, Aug 04, 2014 at 03:36:19PM +0800, chai wen wrote:
>>>>
>>>> For now, soft lockup detector warns once for each case of process softlockup.
>>>> But the thread 'watchdog/n' may can not always get cpu at the time slot between
>>>> the task switch of two processes hogging that cpu.
>>>> This case is a false negative of "warn only once for a process", as there may be
>>>> a different process that is going to hog the cpu. Is is better for detector to
>>>> be aware of it. 
>>>
>>> I am not sure I fully understand the problem resolved.
>>>
>>> >From the changelog I understood that two processes bouncing back and forth
>>> could hog the cpu and could create a 'false negative' (a situation not
>>> reported but should).
>>
>>
>> Hi Don
>>
>> Thanks for your comment.
>> Perhaps 'task-switch' is wrong and is some misleading here, sorry for that.
>>
>> Here I mean the very case that between a termination of an old cpu hogging
>> process and a starting getting cpu of a new process hogging cpu.
>> The case that two or more processes bouncing back and forth and the thread 'watchdog/n'
>> getting no chance to run is rather difficult to be supposed. And I think this situation
>> does not exist.
>>
>> When I am reading the code of warning once about a case, I think maybe it is
>> not so reliable by judging a "soft_watchdog_warn". And I tried a simple test to see
>> if it could handle the cased I mentioned above. Please see the comment and detail of
>> the test below.
> 
> Thank you for your test case.  I understand the problem now.  If you have
> multiple processes hogging the cpu and you kill the one reported by
> the softlockup warning, you will never know about the other processes
> because the soft_watchdog_warn variable never gets a chance to reset.
> 
> I am ok with your patch then.
> 
> Do you mind if I modify the changelog a little bit to maybe help explain
> the problem better?  I am thinking of something like below:
> 
> "
> For now, soft lockup detector warns once for each case of process softlockup.
> But the thread 'watchdog/n' may not always get the cpu at the time slot between
> the task switch of two processes hogging that cpu to reset
> soft_watchdog_warn.
> 
> An example would be two processes hogging the cpu.  Process A causes the
> softlockup warning and is killed manually by a user.  Process B
> immediately becomes the new process hogging the cpu preventing the
> softlockup code from resetting the soft_watchdog_warn variable.
> 
> This case is a false negative of "warn only once for a process", as there may be
> a different process that is going to hog the cpu.  Resolve this by
> saving/checking the pid of the hogging process and use that to reset
> soft_watchdog_warn too.
> "


Your changelog and comment below is more specific for this case.
Thanks for your work on this patch.
Please feel free to do it like this.


thanks
chai wen


> 
>>
>>>
>>> But looking at the patch below I was a little confused on the
>>> __touch_watchdog addition.  See below:
>>>
>>>>
>>>> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
>>>> ---
>>>>  kernel/watchdog.c |   18 ++++++++++++++++--
>>>>  1 files changed, 16 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>>>> index 4c2e11c..908050c 100644
>>>> --- a/kernel/watchdog.c
>>>> +++ b/kernel/watchdog.c
>>>> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
>>>>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
>>>>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
>>>>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
>>>> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
>>>>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>>>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
>>>>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
>>>> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>>>  	 */
>>>>  	duration = is_softlockup(touch_ts);
>>>>  	if (unlikely(duration)) {
>>>> +		pid_t pid = task_pid_nr(current);
>>>> +
>>>>  		/*
>>>>  		 * If a virtual machine is stopped by the host it can look to
>>>>  		 * the watchdog like a soft lockup, check to see if the host
>>>> @@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>>>  			return HRTIMER_RESTART;
>>>>  
>>>>  		/* only warn once */
>>>> -		if (__this_cpu_read(soft_watchdog_warn) == true)
>>>> +		if (__this_cpu_read(soft_watchdog_warn) == true) {
>>>> +			/*
>>>> +			 * soft lockup detector should be aware of that there
>>>> +			 * may be a task-swicth of two different processes
>>>> +			 * hogging the cpu continously
>>>> +			 */
> 
> Can I modify the comment above to something like:
> 
> 
> +			/*
> +			 * Handle the case where multiple processes are
> +			 * causing softlockups but the duration is small
> +			 * enough, the softlockup detector can not reset
> +			 * itself in time.  Use pids to detect this.
> +			 */
> 
> 
> Cheers,
> Don
> 
>>>
>>> The above piece is what I am trying to understand.  Are you saying that
>>> when two different processes are hogging the cpu, undo the
>>> soft_watchdog_warn and allow the second pid to be reported?
>>>
>>
>>
>> Yes, Indeed.
>>
>>> Just trying to understand the problem and see if this is the right
>>> approach (because 3 or more processes could cause the same problem???).
>>>
>>
>>
>> Only 2 processes is involved in this case as mentioned above, and it is a case about
>> a termination of an old process and a starting of a new process.
>>
>> Here is my test about the case:
>>
>> 	stuck.c:
>> 	#include <stdlib.h>
>> 	#include <stdio.h>
>>
>> 	int main(int argc, char **argv)
>> 	{
>>     		while(1);
>>     		exit(0);
>> 	}
>>
>> 	# gcc -o stuck stuck.c
>> 	# ./stuck &
>> 	[1] 30309
>> 	# ./stuck &
>> 	[2] 30310
>> 	# taskset -pc 3 30309
>> 	pid 30309's current affinity list: 0-3
>> 	pid 30309's new affinity list: 3
>> 	# taskset -pc 3 30310
>> 	pid 30310's current affinity list: 0-3
>> 	pid 30310's new affinity list: 3
>>
>> 	Then change the schedule policy of 30309 and 30310 to be SCHED_FIFO with the MAX_RT_PRIO-1 priority.
>> 	As the firstly changed to SCHED_FIFO process hogging cpu#3, and is reported after about ~20s.
>> 	After it is killed or terminated, the process 30310 is going out and keeping hogging the cpu continuously
>> 	But this process can not be always reported by the detector in this test.
>> 	If removing the 'warn once' checking, pid change and rather big lockup duration could be found.
>>
>> thanks
>> chai wen
>>
>>> Cheers,
>>> Don
>>>
>>>>  			return HRTIMER_RESTART;
>>>> +		}
>>>>  
>>>>  		if (softlockup_all_cpu_backtrace) {
>>>>  			/* Prevent multiple soft-lockup reports if one cpu is already
>>>> @@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>>>>  
>>>>  		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>>>>  			smp_processor_id(), duration,
>>>> -			current->comm, task_pid_nr(current));
>>>> +			current->comm, pid);
>>>> +		__this_cpu_write(softlockup_warn_pid_saved, pid);
>>>>  		print_modules();
>>>>  		print_irqtrace_events(current);
>>>>  		if (regs)
>>>> -- 
>>>> 1.7.1
>>>>
>>> .
>>>
>>
>>
>>
>> -- 
>> Regards
>>
>> Chai Wen
> .
> 



-- 
Regards

Chai Wen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu
  2014-08-06  2:23         ` Chai Wen
@ 2014-08-06 13:46           ` Don Zickus
  0 siblings, 0 replies; 8+ messages in thread
From: Don Zickus @ 2014-08-06 13:46 UTC (permalink / raw)
  To: Chai Wen; +Cc: linux-kernel, mingo

On Wed, Aug 06, 2014 at 10:23:40AM +0800, Chai Wen wrote:
> On 08/05/2014 11:20 PM, Don Zickus wrote:
> 
> > On Tue, Aug 05, 2014 at 10:47:57AM +0800, Chai Wen wrote:
> >> On 08/04/2014 10:31 PM, Don Zickus wrote:
> >>
> >>> On Mon, Aug 04, 2014 at 03:36:19PM +0800, chai wen wrote:
> >>>>
> >>>> For now, soft lockup detector warns once for each case of process softlockup.
> >>>> But the thread 'watchdog/n' may can not always get cpu at the time slot between
> >>>> the task switch of two processes hogging that cpu.
> >>>> This case is a false negative of "warn only once for a process", as there may be
> >>>> a different process that is going to hog the cpu. Is is better for detector to
> >>>> be aware of it. 
> >>>
> >>> I am not sure I fully understand the problem resolved.
> >>>
> >>> >From the changelog I understood that two processes bouncing back and forth
> >>> could hog the cpu and could create a 'false negative' (a situation not
> >>> reported but should).
> >>
> >>
> >> Hi Don
> >>
> >> Thanks for your comment.
> >> Perhaps 'task-switch' is wrong and is some misleading here, sorry for that.
> >>
> >> Here I mean the very case that between a termination of an old cpu hogging
> >> process and a starting getting cpu of a new process hogging cpu.
> >> The case that two or more processes bouncing back and forth and the thread 'watchdog/n'
> >> getting no chance to run is rather difficult to be supposed. And I think this situation
> >> does not exist.
> >>
> >> When I am reading the code of warning once about a case, I think maybe it is
> >> not so reliable by judging a "soft_watchdog_warn". And I tried a simple test to see
> >> if it could handle the cased I mentioned above. Please see the comment and detail of
> >> the test below.
> > 
> > Thank you for your test case.  I understand the problem now.  If you have
> > multiple processes hogging the cpu and you kill the one reported by
> > the softlockup warning, you will never know about the other processes
> > because the soft_watchdog_warn variable never gets a chance to reset.
> > 
> > I am ok with your patch then.
> > 
> > Do you mind if I modify the changelog a little bit to maybe help explain
> > the problem better?  I am thinking of something like below:
> > 
> > "
> > For now, soft lockup detector warns once for each case of process softlockup.
> > But the thread 'watchdog/n' may not always get the cpu at the time slot between
> > the task switch of two processes hogging that cpu to reset
> > soft_watchdog_warn.
> > 
> > An example would be two processes hogging the cpu.  Process A causes the
> > softlockup warning and is killed manually by a user.  Process B
> > immediately becomes the new process hogging the cpu preventing the
> > softlockup code from resetting the soft_watchdog_warn variable.
> > 
> > This case is a false negative of "warn only once for a process", as there may be
> > a different process that is going to hog the cpu.  Resolve this by
> > saving/checking the pid of the hogging process and use that to reset
> > soft_watchdog_warn too.
> > "
> 
> 
> Your changelog and comment below is more specific for this case.
> Thanks for your work on this patch.
> Please feel free to do it like this.

Ok.  Great.  I'll repost with my signoff and updated changelog.

Cheers,
Don

> 
> 
> thanks
> chai wen
> 
> 
> > 
> >>
> >>>
> >>> But looking at the patch below I was a little confused on the
> >>> __touch_watchdog addition.  See below:
> >>>
> >>>>
> >>>> Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
> >>>> ---
> >>>>  kernel/watchdog.c |   18 ++++++++++++++++--
> >>>>  1 files changed, 16 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >>>> index 4c2e11c..908050c 100644
> >>>> --- a/kernel/watchdog.c
> >>>> +++ b/kernel/watchdog.c
> >>>> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
> >>>>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
> >>>>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> >>>>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> >>>> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
> >>>>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> >>>>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
> >>>>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> >>>> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>>>  	 */
> >>>>  	duration = is_softlockup(touch_ts);
> >>>>  	if (unlikely(duration)) {
> >>>> +		pid_t pid = task_pid_nr(current);
> >>>> +
> >>>>  		/*
> >>>>  		 * If a virtual machine is stopped by the host it can look to
> >>>>  		 * the watchdog like a soft lockup, check to see if the host
> >>>> @@ -326,8 +329,18 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>>>  			return HRTIMER_RESTART;
> >>>>  
> >>>>  		/* only warn once */
> >>>> -		if (__this_cpu_read(soft_watchdog_warn) == true)
> >>>> +		if (__this_cpu_read(soft_watchdog_warn) == true) {
> >>>> +			/*
> >>>> +			 * soft lockup detector should be aware of that there
> >>>> +			 * may be a task-swicth of two different processes
> >>>> +			 * hogging the cpu continously
> >>>> +			 */
> > 
> > Can I modify the comment above to something like:
> > 
> > 
> > +			/*
> > +			 * Handle the case where multiple processes are
> > +			 * causing softlockups but the duration is small
> > +			 * enough, the softlockup detector can not reset
> > +			 * itself in time.  Use pids to detect this.
> > +			 */
> > 
> > 
> > Cheers,
> > Don
> > 
> >>>
> >>> The above piece is what I am trying to understand.  Are you saying that
> >>> when two different processes are hogging the cpu, undo the
> >>> soft_watchdog_warn and allow the second pid to be reported?
> >>>
> >>
> >>
> >> Yes, Indeed.
> >>
> >>> Just trying to understand the problem and see if this is the right
> >>> approach (because 3 or more processes could cause the same problem???).
> >>>
> >>
> >>
> >> Only 2 processes is involved in this case as mentioned above, and it is a case about
> >> a termination of an old process and a starting of a new process.
> >>
> >> Here is my test about the case:
> >>
> >> 	stuck.c:
> >> 	#include <stdlib.h>
> >> 	#include <stdio.h>
> >>
> >> 	int main(int argc, char **argv)
> >> 	{
> >>     		while(1);
> >>     		exit(0);
> >> 	}
> >>
> >> 	# gcc -o stuck stuck.c
> >> 	# ./stuck &
> >> 	[1] 30309
> >> 	# ./stuck &
> >> 	[2] 30310
> >> 	# taskset -pc 3 30309
> >> 	pid 30309's current affinity list: 0-3
> >> 	pid 30309's new affinity list: 3
> >> 	# taskset -pc 3 30310
> >> 	pid 30310's current affinity list: 0-3
> >> 	pid 30310's new affinity list: 3
> >>
> >> 	Then change the schedule policy of 30309 and 30310 to be SCHED_FIFO with the MAX_RT_PRIO-1 priority.
> >> 	As the firstly changed to SCHED_FIFO process hogging cpu#3, and is reported after about ~20s.
> >> 	After it is killed or terminated, the process 30310 is going out and keeping hogging the cpu continuously
> >> 	But this process can not be always reported by the detector in this test.
> >> 	If removing the 'warn once' checking, pid change and rather big lockup duration could be found.
> >>
> >> thanks
> >> chai wen
> >>
> >>> Cheers,
> >>> Don
> >>>
> >>>>  			return HRTIMER_RESTART;
> >>>> +		}
> >>>>  
> >>>>  		if (softlockup_all_cpu_backtrace) {
> >>>>  			/* Prevent multiple soft-lockup reports if one cpu is already
> >>>> @@ -342,7 +355,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >>>>  
> >>>>  		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> >>>>  			smp_processor_id(), duration,
> >>>> -			current->comm, task_pid_nr(current));
> >>>> +			current->comm, pid);
> >>>> +		__this_cpu_write(softlockup_warn_pid_saved, pid);
> >>>>  		print_modules();
> >>>>  		print_irqtrace_events(current);
> >>>>  		if (regs)
> >>>> -- 
> >>>> 1.7.1
> >>>>
> >>> .
> >>>
> >>
> >>
> >>
> >> -- 
> >> Regards
> >>
> >> Chai Wen
> > .
> > 
> 
> 
> 
> -- 
> Regards
> 
> Chai Wen

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-08-06 13:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-04  7:36 [PATCH 1/2] watchdog: remove unnecessary head files chai wen
2014-08-04  7:36 ` [PATCH 2/2] softlockup: make detector be aware of task switch of processes hogging cpu chai wen
2014-08-04 14:31   ` Don Zickus
2014-08-05  2:47     ` Chai Wen
2014-08-05 15:20       ` Don Zickus
2014-08-06  2:23         ` Chai Wen
2014-08-06 13:46           ` Don Zickus
2014-08-04 14:26 ` [PATCH 1/2] watchdog: remove unnecessary head files Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).