public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Softlockup (out of cpu) killer
@ 2011-12-11 22:48 Vincent Li
  2011-12-12  0:28 ` Frederic Weisbecker
  2011-12-12  9:38 ` Peter Zijlstra
  0 siblings, 2 replies; 4+ messages in thread
From: Vincent Li @ 2011-12-11 22:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Don Zickus, Peter Zijlstra, Andrew Morton, Mandeep Singh Baines,
	linux-kernel, Vincent Li

In kernel, there is out of memory (OOM) killer, why not make an out of cpu (OOC) killer?
I tested following patch by running an user-space cpu hogging process and the softlockukp
detector killed the process successfully.

 Softlockup could be caused by user-space process hogging cpu, add softlockup_kill kernel
 config to allow kernel to kill the user space cpu hogging process. this feature is
 useful for high availability systems that have uptime gurantees and where a softlockup
 must be resolved ASAP

echo 1 > /proc/sys/kernel/softlockukp_kill to enable cpu hog process killer
echo 0 > /proc/sys/kernel/softlockup_kill to disable cpu hog process killer

Signed-off-by: Vincent Li <vincent.mc.li@gmail.com>
---
 Documentation/kernel-parameters.txt |    4 ++++
 include/linux/sched.h               |    1 +
 kernel/sysctl.c                     |    9 +++++++++
 kernel/watchdog.c                   |   18 ++++++++++++++++++
 lib/Kconfig.debug                   |   21 +++++++++++++++++++++
 5 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 81c287f..1609387 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2418,6 +2418,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			[KNL] Should the soft-lockup detector generate panics.
 			Format: <integer>
 
+	softlockup_panic=
+			[KNL] Should the soft-lockup detector kill cpu hog process.
+			Format: <integer>
+
 	sonypi.*=	[HW] Sony Programmable I/O Control Device driver
 			See Documentation/laptops/sonypi.txt
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..4783fac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -315,6 +315,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
 				  void __user *buffer,
 				  size_t *lenp, loff_t *ppos);
 extern unsigned int  softlockup_panic;
+extern unsigned int  softlockup_kill;
 void lockup_detector_init(void);
 #else
 static inline void touch_softlockup_watchdog(void)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ae27196..e79ea9c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -770,6 +770,15 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 	{
+		.procname	= "softlockup_kill",
+		.data		= &softlockup_kill,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
 		.procname       = "nmi_watchdog",
 		.data           = &watchdog_enabled,
 		.maxlen         = sizeof (int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1d7bca7..5832a90 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -75,6 +75,17 @@ static int __init softlockup_panic_setup(char *str)
 }
 __setup("softlockup_panic=", softlockup_panic_setup);
 
+unsigned int __read_mostly softlockup_kill =
+			CONFIG_BOOTPARAM_SOFTLOCKUP_KILL_VALUE;
+
+static int __init softlockup_kill_setup(char *str)
+{
+	softlockup_kill = simple_strtoul(str, NULL, 0);
+
+	return 1;
+}
+__setup("softlockup_kill=", softlockup_kill_setup);
+
 static int __init nowatchdog_setup(char *str)
 {
 	watchdog_enabled = 0;
@@ -306,6 +317,13 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		else
 			dump_stack();
 
+		if (softlockup_kill) {
+			printk(KERN_ERR "Kill softlockup process [%s:%d] on CPU#%d\n",
+				current->comm, task_pid_nr(current),
+				smp_processor_id());
+			force_sig(SIGKILL, current);
+		}
+
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
 		__this_cpu_write(soft_watchdog_warn, true);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 82928f5..e4afc98 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -224,6 +224,27 @@ config BOOTPARAM_SOFTLOCKUP_PANIC_VALUE
 	default 0 if !BOOTPARAM_SOFTLOCKUP_PANIC
 	default 1 if BOOTPARAM_SOFTLOCKUP_PANIC
 
+config BOOTPARAM_SOFTLOCKUP_KILL
+	bool "Kill (cpu hog process) On Soft Lockups"
+	depends on LOCKUP_DETECTOR
+	help
+	  Say Y here to enable the kernel to kill cpu hog process on
+	  "soft lockups", which are bugs that cause the kernel to
+	  loop in kernel mode for more than 60 seconds, without giving
+	  other tasks a chance to run.
+
+	  This feature is useful for high-availability systems that
+	  have uptime guarantees and where a lockup must be resolved ASAP.
+
+	Say N if unsure.
+
+config BOOTPARAM_SOFTLOCKUP_KILL_VALUE
+	int
+	depends on LOCKUP_DETECTOR
+	range 0 1
+	default 0 if !BOOTPARAM_SOFTLOCKUP_KILL
+	default 1 if BOOTPARAM_SOFTLOCKUP_KILL
+
 config DETECT_HUNG_TASK
 	bool "Detect Hung Tasks"
 	depends on DEBUG_KERNEL
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Softlockup (out of cpu) killer
  2011-12-11 22:48 [PATCH] Softlockup (out of cpu) killer Vincent Li
@ 2011-12-12  0:28 ` Frederic Weisbecker
  2011-12-12  9:38 ` Peter Zijlstra
  1 sibling, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2011-12-12  0:28 UTC (permalink / raw)
  To: Vincent Li
  Cc: Ingo Molnar, Don Zickus, Peter Zijlstra, Andrew Morton,
	Mandeep Singh Baines, linux-kernel

On Sun, Dec 11, 2011 at 02:48:55PM -0800, Vincent Li wrote:
> In kernel, there is out of memory (OOM) killer, why not make an out of cpu (OOC) killer?
> I tested following patch by running an user-space cpu hogging process and the softlockukp
> detector killed the process successfully.
> 
>  Softlockup could be caused by user-space process hogging cpu, add softlockup_kill kernel
>  config to allow kernel to kill the user space cpu hogging process. this feature is
>  useful for high availability systems that have uptime gurantees and where a softlockup
>  must be resolved ASAP
> 
> echo 1 > /proc/sys/kernel/softlockukp_kill to enable cpu hog process killer
> echo 0 > /proc/sys/kernel/softlockup_kill to disable cpu hog process killer

That assumes a signal would be enough to pull a process out of its softlockup.
I believe this is seldom the case. A process in a softlockup is stuck in some
place that has preemption disabled. Unless it luckily polls there for pending
signals, that won't work.

But may be that happens more often than I think. May be other people have
more insight.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Softlockup (out of cpu) killer
  2011-12-11 22:48 [PATCH] Softlockup (out of cpu) killer Vincent Li
  2011-12-12  0:28 ` Frederic Weisbecker
@ 2011-12-12  9:38 ` Peter Zijlstra
  2011-12-12 18:00   ` Vincent Li
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2011-12-12  9:38 UTC (permalink / raw)
  To: Vincent Li
  Cc: Ingo Molnar, Don Zickus, Andrew Morton, Mandeep Singh Baines,
	linux-kernel

On Sun, 2011-12-11 at 14:48 -0800, Vincent Li wrote:
> In kernel, there is out of memory (OOM) killer, why not make an out of cpu (OOC) killer?
> I tested following patch by running an user-space cpu hogging process and the softlockukp
> detector killed the process successfully.
> 
>  Softlockup could be caused by user-space process hogging cpu, add softlockup_kill kernel
>  config to allow kernel to kill the user space cpu hogging process. this feature is
>  useful for high availability systems that have uptime gurantees and where a softlockup
>  must be resolved ASAP
> 
> echo 1 > /proc/sys/kernel/softlockukp_kill to enable cpu hog process killer
> echo 0 > /proc/sys/kernel/softlockup_kill to disable cpu hog process killer
> 
> Signed-off-by: Vincent Li <vincent.mc.li@gmail.com>

Your whole premise is broken. Being a cpu hog and the softlockup
mechanism aren't related at all.

Furthermore, since the normal scheduling policy is a proportional one, a
cpu hog can't in fact starve anybody (although a fork bomb could). And
FIFO/RR are privileged ops.

Furthermore the distinction between memory and cpu-time is that memory
isn't a renewable resource, whereas time is. There's always more time,
but there's not always more memory.

So no, I don't think either you patch nor your concept make any sense.
Consider it nacked. 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Softlockup (out of cpu) killer
  2011-12-12  9:38 ` Peter Zijlstra
@ 2011-12-12 18:00   ` Vincent Li
  0 siblings, 0 replies; 4+ messages in thread
From: Vincent Li @ 2011-12-12 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Don Zickus, Andrew Morton, Mandeep Singh Baines,
	linux-kernel

>
> Your whole premise is broken. Being a cpu hog and the softlockup
> mechanism aren't related at all.
>
I fully understand that I may misunderstand the the cpu hog and
softlockup mechanism :)

> Furthermore, since the normal scheduling policy is a proportional one, a
> cpu hog can't in fact starve anybody (although a fork bomb could). And
> FIFO/RR are privileged ops.
>

I have a test program with FIFO privileges
http://www.vcn.bc.ca/~vli/schedrtcpu.c.txt that reliably eat 100% cpu
in top and the patch can kill it reliably, we have an user-space
traffic processing program that runs on FIFO similar like the test
program, under some condition, that user-space program could stuck on
the cpu and we want to kill it for high availability reason. with this
patch, we were able to do that.

I do notice that in the schedrtcpu.c test program, if I fork two
process like below:

pid_t spawn() {
        pid_t pid = fork();
        if (pid == 0)
                busyloop();
        return pid;
}



	pid1 = spawn();
	pid2 = spawn();
	
	waitpid(pid1, &status, 0);
	waitpid(pid2, &status, 0);

and run it on two cpu box, I got "sched: RT throttling activated" on
console and the test program wouldn't stuck on cpu, and can only reach
to 95% percent, it is strange that if I don't fork process, and only
runs the busyloop, it would not activate RT throttling and
consistently eat 100% single cpu.

 in our corner case, it appears that patch does help solve our problem.


> Furthermore the distinction between memory and cpu-time is that memory
> isn't a renewable resource, whereas time is. There's always more time,
> but there's not always more memory.
>
understood, thanks

> So no, I don't think either you patch nor your concept make any sense.
> Consider it nacked.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-12-12 18:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-11 22:48 [PATCH] Softlockup (out of cpu) killer Vincent Li
2011-12-12  0:28 ` Frederic Weisbecker
2011-12-12  9:38 ` Peter Zijlstra
2011-12-12 18:00   ` Vincent Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox