[PATCH] stop on cpu lost

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] stop on cpu lost
@ 2006-06-20  3:51 KAMEZAWA Hiroyuki
  2006-06-22  5:56 ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-06-20  3:51 UTC (permalink / raw)
  To: LKML; +Cc: ashok.raj, pavel, clameter, ak, nickpiggin, mingo, Andrew Morton

When the application is mis-configurated at cpu hot removal, a task's 
cpus_allowd can be empty. this patch adds sysctl to stop tasks whose 
cpus_allowed is empty.

I think there isn't one good answer to handle this problem and this is
depend on system management policy. In a system, forced migration is better 
than stop. In another, stopping tasks (and killing) will meet requirement.

How about this ?

-Kame

Now, when a task loses all of its allowed cpus because of cpu hot removal,
it will be foreced to migrate to not-allowed cpus.

In this case, the task is not properly reconfigurated by a user before
cpu-hot-removal. Here, the task (and system) is in a unexpeced wrong state.
This migration is maybe one of realistic workarounds. But sometimes it will be
harmfull.
(stealing other cpu time, making bugs in thread controllers, do some unexpected
 execution...)

This patch adds sysctl "sigstop_on_cpu_lost". When sigstop_on_cpu_lost==1,
a task which losts is cpu will be stopped by SIGSTOP.
Depends on system management policy, mis-configurated applications are stopped.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/sysctl.h |    1 +
 kernel/sched.c         |   14 ++++++++++++++
 kernel/sysctl.c        |   14 ++++++++++++++
 3 files changed, 29 insertions(+)

Index: linux-2.6.17/kernel/sched.c
===================================================================
--- linux-2.6.17.orig/kernel/sched.c
+++ linux-2.6.17/kernel/sched.c
@@ -4562,11 +4562,13 @@ wait_to_die:
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+int sigstop_on_cpu_lost;
 /* Figure out where task on dead CPU should go, use force if neccessary. */
 static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *tsk)
 {
 	int dest_cpu;
 	cpumask_t mask;
+	int force = 0;
 
 	/* On same node? */
 	mask = node_to_cpumask(cpu_to_node(dead_cpu));
@@ -4591,8 +4593,20 @@ static void move_task_off_dead_cpu(int d
 			printk(KERN_INFO "process %d (%s) no "
 			       "longer affine to cpu%d\n",
 			       tsk->pid, tsk->comm, dead_cpu);
+		/*
+		 * This thread is not properly reconfigurated before cpu hot
+		 * remove. This means this process is in the wrong state now.
+		 * If system management policy doesn't allow mis-configurated
+		 * applications, this process should be stopped.
+		 */
+		if (tsk->mm && sigstop_on_cpu_lost)
+			force = 1;
 	}
 	__migrate_task(tsk, dead_cpu, dest_cpu);
+
+	if (force) {
+		force_sig_specific(SIGSTOP, tsk);
+	}
 }
 
 /*
Index: linux-2.6.17/kernel/sysctl.c
===================================================================
--- linux-2.6.17.orig/kernel/sysctl.c
+++ linux-2.6.17/kernel/sysctl.c
@@ -127,6 +127,10 @@ extern int sysctl_hz_timer;
 extern int acct_parm[];
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+extern int sigstop_on_cpu_lost;
+#endif
+
 #ifdef CONFIG_IA64
 extern int no_unaligned_warning;
 #endif
@@ -683,6 +687,16 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	{
+		.ctl_name	= KERN_STOP_ON_CPU_LOST,
+		.procname	= "sigstop_on_cpu_lost",
+		.data		= &sigstop_on_cpu_lost,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.17/include/linux/sysctl.h
===================================================================
--- linux-2.6.17.orig/include/linux/sysctl.h
+++ linux-2.6.17/include/linux/sysctl.h
@@ -148,6 +148,7 @@ enum
 	KERN_SPIN_RETRY=70,	/* int: number of spinlock retries */
 	KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
 	KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
+	KERN_STOP_ON_CPU_LOST=73, /* int: SIGSTOP when a task losts its cpus */
 };
 
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-20  3:51 [PATCH] stop on cpu lost KAMEZAWA Hiroyuki
@ 2006-06-22  5:56 ` Andrew Morton
  2006-06-22  6:14   ` Christoph Lameter
  2006-06-22 15:08   ` Nathan Lynch
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2006-06-22  5:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, ashok.raj, pavel, clameter, ak, nickpiggin, mingo

On Tue, 20 Jun 2006 12:51:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> When the application is mis-configurated at cpu hot removal, a task's 
> cpus_allowd can be empty. this patch adds sysctl to stop tasks whose 
> cpus_allowed is empty.
> 
> I think there isn't one good answer to handle this problem and this is
> depend on system management policy. In a system, forced migration is better 
> than stop. In another, stopping tasks (and killing) will meet requirement.
> 
> How about this ?
> 
> -Kame
> 
> Now, when a task loses all of its allowed cpus because of cpu hot removal,
> it will be foreced to migrate to not-allowed cpus.
> 
> In this case, the task is not properly reconfigurated by a user before
> cpu-hot-removal. Here, the task (and system) is in a unexpeced wrong state.
> This migration is maybe one of realistic workarounds. But sometimes it will be
> harmfull.
> (stealing other cpu time, making bugs in thread controllers, do some unexpected
>  execution...)
> 
> This patch adds sysctl "sigstop_on_cpu_lost". When sigstop_on_cpu_lost==1,
> a task which losts is cpu will be stopped by SIGSTOP.
> Depends on system management policy, mis-configurated applications are stopped.
> 

Well that's a pretty unpleasant patch, isn't it?

But I guess it's policy, and if we cannot think of anything better then we'll
have to do it this way :(

> 
> 
>  include/linux/sysctl.h |    1 +
>  kernel/sched.c         |   14 ++++++++++++++
>  kernel/sysctl.c        |   14 ++++++++++++++

An update to Documentation/cpu-hotplug.txt would seem appropriate, please, and a
line in Documentation/sysctl/kernel.txt which refers to it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22  5:56 ` Andrew Morton
@ 2006-06-22  6:14   ` Christoph Lameter
  2006-06-22 15:08   ` Nathan Lynch
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2006-06-22  6:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-kernel, ashok.raj, pavel, ak, nickpiggin,
	mingo

On Wed, 21 Jun 2006, Andrew Morton wrote:

> > Now, when a task loses all of its allowed cpus because of cpu hot removal,
> > it will be foreced to migrate to not-allowed cpus.
> > 
> > In this case, the task is not properly reconfigurated by a user before
> > cpu-hot-removal. Here, the task (and system) is in a unexpeced wrong state.
> > This migration is maybe one of realistic workarounds. But sometimes it will be
> > harmfull.
> > (stealing other cpu time, making bugs in thread controllers, do some unexpected
> >  execution...)
> > 
> > This patch adds sysctl "sigstop_on_cpu_lost". When sigstop_on_cpu_lost==1,
> > a task which losts is cpu will be stopped by SIGSTOP.
> > Depends on system management policy, mis-configurated applications are stopped.
> > 
> 
> Well that's a pretty unpleasant patch, isn't it?

The cleanest solution is to terminate the process. If the user has 
configured the process to only run on a certain cpu and the processor then 
becomes unavailable then the process needs to terminate by default since 
it has no resource left to run. This is similar to an Out of Memory 
condition.

We can add this sigstop_on_cpu_lost as an additional measure but it should 
be off by default. So far we have never had the system stop a process 
because resources are not available. This would be unexpected system 
behavior.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22  5:56 ` Andrew Morton
  2006-06-22  6:14   ` Christoph Lameter
@ 2006-06-22 15:08   ` Nathan Lynch
  2006-06-22 15:45     ` Randy.Dunlap
  1 sibling, 1 reply; 22+ messages in thread
From: Nathan Lynch @ 2006-06-22 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-kernel, ashok.raj, pavel, clameter, ak,
	nickpiggin, mingo

Andrew Morton wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > Now, when a task loses all of its allowed cpus because of cpu hot removal,
> > it will be foreced to migrate to not-allowed cpus.
> > 
> > In this case, the task is not properly reconfigurated by a user before
> > cpu-hot-removal. Here, the task (and system) is in a unexpeced wrong state.
> > This migration is maybe one of realistic workarounds. But sometimes it will be
> > harmfull.
> > (stealing other cpu time, making bugs in thread controllers, do some unexpected
> >  execution...)
> > 
> > This patch adds sysctl "sigstop_on_cpu_lost". When sigstop_on_cpu_lost==1,
> > a task which losts is cpu will be stopped by SIGSTOP.
> > Depends on system management policy, mis-configurated applications are stopped.
> > 
> 
> Well that's a pretty unpleasant patch, isn't it?
> 
> But I guess it's policy, and if we cannot think of anything better then we'll
> have to do it this way :(

I tend to favor not changing the kernel to handle this case.  We're
already making a best effort attempt to handle conflicting directives
from the admin.  This is a policy that can be implemented in userspace
without much trouble.

If we really want to keep the admin shooting himself in the foot,
wouldn't it be preferable to fail the offline operation if there are
user tasks exclusively bound to the cpu?

While we're on the subject, what if there are interrupts bound to the
cpu you want to offline?  Should we consider handling that case
differently as well?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 15:08   ` Nathan Lynch
@ 2006-06-22 15:45     ` Randy.Dunlap
  2006-06-22 15:45       ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Randy.Dunlap @ 2006-06-22 15:45 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: akpm, kamezawa.hiroyu, linux-kernel, ashok.raj, pavel, clameter,
	ak, nickpiggin, mingo

On Thu, 22 Jun 2006 10:08:48 -0500 Nathan Lynch wrote:

> Andrew Morton wrote:
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > Now, when a task loses all of its allowed cpus because of cpu hot removal,
> > > it will be foreced to migrate to not-allowed cpus.
> > > 
> > > In this case, the task is not properly reconfigurated by a user before
> > > cpu-hot-removal. Here, the task (and system) is in a unexpeced wrong state.
> > > This migration is maybe one of realistic workarounds. But sometimes it will be
> > > harmfull.
> > > (stealing other cpu time, making bugs in thread controllers, do some unexpected
> > >  execution...)
> > > 
> > > This patch adds sysctl "sigstop_on_cpu_lost". When sigstop_on_cpu_lost==1,
> > > a task which losts is cpu will be stopped by SIGSTOP.
> > > Depends on system management policy, mis-configurated applications are stopped.
> > > 
> > 
> > Well that's a pretty unpleasant patch, isn't it?
> > 
> > But I guess it's policy, and if we cannot think of anything better then we'll
> > have to do it this way :(
> 
> I tend to favor not changing the kernel to handle this case.  We're
> already making a best effort attempt to handle conflicting directives
> from the admin.  This is a policy that can be implemented in userspace
> without much trouble.
> 
> If we really want to keep the admin shooting himself in the foot,
> wouldn't it be preferable to fail the offline operation if there are
> user tasks exclusively bound to the cpu?

Sounds much better than just killing the process.

> While we're on the subject, what if there are interrupts bound to the
> cpu you want to offline?  Should we consider handling that case
> differently as well?


---
~Randy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 15:45     ` Randy.Dunlap
@ 2006-06-22 15:45       ` Christoph Lameter
  2006-06-22 16:05         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2006-06-22 15:45 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Nathan Lynch, akpm, kamezawa.hiroyu, linux-kernel, ashok.raj,
	pavel, ak, nickpiggin, mingo

On Thu, 22 Jun 2006, Randy.Dunlap wrote:

> Sounds much better than just killing the process.

Right and having active interrupts or devices using that processor should 
also stop offlining a processor.

So just remove everything from a processor before offlining. If you cannot 
remove all users then the processor cannot be offlined.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 15:45       ` Christoph Lameter
@ 2006-06-22 16:05         ` KAMEZAWA Hiroyuki
  2006-06-22 16:14           ` Christoph Lameter
  2006-06-22 16:24           ` Randy.Dunlap
  0 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-06-22 16:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: rdunlap, ntl, akpm, linux-kernel, ashok.raj, pavel, ak,
	nickpiggin, mingo

On Thu, 22 Jun 2006 08:45:55 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 22 Jun 2006, Randy.Dunlap wrote:
> 
> > Sounds much better than just killing the process.
> 
> Right and having active interrupts or devices using that processor should 
> also stop offlining a processor.
> 
> So just remove everything from a processor before offlining. If you cannot 
> remove all users then the processor cannot be offlined.
> 
Hm..
Then, there is several ways to manage this sitation.

1. migrate all even if it's not allowed by users
2. kill mis-configured tasks.
3. stop ...
4. cancel cpu-hot-removal.

I just don't like "1". 
I discussed this problem with my colleagues before sending patch,
one said "4" seems regular way but another said "4" is bad.

I sent a patch for "4" in the first place but Andi Kleen said it's bad.
As he said, I'm handling the problem for which I can't find a good answer :(

my point is that "1" is bad.
-Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 16:05         ` KAMEZAWA Hiroyuki
@ 2006-06-22 16:14           ` Christoph Lameter
  2006-06-22 16:24           ` Randy.Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2006-06-22 16:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: rdunlap, ntl, akpm, linux-kernel, ashok.raj, pavel, ak,
	nickpiggin, mingo

On Fri, 23 Jun 2006, KAMEZAWA Hiroyuki wrote:

> Hm..
> Then, there is several ways to manage this sitation.
> 
> 1. migrate all even if it's not allowed by users

If its not allowed then the system should not do this. Otherwise we get an 
inconsistent system with lots of exceptions just because the user can
do something stupid.

> 2. kill mis-configured tasks.

If the user misconfigured then its their problem.

> 3. stop ...

That wont work well since the process may ignore stops. We have no history 
of stopping processes. This would be new functionality to pioneer in 
Linux.

> 4. cancel cpu-hot-removal.
> 
> I just don't like "1". 
> I discussed this problem with my colleagues before sending patch,
> one said "4" seems regular way but another said "4" is bad.

4 is a good thing. Just give the user some feedback as to why. F.e. write 
a message to the syslog. This is the way we deal with many other 
problem situations.

> I sent a patch for "4" in the first place but Andi Kleen said it's bad.
> As he said, I'm handling the problem for which I can't find a good answer :(

Andi: Why is 4 bad?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 16:05         ` KAMEZAWA Hiroyuki
  2006-06-22 16:14           ` Christoph Lameter
@ 2006-06-22 16:24           ` Randy.Dunlap
  2006-06-22 17:04             ` Nathan Lynch
  2006-06-22 18:22             ` Pavel Machek
  1 sibling, 2 replies; 22+ messages in thread
From: Randy.Dunlap @ 2006-06-22 16:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: clameter, ntl, akpm, linux-kernel, ashok.raj, pavel, ak,
	nickpiggin, mingo

On Fri, 23 Jun 2006 01:05:50 +0900 KAMEZAWA Hiroyuki wrote:

> On Thu, 22 Jun 2006 08:45:55 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Thu, 22 Jun 2006, Randy.Dunlap wrote:
> > 
> > > Sounds much better than just killing the process.
> > 
> > Right and having active interrupts or devices using that processor should 
> > also stop offlining a processor.
> > 
> > So just remove everything from a processor before offlining. If you cannot 
> > remove all users then the processor cannot be offlined.
> > 
> Hm..
> Then, there is several ways to manage this sitation.
> 
> 1. migrate all even if it's not allowed by users
> 2. kill mis-configured tasks.

I would claim that the tasks are not misconfigured,
but that the admin misconfigured the hardware (CPU).

> 3. stop ...
> 4. cancel cpu-hot-removal.
> 
> I just don't like "1". 

I like it better than 2.

> I discussed this problem with my colleagues before sending patch,
> one said "4" seems regular way but another said "4" is bad.
> 
> I sent a patch for "4" in the first place but Andi Kleen said it's bad.
> As he said, I'm handling the problem for which I can't find a good answer :(
> 
> my point is that "1" is bad.

Sounds like we are getting nowhere.  The sysctl knob might
have to be the answer.

---
~Randy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 16:24           ` Randy.Dunlap
@ 2006-06-22 17:04             ` Nathan Lynch
  2006-06-22 17:20               ` KAMEZAWA Hiroyuki
  2006-06-22 18:22             ` Pavel Machek
  1 sibling, 1 reply; 22+ messages in thread
From: Nathan Lynch @ 2006-06-22 17:04 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: KAMEZAWA Hiroyuki, clameter, akpm, linux-kernel, ashok.raj, pavel,
	ak, nickpiggin, mingo

Randy.Dunlap wrote:
> On Fri, 23 Jun 2006 01:05:50 +0900 KAMEZAWA Hiroyuki wrote:
> 
> > On Thu, 22 Jun 2006 08:45:55 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > On Thu, 22 Jun 2006, Randy.Dunlap wrote:
> > > 
> > > > Sounds much better than just killing the process.
> > > 
> > > Right and having active interrupts or devices using that processor should 
> > > also stop offlining a processor.
> > > 
> > > So just remove everything from a processor before offlining. If you cannot 
> > > remove all users then the processor cannot be offlined.
> > > 
> > Hm..
> > Then, there is several ways to manage this sitation.
> > 
> > 1. migrate all even if it's not allowed by users
> > 2. kill mis-configured tasks.
> 
> I would claim that the tasks are not misconfigured,
> but that the admin misconfigured the hardware (CPU).
> 
> > 3. stop ...
> > 4. cancel cpu-hot-removal.
> > 
> > I just don't like "1". 
> 
> I like it better than 2.
> 
> > I discussed this problem with my colleagues before sending patch,
> > one said "4" seems regular way but another said "4" is bad.
> > 
> > I sent a patch for "4" in the first place but Andi Kleen said it's bad.
> > As he said, I'm handling the problem for which I can't find a good answer :(
> > 
> > my point is that "1" is bad.
> 
> Sounds like we are getting nowhere.  The sysctl knob might
> have to be the answer.

I don't like having the kernel forcibly kill or stop tasks for this
case, regardless of whether the behavior is configurable.  What I
originally meant to suggest was a sysctl knob which will control
whether the offline will fail in this situation.  But I'm still more
inclined to leave the kernel's handling of this as it stands, since
this is policy that can be implemented in userspace.

We need to preserve the current behavior as the default, in any case.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 17:04             ` Nathan Lynch
@ 2006-06-22 17:20               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-06-22 17:20 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: rdunlap, clameter, akpm, linux-kernel, ashok.raj, pavel, ak,
	nickpiggin, mingo

On Thu, 22 Jun 2006 12:04:31 -0500
Nathan Lynch <ntl@pobox.com> wrote:

> Randy.Dunlap wrote:
> > Sounds like we are getting nowhere.  The sysctl knob might
> > have to be the answer.
> 
> I don't like having the kernel forcibly kill or stop tasks for this
> case, regardless of whether the behavior is configurable.  What I
> originally meant to suggest was a sysctl knob which will control
> whether the offline will fail in this situation. 
Okay, stop_on_cpu_lost patcfh is not good anyway.
Andrew, could you drop stop_on_cpu_lost patch ?

>From this discussion, it seems there is a direction. 
I'll update my avoid_cpu_removal_if_busy patch and add sysctl for it.

> But I'm still more inclined to leave the kernel's handling of this as it
> stands, since this is policy that can be implemented in userspace.
> 
A program to walk through all tasks and check thier allowd_cpus ?

> We need to preserve the current behavior as the default, in any case.
> 
I agree here.

-Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 16:24           ` Randy.Dunlap
  2006-06-22 17:04             ` Nathan Lynch
@ 2006-06-22 18:22             ` Pavel Machek
  2006-06-22 18:35               ` Christoph Lameter
                                 ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Pavel Machek @ 2006-06-22 18:22 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: KAMEZAWA Hiroyuki, clameter, ntl, akpm, linux-kernel, ashok.raj,
	ak, nickpiggin, mingo

Hi!

> > Hm..
> > Then, there is several ways to manage this sitation.
> > 
> > 1. migrate all even if it's not allowed by users

That's what I'd prefer... as swsusp uses cpu hotplug. All the other
options are bad... admin will probably not realize suspend involves
cpu unplugs..

> > 2. kill mis-configured tasks.
> > 3. stop ...
> > 4. cancel cpu-hot-removal.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 18:22             ` Pavel Machek
@ 2006-06-22 18:35               ` Christoph Lameter
  2006-06-22 18:37                 ` Pavel Machek
  2006-06-22 18:54               ` Hugh Dickins
  2006-06-22 19:52               ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2006-06-22 18:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Randy.Dunlap, KAMEZAWA Hiroyuki, ntl, akpm, linux-kernel,
	ashok.raj, ak, nickpiggin, mingo

On Thu, 22 Jun 2006, Pavel Machek wrote:

> > > Hm..
> > > Then, there is several ways to manage this sitation.
> > > 
> > > 1. migrate all even if it's not allowed by users
> 
> That's what I'd prefer... as swsusp uses cpu hotplug. All the other
> options are bad... admin will probably not realize suspend involves
> cpu unplugs..

You probably first suspend a process? If a process was suspended by 
swsusp then we can just ignore the restriction because it will be 
returned later.

The admin wants the system to behave in a consistent way. If he suddenly 
finds a process running on a cpu that was forbidden then that is weird 
and surprising to say the least and may go undetected for a long time.
If the process gets killed when he disables the cpu then he will have to 
fix up his cpu restrictions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 18:35               ` Christoph Lameter
@ 2006-06-22 18:37                 ` Pavel Machek
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2006-06-22 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Randy.Dunlap, KAMEZAWA Hiroyuki, ntl, akpm, linux-kernel,
	ashok.raj, ak, nickpiggin, mingo

Hi!

> > > > Hm..
> > > > Then, there is several ways to manage this sitation.
> > > > 
> > > > 1. migrate all even if it's not allowed by users
> > 
> > That's what I'd prefer... as swsusp uses cpu hotplug. All the other
> > options are bad... admin will probably not realize suspend involves
> > cpu unplugs..
> 
> You probably first suspend a process? If a process was suspended by 
> swsusp then we can just ignore the restriction because it will be 
> returned later.

Yes, I stop processes, first.

> The admin wants the system to behave in a consistent way. If he suddenly 
> finds a process running on a cpu that was forbidden then that is weird 
> and surprising to say the least and may go undetected for a long time.
> If the process gets killed when he disables the cpu then he will have to 
> fix up his cpu restrictions.

Would not keeping current behaviour, with adding _loud_ printk, be
enough?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 18:22             ` Pavel Machek
  2006-06-22 18:35               ` Christoph Lameter
@ 2006-06-22 18:54               ` Hugh Dickins
  2006-06-22 19:27                 ` Nick Piggin
  2006-06-22 19:52               ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2006-06-22 18:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl, akpm,
	linux-kernel, ashok.raj, ak, nickpiggin, mingo

On Thu, 22 Jun 2006, Pavel Machek wrote:
> 
> > > Hm..
> > > Then, there is several ways to manage this sitation.
> > > 
> > > 1. migrate all even if it's not allowed by users
> 
> That's what I'd prefer... as swsusp uses cpu hotplug. All the other
> options are bad... admin will probably not realize suspend involves
> cpu unplugs..
> 
> > > 2. kill mis-configured tasks.
> > > 3. stop ...
> > > 4. cancel cpu-hot-removal.

I'm very reluctant to expose my ignorance by joining this thread;
but what I'd naively expect would, I think, suit swsusp also -
you don't really want tasks to be migrated when resuming?

I'd expect tasks bound to the unplugged cpu simply not to be run
until "that" cpu is plugged back in.

With proviso that it should be possible to "kill -9" such a task
i.e. it be allowed to run in kernel on a wrong cpu just to exit.

Presumably this is difficult, because unplugging a cpu will also
remove infrastructure which would, for example, allow "ps" to show
such tasks.  Perhaps such infrastructure should remain so long as
there are tasks there.

Ignore me if I'm talking nonsense.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 18:54               ` Hugh Dickins
@ 2006-06-22 19:27                 ` Nick Piggin
  2006-06-22 19:46                   ` Hugh Dickins
  2006-06-22 21:44                   ` Pavel Machek
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2006-06-22 19:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pavel Machek, Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl,
	akpm, linux-kernel, ashok.raj, ak, mingo

Hugh Dickins wrote:
> On Thu, 22 Jun 2006, Pavel Machek wrote:
> 
>>>>Hm..
>>>>Then, there is several ways to manage this sitation.
>>>>
>>>>1. migrate all even if it's not allowed by users
>>
>>That's what I'd prefer... as swsusp uses cpu hotplug. All the other
>>options are bad... admin will probably not realize suspend involves
>>cpu unplugs..
>>
>>
>>>>2. kill mis-configured tasks.
>>>>3. stop ...
>>>>4. cancel cpu-hot-removal.
> 
> 
> I'm very reluctant to expose my ignorance by joining this thread;
> but what I'd naively expect would, I think, suit swsusp also -
> you don't really want tasks to be migrated when resuming?

No. And problem with force migrating things is that we lose the
cpus_allowed mask that has been carefully configured.

For this reason, I'm also in favour of #4, however we would need
a solution for swsusp.

> 
> I'd expect tasks bound to the unplugged cpu simply not to be run
> until "that" cpu is plugged back in.

Yes, I don't see why swsusp tasks would need to be migrated and
run. OTOH, this would require more swsusp special casing, but
apparently that's encouraged ;)

> 
> With proviso that it should be possible to "kill -9" such a task
> i.e. it be allowed to run in kernel on a wrong cpu just to exit.
> 
> Presumably this is difficult, because unplugging a cpu will also
> remove infrastructure which would, for example, allow "ps" to show
> such tasks.  Perhaps such infrastructure should remain so long as
> there are tasks there.

They'll be in the global tasklist, so there should be no reason why
they couldn't be migrated over to an online CPU with taskset. Shouldn't
require any rewrites, IIRC.

But after swsusp comes back up, it will be bringing up the same number
of CPUs as went down, won't it? So you shouldn't get into that
situation where you'd need to kill stuff, should you?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 19:27                 ` Nick Piggin
@ 2006-06-22 19:46                   ` Hugh Dickins
  2006-06-22 19:57                     ` Nick Piggin
  2006-06-22 21:44                   ` Pavel Machek
  1 sibling, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2006-06-22 19:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pavel Machek, Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl,
	akpm, linux-kernel, ashok.raj, ak, mingo

On Fri, 23 Jun 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > I'd expect tasks bound to the unplugged cpu simply not to be run
> > until "that" cpu is plugged back in.
> 
> Yes, I don't see why swsusp tasks would need to be migrated and
> run. OTOH, this would require more swsusp special casing, but
> apparently that's encouraged ;)

No, I wasn't meaning any swsusp special casing at all.

I was just using Pavel's swsusp-related mail as the hook to raise
the point that had been haunting me with every earlier mail on
this subject, mails I'd already deleted.

Pavel seemed to imply overriding the requested affinity for tasks
(in preferring #1 migration), I doubted he really wanted that.

> > With proviso that it should be possible to "kill -9" such a task
> > i.e. it be allowed to run in kernel on a wrong cpu just to exit.
> > 
> > Presumably this is difficult, because unplugging a cpu will also
> > remove infrastructure which would, for example, allow "ps" to show
> > such tasks.  Perhaps such infrastructure should remain so long as
> > there are tasks there.
> 
> They'll be in the global tasklist, so there should be no reason why
> they couldn't be migrated over to an online CPU with taskset. Shouldn't
> require any rewrites, IIRC.

I was afraid that "for_each_online_cpu"-type scans would skip over
the unplugged cpus, in such a way that the homeless tasks might be
awkwardly invisible in some contexts.  If no such problem, fine.

> But after swsusp comes back up, it will be bringing up the same number
> of CPUs as went down, won't it? So you shouldn't get into that
> situation where you'd need to kill stuff, should you?

I wasn't meaning "kill -9" for the swsusp case, but for the general
unplug cpu case.  We have a number of homeless tasks, which the admin
might want to run again when "the" cpu is plugged back in; or might
want to kill off without having to plug a cpu back in.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 19:46                   ` Hugh Dickins
@ 2006-06-22 19:57                     ` Nick Piggin
  2006-06-22 20:25                       ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2006-06-22 19:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pavel Machek, Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl,
	akpm, linux-kernel, ashok.raj, ak, mingo

Hugh Dickins wrote:
> On Fri, 23 Jun 2006, Nick Piggin wrote:
> 
>>Hugh Dickins wrote:
>>
>>>I'd expect tasks bound to the unplugged cpu simply not to be run
>>>until "that" cpu is plugged back in.
>>
>>Yes, I don't see why swsusp tasks would need to be migrated and
>>run. OTOH, this would require more swsusp special casing, but
>>apparently that's encouraged ;)
> 
> 
> No, I wasn't meaning any swsusp special casing at all.
> 
> I was just using Pavel's swsusp-related mail as the hook to raise
> the point that had been haunting me with every earlier mail on
> this subject, mails I'd already deleted.
> 
> Pavel seemed to imply overriding the requested affinity for tasks
> (in preferring #1 migration), I doubted he really wanted that.

No, but it is currently the only way to do it.

What I had thought you meant was to disallow cpu unplugging,
except with the special case to allow it from swsusp when
suspending the system.

> 
> 
>>>With proviso that it should be possible to "kill -9" such a task
>>>i.e. it be allowed to run in kernel on a wrong cpu just to exit.
>>>
>>>Presumably this is difficult, because unplugging a cpu will also
>>>remove infrastructure which would, for example, allow "ps" to show
>>>such tasks.  Perhaps such infrastructure should remain so long as
>>>there are tasks there.
>>
>>They'll be in the global tasklist, so there should be no reason why
>>they couldn't be migrated over to an online CPU with taskset. Shouldn't
>>require any rewrites, IIRC.
> 
> 
> I was afraid that "for_each_online_cpu"-type scans would skip over
> the unplugged cpus, in such a way that the homeless tasks might be
> awkwardly invisible in some contexts.  If no such problem, fine.

The management stuff tends to go via the pid hashes or the global
tasklist rather than the runqueues. But you might be right that
there would be some corner cases.

> 
> 
>>But after swsusp comes back up, it will be bringing up the same number
>>of CPUs as went down, won't it? So you shouldn't get into that
>>situation where you'd need to kill stuff, should you?
> 
> 
> I wasn't meaning "kill -9" for the swsusp case, but for the general
> unplug cpu case.  We have a number of homeless tasks, which the admin
> might want to run again when "the" cpu is plugged back in; or might
> want to kill off without having to plug a cpu back in.

Possible maybe... I presumed that would lead to a nightmare of
resource deadlocks (think mutexes). I'd hoped it could still
be useful for the swsusp case where everything gets turned off
at once, though. But I could be wrong...

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 19:57                     ` Nick Piggin
@ 2006-06-22 20:25                       ` Hugh Dickins
  0 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2006-06-22 20:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pavel Machek, Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl,
	akpm, linux-kernel, ashok.raj, ak, mingo

On Fri, 23 Jun 2006, Nick Piggin wrote:
> > >Hugh Dickins wrote:
> > >
> > > >I'd expect tasks bound to the unplugged cpu simply not to be run
> > > >until "that" cpu is plugged back in.
> > 
> > I wasn't meaning "kill -9" for the swsusp case, but for the general
> > unplug cpu case.  We have a number of homeless tasks, which the admin
> > might want to run again when "the" cpu is plugged back in; or might
> > want to kill off without having to plug a cpu back in.
> 
> Possible maybe... I presumed that would lead to a nightmare of
> resource deadlocks (think mutexes).

Yes, that's what I've naively overlooked - thanks.

But _maybe_ there's still a case for allowing such tasks to run
_in_kernel_ on a wrong cpu, to release resources, and to be killed.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 19:27                 ` Nick Piggin
  2006-06-22 19:46                   ` Hugh Dickins
@ 2006-06-22 21:44                   ` Pavel Machek
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2006-06-22 21:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl,
	akpm, linux-kernel, ashok.raj, ak, mingo

Hi!

> >remove infrastructure which would, for example, allow "ps" to show
> >such tasks.  Perhaps such infrastructure should remain so long as
> >there are tasks there.
> 
> They'll be in the global tasklist, so there should be no reason why
> they couldn't be migrated over to an online CPU with taskset. Shouldn't
> require any rewrites, IIRC.
> 
> But after swsusp comes back up, it will be bringing up the same number
> of CPUs as went down, won't it? So you shouldn't get into that
> situation where you'd need to kill stuff, should you?

Well... unless something goes *very* wrong, we wake with same number
of CPUs. I've seen it fail in error cases (went to sleep with dual
cpus, but could not kick the second cpu to life during resume).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 18:22             ` Pavel Machek
  2006-06-22 18:35               ` Christoph Lameter
  2006-06-22 18:54               ` Hugh Dickins
@ 2006-06-22 19:52               ` Jeremy Fitzhardinge
  2006-06-22 21:46                 ` Pavel Machek
  2 siblings, 1 reply; 22+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-22 19:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl, akpm,
	linux-kernel, ashok.raj, ak, nickpiggin, mingo

Pavel Machek wrote:
> That's what I'd prefer... as swsusp uses cpu hotplug.

Does it have to?  I presume this has been considered before, but what if 
the other CPUs were just idled for suspend rather than "removed"?  Or do 
you actually need to simulate a hot-remove to make sure they get 
suspended properly?  In general, the "hot remove as suspend" thing seems 
semantically awkward.

    J

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] stop on cpu lost
  2006-06-22 19:52               ` Jeremy Fitzhardinge
@ 2006-06-22 21:46                 ` Pavel Machek
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2006-06-22 21:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Randy.Dunlap, KAMEZAWA Hiroyuki, clameter, ntl, akpm,
	linux-kernel, ashok.raj, ak, nickpiggin, mingo

On Čt 22-06-06 12:52:32, Jeremy Fitzhardinge wrote:
> Pavel Machek wrote:
> >That's what I'd prefer... as swsusp uses cpu hotplug.
> 
> Does it have to?  I presume this has been considered before, but what if 
> the other CPUs were just idled for suspend rather than "removed"?

Basically yes, it has to. Idling cpus is easy, but bringing cpus back
up during resume is not, and we'd like to reuse cpu hotplug code.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2006-06-22 21:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-20  3:51 [PATCH] stop on cpu lost KAMEZAWA Hiroyuki
2006-06-22  5:56 ` Andrew Morton
2006-06-22  6:14   ` Christoph Lameter
2006-06-22 15:08   ` Nathan Lynch
2006-06-22 15:45     ` Randy.Dunlap
2006-06-22 15:45       ` Christoph Lameter
2006-06-22 16:05         ` KAMEZAWA Hiroyuki
2006-06-22 16:14           ` Christoph Lameter
2006-06-22 16:24           ` Randy.Dunlap
2006-06-22 17:04             ` Nathan Lynch
2006-06-22 17:20               ` KAMEZAWA Hiroyuki
2006-06-22 18:22             ` Pavel Machek
2006-06-22 18:35               ` Christoph Lameter
2006-06-22 18:37                 ` Pavel Machek
2006-06-22 18:54               ` Hugh Dickins
2006-06-22 19:27                 ` Nick Piggin
2006-06-22 19:46                   ` Hugh Dickins
2006-06-22 19:57                     ` Nick Piggin
2006-06-22 20:25                       ` Hugh Dickins
2006-06-22 21:44                   ` Pavel Machek
2006-06-22 19:52               ` Jeremy Fitzhardinge
2006-06-22 21:46                 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox