* init's children list is long and slows reaping children.
@ 2007-04-05 19:51 Robin Holt
2007-04-05 20:57 ` Linus Torvalds
2007-04-06 22:38 ` Jeff Garzik
0 siblings, 2 replies; 100+ messages in thread
From: Robin Holt @ 2007-04-05 19:51 UTC (permalink / raw)
To: Eric W. Biederman, Ingo Molnar, Linus Torvalds; +Cc: linux-kernel, Jack Steiner
We have been testing a new larger configuration and we are seeing a very
large scan time of init's tsk->children list. In the cases we are seeing,
there are numerous kernel processes created for each cpu (ie: events/0
... events/<big number>, xfslogd/0 ... xfslogd/<big number>). These are
all on the list ahead of the processes we are currently trying to reap.
wait_task_zombie() is taking many seconds to get through the list.
For the case of a modprobe, stop_machine creates one thread per cpu
(remember big number). All are parented to init and their exit will
cause wait_task_zombie to scan multiple times most of the way through
this very long list looking for threads which need to be reaped. As
a reference point, when we tried to mount the xfs root filesystem,
we ran out of pid space and had to recompile a kernel with a larger
default max pids.
For testing, Jack Steiner create the following patch. All it does
is moves tasks which are transitioning to the zombie state from where
they are in the children list to the head of the list. In this way,
they will be the first found and reaping does speed up. We will still
do a full scan of the list once the rearranged tasks are all removed.
This does not seem to be a significant problem.
This does, however, modify the order of reaping of children. Is there a
guarantee of the order for reaping children which needs to be preserved
or can this simple patch be used to speed up the reaping? If this
simple patch is not acceptable, are there any preferred methods for
linking together the tasks that have been zombied so they can be reaped
more quickly? Maybe add a zombie list_head to the task_struct and chain
them together in the children list order?
In comparison, without this patch, following modprobe on that particular
machine init is still reaping zombied tasks more than 30 seconds
following command completion. With this patch, all the zombied tasks
are removed within the first couple seconds.
Any suggestions would be greatly appreciated.
Thanks,
Robin Holt
Patch against 2.6.16 SLES 10 kernel.
Index: linux-2.6.16/kernel/exit.c
===================================================================
--- linux-2.6.16.orig/kernel/exit.c 2007-03-28 21:56:20.601860403 -0500
+++ linux-2.6.16/kernel/exit.c 2007-03-28 22:01:12.233942431 -0500
@@ -710,6 +710,13 @@ static void exit_notify(struct task_stru
write_lock_irq(&tasklist_lock);
/*
+ * Relink to head of parent's child list. This makes it easier to find.
+ * On large systems, init has way too many children that never terminate.
+ */
+ list_del_init(&tsk->sibling);
+ list_add(&tsk->sibling, &tsk->parent->children);
+
+ /*
* This does two things:
*
* A. Make init inherit all the child processes
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-05 19:51 Robin Holt
@ 2007-04-05 20:57 ` Linus Torvalds
2007-04-06 0:51 ` Chris Snook
2007-04-06 22:38 ` Jeff Garzik
1 sibling, 1 reply; 100+ messages in thread
From: Linus Torvalds @ 2007-04-05 20:57 UTC (permalink / raw)
To: Robin Holt; +Cc: Eric W. Biederman, Ingo Molnar, linux-kernel, Jack Steiner
On Thu, 5 Apr 2007, Robin Holt wrote:
>
> For testing, Jack Steiner create the following patch. All it does
> is moves tasks which are transitioning to the zombie state from where
> they are in the children list to the head of the list. In this way,
> they will be the first found and reaping does speed up. We will still
> do a full scan of the list once the rearranged tasks are all removed.
> This does not seem to be a significant problem.
I'd almost prefer to just put the zombie children on a separate list. I
wonder how painful that would be..
That would still make it expensive for people who use WUNTRACED to get
stopped children (since they'd have to look at all lists), but maybe
that's not a big deal.
Another thing we could do is to just make sure that kernel threads simply
don't end up as children of init. That whole thing is silly, they're
really not children of the user-space init anyway. Comments?
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-05 20:57 ` Linus Torvalds
@ 2007-04-06 0:51 ` Chris Snook
2007-04-06 1:03 ` Chris Snook
2007-04-06 1:29 ` Linus Torvalds
0 siblings, 2 replies; 100+ messages in thread
From: Chris Snook @ 2007-04-06 0:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Robin Holt, Eric W. Biederman, Ingo Molnar, linux-kernel,
Jack Steiner
Linus Torvalds wrote:
>
> On Thu, 5 Apr 2007, Robin Holt wrote:
>> For testing, Jack Steiner create the following patch. All it does
>> is moves tasks which are transitioning to the zombie state from where
>> they are in the children list to the head of the list. In this way,
>> they will be the first found and reaping does speed up. We will still
>> do a full scan of the list once the rearranged tasks are all removed.
>> This does not seem to be a significant problem.
>
> I'd almost prefer to just put the zombie children on a separate list. I
> wonder how painful that would be..
>
> That would still make it expensive for people who use WUNTRACED to get
> stopped children (since they'd have to look at all lists), but maybe
> that's not a big deal.
Shouldn't be any worse than it already is.
> Another thing we could do is to just make sure that kernel threads simply
> don't end up as children of init. That whole thing is silly, they're
> really not children of the user-space init anyway. Comments?
>
> Linus
Does anyone remember why we started doing this in the first place? I'm sure
there are some tools that expect a process tree, rather than a forest, and
making it a forest could make them unhappy.
The support angel on my shoulder says we should just put all the kernel threads
under a kthread subtree to shorten init's child list and minimize impact. The
hacker devil on my other shoulder says that with usermode helpers, containers,
etc. it's about time we treat it as a tree, and any tools that have a problem
with that need to be fixed.
-- Chris
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 0:51 ` Chris Snook
@ 2007-04-06 1:03 ` Chris Snook
2007-04-06 1:29 ` Linus Torvalds
1 sibling, 0 replies; 100+ messages in thread
From: Chris Snook @ 2007-04-06 1:03 UTC (permalink / raw)
To: Chris Snook
Cc: Linus Torvalds, Robin Holt, Eric W. Biederman, Ingo Molnar,
linux-kernel, Jack Steiner
Chris Snook wrote:
> Linus Torvalds wrote:
>>
>> On Thu, 5 Apr 2007, Robin Holt wrote:
>>> For testing, Jack Steiner create the following patch. All it does
>>> is moves tasks which are transitioning to the zombie state from where
>>> they are in the children list to the head of the list. In this way,
>>> they will be the first found and reaping does speed up. We will still
>>> do a full scan of the list once the rearranged tasks are all removed.
>>> This does not seem to be a significant problem.
>>
>> I'd almost prefer to just put the zombie children on a separate list.
>> I wonder how painful that would be..
>>
>> That would still make it expensive for people who use WUNTRACED to get
>> stopped children (since they'd have to look at all lists), but maybe
>> that's not a big deal.
>
> Shouldn't be any worse than it already is.
>
>> Another thing we could do is to just make sure that kernel threads
>> simply don't end up as children of init. That whole thing is silly,
>> they're really not children of the user-space init anyway. Comments?
>>
>> Linus
>
> Does anyone remember why we started doing this in the first place? I'm
> sure there are some tools that expect a process tree, rather than a
> forest, and making it a forest could make them unhappy.
>
> The support angel on my shoulder says we should just put all the kernel
> threads under a kthread subtree to shorten init's child list and
> minimize impact. The hacker devil on my other shoulder says that with
> usermode helpers, containers, etc. it's about time we treat it as a
> tree, and any tools that have a problem with that need to be fixed.
>
> -- Chris
Err, that should have been "about time we treat it as a forest".
-- Chris
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 0:51 ` Chris Snook
2007-04-06 1:03 ` Chris Snook
@ 2007-04-06 1:29 ` Linus Torvalds
2007-04-06 2:15 ` Eric W. Biederman
2007-04-06 18:05 ` Christoph Hellwig
1 sibling, 2 replies; 100+ messages in thread
From: Linus Torvalds @ 2007-04-06 1:29 UTC (permalink / raw)
To: Chris Snook
Cc: Robin Holt, Eric W. Biederman, Ingo Molnar, linux-kernel,
Jack Steiner
On Thu, 5 Apr 2007, Chris Snook wrote:
> Linus Torvalds wrote:
>
> > Another thing we could do is to just make sure that kernel threads simply
> > don't end up as children of init. That whole thing is silly, they're really
> > not children of the user-space init anyway. Comments?
>
> Does anyone remember why we started doing this in the first place? I'm sure
> there are some tools that expect a process tree, rather than a forest, and
> making it a forest could make them unhappy.
I'm not sure anybody would really be unhappy with pptr pointing to some
magic and special task that has pid 0 (which makes it clear to everybody
that the parent is something special), and that has SIGCHLD set to SIG_IGN
(which should make the exit case not even go through the zombie phase).
I can't even imagine *how* you'd make a tool unhappy with that, since even
tools like "ps" (and even more "pstree" won't read all the process states
atomically, so they invariably will see parent pointers that don't even
exist any more, because by the time they get to the parent, it has exited
already.
> The support angel on my shoulder says we should just put all the kernel
> threads under a kthread subtree to shorten init's child list and minimize
> impact.
A number are already there, of course, since they use the kthread
infrastructure to get there.
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 1:29 ` Linus Torvalds
@ 2007-04-06 2:15 ` Eric W. Biederman
2007-04-06 10:43 ` Robin Holt
2007-04-09 17:37 ` Chris Snook
2007-04-06 18:05 ` Christoph Hellwig
1 sibling, 2 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 2:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Snook, Robin Holt, Ingo Molnar, linux-kernel, Jack Steiner
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, 5 Apr 2007, Chris Snook wrote:
>
>> Linus Torvalds wrote:
>>
>> > Another thing we could do is to just make sure that kernel threads simply
>> > don't end up as children of init. That whole thing is silly, they're really
>> > not children of the user-space init anyway. Comments?
>>
>> Does anyone remember why we started doing this in the first place? I'm sure
>> there are some tools that expect a process tree, rather than a forest, and
>> making it a forest could make them unhappy.
>
> I'm not sure anybody would really be unhappy with pptr pointing to some
> magic and special task that has pid 0 (which makes it clear to everybody
> that the parent is something special), and that has SIGCHLD set to SIG_IGN
> (which should make the exit case not even go through the zombie phase).
>
> I can't even imagine *how* you'd make a tool unhappy with that, since even
> tools like "ps" (and even more "pstree" won't read all the process states
> atomically, so they invariably will see parent pointers that don't even
> exist any more, because by the time they get to the parent, it has exited
> already.
Right. pid == 1 being missing might cause some confusing having
but having ppid == 0 should be fine. Heck pid == 1 already has
ppid == 0, so it is a value user space has had to deal with for a
while.
In addition there was a period in 2.6 where most kernel threads
and init had a pgid == 0 and a session == 0, and nothing seemed
to complain.
We should probably make all of the kernel threads children of
init_task. The initial idle thread on the first cpu that is the
parent of pid == 1. That will give the ppid == 0 naturally because
the idle thread has pid == 0.
>> The support angel on my shoulder says we should just put all the kernel
>> threads under a kthread subtree to shorten init's child list and minimize
>> impact.
>
> A number are already there, of course, since they use the kthread
> infrastructure to get there.
Almost everything should be using kthread by now. I do admit that there
are a handful of kernel threads that still use kthread_create but it
is a relatively short list.
Looking we apparently have a couple of process started by
kthread_create that are not under kthread. They all have pid numbers
lower than kthread so I'm guessing it is some startup ordering issue.
Currently it looks like daemonize is reparenting everything to init,
changing that to init_task and making the threads self reaping
should be trivial.
.....
I'm a little nervous that we exceeded our default pid max just booting
the kernel. 32768 is a lot of kernel threads. That sounds like 32
kernel threads per cpu. That seems to be more than I have on any
of my little machines.
There is no defined order for reaping of child processes and in fact
I can't even see anything in the kernel right now that would even
accidentally give user space the idea we had a defined order.
So I think we have some options once we get the kernel threads out
of the way. Getting the kernel threads out of the way would seem
to be the first priority.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
@ 2007-04-06 8:42 Oleg Nesterov
2007-04-06 9:10 ` Eric W. Biederman
0 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 8:42 UTC (permalink / raw)
To: Robin Holt
Cc: Chris Snook, Eric W. Biederman, Ingo Molnar, Linus Torvalds,
linux-kernel
Robin Holt wrote:
>
> wait_task_zombie() is taking many seconds to get through the list.
> For the case of a modprobe, stop_machine creates one thread per cpu
> (remember big number). All are parented to init and their exit will
> cause wait_task_zombie to scan multiple times most of the way through
> this very long list looking for threads which need to be reaped.
Could you try this patch
http://marc.info/?l=linux-kernel&m=117337194209912
?
It can't solve the whole problem, but at least an exiting kernel thread
won't kick init.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 8:42 init's children list is long and slows reaping children Oleg Nesterov
@ 2007-04-06 9:10 ` Eric W. Biederman
2007-04-06 9:44 ` Oleg Nesterov
0 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 9:10 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Robin Holt, Chris Snook, Ingo Molnar, Linus Torvalds,
linux-kernel
Oleg Nesterov <oleg@tv-sign.ru> writes:
> Robin Holt wrote:
>>
>> wait_task_zombie() is taking many seconds to get through the list.
>> For the case of a modprobe, stop_machine creates one thread per cpu
>> (remember big number). All are parented to init and their exit will
>> cause wait_task_zombie to scan multiple times most of the way through
>> this very long list looking for threads which need to be reaped.
>
> Could you try this patch
>
> http://marc.info/?l=linux-kernel&m=117337194209912
>
> ?
>
> It can't solve the whole problem, but at least an exiting kernel thread
> won't kick init.
At first glance your patch looks reasonable.
Unfortunately it only applies to the rare thread that calls daemonize,
and not also to kernel/kthread/kthread() which means it will miss many of
our current kernel threads.
For the sgi problem I doubt it is will make a difference but the change
is certainly worth making.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 9:10 ` Eric W. Biederman
@ 2007-04-06 9:44 ` Oleg Nesterov
2007-04-06 15:45 ` Eric W. Biederman
0 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 9:44 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Robin Holt, Chris Snook, Ingo Molnar, Linus Torvalds,
linux-kernel
On 04/06, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> writes:
>
> > Robin Holt wrote:
> >>
> >> wait_task_zombie() is taking many seconds to get through the list.
> >> For the case of a modprobe, stop_machine creates one thread per cpu
> >> (remember big number). All are parented to init and their exit will
> >> cause wait_task_zombie to scan multiple times most of the way through
> >> this very long list looking for threads which need to be reaped.
> >
> > Could you try this patch
> >
> > http://marc.info/?l=linux-kernel&m=117337194209912
> >
> > ?
> >
> > It can't solve the whole problem, but at least an exiting kernel thread
> > won't kick init.
>
> At first glance your patch looks reasonable.
>
> Unfortunately it only applies to the rare thread that calls daemonize,
> and not also to kernel/kthread/kthread() which means it will miss many of
> our current kernel threads.
Note that a thread created by kthread_create() has ->parent == "[kthread]",
not /sbin/init (unless it was created before core_initcall). So I don't really
understand how stop_machine() creates the threads parented to init.
> For the sgi problem I doubt it is will make a difference
most probably you are right.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 2:15 ` Eric W. Biederman
@ 2007-04-06 10:43 ` Robin Holt
2007-04-06 15:38 ` Eric W. Biederman
2007-04-09 17:37 ` Chris Snook
1 sibling, 1 reply; 100+ messages in thread
From: Robin Holt @ 2007-04-06 10:43 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Chris Snook, Robin Holt, Ingo Molnar,
linux-kernel, Jack Steiner
> So I think we have some options once we get the kernel threads out
> of the way. Getting the kernel threads out of the way would seem
> to be the first priority.
I think both avenues would probably be the right way to proceeed.
Getting kthreads to not be parented by init would be an opportunity
for optimization.
I think organizing the zombie tasks to be easily reaped also has merit.
Rapidly forking/exiting processes like udevd during the boot of a
different machine were also shown to benefit significantly from this
patch. That machine had 512 cpus and 4608 disk devices, we dropped the
device discovery under udevd by 30%. This, honestly, surprised us.
It makes some sense now that I think about it. This would be a case
where improving the zombie handling would be beneficial to more than
just kthreads.
Thanks,
Robin
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 10:43 ` Robin Holt
@ 2007-04-06 15:38 ` Eric W. Biederman
2007-04-06 16:31 ` Oleg Nesterov
2007-04-06 16:41 ` Robin Holt
0 siblings, 2 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 15:38 UTC (permalink / raw)
To: Robin Holt
Cc: Linus Torvalds, Chris Snook, Ingo Molnar, linux-kernel,
Jack Steiner, Oleg Nesterov
Robin Holt <holt@sgi.com> writes:
>> So I think we have some options once we get the kernel threads out
>> of the way. Getting the kernel threads out of the way would seem
>> to be the first priority.
>
> I think both avenues would probably be the right way to proceeed.
> Getting kthreads to not be parented by init would be an opportunity
> for optimization.
>
> I think organizing the zombie tasks to be easily reaped also has merit.
> Rapidly forking/exiting processes like udevd during the boot of a
> different machine were also shown to benefit significantly from this
> patch. That machine had 512 cpus and 4608 disk devices, we dropped the
> device discovery under udevd by 30%. This, honestly, surprised us.
> It makes some sense now that I think about it. This would be a case
> where improving the zombie handling would be beneficial to more than
> just kthreads.
How hard is tasklist_lock hit on these systems?
How hard is the pid hash hit on these systems?
My hunch is that if you are doing a lot of forking and exiting
zombie reaping isn't the only problem you are observing.
Thinking about it I do agree with Linus that two lists sounds like the
right solution because it ensures we always have O(1) time when
waiting for a zombie. I'd like to place the list head for the zombie
list in the signal_struct and not in the task_struct so our
performance continues to be O(1) when we have a threaded process.
The big benefit of the zombie list over your proposed list reordering
is that waitpid can return immediately when we don't have zombies to
wait for, but we have lots of children. So it looks like a universal
benefit and about as good as it is possible to make zombie handling
of waitpid.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 9:44 ` Oleg Nesterov
@ 2007-04-06 15:45 ` Eric W. Biederman
2007-04-06 15:47 ` Oleg Nesterov
0 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 15:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Robin Holt, Chris Snook, Ingo Molnar, Linus Torvalds,
linux-kernel
Oleg Nesterov <oleg@tv-sign.ru> writes:
>> At first glance your patch looks reasonable.
>>
>> Unfortunately it only applies to the rare thread that calls daemonize,
>> and not also to kernel/kthread/kthread() which means it will miss many of
>> our current kernel threads.
>
> Note that a thread created by kthread_create() has ->parent == "[kthread]",
> not /sbin/init (unless it was created before core_initcall). So I don't really
> understand how stop_machine() creates the threads parented to init.
Duh. I was looking for that detail but the code was sufficiently
abstracted that I missed it skimming through. Now I see it thanks.
That at least explains how a handful of threads created with
kthread_create have init as their parent.
My gut feel says it makes sense to change reparent_to_init to
reparent_to_kthread. Which will get the all of the kernel daemons
under a single parent. Further I think it makes sense to see if
we can start kthread earlier. Having kthread have pid == 2 would be
nice. That way we can kill the conditional that is only used during
early initialization (make the code simpler and more predictable).
Since we seem to need kthread for the work of forking kernel threads,
I don't see much point in not having a proper process tree, and a
single extra child for pid == 1 isn't going to hurt anyone.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 15:45 ` Eric W. Biederman
@ 2007-04-06 15:47 ` Oleg Nesterov
2007-04-06 17:16 ` Linus Torvalds
0 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 15:47 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Robin Holt, Chris Snook, Ingo Molnar, Linus Torvalds,
linux-kernel
On 04/06, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> writes:
>
> >> At first glance your patch looks reasonable.
> >>
> >> Unfortunately it only applies to the rare thread that calls daemonize,
> >> and not also to kernel/kthread/kthread() which means it will miss many of
> >> our current kernel threads.
> >
> > Note that a thread created by kthread_create() has ->parent == "[kthread]",
> > not /sbin/init (unless it was created before core_initcall). So I don't really
> > understand how stop_machine() creates the threads parented to init.
>
> Duh. I was looking for that detail but the code was sufficiently
> abstracted that I missed it skimming through. Now I see it thanks.
> That at least explains how a handful of threads created with
> kthread_create have init as their parent.
Oops. I misread stop_machine(), it does kernel_thread(), not kthread_create().
So "stopmachine" threads are all re-parented to init when the caller exits.
I think it makes sense to set ->exit_state = -1 in stopmachine(), regadless
of any other changes.
> My gut feel says it makes sense to change reparent_to_init to
> reparent_to_kthread. Which will get the all of the kernel daemons
> under a single parent. Further I think it makes sense to see if
> we can start kthread earlier. Having kthread have pid == 2 would be
> nice. That way we can kill the conditional that is only used during
> early initialization (make the code simpler and more predictable).
Yes. But we can't create helper_wq ([kthread]) before init_workqueues().
We can call helper_init() from init_workqueues() before creating "events"
though.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 15:38 ` Eric W. Biederman
@ 2007-04-06 16:31 ` Oleg Nesterov
2007-04-06 17:32 ` Ingo Molnar
` (2 more replies)
2007-04-06 16:41 ` Robin Holt
1 sibling, 3 replies; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 16:31 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Robin Holt, Linus Torvalds, Chris Snook, Ingo Molnar,
linux-kernel, Jack Steiner
On 04/06, Eric W. Biederman wrote:
>
> Thinking about it I do agree with Linus that two lists sounds like the
> right solution because it ensures we always have O(1) time when
> waiting for a zombie.
Well. I bet this will be painful, and will uglify the code even more.
do_wait() has to iterate over 2 lists, __ptrace_unlink() should check
->exit_state, etc...
> I'd like to place the list head for the zombie
> list in the signal_struct and not in the task_struct so our
> performance continues to be O(1) when we have a threaded process.
Sure. It would be nice to move ->children into signal_struct at first.
Except this change breaks (in fact fixes) ->pdeath_signal behaviour.
> The big benefit of the zombie list over your proposed list reordering
> is that waitpid can return immediately when we don't have zombies to
> wait for, but we have lots of children.
TASK_TRACED/TASK_STOPPED ?
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 15:38 ` Eric W. Biederman
2007-04-06 16:31 ` Oleg Nesterov
@ 2007-04-06 16:41 ` Robin Holt
1 sibling, 0 replies; 100+ messages in thread
From: Robin Holt @ 2007-04-06 16:41 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Robin Holt, Linus Torvalds, Chris Snook, Ingo Molnar,
linux-kernel, Jack Steiner, Oleg Nesterov
On Fri, Apr 06, 2007 at 09:38:24AM -0600, Eric W. Biederman wrote:
> How hard is tasklist_lock hit on these systems?
The major hold-off we are seeing is from tasks reaping children,
especially tasks with very large children lists.
> How hard is the pid hash hit on these systems?
In the little bit of testing we got before the machine got taken away,
we never observed significant issues with the pid hash. We only got
time to run a few benchmarks like aim7.
...
> The big benefit of the zombie list over your proposed list reordering
> is that waitpid can return immediately when we don't have zombies to
> wait for, but we have lots of children. So it looks like a universal
> benefit and about as good as it is possible to make zombie handling
> of waitpid.
Is this something you are taking on as a task for yourself or do you
want me to pursue? I am extremely swamped on other non-kernel issues
and Jack is off working on x86_64 issues so we would both _greatly_
appreciate any help you can give. Of course, we understand this is an
issue that is affecting us and the responsibility ultimately lies here.
As for testing a proposed patch, we have a customer machine available for
the next few days which could be put into the same configuration we had
for this test, but it will be limited availability and then only assuming
the SGI site test personnel are certain the machine meets customer needs.
Thanks,
Robin
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 15:47 ` Oleg Nesterov
@ 2007-04-06 17:16 ` Linus Torvalds
2007-04-06 17:27 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 100+ messages in thread
From: Linus Torvalds @ 2007-04-06 17:16 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Eric W. Biederman, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
On Fri, 6 Apr 2007, Oleg Nesterov wrote:
>
> Oops. I misread stop_machine(), it does kernel_thread(), not kthread_create().
> So "stopmachine" threads are all re-parented to init when the caller exits.
> I think it makes sense to set ->exit_state = -1 in stopmachine(), regadless
> of any other changes.
I agree that "->exit_state = -1" makes sense, but I disagree in that I
don't think it *matters*.
Most of the problematic kernel threads are long-running (as in "never
exit"). Things like eventd, khelper, migration-threads etc will have init
as their parent and never actually exit, so their ->exit_state doesn't
matter. What matters is that they are on the children list, so when you
have 1024 CPU's, and init has many thousand of these per-cpu threads as
its children, then the *user* threads (that do exit) will cause problems!
I'd almost prefer to just not add kernel threads to any parent process
list *at*all*.
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:16 ` Linus Torvalds
@ 2007-04-06 17:27 ` Ingo Molnar
2007-04-06 17:31 ` Ingo Molnar
2007-04-06 17:34 ` Eric W. Biederman
2007-04-06 19:36 ` Oleg Nesterov
2 siblings, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 17:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Eric W. Biederman, Robin Holt, Chris Snook,
linux-kernel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I'd almost prefer to just not add kernel threads to any parent process
> list *at*all*.
i think part of the problem is the legacy that the list is artificially
unified: tasks that 'will possibly exit' are on the same list as tasks
that 'have already exited'. If we split it up into its natural data
structure, having a list of tasks that are there and do not intend to
exit, plus a separate list of tasks that are exiting and want to notify
their parent, all this scanning goes away. I can see no real reason for
this other than legacy - i dont think the semantics of the wait4() API
force us to scan all those threads.
putting the freshly reaped tasks at the 'head' of the list is just a
fancy (and incomplete) way of splitting the list up into two lists, and
i'd advocate a clean split. Just like have have split the ptrace_list
away from the main list too.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:27 ` Ingo Molnar
@ 2007-04-06 17:31 ` Ingo Molnar
0 siblings, 0 replies; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 17:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Eric W. Biederman, Robin Holt, Chris Snook,
linux-kernel
* Ingo Molnar <mingo@elte.hu> wrote:
> putting the freshly reaped tasks at the 'head' of the list is just a
> fancy (and incomplete) way of splitting the list up into two lists, and
> i'd advocate a clean split. Just like have have split the ptrace_list
^---s/have/we
> away from the main list too.
and in this sense your suggestion will happen:
> > I'd almost prefer to just not add kernel threads to any parent
> > process list *at*all*.
because that way kernel threads will never show up in the to-be-reaped
list, they'll only show up in the main list that represents the tree of
tasks.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 16:31 ` Oleg Nesterov
@ 2007-04-06 17:32 ` Ingo Molnar
2007-04-06 17:39 ` Roland Dreier
2007-04-06 18:30 ` Eric W. Biederman
2007-04-06 18:02 ` Eric W. Biederman
2007-04-06 18:21 ` Davide Libenzi
2 siblings, 2 replies; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 17:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Eric W. Biederman, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
* Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > Thinking about it I do agree with Linus that two lists sounds like
> > the right solution because it ensures we always have O(1) time when
> > waiting for a zombie.
>
> Well. I bet this will be painful, and will uglify the code even more.
>
> do_wait() has to iterate over 2 lists, __ptrace_unlink() should check
> ->exit_state, etc...
no. Two _completely separate_ lists.
i.e. a to-be-reaped task will still be on the main list _too_. The main
list is for all the PID semantics rules. The reap-list is just for
wait4() processing. The two would be completely separate.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:16 ` Linus Torvalds
2007-04-06 17:27 ` Ingo Molnar
@ 2007-04-06 17:34 ` Eric W. Biederman
2007-04-06 19:06 ` H. Peter Anvin
2007-04-06 19:36 ` Oleg Nesterov
2 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 17:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Robin Holt, Chris Snook, Ingo Molnar, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Fri, 6 Apr 2007, Oleg Nesterov wrote:
>>
>> Oops. I misread stop_machine(), it does kernel_thread(), not kthread_create().
>> So "stopmachine" threads are all re-parented to init when the caller exits.
>> I think it makes sense to set ->exit_state = -1 in stopmachine(), regadless
>> of any other changes.
>
> I agree that "->exit_state = -1" makes sense, but I disagree in that I
> don't think it *matters*.
>
> Most of the problematic kernel threads are long-running (as in "never
> exit"). Things like eventd, khelper, migration-threads etc will have init
> as their parent and never actually exit, so their ->exit_state doesn't
> matter. What matters is that they are on the children list, so when you
> have 1024 CPU's, and init has many thousand of these per-cpu threads as
> its children, then the *user* threads (that do exit) will cause problems!
>
> I'd almost prefer to just not add kernel threads to any parent process
> list *at*all*.
Oleg is coming from a different case where it was found that exiting kernel
threads were causing problems for nash when nash was run as init in an
initramfs. While I think that case is likely a user space bug because
nash should check the pid from waidpid before assuming the process it
was waiting for returned.
I do think you can construct valid embedded cases where you can prove
there that among the user space processes certain actions are safe
when they exit that a spare kernel space thread could invalidate.
It just happens that the exit signal of kernel threads just matters on
the other end of system size.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:32 ` Ingo Molnar
@ 2007-04-06 17:39 ` Roland Dreier
2007-04-06 18:04 ` Eric W. Biederman
2007-04-06 18:30 ` Eric W. Biederman
1 sibling, 1 reply; 100+ messages in thread
From: Roland Dreier @ 2007-04-06 17:39 UTC (permalink / raw)
To: Ingo Molnar
Cc: Oleg Nesterov, Eric W. Biederman, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
> no. Two _completely separate_ lists.
>
> i.e. a to-be-reaped task will still be on the main list _too_. The main
> list is for all the PID semantics rules. The reap-list is just for
> wait4() processing. The two would be completely separate.
I guess this means we add another list head to struct task_struct.
Not that big a deal, but it does make me a little sad to think about
task_struct getting even bigger....
- R.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 16:31 ` Oleg Nesterov
2007-04-06 17:32 ` Ingo Molnar
@ 2007-04-06 18:02 ` Eric W. Biederman
2007-04-06 18:21 ` Davide Libenzi
2 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 18:02 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Robin Holt, Linus Torvalds, Chris Snook, Ingo Molnar,
linux-kernel, Jack Steiner
Oleg Nesterov <oleg@tv-sign.ru> writes:
> On 04/06, Eric W. Biederman wrote:
>>
>> Thinking about it I do agree with Linus that two lists sounds like the
>> right solution because it ensures we always have O(1) time when
>> waiting for a zombie.
>
> Well. I bet this will be painful, and will uglify the code even more.
>
> do_wait() has to iterate over 2 lists, __ptrace_unlink() should check
> ->exit_state, etc...
Well I would prefer to iterate over 2 lists as opposed to N lists that
we have now.
The __ptrace_unlink issue sounds moderately valid although a ptraced
zombie is down right peculiar.
>> I'd like to place the list head for the zombie
>> list in the signal_struct and not in the task_struct so our
>> performance continues to be O(1) when we have a threaded process.
>
> Sure. It would be nice to move ->children into signal_struct at first.
> Except this change breaks (in fact fixes) ->pdeath_signal behaviour.
Actually this should be independent of the pdeath_signal issue.
As long as we record pdeath_signal per task_struct we can continue
to implement the existing semantics.
Although we really should fix pdeath_signal and push the patch to
-mm. We only didn't do that because the original patch was a
bug fix for stable kernels, and we didn't have time to verify fixing
pdeath_signal would not break anything else.
>> The big benefit of the zombie list over your proposed list reordering
>> is that waitpid can return immediately when we don't have zombies to
>> wait for, but we have lots of children.
>
> TASK_TRACED/TASK_STOPPED ?
Well it doesn't fix everything yet but it does fix the common case.
I wonder if it would make sense to have other lists as well.
Regardless of how this fixes scaling someone needs to dig in there load
up all the messy state in their head and clean up, simplify,
and optimize this mess.
We still have the stupid case where we can create unkillable
zombies from user space with a threaded init.
I think it is safe to say that this part of the code has reach the point
of fragility where it is hard to maintain. A clear sign that it is time to
refactor something.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:39 ` Roland Dreier
@ 2007-04-06 18:04 ` Eric W. Biederman
0 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 18:04 UTC (permalink / raw)
To: Roland Dreier
Cc: Ingo Molnar, Oleg Nesterov, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
Roland Dreier <rdreier@cisco.com> writes:
> > no. Two _completely separate_ lists.
> >
> > i.e. a to-be-reaped task will still be on the main list _too_. The main
> > list is for all the PID semantics rules. The reap-list is just for
> > wait4() processing. The two would be completely separate.
>
> I guess this means we add another list head to struct task_struct.
> Not that big a deal, but it does make me a little sad to think about
> task_struct getting even bigger....
signal_struct please. It isn't that much better but still...
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 1:29 ` Linus Torvalds
2007-04-06 2:15 ` Eric W. Biederman
@ 2007-04-06 18:05 ` Christoph Hellwig
2007-04-06 19:39 ` Eric W. Biederman
1 sibling, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2007-04-06 18:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Snook, Robin Holt, Eric W. Biederman, Ingo Molnar,
linux-kernel, Jack Steiner
On Thu, Apr 05, 2007 at 06:29:16PM -0700, Linus Torvalds wrote:
> > The support angel on my shoulder says we should just put all the kernel
> > threads under a kthread subtree to shorten init's child list and minimize
> > impact.
>
> A number are already there, of course, since they use the kthread
> infrastructure to get there.
As all kernel thread (1) should be converted to kthread anyway for
proper containers support and general "let's get rid of a crappy API'
cleanups I think that's enough. It would be nice to have SGI helping
to convert more drivers over to the proper API as conversions have stalled
a little bit.
(1) There's a few very core kernel threads that need to stick to the
low-level API, but they're too few to make any differences.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 16:31 ` Oleg Nesterov
2007-04-06 17:32 ` Ingo Molnar
2007-04-06 18:02 ` Eric W. Biederman
@ 2007-04-06 18:21 ` Davide Libenzi
2007-04-06 18:56 ` Eric W. Biederman
2 siblings, 1 reply; 100+ messages in thread
From: Davide Libenzi @ 2007-04-06 18:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Eric W. Biederman, Robin Holt, Linus Torvalds, Chris Snook,
Ingo Molnar, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Oleg Nesterov wrote:
> Sure. It would be nice to move ->children into signal_struct at first.
> Except this change breaks (in fact fixes) ->pdeath_signal behaviour.
Ohhh, the "signal" struct! Funny name for something that nowadays has
probably no more than a 5% affinity with signal-related tasks :/
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:32 ` Ingo Molnar
2007-04-06 17:39 ` Roland Dreier
@ 2007-04-06 18:30 ` Eric W. Biederman
2007-04-10 13:48 ` Ingo Molnar
1 sibling, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 18:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
Ingo Molnar <mingo@elte.hu> writes:
> no. Two _completely separate_ lists.
>
> i.e. a to-be-reaped task will still be on the main list _too_. The main
> list is for all the PID semantics rules. The reap-list is just for
> wait4() processing. The two would be completely separate.
And what pray tell except for heuristics is the list of children used for?
I could find a use in the scheduler (oldest_child and younger/older_sibling).
I could find a use in mm/oom_kill.
I could find a use in irixsig where it roles it's own version of wait4.
Perhaps I was blind but that was about it.
I didn't see the child list implementing any semantics we really care
about to user space.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 18:21 ` Davide Libenzi
@ 2007-04-06 18:56 ` Eric W. Biederman
2007-04-06 19:16 ` Davide Libenzi
0 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 18:56 UTC (permalink / raw)
To: Davide Libenzi
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
Ingo Molnar, Linux Kernel Mailing List, Jack Steiner
Davide Libenzi <davidel@xmailserver.org> writes:
> On Fri, 6 Apr 2007, Oleg Nesterov wrote:
>
>> Sure. It would be nice to move ->children into signal_struct at first.
>> Except this change breaks (in fact fixes) ->pdeath_signal behaviour.
>
> Ohhh, the "signal" struct! Funny name for something that nowadays has
> probably no more than a 5% affinity with signal-related tasks :/
Hmm. I wonder if we should just rename it the struct thread_group,
or struct task_group. Those seem slightly more accurate names.
I remember last time the conversation about renaming it came up no
one had a good name for it that wasn't already taken.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:34 ` Eric W. Biederman
@ 2007-04-06 19:06 ` H. Peter Anvin
2007-04-06 19:15 ` Eric W. Biederman
0 siblings, 1 reply; 100+ messages in thread
From: H. Peter Anvin @ 2007-04-06 19:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Oleg Nesterov, Robin Holt, Chris Snook,
Ingo Molnar, linux-kernel
Eric W. Biederman wrote:
>
> Oleg is coming from a different case where it was found that exiting kernel
> threads were causing problems for nash when nash was run as init in an
> initramfs. While I think that case is likely a user space bug because
> nash should check the pid from waidpid before assuming the process it
> was waiting for returned.
>
Are you saying waitpid() (wait4) *with a pid specified* can return
another pid? That definitely sounds like a bug.
-hpa
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:06 ` H. Peter Anvin
@ 2007-04-06 19:15 ` Eric W. Biederman
2007-04-06 19:21 ` H. Peter Anvin
2007-04-06 21:04 ` Jeremy Fitzhardinge
0 siblings, 2 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 19:15 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, Oleg Nesterov, Robin Holt, Chris Snook,
Ingo Molnar, linux-kernel
"H. Peter Anvin" <hpa@zytor.com> writes:
> Eric W. Biederman wrote:
>>
>> Oleg is coming from a different case where it was found that exiting kernel
>> threads were causing problems for nash when nash was run as init in an
>> initramfs. While I think that case is likely a user space bug because
>> nash should check the pid from waidpid before assuming the process it
>> was waiting for returned.
>>
>
> Are you saying waitpid() (wait4) *with a pid specified* can return another pid?
> That definitely sounds like a bug.
No.
For the full context look back a couple of messages.
I'm guessing the issue is nash just calls wait and doesn't check the
returned pid value, assuming it is the only child it forked returning.
Which is valid except when you are running as pid == 1.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 18:56 ` Eric W. Biederman
@ 2007-04-06 19:16 ` Davide Libenzi
2007-04-06 19:19 ` Ingo Molnar
0 siblings, 1 reply; 100+ messages in thread
From: Davide Libenzi @ 2007-04-06 19:16 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
Ingo Molnar, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Eric W. Biederman wrote:
> Davide Libenzi <davidel@xmailserver.org> writes:
>
> > On Fri, 6 Apr 2007, Oleg Nesterov wrote:
> >
> >> Sure. It would be nice to move ->children into signal_struct at first.
> >> Except this change breaks (in fact fixes) ->pdeath_signal behaviour.
> >
> > Ohhh, the "signal" struct! Funny name for something that nowadays has
> > probably no more than a 5% affinity with signal-related tasks :/
>
> Hmm. I wonder if we should just rename it the struct thread_group,
> or struct task_group. Those seem slightly more accurate names.
Almost *anything* is better than "signal_struct" ;)
A task_group could be fine, so something on the line of task_shared_ctx.
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:16 ` Davide Libenzi
@ 2007-04-06 19:19 ` Ingo Molnar
2007-04-06 21:29 ` Davide Libenzi
0 siblings, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 19:19 UTC (permalink / raw)
To: Davide Libenzi
Cc: Eric W. Biederman, Oleg Nesterov, Robin Holt, Linus Torvalds,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> > > Ohhh, the "signal" struct! Funny name for something that nowadays
> > > has probably no more than a 5% affinity with signal-related tasks
> > > :/
> >
> > Hmm. I wonder if we should just rename it the struct thread_group,
> > or struct task_group. Those seem slightly more accurate names.
>
> Almost *anything* is better than "signal_struct" ;) A task_group could
> be fine, so something on the line of task_shared_ctx.
or lets just face it and name it what it is: process_struct ;-)
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:15 ` Eric W. Biederman
@ 2007-04-06 19:21 ` H. Peter Anvin
2007-04-06 21:04 ` Jeremy Fitzhardinge
1 sibling, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2007-04-06 19:21 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Oleg Nesterov, Robin Holt, Chris Snook,
Ingo Molnar, linux-kernel
Eric W. Biederman wrote:
>>>
>> Are you saying waitpid() (wait4) *with a pid specified* can return another pid?
>> That definitely sounds like a bug.
>
> No.
>
> For the full context look back a couple of messages.
>
> I'm guessing the issue is nash just calls wait and doesn't check the
> returned pid value, assuming it is the only child it forked returning.
> Which is valid except when you are running as pid == 1.
>
OK, userspace error :)
-hpa
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 17:16 ` Linus Torvalds
2007-04-06 17:27 ` Ingo Molnar
2007-04-06 17:34 ` Eric W. Biederman
@ 2007-04-06 19:36 ` Oleg Nesterov
2007-04-06 19:43 ` Ingo Molnar
` (2 more replies)
2 siblings, 3 replies; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 19:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
On 04/06, Linus Torvalds wrote:
>
> On Fri, 6 Apr 2007, Oleg Nesterov wrote:
> >
> > Oops. I misread stop_machine(), it does kernel_thread(), not kthread_create().
> > So "stopmachine" threads are all re-parented to init when the caller exits.
> > I think it makes sense to set ->exit_state = -1 in stopmachine(), regadless
> > of any other changes.
>
> I agree that "->exit_state = -1" makes sense, but I disagree in that I
> don't think it *matters*.
>
> Most of the problematic kernel threads are long-running (as in "never
> exit"). Things like eventd, khelper, migration-threads etc will have init
> as their parent and never actually exit, so their ->exit_state doesn't
> matter. What matters is that they are on the children list, so when you
> have 1024 CPU's, and init has many thousand of these per-cpu threads as
> its children, then the *user* threads (that do exit) will cause problems!
>
> I'd almost prefer to just not add kernel threads to any parent process
> list *at*all*.
Yes sure, I didn't argue with that. However, "->exit_state = -1" does matter,
we can't detach process unless we make it auto-reap.
Perhaps,
--- t/kernel/exit.c~ 2007-04-06 23:31:31.000000000 +0400
+++ t/kernel/exit.c 2007-04-06 23:31:57.000000000 +0400
@@ -275,10 +275,7 @@ static void reparent_to_init(void)
remove_parent(current);
current->parent = child_reaper(current);
current->real_parent = child_reaper(current);
- add_parent(current);
-
- /* Set the exit signal to SIGCHLD so we signal init on exit */
- current->exit_signal = SIGCHLD;
+ current->exit_signal = -1;
if (!has_rt_policy(current) && (task_nice(current) < 0))
set_user_nice(current, 0);
is enough. init is still our parent (make ps happy), but it can't see us,
we are not on ->children list.
Off course, we also need to add preparent_to_init() to kthread() and
(say) stopmachine(). Or we can create kernel_thread_detached() and
modify callers to use it.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 18:05 ` Christoph Hellwig
@ 2007-04-06 19:39 ` Eric W. Biederman
0 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 19:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Linus Torvalds, Chris Snook, Robin Holt, Ingo Molnar,
linux-kernel, Jack Steiner
Christoph Hellwig <hch@infradead.org> writes:
> As all kernel thread (1) should be converted to kthread anyway for
> proper containers support and general "let's get rid of a crappy API'
> cleanups I think that's enough. It would be nice to have SGI helping
> to convert more drivers over to the proper API as conversions have stalled
> a little bit.
>
>
> (1) There's a few very core kernel threads that need to stick to the
> low-level API, but they're too few to make any differences.
Yes. I had to step back and remind myself why I care as daemonize is
currently doing an effective job of removing the namespace information.
However there is nothing daemonize can do about the pid of the task
because it is returned by kernel_thread and used by the callers
(to at the very least find the task_struct).
So not finishing that conversion is certainly one of the last blockers
for implementing the pid namespace.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:36 ` Oleg Nesterov
@ 2007-04-06 19:43 ` Ingo Molnar
2007-04-06 20:01 ` Oleg Nesterov
2007-04-06 19:47 ` Oleg Nesterov
2007-04-07 20:31 ` Oleg Nesterov
2 siblings, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 19:43 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Eric W. Biederman, Robin Holt, Chris Snook,
linux-kernel
* Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > I'd almost prefer to just not add kernel threads to any parent
> > process list *at*all*.
>
> Yes sure, I didn't argue with that. However, "->exit_state = -1" does
> matter, we can't detach process unless we make it auto-reap.
> Off course, we also need to add preparent_to_init() to kthread() and
> (say) stopmachine(). Or we can create kernel_thread_detached() and
> modify callers to use it.
this isnt a kernel-thread special case. The right solution IMO is to
first migrate wait4()'s ->children use over to a new p->exiting_children
list and then to gradually get rid of all remaining uses of p->children.
(the first patch of which i sent a few minutes ago)
that way wait4() will be sped up, and quite dramatically i believe. No
need to deal with kthreads here at all - those just wont ever show up in
the ->exiting_children list. Am i missing something?
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:36 ` Oleg Nesterov
2007-04-06 19:43 ` Ingo Molnar
@ 2007-04-06 19:47 ` Oleg Nesterov
2007-04-06 19:59 ` Eric W. Biederman
2007-04-07 20:31 ` Oleg Nesterov
2 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 19:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
On 04/06, Oleg Nesterov wrote:
>
> --- t/kernel/exit.c~ 2007-04-06 23:31:31.000000000 +0400
> +++ t/kernel/exit.c 2007-04-06 23:31:57.000000000 +0400
> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
> remove_parent(current);
> current->parent = child_reaper(current);
> current->real_parent = child_reaper(current);
Just noticed. Eric, is it ok we are using child_reaper() here?
This is for kernel threads only, shouldn't we always use
init_pid_ns.child_reaper ? (yes, it is the same currently).
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:47 ` Oleg Nesterov
@ 2007-04-06 19:59 ` Eric W. Biederman
0 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-06 19:59 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
Oleg Nesterov <oleg@tv-sign.ru> writes:
> On 04/06, Oleg Nesterov wrote:
>>
>> --- t/kernel/exit.c~ 2007-04-06 23:31:31.000000000 +0400
>> +++ t/kernel/exit.c 2007-04-06 23:31:57.000000000 +0400
>> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
>> remove_parent(current);
>> current->parent = child_reaper(current);
>> current->real_parent = child_reaper(current);
>
> Just noticed. Eric, is it ok we are using child_reaper() here?
> This is for kernel threads only, shouldn't we always use
> init_pid_ns.child_reaper ? (yes, it is the same currently).
Yes. It is a bug, and the description reparent_to_init is likewise
buggy.
I have recently realized that a number of the inital pid namespace
patches have stupid little bugs in them like this. I am upset with
myself for failing to code review them closely before the went in.
Live and learn I guess.
So when I saw this I added this to my growing list of things I needed
to review before I could assume they were done correctly.
Feel free to fix it. I'm still a couple days out from having enough
time before I can start working on pid related issues.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:43 ` Ingo Molnar
@ 2007-04-06 20:01 ` Oleg Nesterov
2007-04-06 20:21 ` Ingo Molnar
0 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-06 20:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Eric W. Biederman, Robin Holt, Chris Snook,
linux-kernel
On 04/06, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@tv-sign.ru> wrote:
>
> > > I'd almost prefer to just not add kernel threads to any parent
> > > process list *at*all*.
> >
> > Yes sure, I didn't argue with that. However, "->exit_state = -1" does
> > matter, we can't detach process unless we make it auto-reap.
>
> > Off course, we also need to add preparent_to_init() to kthread() and
> > (say) stopmachine(). Or we can create kernel_thread_detached() and
> > modify callers to use it.
>
> this isnt a kernel-thread special case. The right solution IMO is to
> first migrate wait4()'s ->children use over to a new p->exiting_children
> list and then to gradually get rid of all remaining uses of p->children.
> (the first patch of which i sent a few minutes ago)
>
> that way wait4() will be sped up, and quite dramatically i believe. No
> need to deal with kthreads here at all - those just wont ever show up in
> the ->exiting_children list. Am i missing something?
Probably it is I who missed something :)
But why can't we do both changes? I think it is just ugly to use init to
reap the kernel thread. Ok, wait4() can find zombie quickly if we do the
->children split. But /sbin/init could be swapped out, we still need to
deliver SIGCHLD, etc.
And I personally agree with Linus, it is nice to hide the kernel threads
from /sbin/init (or whatever) completely.
No?
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 20:01 ` Oleg Nesterov
@ 2007-04-06 20:21 ` Ingo Molnar
0 siblings, 0 replies; 100+ messages in thread
From: Ingo Molnar @ 2007-04-06 20:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Eric W. Biederman, Robin Holt, Chris Snook,
linux-kernel
* Oleg Nesterov <oleg@tv-sign.ru> wrote:
> Probably it is I who missed something :)
>
> But why can't we do both changes? I think it is just ugly to use init
> to reap the kernel thread. Ok, wait4() can find zombie quickly if we
> do the ->children split. But /sbin/init could be swapped out, we still
> need to deliver SIGCHLD, etc.
>
> And I personally agree with Linus, it is nice to hide the kernel
> threads from /sbin/init (or whatever) completely.
yeah, they should all be made auto-reap - nobody is really interested in
their wait4()-driven completion. (any completion they do is via other
mechanisms)
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:15 ` Eric W. Biederman
2007-04-06 19:21 ` H. Peter Anvin
@ 2007-04-06 21:04 ` Jeremy Fitzhardinge
2007-04-06 21:07 ` H. Peter Anvin
1 sibling, 1 reply; 100+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-06 21:04 UTC (permalink / raw)
To: Eric W. Biederman
Cc: H. Peter Anvin, Linus Torvalds, Oleg Nesterov, Robin Holt,
Chris Snook, Ingo Molnar, linux-kernel
Eric W. Biederman wrote:
> I'm guessing the issue is nash just calls wait and doesn't check the
> returned pid value, assuming it is the only child it forked returning.
> Which is valid except when you are running as pid == 1.
>
Hm, that's always a bug; a process can always have children it doesn't
expect (think forking, and then the parent execs).
J
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 21:04 ` Jeremy Fitzhardinge
@ 2007-04-06 21:07 ` H. Peter Anvin
0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2007-04-06 21:07 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Eric W. Biederman, Linus Torvalds, Oleg Nesterov, Robin Holt,
Chris Snook, Ingo Molnar, linux-kernel
Jeremy Fitzhardinge wrote:
> Eric W. Biederman wrote:
>> I'm guessing the issue is nash just calls wait and doesn't check the
>> returned pid value, assuming it is the only child it forked returning.
>> Which is valid except when you are running as pid == 1.
>>
>
> Hm, that's always a bug; a process can always have children it doesn't
> expect (think forking, and then the parent execs).
>
Not if the process already forked.
But yes, in general this is a major screwup.
-hpa
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:19 ` Ingo Molnar
@ 2007-04-06 21:29 ` Davide Libenzi
2007-04-06 21:51 ` Linus Torvalds
0 siblings, 1 reply; 100+ messages in thread
From: Davide Libenzi @ 2007-04-06 21:29 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric W. Biederman, Oleg Nesterov, Robin Holt, Linus Torvalds,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > > > Ohhh, the "signal" struct! Funny name for something that nowadays
> > > > has probably no more than a 5% affinity with signal-related tasks
> > > > :/
> > >
> > > Hmm. I wonder if we should just rename it the struct thread_group,
> > > or struct task_group. Those seem slightly more accurate names.
> >
> > Almost *anything* is better than "signal_struct" ;) A task_group could
> > be fine, so something on the line of task_shared_ctx.
>
> or lets just face it and name it what it is: process_struct ;-)
That'd be fine too! Wonder if Linus would swallow a rename patch like
that...
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 21:29 ` Davide Libenzi
@ 2007-04-06 21:51 ` Linus Torvalds
2007-04-06 22:31 ` Davide Libenzi
2007-04-09 8:28 ` Ingo Molnar
0 siblings, 2 replies; 100+ messages in thread
From: Linus Torvalds @ 2007-04-06 21:51 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Eric W. Biederman, Oleg Nesterov, Robin Holt,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Davide Libenzi wrote:
> >
> > or lets just face it and name it what it is: process_struct ;-)
>
> That'd be fine too! Wonder if Linus would swallow a rename patch like
> that...
I don't really see the point. It's not even *true*. A "process" includes
more than the shared signal-handling - it would include files and fs etc
too.
So it's actually *more* correct to call it the shared signal state than it
would be to call it "process" state.
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 21:51 ` Linus Torvalds
@ 2007-04-06 22:31 ` Davide Libenzi
2007-04-06 22:46 ` Linus Torvalds
2007-04-09 8:28 ` Ingo Molnar
1 sibling, 1 reply; 100+ messages in thread
From: Davide Libenzi @ 2007-04-06 22:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Eric W. Biederman, Oleg Nesterov, Robin Holt,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Linus Torvalds wrote:
> On Fri, 6 Apr 2007, Davide Libenzi wrote:
> > >
> > > or lets just face it and name it what it is: process_struct ;-)
> >
> > That'd be fine too! Wonder if Linus would swallow a rename patch like
> > that...
>
> I don't really see the point. It's not even *true*. A "process" includes
> more than the shared signal-handling - it would include files and fs etc
> too.
>
> So it's actually *more* correct to call it the shared signal state than it
> would be to call it "process" state.
But "signal" has *nothing* to do with what the structure store nowadays,
really. It's a pool of "things" that are not Linux task specific.
IMO something like "struct task_shared_ctx" or simply "struct task_shared"
would better fit the nature of the thing.
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-05 19:51 Robin Holt
2007-04-05 20:57 ` Linus Torvalds
@ 2007-04-06 22:38 ` Jeff Garzik
2007-04-06 22:51 ` Linus Torvalds
2007-04-10 0:23 ` Andrew Morton
1 sibling, 2 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-06 22:38 UTC (permalink / raw)
To: Robin Holt
Cc: Eric W. Biederman, Ingo Molnar, Linus Torvalds, linux-kernel,
Jack Steiner
Robin Holt wrote:
> We have been testing a new larger configuration and we are seeing a very
> large scan time of init's tsk->children list. In the cases we are seeing,
> there are numerous kernel processes created for each cpu (ie: events/0
> ... events/<big number>, xfslogd/0 ... xfslogd/<big number>). These are
> all on the list ahead of the processes we are currently trying to reap.
What about attacking the explosion of kernel threads?
As CPU counts increase, the number of per-CPU kernel threads gets really
ridiculous.
I would rather change the implementation under the hood to start per-CPU
threads on demand, similar to a thread-pool implementation.
Boxes with $BigNum CPUs probably won't ever use half of those threads.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 22:31 ` Davide Libenzi
@ 2007-04-06 22:46 ` Linus Torvalds
2007-04-06 22:59 ` Davide Libenzi
0 siblings, 1 reply; 100+ messages in thread
From: Linus Torvalds @ 2007-04-06 22:46 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Eric W. Biederman, Oleg Nesterov, Robin Holt,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Davide Libenzi wrote:
> On Fri, 6 Apr 2007, Linus Torvalds wrote:
> >
> > I don't really see the point. It's not even *true*. A "process" includes
> > more than the shared signal-handling - it would include files and fs etc
> > too.
> >
> > So it's actually *more* correct to call it the shared signal state than it
> > would be to call it "process" state.
>
> But "signal" has *nothing* to do with what the structure store nowadays,
> really. It's a pool of "things" that are not Linux task specific.
You're ignoring reality. It has more to do with signals than with
processes. Look at *all* the fields in the top half of the structure, up
to (and including) the "tty" field. They're *all* about signal semantics
in one form or another (whether it's directly about shared signal
behaviour, or indirectly about *sources* of signals like process control
or timers).
And renaming it really has no upsides, even *if* you had a point, which
you don't.
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 22:38 ` Jeff Garzik
@ 2007-04-06 22:51 ` Linus Torvalds
2007-04-06 23:37 ` Jeff Garzik
2007-04-10 0:23 ` Andrew Morton
1 sibling, 1 reply; 100+ messages in thread
From: Linus Torvalds @ 2007-04-06 22:51 UTC (permalink / raw)
To: Jeff Garzik
Cc: Robin Holt, Eric W. Biederman, Ingo Molnar, linux-kernel,
Jack Steiner
On Fri, 6 Apr 2007, Jeff Garzik wrote:
>
> I would rather change the implementation under the hood to start per-CPU
> threads on demand, similar to a thread-pool implementation.
>
> Boxes with $BigNum CPUs probably won't ever use half of those threads.
The counter-argument is that boxes with $BigNum CPU's really don't hurt
from it either, and having per-process data structures is often simpler
and more efficient than trying to have some thread pool.
IOW, once we get the processes off the global list, there just isn't any
downside from them. Sure, they use some memory, but people who buy
1024-cpu machines won't care about a few kB per CPU..
So the *only* downside is literally the process list, and one suggested
patch already just removes kernel threads entirely from the parenthood
lists.
The other potential downside could be "ps is slow", but on the other hand,
having the things stick around and have things like CPU-time accumulate is
probably worth it - if there are some issues, they'd show up properly
accounted for in a way that process pools would have a hard time doing.
So I really don't think this is worth changing things over, apart from
literally removing them from process lists, which I think everybody agrees
we should just do - it just never even came up before!
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 22:46 ` Linus Torvalds
@ 2007-04-06 22:59 ` Davide Libenzi
0 siblings, 0 replies; 100+ messages in thread
From: Davide Libenzi @ 2007-04-06 22:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Eric W. Biederman, Oleg Nesterov, Robin Holt,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
On Fri, 6 Apr 2007, Linus Torvalds wrote:
>
>
> On Fri, 6 Apr 2007, Davide Libenzi wrote:
>
> > On Fri, 6 Apr 2007, Linus Torvalds wrote:
> > >
> > > I don't really see the point. It's not even *true*. A "process" includes
> > > more than the shared signal-handling - it would include files and fs etc
> > > too.
> > >
> > > So it's actually *more* correct to call it the shared signal state than it
> > > would be to call it "process" state.
> >
> > But "signal" has *nothing* to do with what the structure store nowadays,
> > really. It's a pool of "things" that are not Linux task specific.
>
> You're ignoring reality. It has more to do with signals than with
> processes. Look at *all* the fields in the top half of the structure, up
> to (and including) the "tty" field. They're *all* about signal semantics
> in one form or another (whether it's directly about shared signal
> behaviour, or indirectly about *sources* of signals like process control
> or timers).
>
> And renaming it really has no upsides, even *if* you had a point, which
> you don't.
OTOH, the other half of the fields has nothing to do with them (signals).
Not only, the more time it passes, the more ppl (reason why I posted this
comment in the beginning) sees the "struct signal_struct" has a boilerplate
where to store shared resources.
Chosing a name like "struct task_shared_ctx" fits it, because "signals"
are *a* task_shared thing, whereas all the fields on the bottom of the
"struct signal_struct" (on top of the ones that ppl will want to add
everytime there's somethign to be shared between task structs) are *not* a
"signal".
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 22:51 ` Linus Torvalds
@ 2007-04-06 23:37 ` Jeff Garzik
2007-04-11 7:28 ` Nick Piggin
0 siblings, 1 reply; 100+ messages in thread
From: Jeff Garzik @ 2007-04-06 23:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Robin Holt, Eric W. Biederman, Ingo Molnar, linux-kernel,
Jack Steiner
Linus Torvalds wrote:
>
> On Fri, 6 Apr 2007, Jeff Garzik wrote:
>> I would rather change the implementation under the hood to start per-CPU
>> threads on demand, similar to a thread-pool implementation.
>>
>> Boxes with $BigNum CPUs probably won't ever use half of those threads.
>
> The counter-argument is that boxes with $BigNum CPU's really don't hurt
> from it either, and having per-process data structures is often simpler
> and more efficient than trying to have some thread pool.
Two points here:
* A lot of the users in the current kernel tree don't rely on the
per-CPU qualities. They just need multiple threads running.
* Even with per-CPU data structures and code, you don't necessarily have
to keep a thread alive and running for each CPU. Reap the ones that
haven't been used in $TimeFrame, and add thread creation to the slow
path that already exists in the bowels of schedule_work().
Or if some kernel hacker is really motivated, all workqueue users in the
kernel would benefit from a "thread audit", looking at working
conditions to decide if the new kthread APIs are more appropriate.
> IOW, once we get the processes off the global list, there just isn't any
> downside from them. Sure, they use some memory, but people who buy
> 1024-cpu machines won't care about a few kB per CPU..
>
> So the *only* downside is literally the process list, and one suggested
> patch already just removes kernel threads entirely from the parenthood
> lists.
>
> The other potential downside could be "ps is slow", but on the other hand,
> having the things stick around and have things like CPU-time accumulate is
> probably worth it - if there are some issues, they'd show up properly
> accounted for in a way that process pools would have a hard time doing.
Regardless of how things are shuffled about internally, there will
always be annoying overhead /somewhere/ when you have a metric ton of
kernel threads. I think that people should also be working on ways to
make the kernel threads a bit more manageable for the average human.
> So I really don't think this is worth changing things over, apart from
> literally removing them from process lists, which I think everybody agrees
> we should just do - it just never even came up before!
I think there is a human downside. For an admin you have to wade
through a ton of processes on your machine, if you are attempting to
evaluate the overall state of the machine. Just google around for all
the admins complaining about the explosion of kernel threads on
production machines :)
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 19:36 ` Oleg Nesterov
2007-04-06 19:43 ` Ingo Molnar
2007-04-06 19:47 ` Oleg Nesterov
@ 2007-04-07 20:31 ` Oleg Nesterov
2007-04-08 0:38 ` Eric W. Biederman
2 siblings, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-07 20:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
On 04/06, Oleg Nesterov wrote:
>
> Perhaps,
>
> --- t/kernel/exit.c~ 2007-04-06 23:31:31.000000000 +0400
> +++ t/kernel/exit.c 2007-04-06 23:31:57.000000000 +0400
> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
> remove_parent(current);
> current->parent = child_reaper(current);
> current->real_parent = child_reaper(current);
> - add_parent(current);
> -
> - /* Set the exit signal to SIGCHLD so we signal init on exit */
> - current->exit_signal = SIGCHLD;
> + current->exit_signal = -1;
>
> if (!has_rt_policy(current) && (task_nice(current) < 0))
> set_user_nice(current, 0);
>
> is enough. init is still our parent (make ps happy), but it can't see us,
> we are not on ->children list.
OK, this doesn't work. A multi-threaded init may do execve(). Can we forbid
exec() from non-main thread? init is special anyway. We can simplify de_thread()
in this case, and kill tasklist_lock around zap_other_threads().
So, we can re-parent a kernel thread to swapper. In that case it doesn't matter
if we put task on ->children list or not.
User-visible change. Acceptable?
> Off course, we also need to add preparent_to_init() to kthread() and
> (say) stopmachine(). Or we can create kernel_thread_detached() and
> modify callers to use it.
It would be very nice to introduce CLONE_KERNEL_THREAD instead, then
--- kernel/fork.c~ 2007-04-07 20:11:14.000000000 +0400
+++ kernel/fork.c 2007-04-07 23:40:35.000000000 +0400
@@ -1159,7 +1159,8 @@ static struct task_struct *copy_process(
p->parent_exec_id = p->self_exec_id;
/* ok, now we should be set up.. */
- p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
+ p->exit_signal = (clone_flags & (CLONE_THREAD|CLONE_KERNEL_THREAD))
+ ? -1 : (clone_flags & CSIGNAL);
p->pdeath_signal = 0;
p->exit_state = 0;
@@ -1196,6 +1197,8 @@ static struct task_struct *copy_process(
/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD))
p->parent = current->parent;
+ else if (unlikely(clone_flags & CLONE_KERNEL_THREAD))
+ p->parent = &init_task;
else
p->parent = current;
That is all, very simple. However, in that case we should introduce
#define SYS_CLONE_MASK (~CLONE_KERNEL_THREAD)
and change every sys_clone() implementation to filter out non-SYS_CLONE_MASK
flags. This is trivial (and imho useful), but
arch/sparc/kernel/process.c
arch/sparc64/kernel/process.c
arch/m68knommu/kernel/process.c
arch/m68k/kernel/process.c
arch/alpha/kernel/entry.S
arch/h8300/kernel/process.c
arch/v850/kernel/process.c
arch/frv/kernel/kernel_thread.S
arch/frv/kernel/kernel_thread.S
arch/powerpc/kernel/misc_32.S
arch/powerpc/kernel/misc_64.S
implement kthread_create() via sys_clone(). This means that without additional
effort from mainteners these arches can't take advantage of CLONE_KERNEL_THREAD.
Dear CC list, any advice?
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-07 20:31 ` Oleg Nesterov
@ 2007-04-08 0:38 ` Eric W. Biederman
2007-04-08 15:46 ` Oleg Nesterov
0 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-08 0:38 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
Oleg Nesterov <oleg@tv-sign.ru> writes:
> On 04/06, Oleg Nesterov wrote:
>>
>> Perhaps,
>>
>> --- t/kernel/exit.c~ 2007-04-06 23:31:31.000000000 +0400
>> +++ t/kernel/exit.c 2007-04-06 23:31:57.000000000 +0400
>> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
>> remove_parent(current);
>> current->parent = child_reaper(current);
>> current->real_parent = child_reaper(current);
>> - add_parent(current);
>> -
>> - /* Set the exit signal to SIGCHLD so we signal init on exit */
>> - current->exit_signal = SIGCHLD;
>> + current->exit_signal = -1;
>>
>> if (!has_rt_policy(current) && (task_nice(current) < 0))
>> set_user_nice(current, 0);
>>
>> is enough. init is still our parent (make ps happy), but it can't see us,
>> we are not on ->children list.
>
> OK, this doesn't work. A multi-threaded init may do execve().
Good catch. daemonize must die!
> So, we can re-parent a kernel thread to swapper. In that case it doesn't matter
> if we put task on ->children list or not.
Yes. We can.
> User-visible change. Acceptable?
We would have user visible changes when we changed to ktrhead_create anyway,
and it's none of user space's business so certainly.
>> Off course, we also need to add preparent_to_init() to kthread() and
>> (say) stopmachine(). Or we can create kernel_thread_detached() and
>> modify callers to use it.
>
> It would be very nice to introduce CLONE_KERNEL_THREAD instead, then
If we are going to do something to copy_process and the like let's take
this one step farther. Let's pass in the value of the task to copy.
Then we can add a wrapper around copy_process to build kernel_thread
something like:
struct task_struct *__kernel_thread(int (*fn)(void *), void * arg,
unsigned long flags)
{
struct task_struct *task;
struct pt_regs regs, *reg;
reg = kernel_thread_regs(®s, fn, arg);
task = copy_process(&init_task, flags, 0, reg, 0, NULL, NULL, NULL, 0);
if (!IS_ERR(task))
wake_up_new_task(task, flags);
return task;
}
long kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
struct task_struct *task;
task = __kernel_thread(fn, arg, flags);
if (IS_ERR(task))
return PTR_ERR(task);
return task->pid;
}
After that daemonize just becomes:
void daemonize(const char *name, ...)
{
va_list args;
va_start(args, name);
vsnprintf(current->comm, sizeof(current->comm), name, args);
va_end(args);
}
And kthread_create becomes:
struct task_struct *kthread_create(int (*threadfn)(void *data),
void *data,
const char namefmt[],
...)
{
struct kthread_create_info create;
struct task_struct *task;
create.threadfn = threadfn;
create.data = data;
/* We want our own signal handler (we take no signals by default). */
task = __kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
if (!IS_ERR(task)) {
va_list args;
va_start(args, namefmt);
vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
va_end(args);
}
return task;
}
If we are willing to go that far. I think it is worth it to touch the
architecture specific code. As that removes an unnecessary wait
queue, and ensures that kernel threads always start with a consistent
state. So it is a performance, scalability and simplicity boost.
Otherwise we should just put the code in the callers of kernel_thread
like we do today.
I don't have the energy to rework all of th architecture or even a
noticeable fraction of them right now. I have to much on my plate.
A non-architecture specific solution looks like a fairly simple
patch and I might get around to it one of these times..
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-08 0:38 ` Eric W. Biederman
@ 2007-04-08 15:46 ` Oleg Nesterov
0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-08 15:46 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Robin Holt, Chris Snook, Ingo Molnar,
linux-kernel
On 04/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> writes:
>
> > On 04/06, Oleg Nesterov wrote:
> >>
> >> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
> >> remove_parent(current);
> >> current->parent = child_reaper(current);
> >> current->real_parent = child_reaper(current);
> >> - add_parent(current);
> >> -
> >> - /* Set the exit signal to SIGCHLD so we signal init on exit */
> >> - current->exit_signal = SIGCHLD;
> >> + current->exit_signal = -1;
> >>
> >> if (!has_rt_policy(current) && (task_nice(current) < 0))
> >> set_user_nice(current, 0);
> >>
> >> is enough. init is still our parent (make ps happy), but it can't see us,
> >> we are not on ->children list.
> >
> > OK, this doesn't work. A multi-threaded init may do execve().
>
> Good catch. daemonize must die!
Well, this is not the problem of daemonize(), but I agree, daemonize() should
die with the changes you proposed.
> > So, we can re-parent a kernel thread to swapper. In that case it doesn't matter
> > if we put task on ->children list or not.
>
> Yes. We can.
>
> > User-visible change. Acceptable?
>
> We would have user visible changes when we changed to ktrhead_create anyway,
> and it's none of user space's business so certainly.
OK, I'll try to do "just for review" patch. I'm afraid pstree will be confused.
> >> Off course, we also need to add preparent_to_init() to kthread() and
> >> (say) stopmachine(). Or we can create kernel_thread_detached() and
> >> modify callers to use it.
> >
> > It would be very nice to introduce CLONE_KERNEL_THREAD instead, then
>
> If we are going to do something to copy_process and the like let's take
> this one step farther. Let's pass in the value of the task to copy.
Yes! I thought about that too. but see below,
> Then we can add a wrapper around copy_process to build kernel_thread
> something like:
>
> struct task_struct *__kernel_thread(int (*fn)(void *), void * arg,
> unsigned long flags)
> {
> struct task_struct *task;
> struct pt_regs regs, *reg;
>
> reg = kernel_thread_regs(®s, fn, arg);
> task = copy_process(&init_task, flags, 0, reg, 0, NULL, NULL, NULL, 0);
> if (!IS_ERR(task))
> wake_up_new_task(task, flags);
>
> return task;
> }
First, we still need some additional flag/parameter to set ->exit_state = -1.
Let's use CLONE_KERNEL_THREAD for discussion.
Now. Can we do copy_process(an_arbitrary_task) ? If yes, we need additional
locking in copy_process(). Just think about ->signal,->parent access. Nasty,
and probably not so useful.
If we can only use "current" or init_task, CLONE_KERNEL_THREAD is enough.
Just now it only
- sets ->exit_state = -1
- sets ->parent = &init_task
(note that the second change could be omitted, we can use CLONE_PARENT
if we know that all users of CLONE_KERNEL_THREAD are kernel threads).
> After that daemonize just becomes:
>
> void daemonize(const char *name, ...)
> {
> va_list args;
>
> va_start(args, name);
> vsnprintf(current->comm, sizeof(current->comm), name, args);
> va_end(args);
> }
>
> And kthread_create becomes:
>
> struct task_struct *kthread_create(int (*threadfn)(void *data),
> void *data,
> const char namefmt[],
> ...)
> {
> struct kthread_create_info create;
> struct task_struct *task;
>
> create.threadfn = threadfn;
> create.data = data;
>
> /* We want our own signal handler (we take no signals by default). */
> task = __kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> if (!IS_ERR(task)) {
> va_list args;
> va_start(args, namefmt);
> vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
> va_end(args);
> }
> return task;
> }
Yes, nice. This is not complete, we still have to wait for completion,
because kthread_create() promises that a new thread has no CPU on
return (so we can use kthread_bind() for example), but nice anyway.
> If we are willing to go that far. I think it is worth it to touch the
> architecture specific code. As that removes an unnecessary wait
> queue, and ensures that kernel threads always start with a consistent
> state. So it is a performance, scalability and simplicity boost.
Yes.
But note that CLONE_KERNEL_THREAD allows us to achieve the same step by
step. For example, we can extend the meaning of CLONE_KERNEL_THREAD to
use init_task.mm in copy_mm(), then copy_fs(), etc.
Eric, I am far from sure I am right though. Probably it is better to do
one big change as you suggested.
However, I like the idea of for-in-kernel-only CLONE_ flags. For example,
we can kill the ugly "pid" parameter of copy_process() and use CLONE_IDLE
instead.
Also, we can't use __kernel_thread() you propose if we want do_wait(),
using CLONE_ flags a bit more flexible.
BTW. I think your idea of kernel_thread_regs() is very nice in any case. Is
it enough (and possible) to implement an architecture neutral kernel_thread?
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 21:51 ` Linus Torvalds
2007-04-06 22:31 ` Davide Libenzi
@ 2007-04-09 8:28 ` Ingo Molnar
2007-04-09 18:09 ` Bill Davidsen
1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-09 8:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davide Libenzi, Eric W. Biederman, Oleg Nesterov, Robin Holt,
Chris Snook, Linux Kernel Mailing List, Jack Steiner
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 6 Apr 2007, Davide Libenzi wrote:
> > >
> > > or lets just face it and name it what it is: process_struct ;-)
> >
> > That'd be fine too! Wonder if Linus would swallow a rename patch like
> > that...
>
> I don't really see the point. It's not even *true*. A "process"
> includes more than the shared signal-handling - it would include files
> and fs etc too.
>
> So it's actually *more* correct to call it the shared signal state
> than it would be to call it "process" state.
we could call it "structure for everything that we know to be ugly about
POSIX process semantics" ;-) The rest, like files and fs we've
abstracted out already.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 2:15 ` Eric W. Biederman
2007-04-06 10:43 ` Robin Holt
@ 2007-04-09 17:37 ` Chris Snook
1 sibling, 0 replies; 100+ messages in thread
From: Chris Snook @ 2007-04-09 17:37 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Robin Holt, Ingo Molnar, linux-kernel,
Jack Steiner
Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> I'm not sure anybody would really be unhappy with pptr pointing to some
>> magic and special task that has pid 0 (which makes it clear to everybody
>> that the parent is something special), and that has SIGCHLD set to SIG_IGN
>> (which should make the exit case not even go through the zombie phase).
>>
>> I can't even imagine *how* you'd make a tool unhappy with that, since even
>> tools like "ps" (and even more "pstree" won't read all the process states
>> atomically, so they invariably will see parent pointers that don't even
>> exist any more, because by the time they get to the parent, it has exited
>> already.
>
> Right. pid == 1 being missing might cause some confusing having
> but having ppid == 0 should be fine. Heck pid == 1 already has
> ppid == 0, so it is a value user space has had to deal with for a
> while.
>
> In addition there was a period in 2.6 where most kernel threads
> and init had a pgid == 0 and a session == 0, and nothing seemed
> to complain.
>
> We should probably make all of the kernel threads children of
> init_task. The initial idle thread on the first cpu that is the
> parent of pid == 1. That will give the ppid == 0 naturally because
> the idle thread has pid == 0.
Linus, Eric, thanks for the history lesson. I think it's safe to say
that anything that breaks because of this sort of change was already
broken anyway.
If we're going to scale to an obscene number of CPUs (which I believe
was the original motivation on this thread) then putting the dead
children on their own list will probably scale better.
-- Chris
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 8:28 ` Ingo Molnar
@ 2007-04-09 18:09 ` Bill Davidsen
2007-04-09 19:28 ` Kyle Moffett
0 siblings, 1 reply; 100+ messages in thread
From: Bill Davidsen @ 2007-04-09 18:09 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Davide Libenzi, Eric W. Biederman, Oleg Nesterov,
Robin Holt, Chris Snook, Linux Kernel Mailing List, Jack Steiner
Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Fri, 6 Apr 2007, Davide Libenzi wrote:
>>>> or lets just face it and name it what it is: process_struct ;-)
>>> That'd be fine too! Wonder if Linus would swallow a rename patch like
>>> that...
>> I don't really see the point. It's not even *true*. A "process"
>> includes more than the shared signal-handling - it would include files
>> and fs etc too.
>>
>> So it's actually *more* correct to call it the shared signal state
>> than it would be to call it "process" state.
>
> we could call it "structure for everything that we know to be ugly about
> POSIX process semantics" ;-) The rest, like files and fs we've
> abstracted out already.
>
> Ingo
So are you voting for ugly_struct? ;-)
I do think this is still waiting for a more descriptive name, like
proc_misc_struct or some such. Kernel code should be treated as
literature, intended to be both read and readable.
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 18:09 ` Bill Davidsen
@ 2007-04-09 19:28 ` Kyle Moffett
2007-04-09 19:51 ` Linus Torvalds
2007-04-09 20:00 ` Eric W. Biederman
0 siblings, 2 replies; 100+ messages in thread
From: Kyle Moffett @ 2007-04-09 19:28 UTC (permalink / raw)
To: Bill Davidsen
Cc: linux-kernel, Linus Torvalds, Davide Libenzi, Eric W. Biederman,
Oleg Nesterov, Robin Holt, Chris Snook, Jack Steiner
On Apr 09, 2007, at 14:09:51, Bill Davidsen wrote:
> Ingo Molnar wrote:
>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> On Fri, 6 Apr 2007, Davide Libenzi wrote:
>>>>> or lets just face it and name it what it is: process_struct ;-)
>>>> That'd be fine too! Wonder if Linus would swallow a rename patch
>>>> like that...
>>> I don't really see the point. It's not even *true*. A "process"
>>> includes more than the shared signal-handling - it would include
>>> files and fs etc too.
>>>
>>> So it's actually *more* correct to call it the shared signal
>>> state than it would be to call it "process" state.
>> we could call it "structure for everything that we know to be ugly
>> about POSIX process semantics" ;-) The rest, like files and fs
>> we've abstracted out already.
> So are you voting for ugly_struct? ;-)
>
> I do think this is still waiting for a more descriptive name, like
> proc_misc_struct or some such. Kernel code should be treated as
> literature, intended to be both read and readable.
Maybe "struct posix_process" is more descriptive? "struct
process_posix"? "Ugly POSIX process semantics data" seems simple
enough to stick in a struct name. "struct uglyposix_process"?
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 19:28 ` Kyle Moffett
@ 2007-04-09 19:51 ` Linus Torvalds
2007-04-09 20:03 ` Davide Libenzi
2007-04-09 20:00 ` Eric W. Biederman
1 sibling, 1 reply; 100+ messages in thread
From: Linus Torvalds @ 2007-04-09 19:51 UTC (permalink / raw)
To: Kyle Moffett
Cc: Bill Davidsen, linux-kernel, Davide Libenzi, Eric W. Biederman,
Oleg Nesterov, Robin Holt, Chris Snook, Jack Steiner
On Mon, 9 Apr 2007, Kyle Moffett wrote:
>
> Maybe "struct posix_process" is more descriptive? "struct process_posix"?
> "Ugly POSIX process semantics data" seems simple enough to stick in a struct
> name. "struct uglyposix_process"?
Guys, you didn't read my message.
It's *not* about "process" stuff. Anything that tries to call it a
"process" is *by*definition* worse than what it is now. Processes have all
the things that we've cleanly separated out for filesystem, VM, SysV
semaphore state, namespaces etc.
The "struct signal_struct" is the random *leftovers* from all the other
stuff. It's *not* about "processes". Never has been, and never will be.
It's mainly about signal cruft, because that's the nastiest part of POSIX
threads behaviour, and that can't be clearly separated as one clear
structure.
So
- it really *is* mostly about signal handling and signal sources.
- it has some random *cruft* in it that isn't about signals, but even
that is mostly a matter of "it was random cruft in the original task
structure too, and it DID NOT MERIT a flag of its own"
- if you wanted to clean things up, you'd actually make things like
the "rlimit" info structures of their own, and have pointers to them.
So that cruft largely got put into "signal_struct" just because they were
the last thing to be moved out, along with the signal stuff (which was the
big and painful part). NOT because "struct signal_struct" is somehow about
"process state".
So stop blathering about processes. It has *nothing* to do with processes.
It's primarily about signals, but it has "cruft" in it.
So an accurate name is
struct signal_struct_with_some_cruft_in_it_that_did_not_merit_a_struct_of_its_own
but that's actually fairly unwieldly to type, and so in the name of sanity
and clear source code, it's called
struct signal_struct
and that's it.
And people who have argued for renaming it don't even seem to understand
what it's *about*, so the arguments for renaming it have been very weak
indeed so far.
IT IS NOT ABOUT "PROCESSES".
To be a "posix process", you have to share *everything*. The signal-struct
isn't even a very important part of that sharing. In fact, it's quite
arguably the *least* important part to share, which is why it was
separated out as a separate thing absolutely *last*, and which is why we
could do without it for quite long, with just some annoyingly subtle
non-POSIX semantics.
Get it? It's the structure that is *least* important for a "process".
So stop blathering about "processes". Anybody who does that, just shows
that they haven't thought it through!
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 19:28 ` Kyle Moffett
2007-04-09 19:51 ` Linus Torvalds
@ 2007-04-09 20:00 ` Eric W. Biederman
1 sibling, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-09 20:00 UTC (permalink / raw)
To: Kyle Moffett
Cc: Bill Davidsen, linux-kernel, Linus Torvalds, Davide Libenzi,
Oleg Nesterov, Robin Holt, Chris Snook, Jack Steiner
Kyle Moffett <mrmacman_g4@mac.com> writes:
> Maybe "struct posix_process" is more descriptive? "struct process_posix"?
> "Ugly POSIX process semantics data" seems simple enough to stick in a struct
> name. "struct uglyposix_process"?
Nack.
Linux internally doesn't have processes it has tasks with different
properties. Anything that talks about processes will be even more
confusing.
The only thing really wrong with struct signal is that it is easy to
confuse with struct sighand.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 19:51 ` Linus Torvalds
@ 2007-04-09 20:03 ` Davide Libenzi
2007-04-10 15:12 ` Bill Davidsen
0 siblings, 1 reply; 100+ messages in thread
From: Davide Libenzi @ 2007-04-09 20:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kyle Moffett, Bill Davidsen, Linux Kernel Mailing List,
Eric W. Biederman, Oleg Nesterov, Robin Holt, Chris Snook,
Jack Steiner
On Mon, 9 Apr 2007, Linus Torvalds wrote:
> On Mon, 9 Apr 2007, Kyle Moffett wrote:
> >
> > Maybe "struct posix_process" is more descriptive? "struct process_posix"?
> > "Ugly POSIX process semantics data" seems simple enough to stick in a struct
> > name. "struct uglyposix_process"?
>
> Guys, you didn't read my message.
>
> It's *not* about "process" stuff. Anything that tries to call it a
> "process" is *by*definition* worse than what it is now. Processes have all
> the things that we've cleanly separated out for filesystem, VM, SysV
> semaphore state, namespaces etc.
>
> The "struct signal_struct" is the random *leftovers* from all the other
> stuff. It's *not* about "processes". Never has been, and never will be.
I proposed "struct task_shared_ctx" but you ducked :)
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 22:38 ` Jeff Garzik
2007-04-06 22:51 ` Linus Torvalds
@ 2007-04-10 0:23 ` Andrew Morton
2007-04-10 0:48 ` Eric W. Biederman
2007-04-10 1:59 ` Dave Jones
1 sibling, 2 replies; 100+ messages in thread
From: Andrew Morton @ 2007-04-10 0:23 UTC (permalink / raw)
To: Jeff Garzik
Cc: Robin Holt, Eric W. Biederman, Ingo Molnar, Linus Torvalds,
linux-kernel, Jack Steiner
On Fri, 06 Apr 2007 18:38:40 -0400
Jeff Garzik <jeff@garzik.org> wrote:
> Robin Holt wrote:
> > We have been testing a new larger configuration and we are seeing a very
> > large scan time of init's tsk->children list. In the cases we are seeing,
> > there are numerous kernel processes created for each cpu (ie: events/0
> > ... events/<big number>, xfslogd/0 ... xfslogd/<big number>). These are
> > all on the list ahead of the processes we are currently trying to reap.
>
> What about attacking the explosion of kernel threads?
>
> As CPU counts increase, the number of per-CPU kernel threads gets really
> ridiculous.
>
> I would rather change the implementation under the hood to start per-CPU
> threads on demand, similar to a thread-pool implementation.
>
> Boxes with $BigNum CPUs probably won't ever use half of those threads.
I suspect there are quite a few kernel threads which don't really need to
be threads at all: the code would quite happily work if it was changed to
use keventd, via schedule_work() and friends. But kernel threads are
somewhat easier to code for.
I also suspect that there are a number of workqueue threads which
could/should have used create_singlethread_workqueue(). Often this is
because the developer just didn't think to do it.
otoh, a lot of these inefficeincies are probably down in scruffy drivers
rather than in core or top-level code.
<I also wonder where all these parented-by-init,
presumably-not-using-kthread kernel threads are coming from>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 0:23 ` Andrew Morton
@ 2007-04-10 0:48 ` Eric W. Biederman
2007-04-10 1:15 ` Andrew Morton
` (2 more replies)
2007-04-10 1:59 ` Dave Jones
1 sibling, 3 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-10 0:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, Robin Holt, Ingo Molnar, Linus Torvalds,
linux-kernel, Jack Steiner
Andrew Morton <akpm@linux-foundation.org> writes:
> I suspect there are quite a few kernel threads which don't really need to
> be threads at all: the code would quite happily work if it was changed to
> use keventd, via schedule_work() and friends. But kernel threads are
> somewhat easier to code for.
>
> I also suspect that there are a number of workqueue threads which
> could/should have used create_singlethread_workqueue(). Often this is
> because the developer just didn't think to do it.
>
> otoh, a lot of these inefficeincies are probably down in scruffy drivers
> rather than in core or top-level code.
>
> <I also wonder where all these parented-by-init,
> presumably-not-using-kthread kernel threads are coming from>
>From another piece of this thread.
> > Robin how many kernel thread per cpu are you seeing?
>
> 10.
>
> FYI, pid 1539 is kthread.
>
> a01:~ # ps -ef | egrep "\[.*\/255\]"
> root 512 1 0 Apr08 ? 00:00:00 [migration/255]
> root 513 1 0 Apr08 ? 00:00:00 [ksoftirqd/255]
> root 1281 1 0 Apr08 ? 00:00:02 [events/255]
> root 2435 1539 0 Apr08 ? 00:00:00 [kblockd/255]
> root 3159 1539 0 Apr08 ? 00:00:00 [aio/255]
> root 4007 1539 0 Apr08 ? 00:00:00 [cqueue/255]
> root 8653 1539 0 Apr08 ? 00:00:00 [ata/255]
> root 17438 1539 0 Apr08 ? 00:00:00 [xfslogd/255]
> root 17950 1539 0 Apr08 ? 00:00:00 [xfsdatad/255]
> root 18426 1539 0 Apr08 ? 00:00:00 [rpciod/255]
So it looks like there were about 1500 kernel threads that started up before
kthread started.
So the kernel threads appear to have init as their parent is because
they started before kthread for the most part.
At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control. It is mostly that we are observing the kernel as
NR_CPUS approaches infinity. 4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 0:48 ` Eric W. Biederman
@ 2007-04-10 1:15 ` Andrew Morton
2007-04-10 6:53 ` Jeff Garzik
2007-04-10 9:42 ` Robin Holt
2 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2007-04-10 1:15 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jeff Garzik, Robin Holt, Ingo Molnar, Linus Torvalds,
linux-kernel, Jack Steiner
On Mon, 09 Apr 2007 18:48:54 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
> > I suspect there are quite a few kernel threads which don't really need to
> > be threads at all: the code would quite happily work if it was changed to
> > use keventd, via schedule_work() and friends. But kernel threads are
> > somewhat easier to code for.
> >
> > I also suspect that there are a number of workqueue threads which
> > could/should have used create_singlethread_workqueue(). Often this is
> > because the developer just didn't think to do it.
> >
> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
> > rather than in core or top-level code.
> >
> > <I also wonder where all these parented-by-init,
> > presumably-not-using-kthread kernel threads are coming from>
>
>
> >From another piece of this thread.
Yeah, sorry. Without mentioning any names, someone@tv-sign.ru broke the
threading so I came in halfway.
> > > Robin how many kernel thread per cpu are you seeing?
> >
> > 10.
> >
> > FYI, pid 1539 is kthread.
> >
> > a01:~ # ps -ef | egrep "\[.*\/255\]"
> > root 512 1 0 Apr08 ? 00:00:00 [migration/255]
> > root 513 1 0 Apr08 ? 00:00:00 [ksoftirqd/255]
> > root 1281 1 0 Apr08 ? 00:00:02 [events/255]
> > root 2435 1539 0 Apr08 ? 00:00:00 [kblockd/255]
> > root 3159 1539 0 Apr08 ? 00:00:00 [aio/255]
> > root 4007 1539 0 Apr08 ? 00:00:00 [cqueue/255]
> > root 8653 1539 0 Apr08 ? 00:00:00 [ata/255]
> > root 17438 1539 0 Apr08 ? 00:00:00 [xfslogd/255]
> > root 17950 1539 0 Apr08 ? 00:00:00 [xfsdatad/255]
> > root 18426 1539 0 Apr08 ? 00:00:00 [rpciod/255]
>
>
> So it looks like there were about 1500 kernel threads that started up before
> kthread started.
>
> So the kernel threads appear to have init as their parent is because
> they started before kthread for the most part.
>
> At 10 kernel threads per cpu there may be a little bloat but it isn't
> out of control. It is mostly that we are observing the kernel as
> NR_CPUS approaches infinity. 4096 isn't infinity yet but it's easily
> a 1000 fold bigger then most people are used to :)
Yes, I expect we could run init_workqueues() much earlier, from process 0
rather than from process 1. Something _might_ blow up as it often does when
we change startup ordering, but in this case it would be somewhat surprising.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 0:23 ` Andrew Morton
2007-04-10 0:48 ` Eric W. Biederman
@ 2007-04-10 1:59 ` Dave Jones
2007-04-10 2:30 ` Andrew Morton
1 sibling, 1 reply; 100+ messages in thread
From: Dave Jones @ 2007-04-10 1:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:
> I suspect there are quite a few kernel threads which don't really need to
> be threads at all: the code would quite happily work if it was changed to
> use keventd, via schedule_work() and friends. But kernel threads are
> somewhat easier to code for.
Perhaps too easy. We have a bunch of kthreads sitting around that afaict,
are there 'just in case', not because they're actually in use.
10 ? S< 0:00 [khelper]
Why doesn't this get spawned when it needs to?
164 ? S< 0:00 [cqueue/0]
165 ? S< 0:00 [cqueue/1]
I'm not even sure wth these are.
166 ? S< 0:00 [ksuspend_usbd]
Just in case I decide to suspend ? Sounds.. handy.
But why not spawn this just after we start a suspend?
209 ? S< 0:00 [aio/0]
210 ? S< 0:00 [aio/1]
I'm sure I'd appreciate these a lot more if I had any AIO using apps.
364 ? S< 0:00 [kpsmoused]
I never did figure out why this was a thread.
417 ? S< 0:00 [scsi_eh_1]
418 ? S< 0:00 [scsi_eh_2]
419 ? S< 5:28 [scsi_eh_3]
426 ? S< 0:00 [scsi_eh_4]
427 ? S< 0:00 [scsi_eh_5]
428 ? S< 0:00 [scsi_eh_6]
429 ? S< 0:00 [scsi_eh_7]
Just in case my 7-1 in card reader gets an error.
(Which is unlikely on at least 6 of the slots as evidenced by the runtime column.
-- Though now I'm curious as to why the error handler was running so much given
I've not experienced any errors.).
This must be a fun one of on huge diskfarms.
884 ? S< 0:00 [kgameportd]
Just in case I ever decide to plug something into my soundcard.
2179 ? S< 0:00 [kmpathd/0]
2180 ? S< 0:00 [kmpathd/1]
2189 ? S< 0:00 [kmirrord]
Just loading the modules starts up the threads, regardless
of whether you use them. (Not sure why they're getting loaded,
something for me to look into)
3246 ? S< 0:00 [krfcommd]
I don't even *have* bluetooth hardware.
(Yes, the module shouldn't have loaded, but that's another battle..)
> otoh, a lot of these inefficeincies are probably down in scruffy drivers
> rather than in core or top-level code.
You say scruffy, but most of the proliferation of kthreads comes
from code written in the last few years. Compare the explosion of kthreads
we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
slowing down at all.
On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
It doesn't surprise me at all that bigger iron is starting to see issues.
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 1:59 ` Dave Jones
@ 2007-04-10 2:30 ` Andrew Morton
2007-04-10 2:46 ` Linus Torvalds
` (4 more replies)
0 siblings, 5 replies; 100+ messages in thread
From: Andrew Morton @ 2007-04-10 2:30 UTC (permalink / raw)
To: Dave Jones
Cc: Jeff Garzik, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <davej@redhat.com> wrote:
> On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:
>
> > I suspect there are quite a few kernel threads which don't really need to
> > be threads at all: the code would quite happily work if it was changed to
> > use keventd, via schedule_work() and friends. But kernel threads are
> > somewhat easier to code for.
>
> Perhaps too easy. We have a bunch of kthreads sitting around that afaict,
> are there 'just in case', not because they're actually in use.
This is fun.
> 10 ? S< 0:00 [khelper]
That one's needed to parent the call_usermodehelper() apps. I don't think
it does anything else. We used to use keventd for this but that had some
problem whcih I forget. (Who went and misnamed keventd to "events", too?
Nobody calls it "events", and with good reason)
> Why doesn't this get spawned when it needs to?
>
> 164 ? S< 0:00 [cqueue/0]
> 165 ? S< 0:00 [cqueue/1]
>
> I'm not even sure wth these are.
Me either.
> ...
>
: root 3 0.0 0.0 0 0 ? S 18:51 0:00 [watchdog/0]
That's the softlockup detector. Confusingly named to look like a, err,
watchdog. Could probably use keventd.
: root 5 0.0 0.0 0 0 ? S 18:51 0:00 [khelper]
That's there to parent the kthread_create()d threads. Could presumably use
khelper.
: root 152 0.0 0.0 0 0 ? S 18:51 0:00 [ata/0]
Does it need to be per-cpu?
: root 153 0.0 0.0 0 0 ? S 18:51 0:00 [ata_aux]
That's a single-threaded workqueue handler. Perhaps could use keventd.
: root 299 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_0]
: root 300 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_1]
: root 305 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_2]
: root 306 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_3]
This machine has one CPU, one sata disk and one DVD drive. The above is
hard to explain.
: root 319 0.0 0.0 0 0 ? S 18:51 0:00 [pccardd]
hm.
: root 331 0.0 0.0 0 0 ? S 18:51 0:00 [kpsmoused]
hm.
: root 337 0.0 0.0 0 0 ? S 18:51 0:00 [kedac]
hm. I didn't know that the Vaio had EDAC.
: root 1173 0.0 0.0 0 0 ? S 18:51 0:00 [khpsbpkt]
I can't even pronounce that.
: root 1354 0.0 0.0 0 0 ? S 18:51 0:00 [knodemgrd_0]
OK, I do have 1394 hardware, but it hasn't been used.
: root 1636 0.0 0.0 0 0 ? S 18:52 0:00 [kondemand/0]
I blame davej.
> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
> > rather than in core or top-level code.
>
> You say scruffy, but most of the proliferation of kthreads comes
> from code written in the last few years. Compare the explosion of kthreads
> we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
> slowing down at all.
>
> On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
> It doesn't surprise me at all that bigger iron is starting to see issues.
>
Sure.
I don't think it's completely silly to object to all this. Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:30 ` Andrew Morton
@ 2007-04-10 2:46 ` Linus Torvalds
2007-04-10 7:07 ` Jeff Garzik
2007-04-10 5:07 ` Alexey Dobriyan
` (3 subsequent siblings)
4 siblings, 1 reply; 100+ messages in thread
From: Linus Torvalds @ 2007-04-10 2:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Jeff Garzik, Robin Holt, Eric W. Biederman,
Ingo Molnar, linux-kernel, Jack Steiner
On Mon, 9 Apr 2007, Andrew Morton wrote:
>
> > 10 ? S< 0:00 [khelper]
>
> That one's needed to parent the call_usermodehelper() apps. I don't think
> it does anything else. We used to use keventd for this but that had some
> problem whcih I forget.
I think it was one of a long series of deadlocks.
Using a "keventd" for many different things sounds clever and nice, but
then sucks horribly when one event triggers another event, and they depend
on each other. Solution: use independent threads for the events.
Linus
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:30 ` Andrew Morton
2007-04-10 2:46 ` Linus Torvalds
@ 2007-04-10 5:07 ` Alexey Dobriyan
2007-04-10 5:21 ` Dave Jones
2007-04-10 6:09 ` Torsten Kaiser
` (2 subsequent siblings)
4 siblings, 1 reply; 100+ messages in thread
From: Alexey Dobriyan @ 2007-04-10 5:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Jeff Garzik, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
> On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <davej@redhat.com> wrote:
[possible topic for KS2007]
> > 164 ? S< 0:00 [cqueue/0]
> > 165 ? S< 0:00 [cqueue/1]
> >
> > I'm not even sure wth these are.
>
> Me either.
drivers/connector/connector.c:
455 dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 5:07 ` Alexey Dobriyan
@ 2007-04-10 5:21 ` Dave Jones
0 siblings, 0 replies; 100+ messages in thread
From: Dave Jones @ 2007-04-10 5:21 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, Jeff Garzik, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On Tue, Apr 10, 2007 at 09:07:54AM +0400, Alexey Dobriyan wrote:
> On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
> > On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <davej@redhat.com> wrote:
>
> [possible topic for KS2007]
>
> > > 164 ? S< 0:00 [cqueue/0]
> > > 165 ? S< 0:00 [cqueue/1]
> > >
> > > I'm not even sure wth these are.
> >
> > Me either.
>
> drivers/connector/connector.c:
> 455 dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);
Maybe I have apps relying on the connector stuff that I don't
even realise, but ttbomk, nothing actually *needs* this
for 99% of users if I'm not mistaken.
* wonders why he never built this modular..
config PROC_EVENTS
boolean "Report process events to userspace"
depends on CONNECTOR=y
Yay.
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:30 ` Andrew Morton
2007-04-10 2:46 ` Linus Torvalds
2007-04-10 5:07 ` Alexey Dobriyan
@ 2007-04-10 6:09 ` Torsten Kaiser
2007-04-10 7:08 ` Jeff Garzik
2007-04-10 7:05 ` Jeff Garzik
2007-04-10 7:44 ` Russell King
4 siblings, 1 reply; 100+ messages in thread
From: Torsten Kaiser @ 2007-04-10 6:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Jeff Garzik, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On 4/10/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> : root 299 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_0]
> : root 300 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_1]
> : root 305 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_2]
> : root 306 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_3]
>
> This machine has one CPU, one sata disk and one DVD drive. The above is
> hard to explain.
One thread per port, not per device.
796 ? S 0:00 \_ [scsi_eh_0]
797 ? S 0:00 \_ [scsi_eh_1]
798 ? S 0:00 \_ [scsi_eh_2]
819 ? S 0:00 \_ [scsi_eh_3]
820 ? S 0:00 \_ [scsi_eh_4]
824 ? S 0:00 \_ [scsi_eh_5]
825 ? S 0:14 \_ [scsi_eh_6]
bardioc ~ # lsscsi -d
[0:0:0:0] disk ATA ST3160827AS 3.42 /dev/sda[8:0]
[1:0:0:0] disk ATA ST3160827AS 3.42 /dev/sdb[8:16]
[5:0:0:0] disk ATA IBM-DHEA-36480 HE8O /dev/sdc[8:32]
[5:0:1:0] disk ATA Maxtor 6L160P0 BAH4 /dev/sdd[8:48]
[6:0:0:0] cd/dvd HL-DT-ST DVDRAM GSA-4081B A100 /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0] sata_promise
[1] sata_promise
[2] sata_promise
[3] sata_via
[4] sata_via
[5] pata_via
[6] pata_via
The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.
Don't know if the thread is even needed for hotplug...
> I don't think it's completely silly to object to all this. Sure, a kernel
> thread is worth 4k in the best case, but I bet they have associated unused
> resources and as we've seen, they can cause overhead.
For me its not the 4k that annoy me, but the clutter in ps or top.
Torsten
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 0:48 ` Eric W. Biederman
2007-04-10 1:15 ` Andrew Morton
@ 2007-04-10 6:53 ` Jeff Garzik
2007-04-10 9:42 ` Robin Holt
2 siblings, 0 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 6:53 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andrew Morton, Robin Holt, Ingo Molnar, Linus Torvalds,
linux-kernel, Jack Steiner
Eric W. Biederman wrote:
> At 10 kernel threads per cpu there may be a little bloat but it isn't
> out of control. It is mostly that we are observing the kernel as
> NR_CPUS approaches infinity. 4096 isn't infinity yet but it's easily
> a 1000 fold bigger then most people are used to :)
I disagree there is only a little bloat: the current mechanism in place
does not scale as NR_CPUS increases, as this thread demonstrates.
Beyond a certain point, on an 8-CPU box, it gets silly. You certainly
don't need eight kblockd threads or eight ata threads.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:30 ` Andrew Morton
` (2 preceding siblings ...)
2007-04-10 6:09 ` Torsten Kaiser
@ 2007-04-10 7:05 ` Jeff Garzik
2007-04-10 7:37 ` Andrew Morton
2007-04-10 16:35 ` Matt Mackall
2007-04-10 7:44 ` Russell King
4 siblings, 2 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 7:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
Andrew Morton wrote:
> : root 3 0.0 0.0 0 0 ? S 18:51 0:00 [watchdog/0]
>
> That's the softlockup detector. Confusingly named to look like a, err,
> watchdog. Could probably use keventd.
I would think this would run into the keventd "problem", where $N
processes can lock out another?
IMO a lot of these could potentially be simply started as brand new
threads, when an exception arises.
> : root 5 0.0 0.0 0 0 ? S 18:51 0:00 [khelper]
>
> That's there to parent the kthread_create()d threads. Could presumably use
> khelper.
>
> : root 152 0.0 0.0 0 0 ? S 18:51 0:00 [ata/0]
>
> Does it need to be per-cpu?
No, it does not.
It is used for PIO data transfer, so it merely has to respond quickly,
which rules out keventd. You also don't want PIO data xfer for port A
blocked, sitting around waiting for PIO data xfer to complete on port C.
So, we merely need fast-reacting threads that keventd will not block.
We do not need per-CPU threads.
Again, I think a model where threads are created on demand, and reaped
after inactivity, would fit here. As I feel it would fit with many
other non-libata drivers.
> : root 153 0.0 0.0 0 0 ? S 18:51 0:00 [ata_aux]
>
> That's a single-threaded workqueue handler. Perhaps could use keventd.
That is used by libata exception handler, for hotpug and such. My main
worry with keventd is that we might get stuck behind an unrelated
process for an undefined length of time.
IMO the best model would be to create ata_aux thread on demand, and kill
it if it hasn't been used recently.
> : root 299 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_0]
> : root 300 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_1]
> : root 305 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_2]
> : root 306 0.0 0.0 0 0 ? S 18:51 0:00 [scsi_eh_3]
>
> This machine has one CPU, one sata disk and one DVD drive. The above is
> hard to explain.
Nod. I've never thought we needed this many threads. At least it
doesn't scale out of control for $BigNum-CPU boxen.
As the name implies, this is SCSI exception handling thread. Although
some synchronization is required, this could probably work with an
on-demand thread creation model too.
> : root 319 0.0 0.0 0 0 ? S 18:51 0:00 [pccardd]
>
> hm.
>
> : root 331 0.0 0.0 0 0 ? S 18:51 0:00 [kpsmoused]
>
> hm.
This kernel thread is used as a "bottom half" handler for the PS2 mouse
interrupt. This one is a bit more justifiable.
> : root 337 0.0 0.0 0 0 ? S 18:51 0:00 [kedac]
>
> hm. I didn't know that the Vaio had EDAC.
>
> : root 1173 0.0 0.0 0 0 ? S 18:51 0:00 [khpsbpkt]
>
> I can't even pronounce that.
>
> : root 1354 0.0 0.0 0 0 ? S 18:51 0:00 [knodemgrd_0]
>
> OK, I do have 1394 hardware, but it hasn't been used.
>
> : root 1636 0.0 0.0 0 0 ? S 18:52 0:00 [kondemand/0]
>
> I blame davej.
>
>> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
>> > rather than in core or top-level code.
>>
>> You say scruffy, but most of the proliferation of kthreads comes
>> from code written in the last few years. Compare the explosion of kthreads
>> we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
>> slowing down at all.
>>
>> On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
>> It doesn't surprise me at all that bigger iron is starting to see issues.
>>
>
> Sure.
>
> I don't think it's completely silly to object to all this. Sure, a kernel
> thread is worth 4k in the best case, but I bet they have associated unused
> resources and as we've seen, they can cause overhead.
Agreed.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:46 ` Linus Torvalds
@ 2007-04-10 7:07 ` Jeff Garzik
2007-04-10 22:20 ` Ingo Oeser
0 siblings, 1 reply; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 7:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Dave Jones, Robin Holt, Eric W. Biederman,
Ingo Molnar, linux-kernel, Jack Steiner
Linus Torvalds wrote:
>
> On Mon, 9 Apr 2007, Andrew Morton wrote:
>>> 10 ? S< 0:00 [khelper]
>> That one's needed to parent the call_usermodehelper() apps. I don't think
>> it does anything else. We used to use keventd for this but that had some
>> problem whcih I forget.
>
> I think it was one of a long series of deadlocks.
>
> Using a "keventd" for many different things sounds clever and nice, but
> then sucks horribly when one event triggers another event, and they depend
> on each other. Solution: use independent threads for the events.
Nod. That's the key problem with keventd. Independent things must wait
on each other.
That's why I feel thread creation -- cheap under Linux -- is quite
appropriate for many of these situations.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 6:09 ` Torsten Kaiser
@ 2007-04-10 7:08 ` Jeff Garzik
0 siblings, 0 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 7:08 UTC (permalink / raw)
To: Torsten Kaiser
Cc: Andrew Morton, Dave Jones, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
Torsten Kaiser wrote:
> One thread per port, not per device.
>
> 796 ? S 0:00 \_ [scsi_eh_0]
> 797 ? S 0:00 \_ [scsi_eh_1]
> 798 ? S 0:00 \_ [scsi_eh_2]
> 819 ? S 0:00 \_ [scsi_eh_3]
> 820 ? S 0:00 \_ [scsi_eh_4]
> 824 ? S 0:00 \_ [scsi_eh_5]
> 825 ? S 0:14 \_ [scsi_eh_6]
>
> bardioc ~ # lsscsi -d
> [0:0:0:0] disk ATA ST3160827AS 3.42 /dev/sda[8:0]
> [1:0:0:0] disk ATA ST3160827AS 3.42 /dev/sdb[8:16]
> [5:0:0:0] disk ATA IBM-DHEA-36480 HE8O /dev/sdc[8:32]
> [5:0:1:0] disk ATA Maxtor 6L160P0 BAH4 /dev/sdd[8:48]
> [6:0:0:0] cd/dvd HL-DT-ST DVDRAM GSA-4081B A100 /dev/sr0[11:0]
> bardioc ~ # lsscsi -H
> [0] sata_promise
> [1] sata_promise
> [2] sata_promise
> [3] sata_via
> [4] sata_via
> [5] pata_via
> [6] pata_via
>
> The bad is, that there is always a thread, even if the hardware is not
> even hotplug capable.
> Don't know if the thread is even needed for hotplug...
Nope, it's not. At least for SATA (your chosen examples), hotplug is
handled by a libata-specific thread.
The SCSI EH threads are there purely for SCSI exception handling. For
the majority of SAS and SATA, we replace the entire SCSI EH handling
code with our own, making those threads less useful than on older (read:
majority) of SCSI drivers.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:05 ` Jeff Garzik
@ 2007-04-10 7:37 ` Andrew Morton
2007-04-10 8:33 ` Jeff Garzik
2007-04-10 16:35 ` Matt Mackall
1 sibling, 1 reply; 100+ messages in thread
From: Andrew Morton @ 2007-04-10 7:37 UTC (permalink / raw)
To: Jeff Garzik
Cc: Dave Jones, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> My main
> worry with keventd is that we might get stuck behind an unrelated
> process for an undefined length of time.
I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem. I guess we could instrument it and fix stuff
easily enough.
The main problem with keventd has been flush_scheduled_work() deadlocks: the
caller of flush_scheduled_work() wants to flush work item A, but holds some
lock which is also needed by unrelated work item B. Most of the time, it
works. But if item B happens to be queued the flush_scheduled_work() will
deadlock.
The fix is to flush-and-cancel just item A: if it's not running yet, cancel
it. If it is running, wait until it has finished. Oleg's
void cancel_work_sync(struct work_struct *work)
is queued for 2.6.22 and should permit some kthread->keventd conversions
which would previously been deadlocky.
The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from. Conversions to
schedule_work()+cancel_work_sync() and conversions to
create_singlethread_workqueue().
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 2:30 ` Andrew Morton
` (3 preceding siblings ...)
2007-04-10 7:05 ` Jeff Garzik
@ 2007-04-10 7:44 ` Russell King
2007-04-10 8:16 ` Jeff Garzik
2007-04-10 8:59 ` Ingo Molnar
4 siblings, 2 replies; 100+ messages in thread
From: Russell King @ 2007-04-10 7:44 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Jeff Garzik, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
> : root 319 0.0 0.0 0 0 ? S 18:51 0:00 [pccardd]
>
> hm.
One per PC card socket to avoid the sysfs locking crappyness that would
otherwise deadlock, and to convert from the old unreadable state machine
implementation to a much more readable linearly coded implementation.
Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:44 ` Russell King
@ 2007-04-10 8:16 ` Jeff Garzik
2007-04-10 8:59 ` Ingo Molnar
1 sibling, 0 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 8:16 UTC (permalink / raw)
To: Andrew Morton, Dave Jones, Jeff Garzik, Robin Holt,
Eric W. Biederman, Ingo Molnar, Linus Torvalds, linux-kernel,
Jack Steiner
Russell King wrote:
> Could probably be eliminated if we had some mechanism to spawn a helper
> thread to do some task as required which didn't block other helper
> threads until it completes.
kthread_run() should go that for you. Creates a new thread with
kthread_create(), and wakes it up immediately. Goes away when you're
done with it.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:37 ` Andrew Morton
@ 2007-04-10 8:33 ` Jeff Garzik
2007-04-10 8:41 ` Andrew Morton
0 siblings, 1 reply; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 8:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
Andrew Morton wrote:
> On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <jeff@garzik.org> wrote:
>
>> My main
>> worry with keventd is that we might get stuck behind an unrelated
>> process for an undefined length of time.
>
> I don't think it has ever been demonstrated that keventd latency is
> excessive, or a problem. I guess we could instrument it and fix stuff
> easily enough.
It's simple math, combined with user expectations.
On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which
are doing hardware reset tasks that could take 30-60 seconds (realistic
for libata, SCSI and network drivers, at least), then OBVIOUSLY you have
other tasks blocked for that length of time.
Since the cause of the latency is msleep() -- the entire reason why the
driver wanted to use a kernel thread in the first place -- it would seem
to me that the simple fix is to start a new thread, possibly exceeding
the number of CPUs in the box.
> The main problem with keventd has been flush_scheduled_work() deadlocks: the
That's been a problem in the past, yes, but a minor one.
I'm talking about a key conceptual problem with keventd.
It is easy to see how an extra-long tg3 hardware reset might prevent a
disk hotplug event from being processed for 30-60 seconds. And as
hardware gets more complex -- see the Intel IOP storage card which runs
Linux -- the reset times get longer, too.
So the issue is /not/ deadlocks.
> The thing to concentrate on here is the per-cpu threads, which is where the
> proliferation appears to be coming from.
Strongly agreed.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 8:33 ` Jeff Garzik
@ 2007-04-10 8:41 ` Andrew Morton
2007-04-10 8:48 ` Jeff Garzik
0 siblings, 1 reply; 100+ messages in thread
From: Andrew Morton @ 2007-04-10 8:41 UTC (permalink / raw)
To: Jeff Garzik
Cc: Dave Jones, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
On Tue, 10 Apr 2007 04:33:57 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> Andrew Morton wrote:
> > On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> >
> >> My main
> >> worry with keventd is that we might get stuck behind an unrelated
> >> process for an undefined length of time.
> >
> > I don't think it has ever been demonstrated that keventd latency is
> > excessive, or a problem. I guess we could instrument it and fix stuff
> > easily enough.
>
> It's simple math, combined with user expectations.
>
> On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which
> are doing hardware reset tasks that could take 30-60 seconds (realistic
> for libata, SCSI and network drivers, at least), then OBVIOUSLY you have
> other tasks blocked for that length of time.
Well that obviously would be a dumb way to use keventd. One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 8:41 ` Andrew Morton
@ 2007-04-10 8:48 ` Jeff Garzik
2007-04-10 22:35 ` Ingo Oeser
0 siblings, 1 reply; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 8:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Jones, Robin Holt, Eric W. Biederman, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
Andrew Morton wrote:
> Well that obviously would be a dumb way to use keventd. One would need
> to do schedule_work(), kick off the reset then do schedule_delayed_work()
> to wait (or poll) for its termination.
Far too complex. See what Russell wrote, for instance.
When you are in a kernel thread, you can write more simple,
straightforward, easy-to-debug code that does
blah
msleep()
blah
rather than creating an insanely complicated state machine for the same
thing.
ESPECIALLY if you are already inside a state machine (the case with
libata PIO data xfer), you do not want to add another state machine
inside of that.
A great many uses of kernel threads are to simplify device reset and
polling in this way. I know; a year ago I audited every kernel thread,
because I was so frustrated with the per-CPU thread explosion.
Thus, rather than forcing authors to make their code more complex, we
should find another solution.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:44 ` Russell King
2007-04-10 8:16 ` Jeff Garzik
@ 2007-04-10 8:59 ` Ingo Molnar
2007-04-10 9:33 ` Jeff Garzik
1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-10 8:59 UTC (permalink / raw)
To: Andrew Morton, Dave Jones, Jeff Garzik, Robin Holt,
Eric W. Biederman, Linus Torvalds, linux-kernel, Jack Steiner
* Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> One per PC card socket to avoid the sysfs locking crappyness that
> would otherwise deadlock, and to convert from the old unreadable state
> machine implementation to a much more readable linearly coded
> implementation.
>
> Could probably be eliminated if we had some mechanism to spawn a
> helper thread to do some task as required which didn't block other
> helper threads until it completes.
looks like the perfect usecase for threadlets. (threadlets only use up a
separate context if necessary and can be coded in the familiar
sequential/linear model)
(btw., threadlets could in theory be executed in irq context too, and if
we block on anything it gets bounced off to a real context - although
this certainly pushes the limits and there would still be some deadlock
potential for things like irq-unsafe non-sleeping locks (spinlocks,
rwlocks).)
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 8:59 ` Ingo Molnar
@ 2007-04-10 9:33 ` Jeff Garzik
0 siblings, 0 replies; 100+ messages in thread
From: Jeff Garzik @ 2007-04-10 9:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Dave Jones, Robin Holt, Eric W. Biederman,
Linus Torvalds, linux-kernel, Jack Steiner
Ingo Molnar wrote:
> * Russell King <rmk+lkml@arm.linux.org.uk> wrote:
>
>> One per PC card socket to avoid the sysfs locking crappyness that
>> would otherwise deadlock, and to convert from the old unreadable state
>> machine implementation to a much more readable linearly coded
>> implementation.
>>
>> Could probably be eliminated if we had some mechanism to spawn a
>> helper thread to do some task as required which didn't block other
>> helper threads until it completes.
>
> looks like the perfect usecase for threadlets. (threadlets only use up a
> separate context if necessary and can be coded in the familiar
> sequential/linear model)
Same response as to Andrew: AFAICS that just increases complexity.
The simple path for programmers is writing straightforward code that
does something like
blah
msleep()
blah
or in pccardd's case,
mutex_lock()
blah
mutex_unlock()
to permit sleeping without having to write more-complex code that deals
with context transitions.
For slow-path, infrequently executed code, it is best to keep it as
simple as possible.
Jeff
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 0:48 ` Eric W. Biederman
2007-04-10 1:15 ` Andrew Morton
2007-04-10 6:53 ` Jeff Garzik
@ 2007-04-10 9:42 ` Robin Holt
2 siblings, 0 replies; 100+ messages in thread
From: Robin Holt @ 2007-04-10 9:42 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andrew Morton, Jeff Garzik, Robin Holt, Ingo Molnar,
Linus Torvalds, linux-kernel, Jack Steiner
On Mon, Apr 09, 2007 at 06:48:54PM -0600, Eric W. Biederman wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
> > I suspect there are quite a few kernel threads which don't really need to
> > be threads at all: the code would quite happily work if it was changed to
> > use keventd, via schedule_work() and friends. But kernel threads are
> > somewhat easier to code for.
> >
> > I also suspect that there are a number of workqueue threads which
> > could/should have used create_singlethread_workqueue(). Often this is
> > because the developer just didn't think to do it.
> >
> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
> > rather than in core or top-level code.
> >
> > <I also wonder where all these parented-by-init,
> > presumably-not-using-kthread kernel threads are coming from>
>
>
> >From another piece of this thread.
>
> > > Robin how many kernel thread per cpu are you seeing?
> >
> > 10.
> >
> > FYI, pid 1539 is kthread.
> >
> > a01:~ # ps -ef | egrep "\[.*\/255\]"
> > root 512 1 0 Apr08 ? 00:00:00 [migration/255]
> > root 513 1 0 Apr08 ? 00:00:00 [ksoftirqd/255]
> > root 1281 1 0 Apr08 ? 00:00:02 [events/255]
> > root 2435 1539 0 Apr08 ? 00:00:00 [kblockd/255]
> > root 3159 1539 0 Apr08 ? 00:00:00 [aio/255]
> > root 4007 1539 0 Apr08 ? 00:00:00 [cqueue/255]
> > root 8653 1539 0 Apr08 ? 00:00:00 [ata/255]
> > root 17438 1539 0 Apr08 ? 00:00:00 [xfslogd/255]
> > root 17950 1539 0 Apr08 ? 00:00:00 [xfsdatad/255]
> > root 18426 1539 0 Apr08 ? 00:00:00 [rpciod/255]
>
>
> So it looks like there were about 1500 kernel threads that started up before
> kthread started.
I should have been more clear, this is from that 4096 broken down to a
512 cpu partition. This is the configuration the customer will receive
the machine. The 4096 was just to see if it worked.
> So the kernel threads appear to have init as their parent is because
> they started before kthread for the most part.
>
> At 10 kernel threads per cpu there may be a little bloat but it isn't
> out of control. It is mostly that we are observing the kernel as
> NR_CPUS approaches infinity. 4096 isn't infinity yet but it's easily
> a 1000 fold bigger then most people are used to :)
>
> Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 13:48 ` Ingo Molnar
@ 2007-04-10 13:38 ` Oleg Nesterov
2007-04-10 15:00 ` Eric W. Biederman
2007-04-10 14:51 ` Eric W. Biederman
1 sibling, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-10 13:38 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric W. Biederman, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
On 04/10, Ingo Molnar wrote:
>
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> > Ingo Molnar <mingo@elte.hu> writes:
> >
> > > no. Two _completely separate_ lists.
> > >
> > > i.e. a to-be-reaped task will still be on the main list _too_. The
> > > main list is for all the PID semantics rules. The reap-list is just
> > > for wait4() processing. The two would be completely separate.
> >
> > And what pray tell except for heuristics is the list of children used
> > for?
>
> on a second thought: the p->children list is needed for the whole
> child/parent task tree, which is needed for sys_getppid(). The question
> is, does anything require us to reparent to within the same thread
> group?
No! That is why I suggest (a long ago, in fact) to move ->children into
->signal_struct. When sub-thread forks, we set ->parent = group_leader.
We don't need forget_original_parent() until the last thead exists. This
also simplify do_wait().
However, this breaks the current ->pdeath_signal behaviour. In fact (and
Eric thinks the same) this _fixes_ this behaviour, but may break things.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 18:30 ` Eric W. Biederman
@ 2007-04-10 13:48 ` Ingo Molnar
2007-04-10 13:38 ` Oleg Nesterov
2007-04-10 14:51 ` Eric W. Biederman
0 siblings, 2 replies; 100+ messages in thread
From: Ingo Molnar @ 2007-04-10 13:48 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ingo Molnar <mingo@elte.hu> writes:
>
> > no. Two _completely separate_ lists.
> >
> > i.e. a to-be-reaped task will still be on the main list _too_. The
> > main list is for all the PID semantics rules. The reap-list is just
> > for wait4() processing. The two would be completely separate.
>
> And what pray tell except for heuristics is the list of children used
> for?
on a second thought: the p->children list is needed for the whole
child/parent task tree, which is needed for sys_getppid(). The question
is, does anything require us to reparent to within the same thread
group?
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 13:48 ` Ingo Molnar
2007-04-10 13:38 ` Oleg Nesterov
@ 2007-04-10 14:51 ` Eric W. Biederman
2007-04-10 15:06 ` Ingo Molnar
2007-04-10 16:44 ` Oleg Nesterov
1 sibling, 2 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-10 14:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
Ingo Molnar <mingo@elte.hu> writes:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> Ingo Molnar <mingo@elte.hu> writes:
>>
>> > no. Two _completely separate_ lists.
>> >
>> > i.e. a to-be-reaped task will still be on the main list _too_. The
>> > main list is for all the PID semantics rules. The reap-list is just
>> > for wait4() processing. The two would be completely separate.
>>
>> And what pray tell except for heuristics is the list of children used
>> for?
>
> on a second thought: the p->children list is needed for the whole
> child/parent task tree, which is needed for sys_getppid().
Yes, something Oleg said made me realize that.
As long as the reparent isn't to complex it isn't required that we
have exactly one list .
> The question
> is, does anything require us to reparent to within the same thread
> group?
I think my head is finally on straight about this question.
Currently there is the silly linux specific parent death signal
(pdeath_signal). Oleg's memory was a better than mine on this score.
However there is no indication that the parent death signal being sent
when a thread leader dies is actually correct, or even interesting.
It probably should only be sent when getppid changes.
So with pdeath_signal fixed that is nothing that requires us to
reparent within the same thread group.
I'm trying to remember what the story is now. There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap. It requires being reparented twice to trigger.
Anyway it is a real mess and if we can remove the stupid multi-headed
child lists things would become much simpler and the problem could
not occur.
Plus the code would become much simpler...
utrace appears to have removed the ptrace_children list and the
special cases that entailed.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 13:38 ` Oleg Nesterov
@ 2007-04-10 15:00 ` Eric W. Biederman
0 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-10 15:00 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Ingo Molnar, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
Oleg Nesterov <oleg@tv-sign.ru> writes:
> No! That is why I suggest (a long ago, in fact) to move ->children into
> ->signal_struct. When sub-thread forks, we set ->parent = group_leader.
> We don't need forget_original_parent() until the last thead exists. This
> also simplify do_wait().
>
> However, this breaks the current ->pdeath_signal behaviour. In fact (and
> Eric thinks the same) this _fixes_ this behaviour, but may break things.
Thinking about this. As contingency planning if there is something in user
space that actually somehow cares about pdeath_signal, from a threaded
parent we can add a pdeath_signal list, to the task_struct and get
rid of the rest of the complexity.
I think I want to wait until someone screams first.
This does very much mean that we can remove the complexity of
a per thread ->children without fear, of having to revert everything.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 14:51 ` Eric W. Biederman
@ 2007-04-10 15:06 ` Ingo Molnar
2007-04-10 15:22 ` Eric W. Biederman
2007-04-10 16:44 ` Oleg Nesterov
1 sibling, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-10 15:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> > on a second thought: the p->children list is needed for the whole
> > child/parent task tree, which is needed for sys_getppid().
>
> Yes, something Oleg said made me realize that.
>
> As long as the reparent isn't to complex it isn't required that we
> have exactly one list .
>
> > The question is, does anything require us to reparent to within the
> > same thread group?
>
> I think my head is finally on straight about this question.
>
> Currently there is the silly linux specific parent death signal
> (pdeath_signal). Oleg's memory was a better than mine on this score.
>
> However there is no indication that the parent death signal being sent
> when a thread leader dies is actually correct, or even interesting. It
> probably should only be sent when getppid changes.
>
> So with pdeath_signal fixed that is nothing that requires us to
> reparent within the same thread group.
>
> I'm trying to remember what the story is now. There is a nasty race
> somewhere with reparenting, a threaded parent setting SIGCHLD to
> SIGIGN, and non-default signals that results in an zombie that no one
> can wait for and reap. It requires being reparented twice to trigger.
>
> Anyway it is a real mess and if we can remove the stupid multi-headed
> child lists things would become much simpler and the problem could not
> occur.
>
> Plus the code would become much simpler...
>
> utrace appears to have removed the ptrace_children list and the
> special cases that entailed.
so ... is anyone pursuing this? This would allow us to make sys_wait4()
faster and more scalable: no tasklist_lock bouncing for example.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-09 20:03 ` Davide Libenzi
@ 2007-04-10 15:12 ` Bill Davidsen
2007-04-10 19:17 ` Davide Libenzi
0 siblings, 1 reply; 100+ messages in thread
From: Bill Davidsen @ 2007-04-10 15:12 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Kyle Moffett, Linux Kernel Mailing List,
Eric W. Biederman, Oleg Nesterov, Robin Holt, Chris Snook,
Jack Steiner
Davide Libenzi wrote:
> On Mon, 9 Apr 2007, Linus Torvalds wrote:
>
>> On Mon, 9 Apr 2007, Kyle Moffett wrote:
>>> Maybe "struct posix_process" is more descriptive? "struct process_posix"?
>>> "Ugly POSIX process semantics data" seems simple enough to stick in a struct
>>> name. "struct uglyposix_process"?
>> Guys, you didn't read my message.
>>
>> It's *not* about "process" stuff. Anything that tries to call it a
>> "process" is *by*definition* worse than what it is now. Processes have all
>> the things that we've cleanly separated out for filesystem, VM, SysV
>> semaphore state, namespaces etc.
>>
>> The "struct signal_struct" is the random *leftovers* from all the other
>> stuff. It's *not* about "processes". Never has been, and never will be.
>
> I proposed "struct task_shared_ctx" but you ducked :)
>
Descriptive, correct, I like it!
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 15:06 ` Ingo Molnar
@ 2007-04-10 15:22 ` Eric W. Biederman
2007-04-10 15:53 ` Ingo Molnar
0 siblings, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-10 15:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
Ingo Molnar <mingo@elte.hu> writes:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> > on a second thought: the p->children list is needed for the whole
>> > child/parent task tree, which is needed for sys_getppid().
>>
>> Yes, something Oleg said made me realize that.
>>
>> As long as the reparent isn't to complex it isn't required that we
>> have exactly one list .
>>
>> > The question is, does anything require us to reparent to within the
>> > same thread group?
>>
>> I think my head is finally on straight about this question.
>>
>> Currently there is the silly linux specific parent death signal
>> (pdeath_signal). Oleg's memory was a better than mine on this score.
>>
>> However there is no indication that the parent death signal being sent
>> when a thread leader dies is actually correct, or even interesting. It
>> probably should only be sent when getppid changes.
>>
>> So with pdeath_signal fixed that is nothing that requires us to
>> reparent within the same thread group.
>>
>> I'm trying to remember what the story is now. There is a nasty race
>> somewhere with reparenting, a threaded parent setting SIGCHLD to
>> SIGIGN, and non-default signals that results in an zombie that no one
>> can wait for and reap. It requires being reparented twice to trigger.
>>
>> Anyway it is a real mess and if we can remove the stupid multi-headed
>> child lists things would become much simpler and the problem could not
>> occur.
>>
>> Plus the code would become much simpler...
>>
>> utrace appears to have removed the ptrace_children list and the
>> special cases that entailed.
>
> so ... is anyone pursuing this? This would allow us to make sys_wait4()
> faster and more scalable: no tasklist_lock bouncing for example.
which part?
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 15:22 ` Eric W. Biederman
@ 2007-04-10 15:53 ` Ingo Molnar
2007-04-10 16:17 ` Eric W. Biederman
0 siblings, 1 reply; 100+ messages in thread
From: Ingo Molnar @ 2007-04-10 15:53 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> > so ... is anyone pursuing this? This would allow us to make
> > sys_wait4() faster and more scalable: no tasklist_lock bouncing for
> > example.
>
> which part?
all of it :) Everything you mentioned makes sense quite a bit. The
thread signal handling of do_wait was added in a pretty arbitrary
fashion so i doubt there are strong requirements in that area. Apps
might have grown to get used to it meanwhile though, so we've got to do
it carefully.
Ingo
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 15:53 ` Ingo Molnar
@ 2007-04-10 16:17 ` Eric W. Biederman
0 siblings, 0 replies; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-10 16:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Oleg Nesterov, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
Ingo Molnar <mingo@elte.hu> writes:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> > so ... is anyone pursuing this? This would allow us to make
>> > sys_wait4() faster and more scalable: no tasklist_lock bouncing for
>> > example.
>>
>> which part?
>
> all of it :) Everything you mentioned makes sense quite a bit. The
> thread signal handling of do_wait was added in a pretty arbitrary
> fashion so i doubt there are strong requirements in that area. Apps
> might have grown to get used to it meanwhile though, so we've got to do
> it carefully.
I'm looking at. If only because there is a reasonable chance doing this
will fix the races with a threaded init.
However I just found something nasty. The wait __WNOTHREAD flag.
And my quick search seems to find at least one user space applications
that uses it, and it is widely documented so I suspect there are
others :(
I played with moving the lists into signal_struct, and short of
architecture specific users of task->children all I had to touch
were:
include/linux/init_task.h | 2 +-
include/linux/sched.h | 5 +-
kernel/exit.c | 159 +++++++++++++++++++++------------------------
kernel/fork.c | 2 +-
mm/oom_kill.c | 4 +-
So it should be relatively easy to change this child lists around...
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:05 ` Jeff Garzik
2007-04-10 7:37 ` Andrew Morton
@ 2007-04-10 16:35 ` Matt Mackall
1 sibling, 0 replies; 100+ messages in thread
From: Matt Mackall @ 2007-04-10 16:35 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Dave Jones, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On Tue, Apr 10, 2007 at 03:05:56AM -0400, Jeff Garzik wrote:
> Andrew Morton wrote:
> >: root 3 0.0 0.0 0 0 ? S 18:51 0:00
> >[watchdog/0]
> >
> >That's the softlockup detector. Confusingly named to look like a, err,
> >watchdog. Could probably use keventd.
>
> I would think this would run into the keventd "problem", where $N
> processes can lock out another?
>
> IMO a lot of these could potentially be simply started as brand new
> threads, when an exception arises.
>
>
> >: root 5 0.0 0.0 0 0 ? S 18:51 0:00
> >[khelper]
> >
> >That's there to parent the kthread_create()d threads. Could presumably use
> >khelper.
> >
> >: root 152 0.0 0.0 0 0 ? S 18:51 0:00 [ata/0]
> >
> >Does it need to be per-cpu?
>
> No, it does not.
>
> It is used for PIO data transfer, so it merely has to respond quickly,
> which rules out keventd. You also don't want PIO data xfer for port A
> blocked, sitting around waiting for PIO data xfer to complete on port C.
>
> So, we merely need fast-reacting threads that keventd will not block.
> We do not need per-CPU threads.
>
> Again, I think a model where threads are created on demand, and reaped
> after inactivity, would fit here. As I feel it would fit with many
> other non-libata drivers.
>
>
> >: root 153 0.0 0.0 0 0 ? S 18:51 0:00
> >[ata_aux]
> >
> >That's a single-threaded workqueue handler. Perhaps could use keventd.
>
> That is used by libata exception handler, for hotpug and such. My main
> worry with keventd is that we might get stuck behind an unrelated
> process for an undefined length of time.
>
> IMO the best model would be to create ata_aux thread on demand, and kill
> it if it hasn't been used recently.
>
>
>
> >: root 299 0.0 0.0 0 0 ? S 18:51 0:00
> >[scsi_eh_0]
> >: root 300 0.0 0.0 0 0 ? S 18:51 0:00
> >[scsi_eh_1]
> >: root 305 0.0 0.0 0 0 ? S 18:51 0:00
> >[scsi_eh_2]
> >: root 306 0.0 0.0 0 0 ? S 18:51 0:00
> >[scsi_eh_3]
> >
> >This machine has one CPU, one sata disk and one DVD drive. The above is
> >hard to explain.
>
> Nod. I've never thought we needed this many threads. At least it
> doesn't scale out of control for $BigNum-CPU boxen.
>
> As the name implies, this is SCSI exception handling thread. Although
> some synchronization is required, this could probably work with an
> on-demand thread creation model too.
>
>
> >: root 319 0.0 0.0 0 0 ? S 18:51 0:00
> >[pccardd]
> >
> >hm.
> >
> >: root 331 0.0 0.0 0 0 ? S 18:51 0:00
> >[kpsmoused]
> >
> >hm.
>
> This kernel thread is used as a "bottom half" handler for the PS2 mouse
> interrupt. This one is a bit more justifiable.
>
>
> >: root 337 0.0 0.0 0 0 ? S 18:51 0:00 [kedac]
> >
> >hm. I didn't know that the Vaio had EDAC.
> >
> >: root 1173 0.0 0.0 0 0 ? S 18:51 0:00
> >[khpsbpkt]
> >
> >I can't even pronounce that.
> >
> >: root 1354 0.0 0.0 0 0 ? S 18:51 0:00
> >[knodemgrd_0]
> >
> >OK, I do have 1394 hardware, but it hasn't been used.
> >
> >: root 1636 0.0 0.0 0 0 ? S 18:52 0:00
> >[kondemand/0]
> >
> >I blame davej.
> >
> >> > otoh, a lot of these inefficeincies are probably down in scruffy
> >> drivers
> >> > rather than in core or top-level code.
> >>
> >>You say scruffy, but most of the proliferation of kthreads comes
> >>from code written in the last few years. Compare the explosion of
> >>kthreads
> >>we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
> >>slowing down at all.
> >>
> >>On the 2-way box I grabbed the above ps output from, I end up with 69
> >>kthreads.
> >>It doesn't surprise me at all that bigger iron is starting to see issues.
> >>
> >
> >Sure.
> >
> >I don't think it's completely silly to object to all this. Sure, a kernel
> >thread is worth 4k in the best case, but I bet they have associated unused
> >resources and as we've seen, they can cause overhead.
>
> Agreed.
There are some upsides to persistent kernel threads that we might want
to keep in mind:
- they can be reniced, made RT, etc. as needed
- they can be bound to CPUs
- they collect long-term CPU usage information
Most of the above doesn't matter for the average kernel thread that's
handling the occassional hardware reset, but for others it could.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 14:51 ` Eric W. Biederman
2007-04-10 15:06 ` Ingo Molnar
@ 2007-04-10 16:44 ` Oleg Nesterov
2007-04-11 19:55 ` Bill Davidsen
1 sibling, 1 reply; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-10 16:44 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Ingo Molnar, Robin Holt, Linus Torvalds, Chris Snook,
linux-kernel, Jack Steiner
On 04/10, Eric W. Biederman wrote:
> I'm trying to remember what the story is now. There is a nasty
> race somewhere with reparenting, a threaded parent setting SIGCHLD to
> SIGIGN, and non-default signals that results in an zombie that no one
> can wait for and reap. It requires being reparented twice to trigger.
reparent_thread:
...
/* If we'd notified the old parent about this child's death,
* also notify the new parent.
*/
if (!traced && p->exit_state == EXIT_ZOMBIE &&
p->exit_signal != -1 && thread_group_empty(p))
do_notify_parent(p, p->exit_signal);
We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.
The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal == -1"
checks.
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 15:12 ` Bill Davidsen
@ 2007-04-10 19:17 ` Davide Libenzi
0 siblings, 0 replies; 100+ messages in thread
From: Davide Libenzi @ 2007-04-10 19:17 UTC (permalink / raw)
To: Bill Davidsen
Cc: Linus Torvalds, Kyle Moffett, Linux Kernel Mailing List,
Eric W. Biederman, Oleg Nesterov, Robin Holt, Chris Snook,
Jack Steiner
On Tue, 10 Apr 2007, Bill Davidsen wrote:
> Davide Libenzi wrote:
> > On Mon, 9 Apr 2007, Linus Torvalds wrote:
> >
> > > On Mon, 9 Apr 2007, Kyle Moffett wrote:
> > > > Maybe "struct posix_process" is more descriptive? "struct
> > > > process_posix"?
> > > > "Ugly POSIX process semantics data" seems simple enough to stick in a
> > > > struct
> > > > name. "struct uglyposix_process"?
> > > Guys, you didn't read my message.
> > >
> > > It's *not* about "process" stuff. Anything that tries to call it a
> > > "process" is *by*definition* worse than what it is now. Processes have all
> > > the things that we've cleanly separated out for filesystem, VM, SysV
> > > semaphore state, namespaces etc.
> > >
> > > The "struct signal_struct" is the random *leftovers* from all the other
> > > stuff. It's *not* about "processes". Never has been, and never will be.
> >
> > I proposed "struct task_shared_ctx" but you ducked :)
> >
> Descriptive, correct, I like it!
He's stubborn, he'll never accept patches. Must be a seasonal thing ;)
- Davide
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 7:07 ` Jeff Garzik
@ 2007-04-10 22:20 ` Ingo Oeser
0 siblings, 0 replies; 100+ messages in thread
From: Ingo Oeser @ 2007-04-10 22:20 UTC (permalink / raw)
To: Jeff Garzik
Cc: Linus Torvalds, Andrew Morton, Dave Jones, Robin Holt,
Eric W. Biederman, Ingo Molnar, linux-kernel, Jack Steiner
On Tuesday 10 April 2007, Jeff Garzik wrote:
> That's why I feel thread creation -- cheap under Linux -- is quite
> appropriate for many of these situations.
Maybe that (thread creation) can be done at open(), socket-creation,
service request, syscall or whatever event triggers a driver/subsystem
to actually queue work into a thread.
And when there is a close(), socket-destruction, service completion
or whatever these threads can be marked for destruction and destroyed
by a timer or even immediately.
Regards
Ingo Oeser
--
If something is getting cheap, it is getting wasted just because it is cheap.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 8:48 ` Jeff Garzik
@ 2007-04-10 22:35 ` Ingo Oeser
0 siblings, 0 replies; 100+ messages in thread
From: Ingo Oeser @ 2007-04-10 22:35 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Dave Jones, Robin Holt, Eric W. Biederman,
Ingo Molnar, Linus Torvalds, linux-kernel, Jack Steiner
On Tuesday 10 April 2007, Jeff Garzik wrote:
> Thus, rather than forcing authors to make their code more complex, we
> should find another solution.
What about sth. like the "pre-forking" concept? So just have a thread creator thread,
which checks the amount of unused threads and keeps them within certain limits.
So that anything which needs a thread now simply queues up the work and
specifies, that it wants a new thread, if possible.
One problem seems to be, that a thread is nothing else but a statement
on what other tasks I can wait before doing my current one (e.g. I don't want to
mlseep() twice on the same reset timeout).
But we usually use locking to order that.
Do I miss anything fundamental here?
Regards
Ingo Oeser
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-06 23:37 ` Jeff Garzik
@ 2007-04-11 7:28 ` Nick Piggin
0 siblings, 0 replies; 100+ messages in thread
From: Nick Piggin @ 2007-04-11 7:28 UTC (permalink / raw)
To: Jeff Garzik
Cc: Linus Torvalds, Robin Holt, Eric W. Biederman, Ingo Molnar,
linux-kernel, Jack Steiner
Jeff Garzik wrote:
> Linus Torvalds wrote:
>
>>
>> On Fri, 6 Apr 2007, Jeff Garzik wrote:
>>
>>> I would rather change the implementation under the hood to start per-CPU
>>> threads on demand, similar to a thread-pool implementation.
>>>
>>> Boxes with $BigNum CPUs probably won't ever use half of those threads.
>>
>>
>> The counter-argument is that boxes with $BigNum CPU's really don't
>> hurt from it either, and having per-process data structures is often
>> simpler and more efficient than trying to have some thread pool.
>
>
> Two points here:
>
> * A lot of the users in the current kernel tree don't rely on the
> per-CPU qualities. They just need multiple threads running.
>
> * Even with per-CPU data structures and code, you don't necessarily have
> to keep a thread alive and running for each CPU. Reap the ones that
> haven't been used in $TimeFrame, and add thread creation to the slow
> path that already exists in the bowels of schedule_work().
>
> Or if some kernel hacker is really motivated, all workqueue users in the
> kernel would benefit from a "thread audit", looking at working
> conditions to decide if the new kthread APIs are more appropriate.
spawn on demand would require heuristics and complexity though. And
I think there is barely any positive tradeoff to weigh it against.
>> IOW, once we get the processes off the global list, there just isn't
>> any downside from them. Sure, they use some memory, but people who buy
>> 1024-cpu machines won't care about a few kB per CPU..
>>
>> So the *only* downside is literally the process list, and one
>> suggested patch already just removes kernel threads entirely from the
>> parenthood lists.
>>
>> The other potential downside could be "ps is slow", but on the other
>> hand, having the things stick around and have things like CPU-time
>> accumulate is probably worth it - if there are some issues, they'd
>> show up properly accounted for in a way that process pools would have
>> a hard time doing.
>
>
> Regardless of how things are shuffled about internally, there will
> always be annoying overhead /somewhere/ when you have a metric ton of
> kernel threads. I think that people should also be working on ways to
> make the kernel threads a bit more manageable for the average human.
There are a few per CPU, but they should need no human management to
speak of.
Presumably if you have a 1024 CPU system, you'd generally want to be
running at least 1024 of your own processes there, so you already need
some tools to handle that magnitude of processes anyway.
>> So I really don't think this is worth changing things over, apart from
>> literally removing them from process lists, which I think everybody
>> agrees we should just do - it just never even came up before!
>
>
> I think there is a human downside. For an admin you have to wade
> through a ton of processes on your machine, if you are attempting to
> evaluate the overall state of the machine. Just google around for all
> the admins complaining about the explosion of kernel threads on
> production machines :)
User tools should be improved. It shouldn't be too hard to be able to
aggregate kernel thread stats into a single top entry, for example.
I'm not saying the number of threads couldn't be cut down, but there
is still be an order of magnitude problem there...
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-10 16:44 ` Oleg Nesterov
@ 2007-04-11 19:55 ` Bill Davidsen
2007-04-11 20:17 ` Eric W. Biederman
2007-04-11 20:19 ` Oleg Nesterov
0 siblings, 2 replies; 100+ messages in thread
From: Bill Davidsen @ 2007-04-11 19:55 UTC (permalink / raw)
To: linux-kernel
Cc: Eric W. Biederman, Ingo Molnar, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
Oleg Nesterov wrote:
> On 04/10, Eric W. Biederman wrote:
>
>> I'm trying to remember what the story is now. There is a nasty
>> race somewhere with reparenting, a threaded parent setting SIGCHLD to
>> SIGIGN, and non-default signals that results in an zombie that no one
>> can wait for and reap. It requires being reparented twice to trigger.
>
> reparent_thread:
>
> ...
>
> /* If we'd notified the old parent about this child's death,
> * also notify the new parent.
> */
> if (!traced && p->exit_state == EXIT_ZOMBIE &&
> p->exit_signal != -1 && thread_group_empty(p))
> do_notify_parent(p, p->exit_signal);
>
> We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
> We don't do this.
>
> The best fix I believe is to cleanup the forget_original_parent/reparent_thread
> interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal == -1"
> checks.
>
As long as the original parent is preserved for getppid(). There are
programs out there which communicate between the parent and child with
signals, and if the original parent dies, it undesirable to have the
child getppid() and start sending signals to a program not expecting
them. Invites undefined behavior.
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-11 19:55 ` Bill Davidsen
@ 2007-04-11 20:17 ` Eric W. Biederman
2007-04-11 21:24 ` Bill Davidsen
2007-04-11 20:19 ` Oleg Nesterov
1 sibling, 1 reply; 100+ messages in thread
From: Eric W. Biederman @ 2007-04-11 20:17 UTC (permalink / raw)
To: Bill Davidsen
Cc: Oleg Nesterov, Ingo Molnar, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
Bill Davidsen <davidsen@tmr.com> writes:
> As long as the original parent is preserved for getppid(). There are programs
> out there which communicate between the parent and child with signals, and if
> the original parent dies, it undesirable to have the child getppid() and start
> sending signals to a program not expecting them. Invites undefined behavior.
Then the programs are broken. getppid is defined to change if the process
is reparented to init.
Eric
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-11 19:55 ` Bill Davidsen
2007-04-11 20:17 ` Eric W. Biederman
@ 2007-04-11 20:19 ` Oleg Nesterov
1 sibling, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2007-04-11 20:19 UTC (permalink / raw)
To: Bill Davidsen
Cc: Eric W. Biederman, Ingo Molnar, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
On 04/11, Bill Davidsen wrote:
>
> Oleg Nesterov wrote:
> >On 04/10, Eric W. Biederman wrote:
> >
> >>I'm trying to remember what the story is now. There is a nasty
> >>race somewhere with reparenting, a threaded parent setting SIGCHLD to
> >>SIGIGN, and non-default signals that results in an zombie that no one
> >>can wait for and reap. It requires being reparented twice to trigger.
> >
> >reparent_thread:
> >
> > ...
> >
> > /* If we'd notified the old parent about this child's death,
> > * also notify the new parent.
> > */
> > if (!traced && p->exit_state == EXIT_ZOMBIE &&
> > p->exit_signal != -1 && thread_group_empty(p))
> > do_notify_parent(p, p->exit_signal);
> >
> >We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
> >We don't do this.
> >
> >The best fix I believe is to cleanup the
> >forget_original_parent/reparent_thread
> >interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal
> >== -1"
> >checks.
> >
> As long as the original parent is preserved for getppid(). There are
> programs out there which communicate between the parent and child with
> signals, and if the original parent dies, it undesirable to have the
> child getppid() and start sending signals to a program not expecting
> them. Invites undefined behavior.
Sorry, can't understand.
If p->exit_signal == -1 after do_notify_parent() above, the task is completely
dead. Nobody can release it, we should do this (if EXIT_ZOMBIE). At this point
"p" was already re-parented, but this (and getppid) doesn't matter at all.
OK. Most likely you meant something else. Could you clarify?
Oleg.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: init's children list is long and slows reaping children.
2007-04-11 20:17 ` Eric W. Biederman
@ 2007-04-11 21:24 ` Bill Davidsen
0 siblings, 0 replies; 100+ messages in thread
From: Bill Davidsen @ 2007-04-11 21:24 UTC (permalink / raw)
To: linux-kernel
Cc: Oleg Nesterov, Ingo Molnar, Robin Holt, Linus Torvalds,
Chris Snook, linux-kernel, Jack Steiner
Eric W. Biederman wrote:
> Bill Davidsen <davidsen@tmr.com> writes:
>
>> As long as the original parent is preserved for getppid(). There are programs
>> out there which communicate between the parent and child with signals, and if
>> the original parent dies, it undesirable to have the child getppid() and start
>> sending signals to a program not expecting them. Invites undefined behavior.
>
> Then the programs are broken. getppid is defined to change if the process
> is reparented to init.
>
The short answer is that kthreads don't do this so it doesn't matter.
But user programs are NOT broken, currently getppid returns either the
original parent or init, and a program can identify init. Reparenting to
another pid would not be easily noted, and as SUS notes, no values are
reserved to error. So there's no way to check, and no neeed for
kthreads, I was prematurely paranoid.
Presumably user processes will still be reparented to init so that's not
an issue. Since there's no atomic signal_parent() the ppid could change
between getppid() and signal(), but that's never actually been a problem
AFAIK.
Related: Is there a benefit from having separate queues for original
children of init and reparented (to init) tasks? Even in a server would
there be enough to save anything?
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 100+ messages in thread
end of thread, other threads:[~2007-04-11 21:31 UTC | newest]
Thread overview: 100+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-06 8:42 init's children list is long and slows reaping children Oleg Nesterov
2007-04-06 9:10 ` Eric W. Biederman
2007-04-06 9:44 ` Oleg Nesterov
2007-04-06 15:45 ` Eric W. Biederman
2007-04-06 15:47 ` Oleg Nesterov
2007-04-06 17:16 ` Linus Torvalds
2007-04-06 17:27 ` Ingo Molnar
2007-04-06 17:31 ` Ingo Molnar
2007-04-06 17:34 ` Eric W. Biederman
2007-04-06 19:06 ` H. Peter Anvin
2007-04-06 19:15 ` Eric W. Biederman
2007-04-06 19:21 ` H. Peter Anvin
2007-04-06 21:04 ` Jeremy Fitzhardinge
2007-04-06 21:07 ` H. Peter Anvin
2007-04-06 19:36 ` Oleg Nesterov
2007-04-06 19:43 ` Ingo Molnar
2007-04-06 20:01 ` Oleg Nesterov
2007-04-06 20:21 ` Ingo Molnar
2007-04-06 19:47 ` Oleg Nesterov
2007-04-06 19:59 ` Eric W. Biederman
2007-04-07 20:31 ` Oleg Nesterov
2007-04-08 0:38 ` Eric W. Biederman
2007-04-08 15:46 ` Oleg Nesterov
-- strict thread matches above, loose matches on Subject: below --
2007-04-05 19:51 Robin Holt
2007-04-05 20:57 ` Linus Torvalds
2007-04-06 0:51 ` Chris Snook
2007-04-06 1:03 ` Chris Snook
2007-04-06 1:29 ` Linus Torvalds
2007-04-06 2:15 ` Eric W. Biederman
2007-04-06 10:43 ` Robin Holt
2007-04-06 15:38 ` Eric W. Biederman
2007-04-06 16:31 ` Oleg Nesterov
2007-04-06 17:32 ` Ingo Molnar
2007-04-06 17:39 ` Roland Dreier
2007-04-06 18:04 ` Eric W. Biederman
2007-04-06 18:30 ` Eric W. Biederman
2007-04-10 13:48 ` Ingo Molnar
2007-04-10 13:38 ` Oleg Nesterov
2007-04-10 15:00 ` Eric W. Biederman
2007-04-10 14:51 ` Eric W. Biederman
2007-04-10 15:06 ` Ingo Molnar
2007-04-10 15:22 ` Eric W. Biederman
2007-04-10 15:53 ` Ingo Molnar
2007-04-10 16:17 ` Eric W. Biederman
2007-04-10 16:44 ` Oleg Nesterov
2007-04-11 19:55 ` Bill Davidsen
2007-04-11 20:17 ` Eric W. Biederman
2007-04-11 21:24 ` Bill Davidsen
2007-04-11 20:19 ` Oleg Nesterov
2007-04-06 18:02 ` Eric W. Biederman
2007-04-06 18:21 ` Davide Libenzi
2007-04-06 18:56 ` Eric W. Biederman
2007-04-06 19:16 ` Davide Libenzi
2007-04-06 19:19 ` Ingo Molnar
2007-04-06 21:29 ` Davide Libenzi
2007-04-06 21:51 ` Linus Torvalds
2007-04-06 22:31 ` Davide Libenzi
2007-04-06 22:46 ` Linus Torvalds
2007-04-06 22:59 ` Davide Libenzi
2007-04-09 8:28 ` Ingo Molnar
2007-04-09 18:09 ` Bill Davidsen
2007-04-09 19:28 ` Kyle Moffett
2007-04-09 19:51 ` Linus Torvalds
2007-04-09 20:03 ` Davide Libenzi
2007-04-10 15:12 ` Bill Davidsen
2007-04-10 19:17 ` Davide Libenzi
2007-04-09 20:00 ` Eric W. Biederman
2007-04-06 16:41 ` Robin Holt
2007-04-09 17:37 ` Chris Snook
2007-04-06 18:05 ` Christoph Hellwig
2007-04-06 19:39 ` Eric W. Biederman
2007-04-06 22:38 ` Jeff Garzik
2007-04-06 22:51 ` Linus Torvalds
2007-04-06 23:37 ` Jeff Garzik
2007-04-11 7:28 ` Nick Piggin
2007-04-10 0:23 ` Andrew Morton
2007-04-10 0:48 ` Eric W. Biederman
2007-04-10 1:15 ` Andrew Morton
2007-04-10 6:53 ` Jeff Garzik
2007-04-10 9:42 ` Robin Holt
2007-04-10 1:59 ` Dave Jones
2007-04-10 2:30 ` Andrew Morton
2007-04-10 2:46 ` Linus Torvalds
2007-04-10 7:07 ` Jeff Garzik
2007-04-10 22:20 ` Ingo Oeser
2007-04-10 5:07 ` Alexey Dobriyan
2007-04-10 5:21 ` Dave Jones
2007-04-10 6:09 ` Torsten Kaiser
2007-04-10 7:08 ` Jeff Garzik
2007-04-10 7:05 ` Jeff Garzik
2007-04-10 7:37 ` Andrew Morton
2007-04-10 8:33 ` Jeff Garzik
2007-04-10 8:41 ` Andrew Morton
2007-04-10 8:48 ` Jeff Garzik
2007-04-10 22:35 ` Ingo Oeser
2007-04-10 16:35 ` Matt Mackall
2007-04-10 7:44 ` Russell King
2007-04-10 8:16 ` Jeff Garzik
2007-04-10 8:59 ` Ingo Molnar
2007-04-10 9:33 ` Jeff Garzik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox