* [BUG] scheduler: first timeslice of the exiting thread
@ 2007-04-07 7:31 Satoru Takeuchi
2007-04-07 7:45 ` Satoru Takeuchi
2007-04-09 6:09 ` Andrew Morton
0 siblings, 2 replies; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-07 7:31 UTC (permalink / raw)
To: Ingo Molnar, Linux Kernel; +Cc: Satoru Takeuchi
Hi Ingo and all,
When I was examining the following program ...
1. There are a large amount of small jobs takes several msecs,
and the number of job increases constantly.
2. The process creates a thread or a process per job (I examined both
the thread model and the process model).
3. Each child process/thread does the assigned job and exit immediately.
... I found that the thread model's latency is longer than proess
model's one against my expectation. It's because of the current
sched_fork()/sched_exit() implementation as follows:
a) On sched_fork, the creator share its timeslice with new process.
b) On sched_exit, if the exiting process didn't exhaust its first
timeslice yet, it gives its timeslice to the parent.
It has no problem on the process model since the creator is the parent.
However, on the thread model, the creator is not the parent, it is same
as the creator's parent. Hence, on this kind of program, the creator
can't retrieve shared timeslice and exausts its timeslice at a rate of
knots. In addition, somehow, the parent (typically shell?) gets extra
timeslice.
I believe it's a bug and the exiting process should give its timeslice
to the creator. Now I have some patch plan to fix this problem as follow:
a) Add the field for the creator to task_struct. It needs extra memory.
b) Doesn't add extra field and have thread's parent the creater, which is
same as process creation. However it has many side effects, for example,
we also need to change sys_getppid() implementation.
What do you think? Any comments are welcome.
BTW, We can easily confirm the problem with systemtap, a convenient diagnostic
program.
Test programs(attached in the mail):
- satprocess.c: Process model. It creates a child process and wait for it
several times. Each child process exits immediately.
- satthread.c: Thread model. It creates a child thread and join it several
times. Each child thread exits immediately.
- fork_exit.stp: systemtap script to overlook satprocess/satthread
My kernel:
2.6.21-rc6(i386).
How to confirm:
1) Execute systemtap script.
# stap fork_exit.stp -v
Pass 1: parsed user script and 54 library script(s) in 680usr/40sys/728real ms.
Pass 2: analyzed script: 2 probe(s), 8 function(s), 0 embed(s), 0 global(s) in 670usr/40sys/699real ms.
Pass 3: using cached /root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.c
Pass 4: using cached /root/.systemtap/cache/02/stap_02d3975cd5dedf7b6697a2d5f92f966a_3811.ko
Pass 5: starting run.
2) Execute the process model program on another terminal.
$ ./satprocess
Then systemtap overlooks satprocess's fork/exit and prints the followings:
fork: pid = 11635, tgid = 11635, ppid = 5969, time_slice = 11
exit: pid = 11635, tgid = 11635, ppid = 11634, time_slice = 6
fork: pid = 11636, tgid = 11636, ppid = 5969, time_slice = 11
exit: pid = 11636, tgid = 11636, ppid = 11634, time_slice = 6
fork: pid = 11637, tgid = 11637, ppid = 5969, time_slice = 11
exit: pid = 11637, tgid = 11637, ppid = 11634, time_slice = 6
fork: pid = 11638, tgid = 11638, ppid = 5969, time_slice = 11
exit: pid = 11638, tgid = 11638, ppid = 11634, time_slice = 6
fork: pid = 11639, tgid = 11639, ppid = 5969, time_slice = 11
exit: pid = 11639, tgid = 11639, ppid = 11634, time_slice = 6
fork: pid = 11640, tgid = 11640, ppid = 5969, time_slice = 11
exit: pid = 11640, tgid = 11640, ppid = 11634, time_slice = 6
fork: pid = 11641, tgid = 11641, ppid = 5969, time_slice = 11
exit: pid = 11641, tgid = 11641, ppid = 11634, time_slice = 6
fork: pid = 11642, tgid = 11642, ppid = 5969, time_slice = 11
exit: pid = 11642, tgid = 11642, ppid = 11634, time_slice = 6
fork: pid = 11643, tgid = 11643, ppid = 5969, time_slice = 11
exit: pid = 11643, tgid = 11643, ppid = 11634, time_slice = 6
fork: pid = 11644, tgid = 11644, ppid = 5969, time_slice = 11
exit: pid = 11644, tgid = 11644, ppid = 11634, time_slice = 6
exit: pid = 11634, tgid = 11634, ppid = 5969, time_slice = 10
It looks good.
3) Execute the thread model program on another terminal.
$ ./satthread
Then systemtap overlooks satthread's fork/exit and prints the followings:
fork: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 10
exit: pid = 11646, tgid = 11645, ppid = 5969, time_slice = 5
fork: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 5
fork: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 2
exit: pid = 11647, tgid = 11645, ppid = 5969, time_slice = 3
fork: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1
exit: pid = 11648, tgid = 11645, ppid = 5969, time_slice = 1
fork: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 25
exit: pid = 11649, tgid = 11645, ppid = 5969, time_slice = 1
fork: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 12
exit: pid = 11650, tgid = 11645, ppid = 5969, time_slice = 13
fork: pid = 11652, tgid = 11645, ppid = 5969, time_slice = 6
exit: pid = 11651, tgid = 11645, ppid = 5969, time_slice = 6
fork: pid = 11653, tgid = 11645, ppid = 5969, time_slice = 3
exit: pid = 11652, tgid = 11645, ppid = 5969, time_slice = 3
fork: pid = 11654, tgid = 11645, ppid = 5969, time_slice = 1
exit: pid = 11653, tgid = 11645, ppid = 5969, time_slice = 2
fork: pid = 11655, tgid = 11645, ppid = 5969, time_slice = 25
exit: pid = 11654, tgid = 11645, ppid = 5969, time_slice = 1
exit: pid = 11655, tgid = 11645, ppid = 5969, time_slice = 13
exit: pid = 11645, tgid = 11645, ppid = 5969, time_slice = 12
The creator couldn't retrieve child thread's first timeslice and exhausts
its timeslice at lg(25) th tread creation at best, hmm...
Thanks,
Satoru
/// satprocess.c //////////////////////////////////////////////////////////////
/*
* satprocess - Create child process and wait for it several times. Each child
* process does nothing and exits immediately.
*
* Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
*
* This software may be used and distributed according to the terms
* of the GNU General Public License, incorporated herein by reference.
*/
#include <stdio.h>
#include <stdlib.h>
#include <err.h>
#include <signal.h>
#define NPROC 10
int
main(int argc, char **argv)
{
int i;
pid_t pid;
for (i = 0; i < NPROC; i++) {
pid = fork();
if (pid < 0)
err(EXIT_FAILURE, "fork(%d) failed\n", i);
if (pid == 0)
exit(EXIT_SUCCESS);
wait(NULL);
}
exit(EXIT_SUCCESS);
}
///////////////////////////////////////////////////////////////////////////////
/// satthread.c ///////////////////////////////////////////////////////////////
/*
* satthread - Create child thread and join it several times. Each child thread
* does nothing and exits immediately.
*
* Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
*
* This software may be used and distributed according to the terms
* of the GNU General Public License, incorporated herein by reference.
*/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREAD 10
static void *thread_fn(void *arg)
{
return NULL;
}
int
main(int argc, char **argv)
{
int i;
pthread_t t;
for (i = 0; i < NTHREAD; i++) {
if (pthread_create(&t, NULL, thread_fn, NULL)) {
fprintf(stderr, "pthread_create(%d) failed\n", i);
exit(EXIT_FAILURE);
}
pthread_join(t, NULL);
}
exit(EXIT_SUCCESS);
}
///////////////////////////////////////////////////////////////////////////////
/// fork_exit.stp /////////////////////////////////////////////////////////////
/*
* fork_exit.stp - Overlooks sched_fork()/exit_exit() for satprocess/satthread
* and prints some information
*
* Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
*
* This software may be used and distributed according to the terms
* of the GNU General Public License, incorporated herein by reference.
*/
function is_my_testpro(comm)
{
if (comm == "satthread" || comm == "satprocess")
return 1
else
return 0
}
function print_log(name, pid, tgid, ppid, time_slice)
{
printf("%s: pid = %d, tgid = %d, ppid = %d, time_slice = %u\n",
name, pid, tgid, ppid, time_slice);
}
probe kernel.function("sched_fork")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log(kernel_string($p->comm),
$p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}
probe kernel.function("sched_exit")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log(kernel_string($p->comm),
$p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-07 7:31 [BUG] scheduler: first timeslice of the exiting thread Satoru Takeuchi
@ 2007-04-07 7:45 ` Satoru Takeuchi
2007-04-09 6:09 ` Andrew Morton
1 sibling, 0 replies; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-07 7:45 UTC (permalink / raw)
To: Ingo Molnar, Linux Kernel; +Cc: Satoru Takeuchi
At Sat, 07 Apr 2007 16:31:39 +0900,
Satoru Takeuchi wrote:
> Test programs(attached in the mail):
>
> - satprocess.c: Process model. It creates a child process and wait for it
> several times. Each child process exits immediately.
> - satthread.c: Thread model. It creates a child thread and join it several
> times. Each child thread exits immediately.
> - fork_exit.stp: systemtap script to overlook satprocess/satthread
Oh, I sent a old systemtap script. Correct one is here.
Satoru
/// fork_exit.stp /////////////////////////////////////////////////////////////
/*
* fork_exit.stp - Overlooks sched_fork()/exit_exit() for satprocess/satthread
* and prints some information
*
* Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
*
* This software may be used and distributed according to the terms
* of the GNU General Public License, incorporated herein by reference.
*/
function is_my_testpro(comm)
{
if (comm == "satthread" || comm == "satprocess")
return 1
else
return 0
}
function print_log(name, pid, tgid, ppid, time_slice)
{
printf("%s: pid = %d, tgid = %d, ppid = %d, time_slice = %u\n",
name, pid, tgid, ppid, time_slice);
}
probe kernel.function("sched_fork")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log("fork",
$p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}
probe kernel.function("sched_exit")
{
if (is_my_testpro(kernel_string($p->comm)))
print_log("exit",
$p->pid, $p->tgid, $p->parent->pid, $p->time_slice);
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-07 7:31 [BUG] scheduler: first timeslice of the exiting thread Satoru Takeuchi
2007-04-07 7:45 ` Satoru Takeuchi
@ 2007-04-09 6:09 ` Andrew Morton
2007-04-09 6:50 ` Mike Galbraith
2007-04-13 15:31 ` Con Kolivas
1 sibling, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2007-04-09 6:09 UTC (permalink / raw)
To: Satoru Takeuchi; +Cc: Ingo Molnar, Linux Kernel, Con Kolivas, Mike Galbraith
On Sat, 07 Apr 2007 16:31:39 +0900 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> wrote:
> When I was examining the following program ...
>
> 1. There are a large amount of small jobs takes several msecs,
> and the number of job increases constantly.
> 2. The process creates a thread or a process per job (I examined both
> the thread model and the process model).
> 3. Each child process/thread does the assigned job and exit immediately.
>
> ... I found that the thread model's latency is longer than proess
> model's one against my expectation. It's because of the current
> sched_fork()/sched_exit() implementation as follows:
>
> a) On sched_fork, the creator share its timeslice with new process.
> b) On sched_exit, if the exiting process didn't exhaust its first
> timeslice yet, it gives its timeslice to the parent.
>
> It has no problem on the process model since the creator is the parent.
> However, on the thread model, the creator is not the parent, it is same
> as the creator's parent. Hence, on this kind of program, the creator
> can't retrieve shared timeslice and exausts its timeslice at a rate of
> knots. In addition, somehow, the parent (typically shell?) gets extra
> timeslice.
>
> I believe it's a bug and the exiting process should give its timeslice
> to the creator. Now I have some patch plan to fix this problem as follow:
>
> a) Add the field for the creator to task_struct. It needs extra memory.
> b) Doesn't add extra field and have thread's parent the creater, which is
> same as process creation. However it has many side effects, for example,
> we also need to change sys_getppid() implementation.
>
> What do you think? Any comments are welcome.
This comes at an awkward time, because we might well merge the
staircase/deadline work into 2.6.22, and I think it rewrites the part of
the scheduler which is causing the problems you're observing.
Has anyone verified that SD fixes this problem and the one at
http://lkml.org/lkml/2007/4/7/21 ?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-09 6:09 ` Andrew Morton
@ 2007-04-09 6:50 ` Mike Galbraith
2007-04-10 7:18 ` Satoru Takeuchi
2007-04-13 15:31 ` Con Kolivas
1 sibling, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2007-04-09 6:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Satoru Takeuchi, Ingo Molnar, Linux Kernel, Con Kolivas
On Sun, 2007-04-08 at 23:09 -0700, Andrew Morton wrote:
> On Sat, 07 Apr 2007 16:31:39 +0900 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> wrote:
>
> > When I was examining the following program ...
> >
> > 1. There are a large amount of small jobs takes several msecs,
> > and the number of job increases constantly.
> > 2. The process creates a thread or a process per job (I examined both
> > the thread model and the process model).
> > 3. Each child process/thread does the assigned job and exit immediately.
> >
> > ... I found that the thread model's latency is longer than proess
> > model's one against my expectation. It's because of the current
> > sched_fork()/sched_exit() implementation as follows:
> >
> > a) On sched_fork, the creator share its timeslice with new process.
> > b) On sched_exit, if the exiting process didn't exhaust its first
> > timeslice yet, it gives its timeslice to the parent.
> >
> > It has no problem on the process model since the creator is the parent.
> > However, on the thread model, the creator is not the parent, it is same
> > as the creator's parent. Hence, on this kind of program, the creator
> > can't retrieve shared timeslice and exausts its timeslice at a rate of
> > knots. In addition, somehow, the parent (typically shell?) gets extra
> > timeslice.
> >
> > I believe it's a bug and the exiting process should give its timeslice
> > to the creator. Now I have some patch plan to fix this problem as follow:
> >
> > a) Add the field for the creator to task_struct. It needs extra memory.
> > b) Doesn't add extra field and have thread's parent the creater, which is
> > same as process creation. However it has many side effects, for example,
> > we also need to change sys_getppid() implementation.
> >
> > What do you think? Any comments are welcome.
>
> This comes at an awkward time, because we might well merge the
> staircase/deadline work into 2.6.22, and I think it rewrites the part of
> the scheduler which is causing the problems you're observing.
>
> Has anyone verified that SD fixes this problem and the one at
> http://lkml.org/lkml/2007/4/7/21 ?
Not verified either way in testing, but I believe this should be a
problem for SD as well because timeslice fork/exit handling is identical
with mainline. Individual slices are much smaller than mainline, so
priority should drop rapidly, consuming bandwidth allotted for the
current rotation, sending the creator off to the expired array
prematurely.
-Mike
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
@ 2007-04-09 10:29 Oleg Nesterov
2007-04-09 11:03 ` Oleg Nesterov
2007-04-10 1:19 ` Satoru Takeuchi
0 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2007-04-09 10:29 UTC (permalink / raw)
To: Satoru Takeuchi; +Cc: Ingo Molnar, Andrew Morton, Mike Galbraith, linux-kernel
Satoru Takeuchi wrote:
>
> a) On sched_fork, the creator share its timeslice with new process.
> b) On sched_exit, if the exiting process didn't exhaust its first
> timeslice yet, it gives its timeslice to the parent.
>
> It has no problem on the process model since the creator is the parent.
> However, on the thread model, the creator is not the parent, it is same
> as the creator's parent. Hence, on this kind of program, the creator
> can't retrieve shared timeslice and exausts its timeslice at a rate of
> knots. In addition, somehow, the parent (typically shell?) gets extra
> timeslice.
Yes, this is an old oddity.
> I believe it's a bug and the exiting process should give its timeslice
> to the creator. Now I have some patch plan to fix this problem as follow:
>
> a) Add the field for the creator to task_struct. It needs extra memory.
... and locking/complications. The creator may already exited at this time.
> b) Doesn't add extra field and have thread's parent the creater, which is
> same as process creation. However it has many side effects, for example,
> we also need to change sys_getppid() implementation.
can't understand this, sorry.
Ingo, I think sched_exit() has another buglet,
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
should be
if (unlikely(parent->time_slice > task_timeslice(parent)))
parent->time_slice = task_timeslice(parent);
The
[PATCH 1/2] sched_exit: fix parent->time_slice calculation
http://marc.info/?l=linux-kernel&m=115101842505365
was dropped because "[PATCH 2/2]" was very wrong.
Could you look at it?
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-09 10:29 [BUG] scheduler: first timeslice of the exiting thread Oleg Nesterov
@ 2007-04-09 11:03 ` Oleg Nesterov
2007-04-10 1:19 ` Satoru Takeuchi
1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2007-04-09 11:03 UTC (permalink / raw)
To: Satoru Takeuchi; +Cc: Ingo Molnar, Andrew Morton, Mike Galbraith, linux-kernel
On 04/09, Oleg Nesterov wrote:
>
> Satoru Takeuchi wrote:
> >
> > a) On sched_fork, the creator share its timeslice with new process.
> > b) On sched_exit, if the exiting process didn't exhaust its first
> > timeslice yet, it gives its timeslice to the parent.
> >
> > It has no problem on the process model since the creator is the parent.
> > However, on the thread model, the creator is not the parent, it is same
> > as the creator's parent. Hence, on this kind of program, the creator
> > can't retrieve shared timeslice and exausts its timeslice at a rate of
> > knots. In addition, somehow, the parent (typically shell?) gets extra
> > timeslice.
>
> Yes, this is an old oddity.
Perhaps
void sched_exit(struct task_struct *p)
{
task_t *t, *did_fork = p->parent;
if (!thread_group_leader(p))
list_for_each_entry(t, p->thread_group)
if (!t->exit_state) {
did_fork = t;
break;
}
... give time_slice to did_fork ..
}
may improve things a little bit, because the rest of time_slice doesn't leak
off the thread group.
However this is not very nice and we have the same problem with CLONE_PARENT.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-09 10:29 [BUG] scheduler: first timeslice of the exiting thread Oleg Nesterov
2007-04-09 11:03 ` Oleg Nesterov
@ 2007-04-10 1:19 ` Satoru Takeuchi
1 sibling, 0 replies; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-10 1:19 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Satoru Takeuchi, Ingo Molnar, Andrew Morton, Mike Galbraith,
linux-kernel
> > b) Doesn't add extra field and have thread's parent the creater, which is
> > same as process creation. However it has many side effects, for example,
> > we also need to change sys_getppid() implementation.
>
> can't understand this, sorry.
Sorry for my obscure English, perhaps I need more training about writing.
What I meant is ...
If thread A creates thread B and A's parent is C, B's parent is also C
currently. In my idea (b), copy_process() doesn't varies B's parent by
CLONE_THREAD flag and B's parent is A. In this case, there is no need to
change sched_exit() and doesn't need to add extra field. However, we need
to change other code as sysy_getppid(), forget_original_parent(), and so on.
It's same as past linux kernel behavior. I don't know the details when and
why these code was changed (and this problem was overlooked then). Once I
thought I came up with good idea, but now I'm rocking back on my heels
because it seems to need many code change.
Satoru
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-09 6:50 ` Mike Galbraith
@ 2007-04-10 7:18 ` Satoru Takeuchi
0 siblings, 0 replies; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-10 7:18 UTC (permalink / raw)
To: Mike Galbraith
Cc: Andrew Morton, Satoru Takeuchi, Ingo Molnar, Linux Kernel,
Con Kolivas
> > This comes at an awkward time, because we might well merge the
> > staircase/deadline work into 2.6.22, and I think it rewrites the part of
> > the scheduler which is causing the problems you're observing.
> >
> > Has anyone verified that SD fixes this problem and the one at
> > http://lkml.org/lkml/2007/4/7/21 ?
>
> Not verified either way in testing, but I believe this should be a
> problem for SD as well because timeslice fork/exit handling is identical
> with mainline. Individual slices are much smaller than mainline, so
> priority should drop rapidly, consuming bandwidth allotted for the
> current rotation, sending the creator off to the expired array
> prematurely.
>
> -Mike
>
Yes, this probelm also occurs on the SD. Here is the 2.6.21-rc6-mm1's
trace.
Process model:
fork: pid = 12431, tgid = 12431, ppid = 2737, time_slice = 8000000
exit: pid = 12431, tgid = 12431, ppid = 12430, time_slice = 3842675
fork: pid = 12432, tgid = 12432, ppid = 2737, time_slice = 6175786
exit: pid = 12432, tgid = 12432, ppid = 12430, time_slice = 2948968
fork: pid = 12433, tgid = 12433, ppid = 2737, time_slice = 5862607
exit: pid = 12433, tgid = 12433, ppid = 12430, time_slice = 2793696
fork: pid = 12434, tgid = 12434, ppid = 2737, time_slice = 5515983
exit: pid = 12434, tgid = 12434, ppid = 12430, time_slice = 2583218
fork: pid = 12435, tgid = 12435, ppid = 2737, time_slice = 5174253
exit: pid = 12435, tgid = 12435, ppid = 12430, time_slice = 2421119
fork: pid = 12436, tgid = 12436, ppid = 2737, time_slice = 4433521
exit: pid = 12436, tgid = 12436, ppid = 12430, time_slice = 2068271
fork: pid = 12437, tgid = 12437, ppid = 2737, time_slice = 4115828
exit: pid = 12437, tgid = 12437, ppid = 12430, time_slice = 1906364
fork: pid = 12438, tgid = 12438, ppid = 2737, time_slice = 3817721
exit: pid = 12438, tgid = 12438, ppid = 12430, time_slice = 1763586
fork: pid = 12439, tgid = 12439, ppid = 2737, time_slice = 3523880
exit: pid = 12439, tgid = 12439, ppid = 12430, time_slice = 1621289
fork: pid = 12440, tgid = 12440, ppid = 2737, time_slice = 2856168
exit: pid = 12440, tgid = 12440, ppid = 12430, time_slice = 1269502
exit: pid = 12430, tgid = 12430, ppid = 2737, time_slice = 2434379
Thread model:
fork: pid = 12554, tgid = 12553, ppid = 2737, time_slice = 2841904
exit: pid = 12554, tgid = 12553, ppid = 2737, time_slice = 1420952
fork: pid = 12555, tgid = 12553, ppid = 2737, time_slice = 29977
exit: pid = 12555, tgid = 12553, ppid = 2737, time_slice = 14988
fork: pid = 12556, tgid = 12553, ppid = 2737, time_slice = 8000000
exit: pid = 12556, tgid = 12553, ppid = 2737, time_slice = 4000000
fork: pid = 12557, tgid = 12553, ppid = 2737, time_slice = 3976416
exit: pid = 12557, tgid = 12553, ppid = 2737, time_slice = 1988208
fork: pid = 12558, tgid = 12553, ppid = 2737, time_slice = 1965192
exit: pid = 12558, tgid = 12553, ppid = 2737, time_slice = 982596
fork: pid = 12559, tgid = 12553, ppid = 2737, time_slice = 955481
exit: pid = 12559, tgid = 12553, ppid = 2737, time_slice = 477740
fork: pid = 12560, tgid = 12553, ppid = 2737, time_slice = 453317
exit: pid = 12560, tgid = 12553, ppid = 2737, time_slice = 226658
fork: pid = 12561, tgid = 12553, ppid = 2737, time_slice = 203189
exit: pid = 12561, tgid = 12553, ppid = 2737, time_slice = 101594
fork: pid = 12562, tgid = 12553, ppid = 2737, time_slice = 76756
exit: pid = 12562, tgid = 12553, ppid = 2737, time_slice = 38378
fork: pid = 12563, tgid = 12553, ppid = 2737, time_slice = 15096
exit: pid = 12563, tgid = 12553, ppid = 2737, time_slice = 7548
exit: pid = 12553, tgid = 12553, ppid = 2737, time_slice = 7851311
NOTE: Nanosecond is the unit of time_slice on SD, so the time_slice
value is very differ from the current scheduler's one.
Satoru
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] scheduler: first timeslice of the exiting thread
2007-04-09 6:09 ` Andrew Morton
2007-04-09 6:50 ` Mike Galbraith
@ 2007-04-13 15:31 ` Con Kolivas
2007-04-16 2:16 ` [PATCH] scheduler: fix the return of the first time_slice Satoru Takeuchi
1 sibling, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2007-04-13 15:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: Satoru Takeuchi, Ingo Molnar, Linux Kernel, Mike Galbraith
On Monday 09 April 2007 16:09, Andrew Morton wrote:
> On Sat, 07 Apr 2007 16:31:39 +0900 Satoru Takeuchi
<takeuchi_satoru@jp.fujitsu.com> wrote:
> > When I was examining the following program ...
> >
> > 1. There are a large amount of small jobs takes several msecs,
> > and the number of job increases constantly.
> > 2. The process creates a thread or a process per job (I examined both
> > the thread model and the process model).
> > 3. Each child process/thread does the assigned job and exit
> > immediately.
> >
> > ... I found that the thread model's latency is longer than proess
> > model's one against my expectation. It's because of the current
> > sched_fork()/sched_exit() implementation as follows:
> >
> > a) On sched_fork, the creator share its timeslice with new process.
> > b) On sched_exit, if the exiting process didn't exhaust its first
> > timeslice yet, it gives its timeslice to the parent.
> >
> > It has no problem on the process model since the creator is the parent.
> > However, on the thread model, the creator is not the parent, it is same
> > as the creator's parent. Hence, on this kind of program, the creator
> > can't retrieve shared timeslice and exausts its timeslice at a rate of
> > knots. In addition, somehow, the parent (typically shell?) gets extra
> > timeslice.
> >
> > I believe it's a bug and the exiting process should give its timeslice
> > to the creator. Now I have some patch plan to fix this problem as follow:
> >
> > a) Add the field for the creator to task_struct. It needs extra memory.
> > b) Doesn't add extra field and have thread's parent the creater, which
> > is same as process creation. However it has many side effects, for
> > example, we also need to change sys_getppid() implementation.
> >
> > What do you think? Any comments are welcome.
>
> This comes at an awkward time, because we might well merge the
> staircase/deadline work into 2.6.22, and I think it rewrites the part of
> the scheduler which is causing the problems you're observing.
>
> Has anyone verified that SD fixes this problem and the one at
> http://lkml.org/lkml/2007/4/7/21 ?
No, SD does nothing different in this regard. There is no notion of who made
the thread and who the remaining timeslice should go to. As you say, some
decision on sched_exit should probably be used to decide where to send it. In
the absence of yet more fields in task_struct, the easiest thing to do would
be to check if the the thread id is equal to the pid and if not, send it to
the pid (or any parent of that if it no longer exists). Whether it's worth
adding yet another field to task_struct to track this or not I do not have
enough information to make an informed choice in this regard. Either way I'm
low on patch-creation cycles so please feel free to provide your solution.
--
-ck
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] scheduler: fix the return of the first time_slice
2007-04-13 15:31 ` Con Kolivas
@ 2007-04-16 2:16 ` Satoru Takeuchi
2007-04-16 4:36 ` [PATCH -mm] " Satoru Takeuchi
0 siblings, 1 reply; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-16 2:16 UTC (permalink / raw)
To: Con Kolivas
Cc: Andrew Morton, Satoru Takeuchi, Ingo Molnar, Linux Kernel,
Mike Galbraith, Oleg Nesterov
At Sat, 14 Apr 2007 01:31:12 +1000,
Con Kolivas wrote:
>
> On Monday 09 April 2007 16:09, Andrew Morton wrote:
> > On Sat, 07 Apr 2007 16:31:39 +0900 Satoru Takeuchi
> <takeuchi_satoru@jp.fujitsu.com> wrote:
> > > When I was examining the following program ...
> > >
> > > 1. There are a large amount of small jobs takes several msecs,
> > > and the number of job increases constantly.
> > > 2. The process creates a thread or a process per job (I examined both
> > > the thread model and the process model).
> > > 3. Each child process/thread does the assigned job and exit
> > > immediately.
> > >
> > > ... I found that the thread model's latency is longer than proess
> > > model's one against my expectation. It's because of the current
> > > sched_fork()/sched_exit() implementation as follows:
> > >
> > > a) On sched_fork, the creator share its timeslice with new process.
> > > b) On sched_exit, if the exiting process didn't exhaust its first
> > > timeslice yet, it gives its timeslice to the parent.
> > >
> > > It has no problem on the process model since the creator is the parent.
> > > However, on the thread model, the creator is not the parent, it is same
> > > as the creator's parent. Hence, on this kind of program, the creator
> > > can't retrieve shared timeslice and exausts its timeslice at a rate of
> > > knots. In addition, somehow, the parent (typically shell?) gets extra
> > > timeslice.
> > >
> > > I believe it's a bug and the exiting process should give its timeslice
> > > to the creator. Now I have some patch plan to fix this problem as follow:
> > >
> > > a) Add the field for the creator to task_struct. It needs extra memory.
> > > b) Doesn't add extra field and have thread's parent the creater, which
> > > is same as process creation. However it has many side effects, for
> > > example, we also need to change sys_getppid() implementation.
> > >
> > > What do you think? Any comments are welcome.
> >
> > This comes at an awkward time, because we might well merge the
> > staircase/deadline work into 2.6.22, and I think it rewrites the part of
> > the scheduler which is causing the problems you're observing.
> >
> > Has anyone verified that SD fixes this problem and the one at
> > http://lkml.org/lkml/2007/4/7/21 ?
>
> No, SD does nothing different in this regard. There is no notion of who made
> the thread and who the remaining timeslice should go to. As you say, some
> decision on sched_exit should probably be used to decide where to send it. In
> the absence of yet more fields in task_struct, the easiest thing to do would
> be to check if the the thread id is equal to the pid and if not, send it to
> the pid (or any parent of that if it no longer exists). Whether it's worth
> adding yet another field to task_struct to track this or not I do not have
> enough information to make an informed choice in this regard. Either way I'm
> low on patch-creation cycles so please feel free to provide your solution.
I wrote the patch fixing this problem. It's for 2.6.21-rc6 and works well.
I also examined the following load tests:
- Compile kernel with -j4 option continuously for 8 hours on i386 UP box.
- Compile kernel with -j48 option continuously for 2 hours on ia64 12 CPU
box.
Now I'm testing its 2.6.21-rc6-mm1(with SD) version and would submit it soon.
Thanks,
Satoru
---
Fix the return of the first time_slice
Currently exiting process takes back its first_time_slice to its
parent. If the task is created with CLONE_PARENT or CLONE_THREAD
flag, the parent is not the creator. Hence the creator can't
collect the time_slice and time_slice leak occurs.
To fix this problem, remove first_time_slice field of task_struct
and introduce time_slice_reaper field instead. The new field means
the task which exiting task should return its first_time_slice to,
and is NULL after exhausting its first time_slice.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Index: linux-2.6.21-rc6.tsfix/kernel/exit.c
===================================================================
--- linux-2.6.21-rc6.tsfix.orig/kernel/exit.c 2007-04-13 20:53:11.000000000 +0900
+++ linux-2.6.21-rc6.tsfix/kernel/exit.c 2007-04-14 13:58:29.000000000 +0900
@@ -679,6 +679,8 @@
reaper = child_reaper(father);
break;
}
+ if (reaper->time_slice_reaper == father)
+ p->time_slice_reaper = NULL;
} while (reaper->exit_state);
/*
@@ -700,6 +702,8 @@
if (father == p->real_parent) {
/* reparent with a reaper, real father it's us */
+ if (p->time_slice_reaper == father)
+ p->time_slice_reaper = NULL;
choose_new_parent(p, reaper);
reparent_thread(p, father, 0);
} else {
@@ -721,6 +725,8 @@
}
list_for_each_safe(_p, _n, &father->ptrace_children) {
p = list_entry(_p, struct task_struct, ptrace_list);
+ if (p->time_slice_reaper == father)
+ p->time_slice_reaper = NULL;
choose_new_parent(p, reaper);
reparent_thread(p, father, 1);
}
Index: linux-2.6.21-rc6.tsfix/kernel/sched.c
===================================================================
--- linux-2.6.21-rc6.tsfix.orig/kernel/sched.c 2007-04-13 20:53:11.000000000 +0900
+++ linux-2.6.21-rc6.tsfix/kernel/sched.c 2007-04-14 01:27:20.000000000 +0900
@@ -1628,7 +1628,7 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ p->time_slice_reaper = current;
current->time_slice >>= 1;
p->timestamp = sched_clock();
if (unlikely(!current->time_slice)) {
@@ -1739,19 +1739,23 @@
{
unsigned long flags;
struct rq *rq;
+ struct task_struct *reaper = p->time_slice_reaper;
/*
* If the child was a (relative-) CPU hog then decrease
* the sleep_avg of the parent as well.
*/
- rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
- p->parent->time_slice += p->time_slice;
- if (unlikely(p->parent->time_slice > task_timeslice(p)))
- p->parent->time_slice = task_timeslice(p);
+ if (!reaper)
+ return;
+
+ rq = task_rq_lock(reaper, &flags);
+ if (task_cpu(p) == task_cpu(reaper)) {
+ reaper->time_slice += p->time_slice;
+ if (unlikely(reaper->time_slice > task_timeslice(reaper)))
+ reaper->time_slice = task_timeslice(reaper);
}
- if (p->sleep_avg < p->parent->sleep_avg)
- p->parent->sleep_avg = p->parent->sleep_avg /
+ if (p->sleep_avg < reaper->sleep_avg)
+ reaper->sleep_avg = reaper->sleep_avg /
(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
(EXIT_WEIGHT + 1);
task_rq_unlock(rq, &flags);
@@ -3153,7 +3157,7 @@
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ p->time_slice_reaper = NULL;
set_tsk_need_resched(p);
/* put it at the end of the queue: */
@@ -3166,7 +3170,7 @@
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ p->time_slice_reaper = NULL;
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
Index: linux-2.6.21-rc6.tsfix/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc6.tsfix.orig/include/linux/sched.h 2007-04-13 20:58:27.000000000 +0900
+++ linux-2.6.21-rc6.tsfix/include/linux/sched.h 2007-04-13 20:59:03.000000000 +0900
@@ -827,7 +827,8 @@
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int time_slice;
+ struct task_struct *time_slice_reaper;
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH -mm] scheduler: fix the return of the first time_slice
2007-04-16 2:16 ` [PATCH] scheduler: fix the return of the first time_slice Satoru Takeuchi
@ 2007-04-16 4:36 ` Satoru Takeuchi
2007-04-16 10:45 ` Oleg Nesterov
0 siblings, 1 reply; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-16 4:36 UTC (permalink / raw)
To: Con Kolivas
Cc: Andrew Morton, Satoru Takeuchi, Ingo Molnar, Linux Kernel,
Mike Galbraith, Oleg Nesterov
At Mon, 16 Apr 2007 11:16:24 +0900,
Satoru Takeuchi wrote:
>
> At Sat, 14 Apr 2007 01:31:12 +1000,
> Con Kolivas wrote:
> >
> > On Monday 09 April 2007 16:09, Andrew Morton wrote:
> > > On Sat, 07 Apr 2007 16:31:39 +0900 Satoru Takeuchi
> > <takeuchi_satoru@jp.fujitsu.com> wrote:
> > > > When I was examining the following program ...
> > > >
> > > > 1. There are a large amount of small jobs takes several msecs,
> > > > and the number of job increases constantly.
> > > > 2. The process creates a thread or a process per job (I examined both
> > > > the thread model and the process model).
> > > > 3. Each child process/thread does the assigned job and exit
> > > > immediately.
> > > >
> > > > ... I found that the thread model's latency is longer than proess
> > > > model's one against my expectation. It's because of the current
> > > > sched_fork()/sched_exit() implementation as follows:
> > > >
> > > > a) On sched_fork, the creator share its timeslice with new process.
> > > > b) On sched_exit, if the exiting process didn't exhaust its first
> > > > timeslice yet, it gives its timeslice to the parent.
> > > >
> > > > It has no problem on the process model since the creator is the parent.
> > > > However, on the thread model, the creator is not the parent, it is same
> > > > as the creator's parent. Hence, on this kind of program, the creator
> > > > can't retrieve shared timeslice and exausts its timeslice at a rate of
> > > > knots. In addition, somehow, the parent (typically shell?) gets extra
> > > > timeslice.
> > > >
> > > > I believe it's a bug and the exiting process should give its timeslice
> > > > to the creator. Now I have some patch plan to fix this problem as follow:
> > > >
> > > > a) Add the field for the creator to task_struct. It needs extra memory.
> > > > b) Doesn't add extra field and have thread's parent the creater, which
> > > > is same as process creation. However it has many side effects, for
> > > > example, we also need to change sys_getppid() implementation.
> > > >
> > > > What do you think? Any comments are welcome.
> > >
> > > This comes at an awkward time, because we might well merge the
> > > staircase/deadline work into 2.6.22, and I think it rewrites the part of
> > > the scheduler which is causing the problems you're observing.
> > >
> > > Has anyone verified that SD fixes this problem and the one at
> > > http://lkml.org/lkml/2007/4/7/21 ?
> >
> > No, SD does nothing different in this regard. There is no notion of who made
> > the thread and who the remaining timeslice should go to. As you say, some
> > decision on sched_exit should probably be used to decide where to send it. In
> > the absence of yet more fields in task_struct, the easiest thing to do would
> > be to check if the the thread id is equal to the pid and if not, send it to
> > the pid (or any parent of that if it no longer exists). Whether it's worth
> > adding yet another field to task_struct to track this or not I do not have
> > enough information to make an informed choice in this regard. Either way I'm
> > low on patch-creation cycles so please feel free to provide your solution.
>
> I wrote the patch fixing this problem. It's for 2.6.21-rc6 and works well.
> I also examined the following load tests:
>
> - Compile kernel with -j4 option continuously for 8 hours on i386 UP box.
> - Compile kernel with -j48 option continuously for 2 hours on ia64 12 CPU
> box.
>
> Now I'm testing its 2.6.21-rc6-mm1(with SD) version and would submit it soon.
Here is the -mm version. I examined the same test as -rc6.
Thanks,
Satoru
---
Fix the return of the first time_slice
Currently exiting process takes back its first_time_slice to its
parent. If the task is created with CLONE_PARENT or CLONE_THREAD
flag, the parent is not the creator. Hence the creator can't
collect the time_slice and time_slice leak occurs.
To fix this problem, remove first_time_slice field of task_struct
and introduce time_slice_reaper field instead. The new field means
the task which exiting task should return its first_time_slice to,
and is NULL after exhausting its first time_slice.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Index: linux-2.6.21-rc6-mm1.tsfix/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc6-mm1.tsfix.orig/include/linux/sched.h 2007-04-15 16:56:12.000000000 +0900
+++ linux-2.6.21-rc6-mm1.tsfix/include/linux/sched.h 2007-04-15 16:56:17.000000000 +0900
@@ -864,8 +864,8 @@
* before being requeued at a lower priority.
*/
int time_slice;
- /* Is this the very first time_slice this task has ever run. */
- unsigned int first_time_slice;
+ /* The task which this task should bring back its first time_slice. */
+ struct task_struct *time_slice_reaper;
/* How much this task receives at each priority level */
int quota;
Index: linux-2.6.21-rc6-mm1.tsfix/kernel/exit.c
===================================================================
--- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/exit.c 2007-04-15 16:56:12.000000000 +0900
+++ linux-2.6.21-rc6-mm1.tsfix/kernel/exit.c 2007-04-15 16:56:17.000000000 +0900
@@ -680,10 +680,14 @@
reaper = child_reaper(father);
break;
}
+ if (reaper->time_slice_reaper == father)
+ reaper->time_slice_reaper = NULL;
} while (reaper->exit_state);
list_for_each_safe(_p, _n, &father->children) {
p = list_entry(_p, struct task_struct, sibling);
+ if (p->time_slice_reaper == father)
+ p->time_slice_reaper = NULL;
choose_new_parent(p, reaper);
reparent_thread(p, father);
}
Index: linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c
===================================================================
--- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/sched.c 2007-04-15 16:56:12.000000000 +0900
+++ linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c 2007-04-15 16:56:17.000000000 +0900
@@ -1693,7 +1693,7 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ p->time_slice_reaper = current;
p->timestamp = sched_clock();
local_irq_enable();
out:
@@ -1768,16 +1768,18 @@
*/
void fastcall sched_exit(struct task_struct *p)
{
- struct task_struct *parent;
+ struct task_struct *reaper = p->time_slice_reaper;
unsigned long flags;
struct rq *rq;
- parent = p->parent;
- rq = task_rq_lock(parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(parent)) {
- parent->time_slice += p->time_slice;
- if (unlikely(parent->time_slice > parent->quota))
- parent->time_slice = parent->quota;
+ if (!reaper)
+ return;
+
+ rq = task_rq_lock(reaper, &flags);
+ if (p->time_slice_reaper && task_cpu(p) == task_cpu(reaper)) {
+ reaper->time_slice += p->time_slice;
+ if (unlikely(reaper->time_slice > reaper->quota))
+ reaper->time_slice = reaper->quota;
}
task_rq_unlock(rq, &flags);
}
@@ -3352,8 +3354,8 @@
int overrun;
int old_prio;
- if (unlikely(p->first_time_slice))
- p->first_time_slice = 0;
+ if (unlikely(p->time_slice_reaper))
+ p->time_slice_reaper = NULL;
if (rt_task(p)) {
p->time_slice = p->quota;
list_move_tail(&p->run_list, p->array->queue + p->prio);
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH -mm] scheduler: fix the return of the first time_slice
2007-04-16 4:36 ` [PATCH -mm] " Satoru Takeuchi
@ 2007-04-16 10:45 ` Oleg Nesterov
2007-04-16 11:47 ` Oleg Nesterov
2007-04-16 11:58 ` Satoru Takeuchi
0 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2007-04-16 10:45 UTC (permalink / raw)
To: Satoru Takeuchi
Cc: Con Kolivas, Andrew Morton, Ingo Molnar, Linux Kernel,
Mike Galbraith
On 04/16, Satoru Takeuchi wrote:
>
> --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/exit.c 2007-04-15 16:56:12.000000000 +0900
> +++ linux-2.6.21-rc6-mm1.tsfix/kernel/exit.c 2007-04-15 16:56:17.000000000 +0900
> @@ -680,10 +680,14 @@
> reaper = child_reaper(father);
> break;
> }
> + if (reaper->time_slice_reaper == father)
> + reaper->time_slice_reaper = NULL;
> } while (reaper->exit_state);
>
> list_for_each_safe(_p, _n, &father->children) {
> p = list_entry(_p, struct task_struct, sibling);
> + if (p->time_slice_reaper == father)
> + p->time_slice_reaper = NULL;
> choose_new_parent(p, reaper);
> reparent_thread(p, father);
> }
> Index: linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c
> ===================================================================
> --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/sched.c 2007-04-15 16:56:12.000000000 +0900
> +++ linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c 2007-04-15 16:56:17.000000000 +0900
> @@ -1693,7 +1693,7 @@
> * The remainder of the first timeslice might be recovered by
> * the parent if the child exits early enough.
> */
> - p->first_time_slice = 1;
> + p->time_slice_reaper = current;
> p->timestamp = sched_clock();
> local_irq_enable();
I am afraid this doesn't work for CLONE_THREAD. Suppose that some sub-thread (not
main thread) T1 creates another sub-thread, T2.
T2->time_slice_reaper = T1.
then T1 exits, but forget_original_parent(T1) doesn't change T2->time_slice_reaper,
because T2 is not on T1->children list. Now T2->time_slice_reaper points to an
already freed task_struct.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH -mm] scheduler: fix the return of the first time_slice
2007-04-16 10:45 ` Oleg Nesterov
@ 2007-04-16 11:47 ` Oleg Nesterov
2007-04-16 11:54 ` Ingo Molnar
2007-04-16 11:58 ` Satoru Takeuchi
1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2007-04-16 11:47 UTC (permalink / raw)
To: Satoru Takeuchi
Cc: Con Kolivas, Andrew Morton, Ingo Molnar, Linux Kernel,
Mike Galbraith
On 04/16, Oleg Nesterov wrote:
>
> On 04/16, Satoru Takeuchi wrote:
> >
> > --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/exit.c 2007-04-15 16:56:12.000000000 +0900
> > +++ linux-2.6.21-rc6-mm1.tsfix/kernel/exit.c 2007-04-15 16:56:17.000000000 +0900
> > @@ -680,10 +680,14 @@
> > reaper = child_reaper(father);
> > break;
> > }
> > + if (reaper->time_slice_reaper == father)
> > + reaper->time_slice_reaper = NULL;
> > } while (reaper->exit_state);
> >
> > list_for_each_safe(_p, _n, &father->children) {
> > p = list_entry(_p, struct task_struct, sibling);
> > + if (p->time_slice_reaper == father)
> > + p->time_slice_reaper = NULL;
> > choose_new_parent(p, reaper);
> > reparent_thread(p, father);
> > }
> > Index: linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/sched.c 2007-04-15 16:56:12.000000000 +0900
> > +++ linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c 2007-04-15 16:56:17.000000000 +0900
> > @@ -1693,7 +1693,7 @@
> > * The remainder of the first timeslice might be recovered by
> > * the parent if the child exits early enough.
> > */
> > - p->first_time_slice = 1;
> > + p->time_slice_reaper = current;
> > p->timestamp = sched_clock();
> > local_irq_enable();
>
> I am afraid this doesn't work for CLONE_THREAD. Suppose that some sub-thread (not
> main thread) T1 creates another sub-thread, T2.
>
> T2->time_slice_reaper = T1.
>
> then T1 exits, but forget_original_parent(T1) doesn't change T2->time_slice_reaper,
> because T2 is not on T1->children list. Now T2->time_slice_reaper points to an
> already freed task_struct.
In case I was not clear...
Note that "do {} while ()" loop in forget_original_parent() stops when it finds
a sub-thread with ->exit_state == 0. This means we can miss another sub-thread
which has ->time_slice_reaper == father.
To make this correct, we should iterate over all thread-group, but this can slow
down exit() when we have a lot of threads.
I guess we need Ingo's opinion on that.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH -mm] scheduler: fix the return of the first time_slice
2007-04-16 11:47 ` Oleg Nesterov
@ 2007-04-16 11:54 ` Ingo Molnar
0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Satoru Takeuchi, Con Kolivas, Andrew Morton, Linux Kernel,
Mike Galbraith
* Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > > * The remainder of the first timeslice might be recovered by
> > > * the parent if the child exits early enough.
> > > */
> > > - p->first_time_slice = 1;
> > > + p->time_slice_reaper = current;
> > > p->timestamp = sched_clock();
> > > local_irq_enable();
> >
> > I am afraid this doesn't work for CLONE_THREAD. Suppose that some
> > sub-thread (not main thread) T1 creates another sub-thread, T2.
> In case I was not clear...
> To make this correct, we should iterate over all thread-group, but
> this can slow down exit() when we have a lot of threads.
>
> I guess we need Ingo's opinion on that.
right now my first cautious estimation seems to be that we might be able
to get rid of this whole child/parent timeslice sharing complexity and
do all the scheduling setup without affecting the parent - hence
avoiding all the reaper problems as well. People reported interactivity
improvements with this removed from CFS. (It all still needs a ton of
validation to make sure, but the trend seems to be this.)
(the only valid component of that complexity is 'child runs first' - but
it's not really related to the timesplice splitting thing just
intermixed with it.)
Ingo
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH -mm] scheduler: fix the return of the first time_slice
2007-04-16 10:45 ` Oleg Nesterov
2007-04-16 11:47 ` Oleg Nesterov
@ 2007-04-16 11:58 ` Satoru Takeuchi
1 sibling, 0 replies; 15+ messages in thread
From: Satoru Takeuchi @ 2007-04-16 11:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Satoru Takeuchi, Con Kolivas, Andrew Morton, Ingo Molnar,
Linux Kernel, Mike Galbraith
At Mon, 16 Apr 2007 14:45:59 +0400,
Oleg Nesterov wrote:
>
> On 04/16, Satoru Takeuchi wrote:
> >
> > --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/exit.c 2007-04-15 16:56:12.000000000 +0900
> > +++ linux-2.6.21-rc6-mm1.tsfix/kernel/exit.c 2007-04-15 16:56:17.000000000 +0900
> > @@ -680,10 +680,14 @@
> > reaper = child_reaper(father);
> > break;
> > }
> > + if (reaper->time_slice_reaper == father)
> > + reaper->time_slice_reaper = NULL;
> > } while (reaper->exit_state);
> >
> > list_for_each_safe(_p, _n, &father->children) {
> > p = list_entry(_p, struct task_struct, sibling);
> > + if (p->time_slice_reaper == father)
> > + p->time_slice_reaper = NULL;
> > choose_new_parent(p, reaper);
> > reparent_thread(p, father);
> > }
> > Index: linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.21-rc6-mm1.tsfix.orig/kernel/sched.c 2007-04-15 16:56:12.000000000 +0900
> > +++ linux-2.6.21-rc6-mm1.tsfix/kernel/sched.c 2007-04-15 16:56:17.000000000 +0900
> > @@ -1693,7 +1693,7 @@
> > * The remainder of the first timeslice might be recovered by
> > * the parent if the child exits early enough.
> > */
> > - p->first_time_slice = 1;
> > + p->time_slice_reaper = current;
> > p->timestamp = sched_clock();
> > local_irq_enable();
>
> I am afraid this doesn't work for CLONE_THREAD. Suppose that some sub-thread (not
> main thread) T1 creates another sub-thread, T2.
>
> T2->time_slice_reaper = T1.
>
> then T1 exits, but forget_original_parent(T1) doesn't change T2->time_slice_reaper,
> because T2 is not on T1->children list. Now T2->time_slice_reaper points to an
> already freed task_struct.
I intended that CLONE_THREAD case is managed on the do-while loop seeking `reaper'.
However, anyway, my patch has a bug. On T1 exiting, if there are any living threads
ahead of T2, T2->time_slice_reaper isn't checked. What a easy mistake ... sorry.
I'll post fixed version soon.
Satoru
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2007-04-16 12:01 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-07 7:31 [BUG] scheduler: first timeslice of the exiting thread Satoru Takeuchi
2007-04-07 7:45 ` Satoru Takeuchi
2007-04-09 6:09 ` Andrew Morton
2007-04-09 6:50 ` Mike Galbraith
2007-04-10 7:18 ` Satoru Takeuchi
2007-04-13 15:31 ` Con Kolivas
2007-04-16 2:16 ` [PATCH] scheduler: fix the return of the first time_slice Satoru Takeuchi
2007-04-16 4:36 ` [PATCH -mm] " Satoru Takeuchi
2007-04-16 10:45 ` Oleg Nesterov
2007-04-16 11:47 ` Oleg Nesterov
2007-04-16 11:54 ` Ingo Molnar
2007-04-16 11:58 ` Satoru Takeuchi
-- strict thread matches above, loose matches on Subject: below --
2007-04-09 10:29 [BUG] scheduler: first timeslice of the exiting thread Oleg Nesterov
2007-04-09 11:03 ` Oleg Nesterov
2007-04-10 1:19 ` Satoru Takeuchi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox