* Job control bug in revision 3800d4934391b,
@ 2010-05-28 0:21 Kris Maglione
2010-05-28 4:53 ` Herbert Xu
2010-05-28 23:09 ` Jilles Tjoelker
0 siblings, 2 replies; 9+ messages in thread
From: Kris Maglione @ 2010-05-28 0:21 UTC (permalink / raw)
To: dash
I'm not sure how to describe this bug, but it's affected one of my
scripts, and those of several of my users. Basically, we've had loops
dieing when backgrounded programs exit. This is the simplest test case
I can come up with:
#!/bin/dash
{
echo foo
sleep 1
echo foo
echo done>/dev/tty
} | while read p; do
( echo good & ) &
done
echo done
In versions prior to 3800d4934391b, the output would
"good\ndone\ndone\ngood" (or some permutation thereof depending on
system load), but from 3800d4934391b on, it's "good\ndone".
The offending revision:
[JOBS] Fix dowait signal race
author Herbert Xu <herbert@gondor.apana.org.au>
Sun, 22 Feb 2009 10:10:01 +0000 (18:10 +0800)
committer Herbert Xu <herbert@gondor.apana.org.au>
Sun, 22 Feb 2009 10:10:01 +0000 (18:10 +0800)
commit 3800d4934391b144fd261a7957aea72ced7d47ea
tree 40c003ab3063ceab7f3615a623a09d3c610332a0
parent 6045fe25078345074f027312d106d3fc19df56e5
[JOBS] Fix dowait signal race
This test program by Alexey Gladkov can cause dash to enter an
infinite loop in waitcmd.
#!/bin/dash
trap "echo TRAP" USR1
stub() {
echo ">>> STUB $1" >&2
sleep $1
echo "<<< STUB $1" >&2
kill -USR1 $$
}
stub 3 &
stub 2 &
until { echo "###"; wait; } do
echo "*** $?"
done
The problem is that if we get a signal after the wait3 system
call has returned but before we get to INTON in dowait, then
we can jump back up to the top and lose the exit status. So
if we then wait for the job that has just exited, then it'll
stay there forever.
I made the original change that caused this bug to fix pretty
much the same bug but in the opposite direction. That is, if
we get a signal after we enter wait3 but before we hit the kernel
then it too can cause the wait to go on forever (assuming the
child doesn't exit).
In fact this is pretty much exactly the scenario that you'll
find in glibc's documentation on pause(). The solution is given
there too, in the form of sigsuspend, which is the only way to
do the check and wait atomically.
So this patch fixes Alexey's race without reintroducing the old
bug by converting the blocking wait3 to a sigsuspend.
In order to do this we need to set a signal handler for SIGCHLD,
so the code has been modified to always do that.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
--
Kris Maglione
If you want to go somewhere, goto is the best way to get there.
--Ken Thompson
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione
@ 2010-05-28 4:53 ` Herbert Xu
2010-05-28 15:34 ` Kris Maglione
2010-05-28 23:09 ` Jilles Tjoelker
1 sibling, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2010-05-28 4:53 UTC (permalink / raw)
To: Kris Maglione; +Cc: dash
Kris Maglione <maglione.k@gmail.com> wrote:
>
> I'm not sure how to describe this bug, but it's affected one of my
> scripts, and those of several of my users. Basically, we've had loops
> dieing when backgrounded programs exit. This is the simplest test case
> I can come up with:
>
> #!/bin/dash
> {
> echo foo
> sleep 1
> echo foo
> echo done>/dev/tty
> } | while read p; do
> ( echo good & ) &
> done
> echo done
>
>
> In versions prior to 3800d4934391b, the output would
> "good\ndone\ndone\ngood" (or some permutation thereof depending on
> system load), but from 3800d4934391b on, it's "good\ndone".
This should be fixed by the patch that I posted yesterday.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 4:53 ` Herbert Xu
@ 2010-05-28 15:34 ` Kris Maglione
2010-05-28 22:43 ` Herbert Xu
0 siblings, 1 reply; 9+ messages in thread
From: Kris Maglione @ 2010-05-28 15:34 UTC (permalink / raw)
To: Herbert Xu, dash
On Fri, May 28, 2010 at 02:53:16PM +1000, Herbert Xu wrote:
>Kris Maglione <maglione.k@gmail.com> wrote:
>>
>> I'm not sure how to describe this bug, but it's affected one of my
>> scripts, and those of several of my users. Basically, we've had loops
>> dieing when backgrounded programs exit. This is the simplest test case
>> I can come up with:
>>
>> #!/bin/dash
>> {
>> echo foo
>> sleep 1
>> echo foo
>> echo done>/dev/tty
>> } | while read p; do
>> ( echo good & ) &
>> done
>> echo done
>>
>>
>> In versions prior to 3800d4934391b, the output would
>> "good\ndone\ndone\ngood" (or some permutation thereof depending on
>> system load), but from 3800d4934391b on, it's "good\ndone".
>
>This should be fixed by the patch that I posted yesterday.
I should have mentioned that I tested it with every revision
upto and including that one (especially as it looked promising).
I definitely still have this problem as of
207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master
Thanks,
--
Kris Maglione
Real Programmers don't believe in schedules. Planners make up
schedules. Managers "firm up" schedules. Frightened coders strive to
meet schedules. Real Programmers ignore schedules.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 15:34 ` Kris Maglione
@ 2010-05-28 22:43 ` Herbert Xu
2010-05-28 23:01 ` Kris Maglione
0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2010-05-28 22:43 UTC (permalink / raw)
To: Kris Maglione; +Cc: dash
Kris Maglione <maglione.k@gmail.com> wrote:
>
> I should have mentioned that I tested it with every revision
> upto and including that one (especially as it looked promising).
>
> I definitely still have this problem as of
> 207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master
Hmm, I can't reproduce this here at all. Can you please run it
with strace -f and send the result to us?
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 22:43 ` Herbert Xu
@ 2010-05-28 23:01 ` Kris Maglione
2010-05-28 23:10 ` Herbert Xu
0 siblings, 1 reply; 9+ messages in thread
From: Kris Maglione @ 2010-05-28 23:01 UTC (permalink / raw)
To: Herbert Xu; +Cc: dash
[-- Attachment #1: Type: text/plain, Size: 741 bytes --]
On Sat, May 29, 2010 at 08:43:34AM +1000, Herbert Xu wrote:
>Kris Maglione <maglione.k@gmail.com> wrote:
>>
>> I should have mentioned that I tested it with every revision
>> upto and including that one (especially as it looked promising).
>>
>> I definitely still have this problem as of
>> 207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master
>
>Hmm, I can't reproduce this here at all. Can you please run it
>with strace -f and send the result to us?
That's interesting. I've had three users complain about it so
far. strace is attached.
--
Kris Maglione
Insane people are always sure that they are fine. It is only the sane
people who are willing to admit that they are crazy.
--Nora Ephron
[-- Attachment #2: strace-dash1.log --]
[-- Type: text/plain, Size: 11444 bytes --]
execve("/home/kris/code/dash.git/src/dash", ["/home/kris/code/dash.git/src/das"...], [/* 97 vars */]) = 0
brk(0) = 0x174d000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b299768c000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=258429, ...}) = 0
mmap(NULL, 258429, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b299768d000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\354\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1651953, ...}) = 0
mmap(NULL, 3521384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b299788e000
mprotect(0x2b29979e0000, 2097152, PROT_NONE) = 0
mmap(0x2b2997be0000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x152000) = 0x2b2997be0000
mmap(0x2b2997be5000, 19304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2997be5000
close(3) = 0
open("/usr/lib/libedit.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\266\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=199791, ...}) = 0
mmap(NULL, 2289024, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2997bea000
mprotect(0x2b2997c12000, 2097152, PROT_NONE) = 0
mmap(0x2b2997e12000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x2b2997e12000
mmap(0x2b2997e15000, 15744, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2997e15000
close(3) = 0
open("/lib/libncursesw.so.5", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0B\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=394095, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2997e19000
mmap(NULL, 2456904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2997e1a000
mprotect(0x2b2997e6e000, 2093056, PROT_NONE) = 0
mmap(0x2b299806d000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x53000) = 0x2b299806d000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2998072000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2998073000
arch_prctl(ARCH_SET_FS, 0x2b2998072700) = 0
mprotect(0x2b2997be0000, 16384, PROT_READ) = 0
mprotect(0x2b299788b000, 4096, PROT_READ) = 0
munmap(0x2b299768d000, 258429) = 0
getpid() = 23948
rt_sigaction(SIGCHLD, {0x41384e, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0
geteuid() = 1000
brk(0) = 0x174d000
brk(0x176e000) = 0x176e000
getppid() = 23947
stat("/home/kris", {st_mode=S_IFDIR|0755, st_size=270336, ...}) = 0
stat(".", {st_mode=S_IFDIR|0755, st_size=270336, ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff75f09bf0) = -1 ENOTTY (Inappropriate ioctl for device)
rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0
read(0, "#!/bin/dash\n{\n\techo foo\n\tsleep 1"..., 8192) = 118
pipe([3, 4]) = 0
clone(Process 23949 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23949
[pid 23948] close(4) = 0
[pid 23948] clone( <unfinished ...>
[pid 23949] close(3) = 0
[pid 23949] dup2(4, 1) = 1
Process 23950 attached
[pid 23948] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23950
[pid 23949] close(4) = 0
[pid 23948] close(3 <unfinished ...>
[pid 23949] write(1, "foo\n", 4 <unfinished ...>
[pid 23950] dup2(3, 0 <unfinished ...>
[pid 23948] <... close resumed> ) = 0
[pid 23950] <... dup2 resumed> ) = 0
[pid 23948] close(4294967295 <unfinished ...>
[pid 23950] close(3 <unfinished ...>
[pid 23948] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 23950] <... close resumed> ) = 0
[pid 23948] wait4(-1, <unfinished ...>
[pid 23949] <... write resumed> ) = 4
[pid 23950] read(0, <unfinished ...>
[pid 23949] stat("./sleep", <unfinished ...>
[pid 23950] <... read resumed> "f", 1) = 1
[pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory)
[pid 23950] read(0, <unfinished ...>
[pid 23949] stat("/home/kris/bin/sleep", <unfinished ...>
[pid 23950] <... read resumed> "o", 1) = 1
[pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory)
[pid 23950] read(0, <unfinished ...>
[pid 23949] stat("/home/kris/wmiiinst/bin/sleep", <unfinished ...>
[pid 23950] <... read resumed> "o", 1) = 1
[pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory)
[pid 23950] read(0, <unfinished ...>
[pid 23949] stat("/home/kris/.gem/ruby/1.8/bin/sleep", <unfinished ...>
[pid 23950] <... read resumed> "\n", 1) = 1
[pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory)
[pid 23949] stat("/opt/wine/bin/sleep", <unfinished ...>
[pid 23950] clone( <unfinished ...>
[pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory)
[pid 23949] stat("/bin/sleep", {st_mode=S_IFREG|0755, st_size=28448, ...}) = 0
Process 23951 attached
[pid 23950] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23951
[pid 23949] clone( <unfinished ...>
[pid 23951] rt_sigaction(SIGINT, {SIG_IGN, [INT], SA_RESTORER|SA_RESTART, 0x2b29978c04e0}, Process 23952 attached (waiting for parent)
{SIG_DFL, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, 8) = 0
[pid 23950] read(0, Process 23952 resumed (parent 23949 ready)
<unfinished ...>
[pid 23949] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23952
[pid 23952] execve("/bin/sleep", ["sleep", "1"], [/* 97 vars */] <unfinished ...>
[pid 23951] rt_sigaction(SIGQUIT, {SIG_IGN, [QUIT], SA_RESTORER|SA_RESTART, 0x2b29978c04e0}, <unfinished ...>
[pid 23949] wait4(-1, <unfinished ...>
[pid 23951] <... rt_sigaction resumed> {SIG_DFL, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, 8) = 0
[pid 23951] close(0) = 0
[pid 23951] open("/dev/null", O_RDONLY) = 0
[pid 23951] clone( <unfinished ...>
[pid 23952] <... execve resumed> ) = 0
Process 23953 attached (waiting for parent)
Process 23953 resumed (parent 23951 ready)
[pid 23951] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23953
[pid 23953] close(0 <unfinished ...>
[pid 23952] brk(0 <unfinished ...>
[pid 23953] <... close resumed> ) = 0
[pid 23952] <... brk resumed> ) = 0x78f000
[pid 23953] open("/dev/null", O_RDONLY <unfinished ...>
[pid 23951] exit_group(0) = ?
Process 23951 detached
[pid 23953] <... open resumed> ) = 0
[pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 23950] <... read resumed> 0x7fff75f098b7, 1) = ? ERESTARTSYS (To be restarted)
[pid 23952] <... mmap resumed> ) = 0x2b185172f000
[pid 23950] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 23952] access("/etc/ld.so.preload", R_OK <unfinished ...>
[pid 23950] rt_sigreturn(0x11 <unfinished ...>
[pid 23952] <... access resumed> ) = -1 ENOENT (No such file or directory)
[pid 23950] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 23953] write(1, "good\n", 5 <unfinished ...>
[pid 23952] open("/etc/ld.so.cache", O_RDONLY <unfinished ...>
[pid 23950] exit_group(0) = ?
Process 23950 detached
[pid 23953] <... write resumed> ) = 5
[pid 23952] <... open resumed> ) = 3
[pid 23953] exit_group(0) = ?
Process 23953 detached
[pid 23948] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23950
[pid 23948] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 23952] fstat(3, <unfinished ...>
[pid 23948] rt_sigreturn(0x11 <unfinished ...>
[pid 23952] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=258429, ...}) = 0
[pid 23948] <... rt_sigreturn resumed> ) = 23950
[pid 23952] mmap(NULL, 258429, PROT_READ, MAP_PRIVATE, 3, 0 <unfinished ...>
[pid 23948] wait4(-1, <unfinished ...>
[pid 23952] <... mmap resumed> ) = 0x2b1851730000
[pid 23952] close(3) = 0
[pid 23952] open("/lib/libc.so.6", O_RDONLY) = 3
[pid 23952] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\354\1\0\0\0\0\0"..., 832) = 832
[pid 23952] fstat(3, {st_mode=S_IFREG|0755, st_size=1651953, ...}) = 0
[pid 23952] mmap(NULL, 3521384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b1851931000
[pid 23952] mprotect(0x2b1851a83000, 2097152, PROT_NONE) = 0
[pid 23952] mmap(0x2b1851c83000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x152000) = 0x2b1851c83000
[pid 23952] mmap(0x2b1851c88000, 19304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b1851c88000
[pid 23952] close(3) = 0
[pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b1851c8d000
[pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b1851c8e000
[pid 23952] arch_prctl(ARCH_SET_FS, 0x2b1851c8db20) = 0
[pid 23952] mprotect(0x2b1851c83000, 16384, PROT_READ) = 0
[pid 23952] mprotect(0x2b185192e000, 4096, PROT_READ) = 0
[pid 23952] munmap(0x2b1851730000, 258429) = 0
[pid 23952] brk(0) = 0x78f000
[pid 23952] brk(0x7b0000) = 0x7b0000
[pid 23952] open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
[pid 23952] fstat(3, {st_mode=S_IFREG|0644, st_size=3196096, ...}) = 0
[pid 23952] mmap(NULL, 3196096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b1851c8f000
[pid 23952] close(3) = 0
[pid 23952] nanosleep({1, 0}, NULL) = 0
[pid 23952] close(1) = 0
[pid 23952] close(2) = 0
[pid 23952] exit_group(0) = ?
Process 23952 detached
[pid 23949] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23952
[pid 23949] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 23949] rt_sigreturn(0x11) = 23952
[pid 23949] write(1, "foo\n", 4) = -1 EPIPE (Broken pipe)
[pid 23949] --- SIGPIPE (Broken pipe) @ 0 (0) ---
Process 23949 detached
<... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 23949
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigreturn(0x11) = 23949
write(1, "done\n", 5) = 5
read(0, "", 8192) = 0
exit_group(0) = ?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione
2010-05-28 4:53 ` Herbert Xu
@ 2010-05-28 23:09 ` Jilles Tjoelker
1 sibling, 0 replies; 9+ messages in thread
From: Jilles Tjoelker @ 2010-05-28 23:09 UTC (permalink / raw)
To: Kris Maglione; +Cc: dash
On Thu, May 27, 2010 at 08:21:57PM -0400, Kris Maglione wrote:
> I'm not sure how to describe this bug, but it's affected one of my
> scripts, and those of several of my users. Basically, we've had loops
> dieing when backgrounded programs exit. This is the simplest test case
> I can come up with:
> #!/bin/dash
> {
> echo foo
> sleep 1
> echo foo
> echo done>/dev/tty
> } | while read p; do
> ( echo good & ) &
> done
> echo done
> In versions prior to 3800d4934391b, the output would
> "good\ndone\ndone\ngood" (or some permutation thereof depending on
> system load), but from 3800d4934391b on, it's "good\ndone".
> The offending revision:
> [JOBS] Fix dowait signal race
I can reproduce this here. It happens because the child process exit's
SIGCHLD causes an EINTR on the read, which then fails.
While the above testcase works with FreeBSD sh, a similar script with a
trap on CHLD fails in both (and likely also in older dash):
#!/bin/dash
{
echo foo
sleep 1
echo foo
echo done>/dev/tty
} | { trap : CHLD; while read p; do
( echo good & ) &
done }
echo done
Most other shells do not seem to abort a read builtin for a trap. bash
and ksh93 execute the trap right away and then continue the read, while
mksh postpones it until the read is done. POSIX seems vague about it,
but a look at existing scripts suggests that traps should not abort a
read. Executing the trap right away is useful in case it does something
such as exiting, and should be safe given that traps are already invoked
from deeper levels of recursion of the interpreter. If the trap executes
a break, continue or return command outside its boundaries this should
abort the read.
--
Jilles Tjoelker
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 23:01 ` Kris Maglione
@ 2010-05-28 23:10 ` Herbert Xu
2010-05-28 23:40 ` Herbert Xu
0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2010-05-28 23:10 UTC (permalink / raw)
To: Kris Maglione; +Cc: dash
On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote:
>
> That's interesting. I've had three users complain about it so far. strace
> is attached.
Thanks, I have reproduced it here now. It looks like it only
happens occasionally on my system.
> [pid 23950] <... read resumed> 0x7fff75f098b7, 1) = ? ERESTARTSYS (To be restarted)
> [pid 23952] <... mmap resumed> ) = 0x2b185172f000
> [pid 23950] --- SIGCHLD (Child exited) @ 0 (0) ---
> [pid 23952] access("/etc/ld.so.preload", R_OK <unfinished ...>
> [pid 23950] rt_sigreturn(0x11 <unfinished ...>
> [pid 23952] <... access resumed> ) = -1 ENOENT (No such file or directory)
> [pid 23950] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
> [pid 23953] write(1, "good\n", 5 <unfinished ...>
> [pid 23952] open("/etc/ld.so.cache", O_RDONLY <unfinished ...>
> [pid 23950] exit_group(0) = ?
So the crux of the issue is that SIGCHLD is causing read to fail
and the shell doing the read to exit.
I'll look into this.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 23:10 ` Herbert Xu
@ 2010-05-28 23:40 ` Herbert Xu
2010-05-29 0:00 ` Kris Maglione
0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2010-05-28 23:40 UTC (permalink / raw)
To: Kris Maglione; +Cc: dash
On Sat, May 29, 2010 at 09:10:39AM +1000, Herbert Xu wrote:
> On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote:
> >
> > That's interesting. I've had three users complain about it so far. strace
> > is attached.
>
> Thanks, I have reproduced it here now. It looks like it only
> happens occasionally on my system.
I think this patch should fix it, but please do test it as I
can only rarely reproduce this on my machine.
diff --git a/ChangeLog b/ChangeLog
index 54d2972..5a11a8c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,7 @@
+2010-05-29 Herbert Xu <herbert@gondor.apana.org.au>
+
+ * Continue after EINTR in read(1) with no pending signals.
+
2010-05-27 Jilles Tjoelker <jilles@stack.nl>
* Force fork if any trap is set, not just on EXIT.
diff --git a/src/miscbltin.c b/src/miscbltin.c
index 046f2f2..5ab1648 100644
--- a/src/miscbltin.c
+++ b/src/miscbltin.c
@@ -57,6 +57,7 @@
#include "main.h"
#include "expand.h"
#include "parser.h"
+#include "trap.h"
#undef rflag
@@ -158,9 +159,16 @@ readcmd(int argc, char **argv)
backslash = 0;
STARTSTACKSTR(p);
for (;;) {
- if (read(0, &c, 1) != 1) {
- status = 1;
+ switch (read(0, &c, 1)) {
+ case 1:
break;
+ default:
+ if (errno == EINTR && !pendingsigs)
+ continue;
+ /* fall through */
+ case 0:
+ status = 1;
+ goto out;
}
if (c == '\0')
continue;
@@ -181,6 +189,7 @@ put:
resetbs:
backslash = 0;
}
+out:
STACKSTRNUL(p);
readcmd_handle_line(stackblock(), ap, p + 1 - (char *)stackblock());
return status;
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b,
2010-05-28 23:40 ` Herbert Xu
@ 2010-05-29 0:00 ` Kris Maglione
0 siblings, 0 replies; 9+ messages in thread
From: Kris Maglione @ 2010-05-29 0:00 UTC (permalink / raw)
To: Herbert Xu; +Cc: dash
On Sat, May 29, 2010 at 09:40:49AM +1000, Herbert Xu wrote:
>On Sat, May 29, 2010 at 09:10:39AM +1000, Herbert Xu wrote:
>> On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote:
>> >
>> > That's interesting. I've had three users complain about it so far. strace
>> > is attached.
>>
>> Thanks, I have reproduced it here now. It looks like it only
>> happens occasionally on my system.
>
>I think this patch should fix it, but please do test it as I
>can only rarely reproduce this on my machine.
Yes, this fixes it for me.
Thank you,
--
Kris Maglione
If you don't think carefully, you might think that programming is just
typing statements in a programming language.
--Ward Cunningham
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-05-29 0:01 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione
2010-05-28 4:53 ` Herbert Xu
2010-05-28 15:34 ` Kris Maglione
2010-05-28 22:43 ` Herbert Xu
2010-05-28 23:01 ` Kris Maglione
2010-05-28 23:10 ` Herbert Xu
2010-05-28 23:40 ` Herbert Xu
2010-05-29 0:00 ` Kris Maglione
2010-05-28 23:09 ` Jilles Tjoelker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox