* Job control bug in revision 3800d4934391b,
@ 2010-05-28 0:21 Kris Maglione
2010-05-28 4:53 ` Herbert Xu
2010-05-28 23:09 ` Jilles Tjoelker
0 siblings, 2 replies; 9+ messages in thread
From: Kris Maglione @ 2010-05-28 0:21 UTC (permalink / raw)
To: dash
I'm not sure how to describe this bug, but it's affected one of my
scripts, and those of several of my users. Basically, we've had loops
dieing when backgrounded programs exit. This is the simplest test case
I can come up with:
#!/bin/dash
{
echo foo
sleep 1
echo foo
echo done>/dev/tty
} | while read p; do
( echo good & ) &
done
echo done
In versions prior to 3800d4934391b, the output would
"good\ndone\ndone\ngood" (or some permutation thereof depending on
system load), but from 3800d4934391b on, it's "good\ndone".
The offending revision:
[JOBS] Fix dowait signal race
author Herbert Xu <herbert@gondor.apana.org.au>
Sun, 22 Feb 2009 10:10:01 +0000 (18:10 +0800)
committer Herbert Xu <herbert@gondor.apana.org.au>
Sun, 22 Feb 2009 10:10:01 +0000 (18:10 +0800)
commit 3800d4934391b144fd261a7957aea72ced7d47ea
tree 40c003ab3063ceab7f3615a623a09d3c610332a0
parent 6045fe25078345074f027312d106d3fc19df56e5
[JOBS] Fix dowait signal race
This test program by Alexey Gladkov can cause dash to enter an
infinite loop in waitcmd.
#!/bin/dash
trap "echo TRAP" USR1
stub() {
echo ">>> STUB $1" >&2
sleep $1
echo "<<< STUB $1" >&2
kill -USR1 $$
}
stub 3 &
stub 2 &
until { echo "###"; wait; } do
echo "*** $?"
done
The problem is that if we get a signal after the wait3 system
call has returned but before we get to INTON in dowait, then
we can jump back up to the top and lose the exit status. So
if we then wait for the job that has just exited, then it'll
stay there forever.
I made the original change that caused this bug to fix pretty
much the same bug but in the opposite direction. That is, if
we get a signal after we enter wait3 but before we hit the kernel
then it too can cause the wait to go on forever (assuming the
child doesn't exit).
In fact this is pretty much exactly the scenario that you'll
find in glibc's documentation on pause(). The solution is given
there too, in the form of sigsuspend, which is the only way to
do the check and wait atomically.
So this patch fixes Alexey's race without reintroducing the old
bug by converting the blocking wait3 to a sigsuspend.
In order to do this we need to set a signal handler for SIGCHLD,
so the code has been modified to always do that.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
--
Kris Maglione
If you want to go somewhere, goto is the best way to get there.
--Ken Thompson
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Job control bug in revision 3800d4934391b, 2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione @ 2010-05-28 4:53 ` Herbert Xu 2010-05-28 15:34 ` Kris Maglione 2010-05-28 23:09 ` Jilles Tjoelker 1 sibling, 1 reply; 9+ messages in thread From: Herbert Xu @ 2010-05-28 4:53 UTC (permalink / raw) To: Kris Maglione; +Cc: dash Kris Maglione <maglione.k@gmail.com> wrote: > > I'm not sure how to describe this bug, but it's affected one of my > scripts, and those of several of my users. Basically, we've had loops > dieing when backgrounded programs exit. This is the simplest test case > I can come up with: > > #!/bin/dash > { > echo foo > sleep 1 > echo foo > echo done>/dev/tty > } | while read p; do > ( echo good & ) & > done > echo done > > > In versions prior to 3800d4934391b, the output would > "good\ndone\ndone\ngood" (or some permutation thereof depending on > system load), but from 3800d4934391b on, it's "good\ndone". This should be fixed by the patch that I posted yesterday. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 4:53 ` Herbert Xu @ 2010-05-28 15:34 ` Kris Maglione 2010-05-28 22:43 ` Herbert Xu 0 siblings, 1 reply; 9+ messages in thread From: Kris Maglione @ 2010-05-28 15:34 UTC (permalink / raw) To: Herbert Xu, dash On Fri, May 28, 2010 at 02:53:16PM +1000, Herbert Xu wrote: >Kris Maglione <maglione.k@gmail.com> wrote: >> >> I'm not sure how to describe this bug, but it's affected one of my >> scripts, and those of several of my users. Basically, we've had loops >> dieing when backgrounded programs exit. This is the simplest test case >> I can come up with: >> >> #!/bin/dash >> { >> echo foo >> sleep 1 >> echo foo >> echo done>/dev/tty >> } | while read p; do >> ( echo good & ) & >> done >> echo done >> >> >> In versions prior to 3800d4934391b, the output would >> "good\ndone\ndone\ngood" (or some permutation thereof depending on >> system load), but from 3800d4934391b on, it's "good\ndone". > >This should be fixed by the patch that I posted yesterday. I should have mentioned that I tested it with every revision upto and including that one (especially as it looked promising). I definitely still have this problem as of 207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master Thanks, -- Kris Maglione Real Programmers don't believe in schedules. Planners make up schedules. Managers "firm up" schedules. Frightened coders strive to meet schedules. Real Programmers ignore schedules. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 15:34 ` Kris Maglione @ 2010-05-28 22:43 ` Herbert Xu 2010-05-28 23:01 ` Kris Maglione 0 siblings, 1 reply; 9+ messages in thread From: Herbert Xu @ 2010-05-28 22:43 UTC (permalink / raw) To: Kris Maglione; +Cc: dash Kris Maglione <maglione.k@gmail.com> wrote: > > I should have mentioned that I tested it with every revision > upto and including that one (especially as it looked promising). > > I definitely still have this problem as of > 207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master Hmm, I can't reproduce this here at all. Can you please run it with strace -f and send the result to us? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 22:43 ` Herbert Xu @ 2010-05-28 23:01 ` Kris Maglione 2010-05-28 23:10 ` Herbert Xu 0 siblings, 1 reply; 9+ messages in thread From: Kris Maglione @ 2010-05-28 23:01 UTC (permalink / raw) To: Herbert Xu; +Cc: dash [-- Attachment #1: Type: text/plain, Size: 741 bytes --] On Sat, May 29, 2010 at 08:43:34AM +1000, Herbert Xu wrote: >Kris Maglione <maglione.k@gmail.com> wrote: >> >> I should have mentioned that I tested it with every revision >> upto and including that one (especially as it looked promising). >> >> I definitely still have this problem as of >> 207e4c2a322fe: [JOBS] Fix wait regression where it does not wait for all jobs master > >Hmm, I can't reproduce this here at all. Can you please run it >with strace -f and send the result to us? That's interesting. I've had three users complain about it so far. strace is attached. -- Kris Maglione Insane people are always sure that they are fine. It is only the sane people who are willing to admit that they are crazy. --Nora Ephron [-- Attachment #2: strace-dash1.log --] [-- Type: text/plain, Size: 11444 bytes --] execve("/home/kris/code/dash.git/src/dash", ["/home/kris/code/dash.git/src/das"...], [/* 97 vars */]) = 0 brk(0) = 0x174d000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b299768c000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=258429, ...}) = 0 mmap(NULL, 258429, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b299768d000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\354\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1651953, ...}) = 0 mmap(NULL, 3521384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b299788e000 mprotect(0x2b29979e0000, 2097152, PROT_NONE) = 0 mmap(0x2b2997be0000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x152000) = 0x2b2997be0000 mmap(0x2b2997be5000, 19304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2997be5000 close(3) = 0 open("/usr/lib/libedit.so.0", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\266\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=199791, ...}) = 0 mmap(NULL, 2289024, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2997bea000 mprotect(0x2b2997c12000, 2097152, PROT_NONE) = 0 mmap(0x2b2997e12000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x2b2997e12000 mmap(0x2b2997e15000, 15744, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b2997e15000 close(3) = 0 open("/lib/libncursesw.so.5", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0B\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=394095, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2997e19000 mmap(NULL, 2456904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b2997e1a000 mprotect(0x2b2997e6e000, 2093056, PROT_NONE) = 0 mmap(0x2b299806d000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x53000) = 0x2b299806d000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2998072000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b2998073000 arch_prctl(ARCH_SET_FS, 0x2b2998072700) = 0 mprotect(0x2b2997be0000, 16384, PROT_READ) = 0 mprotect(0x2b299788b000, 4096, PROT_READ) = 0 munmap(0x2b299768d000, 258429) = 0 getpid() = 23948 rt_sigaction(SIGCHLD, {0x41384e, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0 geteuid() = 1000 brk(0) = 0x174d000 brk(0x176e000) = 0x176e000 getppid() = 23947 stat("/home/kris", {st_mode=S_IFDIR|0755, st_size=270336, ...}) = 0 stat(".", {st_mode=S_IFDIR|0755, st_size=270336, ...}) = 0 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff75f09bf0) = -1 ENOTTY (Inappropriate ioctl for device) rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0 rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0 rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, NULL, 8) = 0 read(0, "#!/bin/dash\n{\n\techo foo\n\tsleep 1"..., 8192) = 118 pipe([3, 4]) = 0 clone(Process 23949 attached child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23949 [pid 23948] close(4) = 0 [pid 23948] clone( <unfinished ...> [pid 23949] close(3) = 0 [pid 23949] dup2(4, 1) = 1 Process 23950 attached [pid 23948] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23950 [pid 23949] close(4) = 0 [pid 23948] close(3 <unfinished ...> [pid 23949] write(1, "foo\n", 4 <unfinished ...> [pid 23950] dup2(3, 0 <unfinished ...> [pid 23948] <... close resumed> ) = 0 [pid 23950] <... dup2 resumed> ) = 0 [pid 23948] close(4294967295 <unfinished ...> [pid 23950] close(3 <unfinished ...> [pid 23948] <... close resumed> ) = -1 EBADF (Bad file descriptor) [pid 23950] <... close resumed> ) = 0 [pid 23948] wait4(-1, <unfinished ...> [pid 23949] <... write resumed> ) = 4 [pid 23950] read(0, <unfinished ...> [pid 23949] stat("./sleep", <unfinished ...> [pid 23950] <... read resumed> "f", 1) = 1 [pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory) [pid 23950] read(0, <unfinished ...> [pid 23949] stat("/home/kris/bin/sleep", <unfinished ...> [pid 23950] <... read resumed> "o", 1) = 1 [pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory) [pid 23950] read(0, <unfinished ...> [pid 23949] stat("/home/kris/wmiiinst/bin/sleep", <unfinished ...> [pid 23950] <... read resumed> "o", 1) = 1 [pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory) [pid 23950] read(0, <unfinished ...> [pid 23949] stat("/home/kris/.gem/ruby/1.8/bin/sleep", <unfinished ...> [pid 23950] <... read resumed> "\n", 1) = 1 [pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory) [pid 23949] stat("/opt/wine/bin/sleep", <unfinished ...> [pid 23950] clone( <unfinished ...> [pid 23949] <... stat resumed> 0x7fff75f098e0) = -1 ENOENT (No such file or directory) [pid 23949] stat("/bin/sleep", {st_mode=S_IFREG|0755, st_size=28448, ...}) = 0 Process 23951 attached [pid 23950] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23951 [pid 23949] clone( <unfinished ...> [pid 23951] rt_sigaction(SIGINT, {SIG_IGN, [INT], SA_RESTORER|SA_RESTART, 0x2b29978c04e0}, Process 23952 attached (waiting for parent) {SIG_DFL, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, 8) = 0 [pid 23950] read(0, Process 23952 resumed (parent 23949 ready) <unfinished ...> [pid 23949] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23952 [pid 23952] execve("/bin/sleep", ["sleep", "1"], [/* 97 vars */] <unfinished ...> [pid 23951] rt_sigaction(SIGQUIT, {SIG_IGN, [QUIT], SA_RESTORER|SA_RESTART, 0x2b29978c04e0}, <unfinished ...> [pid 23949] wait4(-1, <unfinished ...> [pid 23951] <... rt_sigaction resumed> {SIG_DFL, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x2b29978c04e0}, 8) = 0 [pid 23951] close(0) = 0 [pid 23951] open("/dev/null", O_RDONLY) = 0 [pid 23951] clone( <unfinished ...> [pid 23952] <... execve resumed> ) = 0 Process 23953 attached (waiting for parent) Process 23953 resumed (parent 23951 ready) [pid 23951] <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b29980729d0) = 23953 [pid 23953] close(0 <unfinished ...> [pid 23952] brk(0 <unfinished ...> [pid 23953] <... close resumed> ) = 0 [pid 23952] <... brk resumed> ) = 0x78f000 [pid 23953] open("/dev/null", O_RDONLY <unfinished ...> [pid 23951] exit_group(0) = ? Process 23951 detached [pid 23953] <... open resumed> ) = 0 [pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> [pid 23950] <... read resumed> 0x7fff75f098b7, 1) = ? ERESTARTSYS (To be restarted) [pid 23952] <... mmap resumed> ) = 0x2b185172f000 [pid 23950] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 23952] access("/etc/ld.so.preload", R_OK <unfinished ...> [pid 23950] rt_sigreturn(0x11 <unfinished ...> [pid 23952] <... access resumed> ) = -1 ENOENT (No such file or directory) [pid 23950] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call) [pid 23953] write(1, "good\n", 5 <unfinished ...> [pid 23952] open("/etc/ld.so.cache", O_RDONLY <unfinished ...> [pid 23950] exit_group(0) = ? Process 23950 detached [pid 23953] <... write resumed> ) = 5 [pid 23952] <... open resumed> ) = 3 [pid 23953] exit_group(0) = ? Process 23953 detached [pid 23948] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23950 [pid 23948] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 23952] fstat(3, <unfinished ...> [pid 23948] rt_sigreturn(0x11 <unfinished ...> [pid 23952] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=258429, ...}) = 0 [pid 23948] <... rt_sigreturn resumed> ) = 23950 [pid 23952] mmap(NULL, 258429, PROT_READ, MAP_PRIVATE, 3, 0 <unfinished ...> [pid 23948] wait4(-1, <unfinished ...> [pid 23952] <... mmap resumed> ) = 0x2b1851730000 [pid 23952] close(3) = 0 [pid 23952] open("/lib/libc.so.6", O_RDONLY) = 3 [pid 23952] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\354\1\0\0\0\0\0"..., 832) = 832 [pid 23952] fstat(3, {st_mode=S_IFREG|0755, st_size=1651953, ...}) = 0 [pid 23952] mmap(NULL, 3521384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b1851931000 [pid 23952] mprotect(0x2b1851a83000, 2097152, PROT_NONE) = 0 [pid 23952] mmap(0x2b1851c83000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x152000) = 0x2b1851c83000 [pid 23952] mmap(0x2b1851c88000, 19304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b1851c88000 [pid 23952] close(3) = 0 [pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b1851c8d000 [pid 23952] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b1851c8e000 [pid 23952] arch_prctl(ARCH_SET_FS, 0x2b1851c8db20) = 0 [pid 23952] mprotect(0x2b1851c83000, 16384, PROT_READ) = 0 [pid 23952] mprotect(0x2b185192e000, 4096, PROT_READ) = 0 [pid 23952] munmap(0x2b1851730000, 258429) = 0 [pid 23952] brk(0) = 0x78f000 [pid 23952] brk(0x7b0000) = 0x7b0000 [pid 23952] open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 [pid 23952] fstat(3, {st_mode=S_IFREG|0644, st_size=3196096, ...}) = 0 [pid 23952] mmap(NULL, 3196096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b1851c8f000 [pid 23952] close(3) = 0 [pid 23952] nanosleep({1, 0}, NULL) = 0 [pid 23952] close(1) = 0 [pid 23952] close(2) = 0 [pid 23952] exit_group(0) = ? Process 23952 detached [pid 23949] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23952 [pid 23949] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 23949] rt_sigreturn(0x11) = 23952 [pid 23949] write(1, "foo\n", 4) = -1 EPIPE (Broken pipe) [pid 23949] --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 23949 detached <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 23949 --- SIGCHLD (Child exited) @ 0 (0) --- rt_sigreturn(0x11) = 23949 write(1, "done\n", 5) = 5 read(0, "", 8192) = 0 exit_group(0) = ? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 23:01 ` Kris Maglione @ 2010-05-28 23:10 ` Herbert Xu 2010-05-28 23:40 ` Herbert Xu 0 siblings, 1 reply; 9+ messages in thread From: Herbert Xu @ 2010-05-28 23:10 UTC (permalink / raw) To: Kris Maglione; +Cc: dash On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote: > > That's interesting. I've had three users complain about it so far. strace > is attached. Thanks, I have reproduced it here now. It looks like it only happens occasionally on my system. > [pid 23950] <... read resumed> 0x7fff75f098b7, 1) = ? ERESTARTSYS (To be restarted) > [pid 23952] <... mmap resumed> ) = 0x2b185172f000 > [pid 23950] --- SIGCHLD (Child exited) @ 0 (0) --- > [pid 23952] access("/etc/ld.so.preload", R_OK <unfinished ...> > [pid 23950] rt_sigreturn(0x11 <unfinished ...> > [pid 23952] <... access resumed> ) = -1 ENOENT (No such file or directory) > [pid 23950] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call) > [pid 23953] write(1, "good\n", 5 <unfinished ...> > [pid 23952] open("/etc/ld.so.cache", O_RDONLY <unfinished ...> > [pid 23950] exit_group(0) = ? So the crux of the issue is that SIGCHLD is causing read to fail and the shell doing the read to exit. I'll look into this. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 23:10 ` Herbert Xu @ 2010-05-28 23:40 ` Herbert Xu 2010-05-29 0:00 ` Kris Maglione 0 siblings, 1 reply; 9+ messages in thread From: Herbert Xu @ 2010-05-28 23:40 UTC (permalink / raw) To: Kris Maglione; +Cc: dash On Sat, May 29, 2010 at 09:10:39AM +1000, Herbert Xu wrote: > On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote: > > > > That's interesting. I've had three users complain about it so far. strace > > is attached. > > Thanks, I have reproduced it here now. It looks like it only > happens occasionally on my system. I think this patch should fix it, but please do test it as I can only rarely reproduce this on my machine. diff --git a/ChangeLog b/ChangeLog index 54d2972..5a11a8c 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,7 @@ +2010-05-29 Herbert Xu <herbert@gondor.apana.org.au> + + * Continue after EINTR in read(1) with no pending signals. + 2010-05-27 Jilles Tjoelker <jilles@stack.nl> * Force fork if any trap is set, not just on EXIT. diff --git a/src/miscbltin.c b/src/miscbltin.c index 046f2f2..5ab1648 100644 --- a/src/miscbltin.c +++ b/src/miscbltin.c @@ -57,6 +57,7 @@ #include "main.h" #include "expand.h" #include "parser.h" +#include "trap.h" #undef rflag @@ -158,9 +159,16 @@ readcmd(int argc, char **argv) backslash = 0; STARTSTACKSTR(p); for (;;) { - if (read(0, &c, 1) != 1) { - status = 1; + switch (read(0, &c, 1)) { + case 1: break; + default: + if (errno == EINTR && !pendingsigs) + continue; + /* fall through */ + case 0: + status = 1; + goto out; } if (c == '\0') continue; @@ -181,6 +189,7 @@ put: resetbs: backslash = 0; } +out: STACKSTRNUL(p); readcmd_handle_line(stackblock(), ap, p + 1 - (char *)stackblock()); return status; Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 23:40 ` Herbert Xu @ 2010-05-29 0:00 ` Kris Maglione 0 siblings, 0 replies; 9+ messages in thread From: Kris Maglione @ 2010-05-29 0:00 UTC (permalink / raw) To: Herbert Xu; +Cc: dash On Sat, May 29, 2010 at 09:40:49AM +1000, Herbert Xu wrote: >On Sat, May 29, 2010 at 09:10:39AM +1000, Herbert Xu wrote: >> On Fri, May 28, 2010 at 07:01:25PM -0400, Kris Maglione wrote: >> > >> > That's interesting. I've had three users complain about it so far. strace >> > is attached. >> >> Thanks, I have reproduced it here now. It looks like it only >> happens occasionally on my system. > >I think this patch should fix it, but please do test it as I >can only rarely reproduce this on my machine. Yes, this fixes it for me. Thank you, -- Kris Maglione If you don't think carefully, you might think that programming is just typing statements in a programming language. --Ward Cunningham ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Job control bug in revision 3800d4934391b, 2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione 2010-05-28 4:53 ` Herbert Xu @ 2010-05-28 23:09 ` Jilles Tjoelker 1 sibling, 0 replies; 9+ messages in thread From: Jilles Tjoelker @ 2010-05-28 23:09 UTC (permalink / raw) To: Kris Maglione; +Cc: dash On Thu, May 27, 2010 at 08:21:57PM -0400, Kris Maglione wrote: > I'm not sure how to describe this bug, but it's affected one of my > scripts, and those of several of my users. Basically, we've had loops > dieing when backgrounded programs exit. This is the simplest test case > I can come up with: > #!/bin/dash > { > echo foo > sleep 1 > echo foo > echo done>/dev/tty > } | while read p; do > ( echo good & ) & > done > echo done > In versions prior to 3800d4934391b, the output would > "good\ndone\ndone\ngood" (or some permutation thereof depending on > system load), but from 3800d4934391b on, it's "good\ndone". > The offending revision: > [JOBS] Fix dowait signal race I can reproduce this here. It happens because the child process exit's SIGCHLD causes an EINTR on the read, which then fails. While the above testcase works with FreeBSD sh, a similar script with a trap on CHLD fails in both (and likely also in older dash): #!/bin/dash { echo foo sleep 1 echo foo echo done>/dev/tty } | { trap : CHLD; while read p; do ( echo good & ) & done } echo done Most other shells do not seem to abort a read builtin for a trap. bash and ksh93 execute the trap right away and then continue the read, while mksh postpones it until the read is done. POSIX seems vague about it, but a look at existing scripts suggests that traps should not abort a read. Executing the trap right away is useful in case it does something such as exiting, and should be safe given that traps are already invoked from deeper levels of recursion of the interpreter. If the trap executes a break, continue or return command outside its boundaries this should abort the read. -- Jilles Tjoelker ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-05-29 0:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-28 0:21 Job control bug in revision 3800d4934391b, Kris Maglione 2010-05-28 4:53 ` Herbert Xu 2010-05-28 15:34 ` Kris Maglione 2010-05-28 22:43 ` Herbert Xu 2010-05-28 23:01 ` Kris Maglione 2010-05-28 23:10 ` Herbert Xu 2010-05-28 23:40 ` Herbert Xu 2010-05-29 0:00 ` Kris Maglione 2010-05-28 23:09 ` Jilles Tjoelker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox