* Re: Question regarding ptrace work for LInux v3.1 [not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com> @ 2016-03-21 17:47 ` Oleg Nesterov 2016-03-21 18:28 ` Patrick Donnelly 0 siblings, 1 reply; 6+ messages in thread From: Oleg Nesterov @ 2016-03-21 17:47 UTC (permalink / raw) To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel Hello Patrick, On 03/18, Patrick Donnelly wrote: > > We are currently trying to debug a problem with ptrace that I believe > was incidentally fixed by you and maybe Tejun Heo in Linux v3.1. So let me add Tejun and lkml, > The > issue is on github [2] but I will describe it here briefly. My hope is > that you may remember fixing this and a patch may be made for v3.0. > [An HPC center is using Linux v3.0 which exhibits this.] Heh, sorry, I can't recall anything related ;) > The basic problem is that the application we are tracing spawns > threads which **sometimes** are not traced (or lost). For Linux v3.0, > we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK) > options to follow children [3]. The problem we see is that we will > receive a PTRACE_EVENT_CLONE event that a thread is created but we > receive no other events for the thread. IOW, the new thread do not report SIGSTOP injected by implicit attach? > What's worse is that the > thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones > its own thread. OK, so at least the new child is traced too, and it also has PT_TRACE_* flags copied from its parent. > Do you recall fixing anything for v3.1 that might cause this problem? No. I do not see how the new tracee can miss that SIGSTOP/TIF_SIGPENDING. To clarify, the usage of SIGSTOP in ptrace was always buggy by design. For example, SIGCONT from somewhere can remove the pending (and not yet reported) SIGSTOP, and this _can_ explain the problem you hit. But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems there is something else. It would be nice to have a test-case :/ Oleg. > [1] http://ccl.cse.nd.edu/software/parrot/ > [2] https://github.com/cooperative-computing-lab/cctools/issues/1207 > [3] https://github.com/cooperative-computing-lab/cctools/blob/f82288167b1b5abb836b1d9b8135c98f71ed90f6/parrot/src/tracer.c#L91-L128 > > -- > Patrick Donnelly ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1 2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov @ 2016-03-21 18:28 ` Patrick Donnelly 2016-03-21 19:07 ` Oleg Nesterov 0 siblings, 1 reply; 6+ messages in thread From: Patrick Donnelly @ 2016-03-21 18:28 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel On Mon, Mar 21, 2016 at 1:47 PM, Oleg Nesterov <oleg@redhat.com> wrote: >> The basic problem is that the application we are tracing spawns >> threads which **sometimes** are not traced (or lost). For Linux v3.0, >> we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK) >> options to follow children [3]. The problem we see is that we will >> receive a PTRACE_EVENT_CLONE event that a thread is created but we >> receive no other events for the thread. > > IOW, the new thread do not report SIGSTOP injected by implicit attach? It does not report the SIGSTOP nor the system calls leading up to the PTRACE_EVENT_CLONE (not even the entry into the clone syscall). >> What's worse is that the >> thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones >> its own thread. > > OK, so at least the new child is traced too, and it also has PT_TRACE_* > flags copied from its parent. That seems to be the case but it will only report certain events (not syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE events... Hmm, now that I think about this, it would be necessary to see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the problem. > To clarify, the usage of SIGSTOP in ptrace was always buggy by design. > For example, SIGCONT from somewhere can remove the pending (and not yet > reported) SIGSTOP, and this _can_ explain the problem you hit. The tree of processes being traced do no send any signals but an external process may have. However, I did notice the use of futexes near these clones. Perhaps that may be causing this? > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems > there is something else. Okay, it might be that PTRACE_SEIZE fixes it. > It would be nice to have a test-case :/ Unfortunately, I have not yet been able to isolate a test case. Thanks for your help! -- Patrick Donnelly ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1 2016-03-21 18:28 ` Patrick Donnelly @ 2016-03-21 19:07 ` Oleg Nesterov 2016-03-21 19:24 ` Patrick Donnelly 0 siblings, 1 reply; 6+ messages in thread From: Oleg Nesterov @ 2016-03-21 19:07 UTC (permalink / raw) To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel On 03/21, Patrick Donnelly wrote: > > That seems to be the case but it will only report certain events (not > syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE > events... Hmm, now that I think about this, it would be necessary to > see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate > syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the > problem. Yes, exactly, you need to see the initial SIGSTOP or another event which can be reported before it. > > To clarify, the usage of SIGSTOP in ptrace was always buggy by design. > > For example, SIGCONT from somewhere can remove the pending (and not yet > > reported) SIGSTOP, and this _can_ explain the problem you hit. > > The tree of processes being traced do no send any signals but an > external process may have. I am looking into https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc and this code case SIGSTOP: /* Black magic to get threads working on old Linux kernels... */ if(p->nsyscalls == 0) { /* stop before we begin running the process */ debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid); signum = 0; /* suppress delivery */ kill(p->pid,SIGCONT); } break; doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread- group. So if this kill() races with another thread doing clone() you can hit the problem you described. > However, I did notice the use of futexes > near these clones. Perhaps that may be causing this? I don't think so, > > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems > > there is something else. > > Okay, it might be that PTRACE_SEIZE fixes it. Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH? Oleg. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1 2016-03-21 19:07 ` Oleg Nesterov @ 2016-03-21 19:24 ` Patrick Donnelly 2016-03-21 19:35 ` Oleg Nesterov 0 siblings, 1 reply; 6+ messages in thread From: Patrick Donnelly @ 2016-03-21 19:24 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/21, Patrick Donnelly wrote: >> >> That seems to be the case but it will only report certain events (not >> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE >> events... Hmm, now that I think about this, it would be necessary to >> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate >> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the >> problem. > > Yes, exactly, you need to see the initial SIGSTOP or another event which > can be reported before it. Assuming a SIGSTOP is being silenced, is there anything we can do to forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE) >> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design. >> > For example, SIGCONT from somewhere can remove the pending (and not yet >> > reported) SIGSTOP, and this _can_ explain the problem you hit. >> >> The tree of processes being traced do no send any signals but an >> external process may have. > > I am looking into > > https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc > > and this code > > case SIGSTOP: > /* Black magic to get threads working on old Linux kernels... */ > > if(p->nsyscalls == 0) { /* stop before we begin running the process */ > debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid); > signum = 0; /* suppress delivery */ > kill(p->pid,SIGCONT); > } > break; > > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread- > group. So if this kill() races with another thread doing clone() you can > hit the problem you described. You're right, that should be tkill! I will give that a try and report back if that solved the issue for our collaborators... >> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems >> > there is something else. >> >> Okay, it might be that PTRACE_SEIZE fixes it. > > Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH? I have not tested on >v3.1 with PTRACE_ATTACH. As you know, v3.1 was when the PTRACE_SEIZE code was merged along with many other changes. [I actually thought the merge occurred in 3.4 because of the ptrace man page. I have submitted a bug report to get that fixed.] I have not had any reports of the problem with Linux versions after and including v3.1. Again, I will see if the kill system call was the cause and report back if so. Thanks for taking the time to look at the code! -- Patrick Donnelly ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1 2016-03-21 19:24 ` Patrick Donnelly @ 2016-03-21 19:35 ` Oleg Nesterov 2016-03-23 14:12 ` Patrick Donnelly 0 siblings, 1 reply; 6+ messages in thread From: Oleg Nesterov @ 2016-03-21 19:35 UTC (permalink / raw) To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel On 03/21, Patrick Donnelly wrote: > > On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Yes, exactly, you need to see the initial SIGSTOP or another event which > > can be reported before it. > > Assuming a SIGSTOP is being silenced, is there anything we can do to > forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE) No. Only PTRACE_SYSCALL can set TIF_SYSCALL_TRACE. > > case SIGSTOP: > > /* Black magic to get threads working on old Linux kernels... */ > > > > if(p->nsyscalls == 0) { /* stop before we begin running the process */ > > debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid); > > signum = 0; /* suppress delivery */ > > kill(p->pid,SIGCONT); > > } > > break; > > > > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread- > > group. So if this kill() races with another thread doing clone() you can > > hit the problem you described. > > You're right, that should be tkill! I will give that a try and report > back if that solved the issue for our collaborators... Ah, sorry, I should have mentioned this... No, tkill() won't help. See prepare_signal(), SIGCONT always removes the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was sent by tkill() or kill(). Perhaps you should just remove this kill(SIGCONT) ? tracer_continue(signr => 0) should equally suppress the delivery. To clarify this won't be right too, but without PTRACE_SEIZE you simply can't write the code which handles the stop/cont/etc events correctly anyway... > >> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems > >> > there is something else. > >> > >> Okay, it might be that PTRACE_SEIZE fixes it. > > > > Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH? > > I have not tested on >v3.1 with PTRACE_ATTACH. OK, thanks. So perhaps this is not v3.0-specific. Oleg. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1 2016-03-21 19:35 ` Oleg Nesterov @ 2016-03-23 14:12 ` Patrick Donnelly 0 siblings, 0 replies; 6+ messages in thread From: Patrick Donnelly @ 2016-03-23 14:12 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel On Mon, Mar 21, 2016 at 3:35 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/21, Patrick Donnelly wrote: >> On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote: >> > case SIGSTOP: >> > /* Black magic to get threads working on old Linux kernels... */ >> > >> > if(p->nsyscalls == 0) { /* stop before we begin running the process */ >> > debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid); >> > signum = 0; /* suppress delivery */ >> > kill(p->pid,SIGCONT); >> > } >> > break; >> > >> > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread- >> > group. So if this kill() races with another thread doing clone() you can >> > hit the problem you described. >> >> You're right, that should be tkill! I will give that a try and report >> back if that solved the issue for our collaborators... > > Ah, sorry, I should have mentioned this... > > No, tkill() won't help. See prepare_signal(), SIGCONT always removes > the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was > sent by tkill() or kill(). > > Perhaps you should just remove this kill(SIGCONT) ? > > tracer_continue(signr => 0) should equally suppress the delivery. To > clarify this won't be right too, but without PTRACE_SEIZE you simply > can't write the code which handles the stop/cont/etc events correctly > anyway... Thanks so much Oleg. Indeed this was the problem. -- Patrick Donnelly ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-03-23 14:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com>
2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov
2016-03-21 18:28 ` Patrick Donnelly
2016-03-21 19:07 ` Oleg Nesterov
2016-03-21 19:24 ` Patrick Donnelly
2016-03-21 19:35 ` Oleg Nesterov
2016-03-23 14:12 ` Patrick Donnelly
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox