* Re: Question regarding ptrace work for LInux v3.1
[not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com>
@ 2016-03-21 17:47 ` Oleg Nesterov
2016-03-21 18:28 ` Patrick Donnelly
0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 17:47 UTC (permalink / raw)
To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel
Hello Patrick,
On 03/18, Patrick Donnelly wrote:
>
> We are currently trying to debug a problem with ptrace that I believe
> was incidentally fixed by you and maybe Tejun Heo in Linux v3.1.
So let me add Tejun and lkml,
> The
> issue is on github [2] but I will describe it here briefly. My hope is
> that you may remember fixing this and a patch may be made for v3.0.
> [An HPC center is using Linux v3.0 which exhibits this.]
Heh, sorry, I can't recall anything related ;)
> The basic problem is that the application we are tracing spawns
> threads which **sometimes** are not traced (or lost). For Linux v3.0,
> we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK)
> options to follow children [3]. The problem we see is that we will
> receive a PTRACE_EVENT_CLONE event that a thread is created but we
> receive no other events for the thread.
IOW, the new thread do not report SIGSTOP injected by implicit attach?
> What's worse is that the
> thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones
> its own thread.
OK, so at least the new child is traced too, and it also has PT_TRACE_*
flags copied from its parent.
> Do you recall fixing anything for v3.1 that might cause this problem?
No. I do not see how the new tracee can miss that SIGSTOP/TIF_SIGPENDING.
To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
For example, SIGCONT from somewhere can remove the pending (and not yet
reported) SIGSTOP, and this _can_ explain the problem you hit.
But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
there is something else.
It would be nice to have a test-case :/
Oleg.
> [1] http://ccl.cse.nd.edu/software/parrot/
> [2] https://github.com/cooperative-computing-lab/cctools/issues/1207
> [3] https://github.com/cooperative-computing-lab/cctools/blob/f82288167b1b5abb836b1d9b8135c98f71ed90f6/parrot/src/tracer.c#L91-L128
>
> --
> Patrick Donnelly
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1
2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov
@ 2016-03-21 18:28 ` Patrick Donnelly
2016-03-21 19:07 ` Oleg Nesterov
0 siblings, 1 reply; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-21 18:28 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel
On Mon, Mar 21, 2016 at 1:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> The basic problem is that the application we are tracing spawns
>> threads which **sometimes** are not traced (or lost). For Linux v3.0,
>> we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK)
>> options to follow children [3]. The problem we see is that we will
>> receive a PTRACE_EVENT_CLONE event that a thread is created but we
>> receive no other events for the thread.
>
> IOW, the new thread do not report SIGSTOP injected by implicit attach?
It does not report the SIGSTOP nor the system calls leading up to the
PTRACE_EVENT_CLONE (not even the entry into the clone syscall).
>> What's worse is that the
>> thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones
>> its own thread.
>
> OK, so at least the new child is traced too, and it also has PT_TRACE_*
> flags copied from its parent.
That seems to be the case but it will only report certain events (not
syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
events... Hmm, now that I think about this, it would be necessary to
see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
problem.
> To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
> For example, SIGCONT from somewhere can remove the pending (and not yet
> reported) SIGSTOP, and this _can_ explain the problem you hit.
The tree of processes being traced do no send any signals but an
external process may have. However, I did notice the use of futexes
near these clones. Perhaps that may be causing this?
> But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> there is something else.
Okay, it might be that PTRACE_SEIZE fixes it.
> It would be nice to have a test-case :/
Unfortunately, I have not yet been able to isolate a test case.
Thanks for your help!
--
Patrick Donnelly
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1
2016-03-21 18:28 ` Patrick Donnelly
@ 2016-03-21 19:07 ` Oleg Nesterov
2016-03-21 19:24 ` Patrick Donnelly
0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 19:07 UTC (permalink / raw)
To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel
On 03/21, Patrick Donnelly wrote:
>
> That seems to be the case but it will only report certain events (not
> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
> events... Hmm, now that I think about this, it would be necessary to
> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
> problem.
Yes, exactly, you need to see the initial SIGSTOP or another event which
can be reported before it.
> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
> > For example, SIGCONT from somewhere can remove the pending (and not yet
> > reported) SIGSTOP, and this _can_ explain the problem you hit.
>
> The tree of processes being traced do no send any signals but an
> external process may have.
I am looking into
https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc
and this code
case SIGSTOP:
/* Black magic to get threads working on old Linux kernels... */
if(p->nsyscalls == 0) { /* stop before we begin running the process */
debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
signum = 0; /* suppress delivery */
kill(p->pid,SIGCONT);
}
break;
doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
group. So if this kill() races with another thread doing clone() you can
hit the problem you described.
> However, I did notice the use of futexes
> near these clones. Perhaps that may be causing this?
I don't think so,
> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> > there is something else.
>
> Okay, it might be that PTRACE_SEIZE fixes it.
Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?
Oleg.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1
2016-03-21 19:07 ` Oleg Nesterov
@ 2016-03-21 19:24 ` Patrick Donnelly
2016-03-21 19:35 ` Oleg Nesterov
0 siblings, 1 reply; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-21 19:24 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel
On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/21, Patrick Donnelly wrote:
>>
>> That seems to be the case but it will only report certain events (not
>> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
>> events... Hmm, now that I think about this, it would be necessary to
>> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
>> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
>> problem.
>
> Yes, exactly, you need to see the initial SIGSTOP or another event which
> can be reported before it.
Assuming a SIGSTOP is being silenced, is there anything we can do to
forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE)
>> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
>> > For example, SIGCONT from somewhere can remove the pending (and not yet
>> > reported) SIGSTOP, and this _can_ explain the problem you hit.
>>
>> The tree of processes being traced do no send any signals but an
>> external process may have.
>
> I am looking into
>
> https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc
>
> and this code
>
> case SIGSTOP:
> /* Black magic to get threads working on old Linux kernels... */
>
> if(p->nsyscalls == 0) { /* stop before we begin running the process */
> debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
> signum = 0; /* suppress delivery */
> kill(p->pid,SIGCONT);
> }
> break;
>
> doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
> group. So if this kill() races with another thread doing clone() you can
> hit the problem you described.
You're right, that should be tkill! I will give that a try and report
back if that solved the issue for our collaborators...
>> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
>> > there is something else.
>>
>> Okay, it might be that PTRACE_SEIZE fixes it.
>
> Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?
I have not tested on >v3.1 with PTRACE_ATTACH. As you know, v3.1 was
when the PTRACE_SEIZE code was merged along with many other changes.
[I actually thought the merge occurred in 3.4 because of the ptrace
man page. I have submitted a bug report to get that fixed.] I have not
had any reports of the problem with Linux versions after and including
v3.1.
Again, I will see if the kill system call was the cause and report
back if so. Thanks for taking the time to look at the code!
--
Patrick Donnelly
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1
2016-03-21 19:24 ` Patrick Donnelly
@ 2016-03-21 19:35 ` Oleg Nesterov
2016-03-23 14:12 ` Patrick Donnelly
0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 19:35 UTC (permalink / raw)
To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel
On 03/21, Patrick Donnelly wrote:
>
> On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, exactly, you need to see the initial SIGSTOP or another event which
> > can be reported before it.
>
> Assuming a SIGSTOP is being silenced, is there anything we can do to
> forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE)
No. Only PTRACE_SYSCALL can set TIF_SYSCALL_TRACE.
> > case SIGSTOP:
> > /* Black magic to get threads working on old Linux kernels... */
> >
> > if(p->nsyscalls == 0) { /* stop before we begin running the process */
> > debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
> > signum = 0; /* suppress delivery */
> > kill(p->pid,SIGCONT);
> > }
> > break;
> >
> > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
> > group. So if this kill() races with another thread doing clone() you can
> > hit the problem you described.
>
> You're right, that should be tkill! I will give that a try and report
> back if that solved the issue for our collaborators...
Ah, sorry, I should have mentioned this...
No, tkill() won't help. See prepare_signal(), SIGCONT always removes
the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was
sent by tkill() or kill().
Perhaps you should just remove this kill(SIGCONT) ?
tracer_continue(signr => 0) should equally suppress the delivery. To
clarify this won't be right too, but without PTRACE_SEIZE you simply
can't write the code which handles the stop/cont/etc events correctly
anyway...
> >> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> >> > there is something else.
> >>
> >> Okay, it might be that PTRACE_SEIZE fixes it.
> >
> > Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?
>
> I have not tested on >v3.1 with PTRACE_ATTACH.
OK, thanks. So perhaps this is not v3.0-specific.
Oleg.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question regarding ptrace work for LInux v3.1
2016-03-21 19:35 ` Oleg Nesterov
@ 2016-03-23 14:12 ` Patrick Donnelly
0 siblings, 0 replies; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-23 14:12 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel
On Mon, Mar 21, 2016 at 3:35 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/21, Patrick Donnelly wrote:
>> On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > case SIGSTOP:
>> > /* Black magic to get threads working on old Linux kernels... */
>> >
>> > if(p->nsyscalls == 0) { /* stop before we begin running the process */
>> > debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
>> > signum = 0; /* suppress delivery */
>> > kill(p->pid,SIGCONT);
>> > }
>> > break;
>> >
>> > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
>> > group. So if this kill() races with another thread doing clone() you can
>> > hit the problem you described.
>>
>> You're right, that should be tkill! I will give that a try and report
>> back if that solved the issue for our collaborators...
>
> Ah, sorry, I should have mentioned this...
>
> No, tkill() won't help. See prepare_signal(), SIGCONT always removes
> the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was
> sent by tkill() or kill().
>
> Perhaps you should just remove this kill(SIGCONT) ?
>
> tracer_continue(signr => 0) should equally suppress the delivery. To
> clarify this won't be right too, but without PTRACE_SEIZE you simply
> can't write the code which handles the stop/cont/etc events correctly
> anyway...
Thanks so much Oleg. Indeed this was the problem.
--
Patrick Donnelly
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-03-23 14:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com>
2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov
2016-03-21 18:28 ` Patrick Donnelly
2016-03-21 19:07 ` Oleg Nesterov
2016-03-21 19:24 ` Patrick Donnelly
2016-03-21 19:35 ` Oleg Nesterov
2016-03-23 14:12 ` Patrick Donnelly
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox