public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Question regarding ptrace work for LInux v3.1
       [not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com>
@ 2016-03-21 17:47 ` Oleg Nesterov
  2016-03-21 18:28   ` Patrick Donnelly
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 17:47 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel

Hello Patrick,

On 03/18, Patrick Donnelly wrote:
>
> We are currently trying to debug a problem with ptrace that I believe
> was incidentally fixed by you and maybe Tejun Heo in Linux v3.1.

So let me add Tejun and lkml,

> The
> issue is on github [2] but I will describe it here briefly. My hope is
> that you may remember fixing this and a patch may be made for v3.0.
> [An HPC center is using Linux v3.0 which exhibits this.]

Heh, sorry, I can't recall anything related ;)

> The basic problem is that the application we are tracing spawns
> threads which **sometimes** are not traced (or lost). For Linux v3.0,
> we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK)
> options to follow children [3]. The problem we see is that we will
> receive a PTRACE_EVENT_CLONE event that a thread is created but we
> receive no other events for the thread.

IOW, the new thread do not report SIGSTOP injected by implicit attach?

> What's worse is that the
> thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones
> its own thread.

OK, so at least the new child is traced too, and it also has PT_TRACE_*
flags copied from its parent.

> Do you recall fixing anything for v3.1 that might cause this problem?

No. I do not see how the new tracee can miss that SIGSTOP/TIF_SIGPENDING.

To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
For example, SIGCONT from somewhere can remove the pending (and not yet
reported) SIGSTOP, and this _can_ explain the problem you hit.

But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
there is something else.

It would be nice to have a test-case :/

Oleg.

> [1] http://ccl.cse.nd.edu/software/parrot/
> [2] https://github.com/cooperative-computing-lab/cctools/issues/1207
> [3] https://github.com/cooperative-computing-lab/cctools/blob/f82288167b1b5abb836b1d9b8135c98f71ed90f6/parrot/src/tracer.c#L91-L128
>
> --
> Patrick Donnelly

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding ptrace work for LInux v3.1
  2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov
@ 2016-03-21 18:28   ` Patrick Donnelly
  2016-03-21 19:07     ` Oleg Nesterov
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-21 18:28 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel

On Mon, Mar 21, 2016 at 1:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> The basic problem is that the application we are tracing spawns
>> threads which **sometimes** are not traced (or lost). For Linux v3.0,
>> we are using PTRACE_ATTACH and the PTRACE_O_TRACE(CLONE|FORK|VFORK)
>> options to follow children [3]. The problem we see is that we will
>> receive a PTRACE_EVENT_CLONE event that a thread is created but we
>> receive no other events for the thread.
>
> IOW, the new thread do not report SIGSTOP injected by implicit attach?

It does not report the SIGSTOP nor the system calls leading up to the
PTRACE_EVENT_CLONE (not even the entry into the clone syscall).

>> What's worse is that the
>> thread eventually "comes back" via a PTRACE_EVENT_CLONE when it clones
>> its own thread.
>
> OK, so at least the new child is traced too, and it also has PT_TRACE_*
> flags copied from its parent.

That seems to be the case but it will only report certain events (not
syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
events... Hmm, now that I think about this, it would be necessary to
see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
problem.

> To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
> For example, SIGCONT from somewhere can remove the pending (and not yet
> reported) SIGSTOP, and this _can_ explain the problem you hit.

The tree of processes being traced do no send any signals but an
external process may have. However, I did notice the use of futexes
near these clones. Perhaps that may be causing this?

> But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> there is something else.

Okay, it might be that PTRACE_SEIZE fixes it.

> It would be nice to have a test-case :/

Unfortunately, I have not yet been able to isolate a test case.

Thanks for your help!

-- 
Patrick Donnelly

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding ptrace work for LInux v3.1
  2016-03-21 18:28   ` Patrick Donnelly
@ 2016-03-21 19:07     ` Oleg Nesterov
  2016-03-21 19:24       ` Patrick Donnelly
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 19:07 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel

On 03/21, Patrick Donnelly wrote:
>
> That seems to be the case but it will only report certain events (not
> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
> events... Hmm, now that I think about this, it would be necessary to
> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
> problem.

Yes, exactly, you need to see the initial SIGSTOP or another event which
can be reported before it.

> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
> > For example, SIGCONT from somewhere can remove the pending (and not yet
> > reported) SIGSTOP, and this _can_ explain the problem you hit.
>
> The tree of processes being traced do no send any signals but an
> external process may have.

I am looking into

   https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc

and this code

	case SIGSTOP:
	/* Black magic to get threads working on old Linux kernels... */

	if(p->nsyscalls == 0) { /* stop before we begin running the process */
		debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
		signum = 0; /* suppress delivery */
		kill(p->pid,SIGCONT);
	}
	break;

doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
group. So if this kill() races with another thread doing clone() you can
hit the problem you described.

> However, I did notice the use of futexes
> near these clones. Perhaps that may be causing this?

I don't think so,

> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> > there is something else.
>
> Okay, it might be that PTRACE_SEIZE fixes it.

Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?

Oleg.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding ptrace work for LInux v3.1
  2016-03-21 19:07     ` Oleg Nesterov
@ 2016-03-21 19:24       ` Patrick Donnelly
  2016-03-21 19:35         ` Oleg Nesterov
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-21 19:24 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel

On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/21, Patrick Donnelly wrote:
>>
>> That seems to be the case but it will only report certain events (not
>> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
>> events... Hmm, now that I think about this, it would be necessary to
>> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
>> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
>> problem.
>
> Yes, exactly, you need to see the initial SIGSTOP or another event which
> can be reported before it.

Assuming a SIGSTOP is being silenced, is there anything we can do to
forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE)

>> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
>> > For example, SIGCONT from somewhere can remove the pending (and not yet
>> > reported) SIGSTOP, and this _can_ explain the problem you hit.
>>
>> The tree of processes being traced do no send any signals but an
>> external process may have.
>
> I am looking into
>
>    https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc
>
> and this code
>
>         case SIGSTOP:
>         /* Black magic to get threads working on old Linux kernels... */
>
>         if(p->nsyscalls == 0) { /* stop before we begin running the process */
>                 debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
>                 signum = 0; /* suppress delivery */
>                 kill(p->pid,SIGCONT);
>         }
>         break;
>
> doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
> group. So if this kill() races with another thread doing clone() you can
> hit the problem you described.

You're right, that should be tkill! I will give that a try and report
back if that solved the issue for our collaborators...

>> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
>> > there is something else.
>>
>> Okay, it might be that PTRACE_SEIZE fixes it.
>
> Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?

I have not tested on >v3.1 with PTRACE_ATTACH. As you know, v3.1 was
when the PTRACE_SEIZE code was merged along with many other changes.
[I actually thought the merge occurred in 3.4 because of the ptrace
man page. I have submitted a bug report to get that fixed.] I have not
had any reports of the problem with Linux versions after and including
v3.1.

Again, I will see if the kill system call was the cause and report
back if so. Thanks for taking the time to look at the code!

-- 
Patrick Donnelly

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding ptrace work for LInux v3.1
  2016-03-21 19:24       ` Patrick Donnelly
@ 2016-03-21 19:35         ` Oleg Nesterov
  2016-03-23 14:12           ` Patrick Donnelly
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2016-03-21 19:35 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: Tejun Heo, linux-kernel

On 03/21, Patrick Donnelly wrote:
>
> On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, exactly, you need to see the initial SIGSTOP or another event which
> > can be reported before it.
>
> Assuming a SIGSTOP is being silenced, is there anything we can do to
> forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE)

No. Only PTRACE_SYSCALL can set TIF_SYSCALL_TRACE.

> >         case SIGSTOP:
> >         /* Black magic to get threads working on old Linux kernels... */
> >
> >         if(p->nsyscalls == 0) { /* stop before we begin running the process */
> >                 debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
> >                 signum = 0; /* suppress delivery */
> >                 kill(p->pid,SIGCONT);
> >         }
> >         break;
> >
> > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
> > group. So if this kill() races with another thread doing clone() you can
> > hit the problem you described.
>
> You're right, that should be tkill! I will give that a try and report
> back if that solved the issue for our collaborators...

Ah, sorry, I should have mentioned this...

No, tkill() won't help. See prepare_signal(), SIGCONT always removes
the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was
sent by tkill() or kill().

Perhaps you should just remove this kill(SIGCONT) ?

tracer_continue(signr => 0) should equally suppress the delivery. To
clarify this won't be right too, but without PTRACE_SEIZE you simply
can't write the code which handles the stop/cont/etc events correctly
anyway...

> >> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
> >> > there is something else.
> >>
> >> Okay, it might be that PTRACE_SEIZE fixes it.
> >
> > Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?
>
> I have not tested on >v3.1 with PTRACE_ATTACH.

OK, thanks. So perhaps this is not v3.0-specific.

Oleg.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question regarding ptrace work for LInux v3.1
  2016-03-21 19:35         ` Oleg Nesterov
@ 2016-03-23 14:12           ` Patrick Donnelly
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick Donnelly @ 2016-03-23 14:12 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, linux-kernel

On Mon, Mar 21, 2016 at 3:35 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/21, Patrick Donnelly wrote:
>> On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >         case SIGSTOP:
>> >         /* Black magic to get threads working on old Linux kernels... */
>> >
>> >         if(p->nsyscalls == 0) { /* stop before we begin running the process */
>> >                 debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
>> >                 signum = 0; /* suppress delivery */
>> >                 kill(p->pid,SIGCONT);
>> >         }
>> >         break;
>> >
>> > doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
>> > group. So if this kill() races with another thread doing clone() you can
>> > hit the problem you described.
>>
>> You're right, that should be tkill! I will give that a try and report
>> back if that solved the issue for our collaborators...
>
> Ah, sorry, I should have mentioned this...
>
> No, tkill() won't help. See prepare_signal(), SIGCONT always removes
> the SIG_KERNEL_STOP_MASK signals from all threads, not matter if it was
> sent by tkill() or kill().
>
> Perhaps you should just remove this kill(SIGCONT) ?
>
> tracer_continue(signr => 0) should equally suppress the delivery. To
> clarify this won't be right too, but without PTRACE_SEIZE you simply
> can't write the code which handles the stop/cont/etc events correctly
> anyway...

Thanks so much Oleg. Indeed this was the problem.

-- 
Patrick Donnelly

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-03-23 14:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CALJO4zGaZBzCEHsD4oan=nhpQasmxWiN535RLM+2bXngcabQmA@mail.gmail.com>
2016-03-21 17:47 ` Question regarding ptrace work for LInux v3.1 Oleg Nesterov
2016-03-21 18:28   ` Patrick Donnelly
2016-03-21 19:07     ` Oleg Nesterov
2016-03-21 19:24       ` Patrick Donnelly
2016-03-21 19:35         ` Oleg Nesterov
2016-03-23 14:12           ` Patrick Donnelly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox