Linux Container Development
 help / color / mirror / Atom feed
* [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
@ 2017-06-06 19:01 Eric W. Biederman
       [not found] ` <877f0pym71.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2017-06-06 19:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kees Cook, Roland McGrath, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Linux Containers, Oleg Nesterov, David Howells, Al Viro,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar,
	Michael Kerrisk (man-pages)


All,

I am posting this patches in the hope of some review of the strategy I
am taking and to let the individual patches be reviewed.

The intersection of wait, exit, ptrace, exec, and CLONE_THREAD does not
work correctly in the linux kernel.

An easy entry point into seeing the problem is that exec can wait
forever holding a mutex while waiting for userspace to call waitpid on a
ptraced thread.  This is bad enough it has been observed to cause
deadlocks in real world situations today.

Another easy entry point is to see that a multi-threaded setuid won't
change the credentials on a zombie thread group leader.  Which can allow
sending signals to a process that the credential change should forbid.
This is in violation of posix and the semantics we attempt to enforce in
linux.

I have been silent the last couple of weeks reading up on the history
and seeing if I could come up with a solution to the buggy semantics and
great big piles of technical debt that exists in this area of the
kernel.  It unfortunately requires understanding all of the pieces to
come up with a satisfactory long term solution to this problem.  My goal
is to fix enough of the issues that futher cleanups and bug fixes won't
require such comprehensive knowledge.

In large the solution I have found is to:

- Fix the semantics of wait.

  This makes it clear what the semantics the rest of the fixes need to
  maintain.

- To move everything possible from release_task to do_exit.

  Moving code from release_task to do_exit means that zombie
  threads can no longer be used to access shared thread group state,
  removing the problem of stale credentials on zombie threads.

  Much of the state that is freed in release_task instead of do_exit
  is freed there because originally during the CLONE_THREAD development
  there was no where else to put code that freed thread group state.
  And my changes largely move state freeing back where it used to
  be in 2.4 and earlier.

- To remove the concept of thread group leader (and allowing the first
  thread to exit).

  The concept of thread group leader keeps leading people into false
  mental models of what the kernel actually does.

  There is a lot of code that makes the mistake of assuming the thread
  group leader is always alive.  When the code doesn't make that mistake
  it requires additional code complexity to deal with zombie thread
  group leaders.  As seen in the signal code which still doesn't
  handle zombie thread group leaders correctly from a permission
  checking perspective.

- To promote tgid to a first class concept in the kernel.

  This is necessary if there isn't a thread group leader, and a
  significant part of the functional changes I am making.

- To only allow sending signals through living threads.

  This is a subset of removing the concept of thread group leader.  But
  worth calling out because signal handling is the most significant
  offender.

My first batch of changes includes some trivial cleanups and then
includes the small semantic changes (AKA user visible) I noticed that
are necessary to give userspace consistent semantics.

The full set of changes that I have come up with is at:
   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git exit-cleanups-for-testing

Unless someone has a better suggestion I intend to collect this up as
a topic branch and merge this through my tree.

Review on the details and of the concept would be very much appreciated.

Eric W. Biederman (26):
      alpha: Remove unused TASK_GROUP_LEADER
      cgroup: Don't open code tasklist_empty()
      signal: Do not perform permission checks when sending pdeath_signal
      signal: Make group_send_sig_info static
      exit:  Remove the pointless clearing of SIGPENDING in __exit_signal
      rlimit: Remove unnecessary grab of tasklist_lock
      pidns: Improve the error handling in alloc_pid
      exit: Make the runqueue rcu safe
      signal: Don't allow sending SIGKILL or SIGSTOP to init
      ptrace: Simplify ptrace_detach & exit_ptrace
      wait: Properly implement __WCLONE handling in the presence of exec and ptrace
      wait: Directly test for the two cases where wait_task_zombie is called
      wait: Remove unused delay_group_leader
      wait: Move changing of ptrace from wait_consider_task into wait_task_stopped
      wait: Don't delay !ptrace_reparented leaders
      exit: Fix reporting a ptraced !reparented leader has exited
      exit: Rework the exit states for ptracees
      wait: Fix WSTOPPED on a ptraced child
      wait: Simpler code for clearing notask_error in wait_consider_task
      wait: Don't pass the list to wait_consider_task
      wait: Optmize waitpid
      exit: Fix auto-wait of ptraced children
      signal:  Fix SIGCONT before group stop completes.
      signal: In ptrace_stop improve identical signal detection
      signal: In ptrace_stop use CLD_TRAPPED in all ptrace signals
      pidns: Ensure zap_pid_ns_processes always terminates

 arch/alpha/kernel/asm-offsets.c |   1 -
 include/linux/sched.h           |   7 +-
 include/linux/sched/signal.h    |  29 ++--
 include/linux/sched/task.h      |   4 +-
 include/linux/signal.h          |   1 -
 kernel/cgroup/cgroup.c          |   2 +-
 kernel/exit.c                   | 375 ++++++++++++++++------------------------
 kernel/fork.c                   |   3 +-
 kernel/pid.c                    |   8 +-
 kernel/pid_namespace.c          |   2 +-
 kernel/ptrace.c                 |  44 ++---
 kernel/sched/core.c             |   2 +-
 kernel/sched/fair.c             |   2 +-
 kernel/signal.c                 | 122 +++++++------
 kernel/sys.c                    |  13 +-
 15 files changed, 267 insertions(+), 348 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
       [not found] ` <877f0pym71.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2017-06-06 19:40   ` Aleksa Sarai
       [not found]     ` <dd16b1bb-e99e-69f2-72f4-1be4cb24d18d-l3A5Bk7waGM@public.gmane.org>
       [not found]     ` <87ink8vxkf.fsf@xmission.com>
  2017-06-06 20:07   ` Linus Torvalds
  1 sibling, 2 replies; 6+ messages in thread
From: Aleksa Sarai @ 2017-06-06 19:40 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kees Cook, Roland McGrath, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Linux Containers, Oleg Nesterov, David Howells, Al Viro,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar,
	Michael Kerrisk (man-pages)

> Another easy entry point is to see that a multi-threaded setuid won't
> change the credentials on a zombie thread group leader.  Which can allow
> sending signals to a process that the credential change should forbid.
> This is in violation of posix and the semantics we attempt to enforce in
> linux.

I might be completely wrong on this point (and I haven't looked at the 
patches), but I was under the impression that multi-threaded set[ug]id 
was implemented in userspace (by glibc's nptl(7) library that uses RT 
signals internally to get each thread to update their credentials). And 
given that, I wouldn't be surprised (as a user) that zombie threads will 
have stale credentials (glibc isn't running in those threads anymore).

Am I mistaken in that belief?

</off-topic>

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
       [not found] ` <877f0pym71.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2017-06-06 19:40   ` Aleksa Sarai
@ 2017-06-06 20:07   ` Linus Torvalds
       [not found]     ` <CA+55aFze5rR+rGcG6kt=8PtfgAcs02jqQ7Gm-1=1MzkbA7_nqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2017-06-06 20:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Roland McGrath, Linux API, Linux Containers,
	Linux Kernel Mailing List, Oleg Nesterov, David Howells, Al Viro,
	Thomas Gleixner, Ingo Molnar, Michael Kerrisk (man-pages)

On Tue, Jun 6, 2017 at 12:01 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> I am posting this patches in the hope of some review of the strategy I
> am taking and to let the individual patches be reviewed.

I'm trying to look through these, and finding (as usual) that the
signal handling and exit code is extremely scary from a correctness
and security standpoint.

I really want Oleg to review/ack these. Oleg?

I also would really really want to see the stuff that actually changes
semantics split out.

For example, I feel much less nervous about things like making the
tasklist RCU-safe. So I'd like to see changes like that be separated
out from the much scarier ones. Would that be possible? Hint hint..

                Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
       [not found]     ` <dd16b1bb-e99e-69f2-72f4-1be4cb24d18d-l3A5Bk7waGM@public.gmane.org>
@ 2017-06-07 11:36       ` Eric W. Biederman
  0 siblings, 0 replies; 6+ messages in thread
From: Eric W. Biederman @ 2017-06-07 11:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Kees Cook, Roland McGrath, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Linux Containers, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Howells, Al Viro,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar,
	Michael Kerrisk (man-pages)

Aleksa Sarai <asarai-l3A5Bk7waGM@public.gmane.org> writes:

>> Another easy entry point is to see that a multi-threaded setuid won't
>> change the credentials on a zombie thread group leader.  Which can allow
>> sending signals to a process that the credential change should forbid.
>> This is in violation of posix and the semantics we attempt to enforce in
>> linux.
>
> I might be completely wrong on this point (and I haven't looked at the patches),
> but I was under the impression that multi-threaded set[ug]id was implemented in
> userspace (by glibc's nptl(7) library that uses RT signals internally to get
> each thread to update their credentials). And given that, I wouldn't be
> surprised (as a user) that zombie threads will have stale credentials (glibc
> isn't running in those threads anymore).
>
> Am I mistaken in that belief?

Would you be surprised if you learned that if your first thread
exits, it will become a zombie and persist for the lifetime of your
process?

Furthermore all non-thread specific signals will permission check
against that first zombie thread.

Which I think makes this surprising even if you know that setuid is
implemented in userspace.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
       [not found]       ` <87ink8vxkf.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2017-06-07 12:21         ` Aleksa Sarai
  0 siblings, 0 replies; 6+ messages in thread
From: Aleksa Sarai @ 2017-06-07 12:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Roland McGrath, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Linux Containers, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Howells, Al Viro,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar,
	Michael Kerrisk (man-pages)

On 06/07/2017 09:36 PM, Eric W. Biederman wrote:
>>> Another easy entry point is to see that a multi-threaded setuid won't
>>> change the credentials on a zombie thread group leader.  Which can allow
>>> sending signals to a process that the credential change should forbid.
>>> This is in violation of posix and the semantics we attempt to enforce in
>>> linux.
>>
>> I might be completely wrong on this point (and I haven't looked at the patches),
>> but I was under the impression that multi-threaded set[ug]id was implemented in
>> userspace (by glibc's nptl(7) library that uses RT signals internally to get
>> each thread to update their credentials). And given that, I wouldn't be
>> surprised (as a user) that zombie threads will have stale credentials (glibc
>> isn't running in those threads anymore).
>>
>> Am I mistaken in that belief?
> 
> Would you be surprised if you learned that if your first thread
> exits, it will become a zombie and persist for the lifetime of your
> process?
> 
> Furthermore all non-thread specific signals will permission check
> against that first zombie thread.

Ah okay, so it really is a matter of Linux's threadgroup semantics just 
not being "right" on a more fundamental level than nptl.

> Which I think makes this surprising even if you know that setuid is
> implemented in userspace.

Quite surprising, thanks for the explanation.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD
       [not found]     ` <CA+55aFze5rR+rGcG6kt=8PtfgAcs02jqQ7Gm-1=1MzkbA7_nqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-07 15:59       ` Eric W. Biederman
  0 siblings, 0 replies; 6+ messages in thread
From: Eric W. Biederman @ 2017-06-07 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Roland McGrath, Linux API, Linux Containers,
	Linux Kernel Mailing List, Oleg Nesterov, David Howells, Al Viro,
	Thomas Gleixner, Ingo Molnar, Michael Kerrisk (man-pages)

Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:

> On Tue, Jun 6, 2017 at 12:01 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> I am posting this patches in the hope of some review of the strategy I
>> am taking and to let the individual patches be reviewed.
>
> I'm trying to look through these, and finding (as usual) that the
> signal handling and exit code is extremely scary from a correctness
> and security standpoint.
>
> I really want Oleg to review/ack these. Oleg?

Definitely.  The more review I can get the better. 

> I also would really really want to see the stuff that actually changes
> semantics split out.
>
> For example, I feel much less nervous about things like making the
> tasklist RCU-safe. So I'd like to see changes like that be separated
> out from the much scarier ones. Would that be possible? Hint hint..

The patches that I posted are the ones that I would really like to have
ready for the 4.13 merge window.  They are supposed to be the least
scary patches that don't really depend on anything else, and the semantic changes.

After a first brush with code review the non-scary patches wind up just
being these.

[PATCH 01/26] alpha: Remove unused TASK_GROUP_LEADER
[PATCH 02/26] cgroup: Don't open code tasklist_empty()
[PATCH 05/26] exit:  Remove the pointless clearing of SIGPENDING in __exit_signal
[PATCH 07/26] pidns: Improve the error handling in alloc_pid
[PATCH 08/26] exit: Make the runqueue rcu safe

This turns out to have a dependency I overlooked so I am dropping it for now.
[PATCH 06/26] rlimit: Remove unnecessary grab of tasklist_lock

There is a small pair of a semantic change and it's dependency:
[PATCH 03/26] signal: Do not perform permission checks when sending pdeath_signal
[PATCH 04/26] signal: Make group_send_sig_info static

There is a teeny tiny semantic change:
[PATCH 09/26] signal: Don't allow sending SIGKILL or SIGSTOP to init

The deeply related scarier changes (+ indicates it includes a semantic change):
 [PATCH 10/26] ptrace: Simplify ptrace_detach & exit_ptrace
+[PATCH 11/26] wait: Properly implement __WCLONE handling in the presence of exec and ptrace
 [PATCH 12/26] wait: Directly test for the two cases where wait_task_zombie is called
 [PATCH 13/26] wait: Remove unused delay_group_leader
+[PATCH 14/26] wait: Move changing of ptrace from wait_consider_task into wait_task_stopped
+[PATCH 15/26] wait: Don't delay !ptrace_reparented leaders
+[PATCH 16/26] exit: Fix reporting a ptraced !reparented leader has exited
 [PATCH 17/26] exit: Rework the exit states for ptracees
+[PATCH 18/26] wait: Fix WSTOPPED on a ptraced child
 [PATCH 19/26] wait: Simpler code for clearing notask_error in wait_consider_task
 [PATCH 20/26] wait: Don't pass the list to wait_consider_task
 [PATCH 21/26] wait: Optmize waitpid
+[PATCH 22/26] exit: Fix auto-wait of ptraced children
+[PATCH 23/26] signal: Fix SIGCONT before group stop completes.
+[PATCH 24/26] signal: In ptrace_stop improve identical signal detection
+[PATCH 25/26] signal: In ptrace_stop use CLD_TRAPPED in all ptrace signals
+[PATCH 26/26] pidns: Ensure zap_pid_ns_processes always terminates

Out of my 170 or so total changes those are the bulk of the semantic
changes and definitely the scariest.  Even those above are very
conservative and are really just about sorting out the weirdness in the
semantics when we are ptracing our own child which causes wait and the
siginfo of SIGCHLD to have two sets of meanings.

I am hoping we can just get 1,2,5,7, and 8 reviewed and I can just apply
them to my for-next branch.

If I need to repost I will respect the split-out I have described above.

At this point I am happy to see that people are not scared when I
suggest killing the concept of the thread group leader.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-07 15:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-06 19:01 [PATCH 00/26] Fixing wait, exit, ptrace, exec, and CLONE_THREAD Eric W. Biederman
     [not found] ` <877f0pym71.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-06-06 19:40   ` Aleksa Sarai
     [not found]     ` <dd16b1bb-e99e-69f2-72f4-1be4cb24d18d-l3A5Bk7waGM@public.gmane.org>
2017-06-07 11:36       ` Eric W. Biederman
     [not found]     ` <87ink8vxkf.fsf@xmission.com>
     [not found]       ` <87ink8vxkf.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-06-07 12:21         ` Aleksa Sarai
2017-06-06 20:07   ` Linus Torvalds
     [not found]     ` <CA+55aFze5rR+rGcG6kt=8PtfgAcs02jqQ7Gm-1=1MzkbA7_nqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-07 15:59       ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox