Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Pavel Emelyanov <xemul@parallels.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>,
	Andrew Vagin <avagin@gmail.com>,
	Aditya Kali <adityakali@google.com>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Oleg Nesterov <oleg@redhat.com>, <linux-kernel@vger.kernel.org>,
	<criu@openvz.org>, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>
Subject: Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
Date: Mon, 17 Feb 2014 12:34:12 +0400	[thread overview]
Message-ID: <5301C984.40904@parallels.com> (raw)
In-Reply-To: <87txc1pibc.fsf@xmission.com>

On 02/15/2014 12:09 AM, Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> On 02/14/2014 11:16 PM, Eric W. Biederman wrote:
>>> Cyrill Gorcunov <gorcunov@gmail.com> writes:
>>>
>>>> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>>>>>> My brain hurts just looking at this patch and how you are justifying it.
>>>>>>
>>>>>> For the resources you are mucking with below all you have to do is to
>>>>>> verify that you are below the appropriate rlimit at all times and no
>>>>>> CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>>>>>> to exceed your per process limits.
>>>>>>
>>>>>> All you have to do is to fix the current code to properly enforce the
>>>>>> limits.
>>>>>
>>>>> I'm afraid what you are suggesting doesn't work.
>>>>>
>>>>> The first reason is that we can not change both boundaries in one call.
>>>>> But when we are restoring these attributes, we may need to move their
>>>>> too far.
>>>>
>>>> When this code was introduced, there were no user-namespace implementation,
>>>> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
>>>> to prevent modifying this values by anyone. Now user-ns brings a limit --
>>>> we need somehow to provide a way to modify these mm fields having no
>>>> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
>>>> we're modifying members one by one (looking back I think this was not
>>>> a good idea to modify the fields in this manner).
>>>>
>>>> Maybe we could improve this api and provide argument as a pointer
>>>> to a structure, which would have all the fields we're going to
>>>> modify, which in turn would allow us to verify that all new values
>>>> are sane and fit rlimits, then we could (probably) deprecate old
>>>> api if noone except c/r camp is using it (I actually can't imagine
>>>> who else might need this api). Then CAP_SYS_RESOURCE requirement
>>>> could be ripped off. Hm? (sure touching api is always "no-no"
>>>> case, but maybe...)
>>>
>>> Hmm.  Let me rewind this a little bit.
>>>
>>> I want to be very stupid and ask the following.
>>>
>>> Why can't you have the process of interest do:
>>> 	ptrace(PTRACE_ATTACHME);
>>> 	execve(executable, args, ...);
>>>         
>>>         /* Have the ptracer inject the recovery/fixup code */
>>> 	/* Fix up the mostly correct process to look like it has been
>>>          * executing for a while.
>>>          */
>>
>> Let's imagine we do that.
>>
>> This means, that the whole memory contents should be restored _after_
>> the execve() call, since the execve() flushes old mappings. In
>> that case we lose the ability to preserve any shared memory regions
>> between any two processes. This "shared" can be either regular
>> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.
> 
> If we have MAP_ANONYMOUS but not COW-ed mappings we have the correct
> executable, which implies we have everything else correct except for the
> brk and the stack addresses, because the process was started with fork.
> 
> So while that sounds like an interesting case to handle it does not seem
> to invalidate the idea of using exec to set all of the other fields when
> we need to set them.

Well, yes, what you propose we call "inheritable resources". These are, e.g.
SIDs or shared FD-table/MM-s. That's OK to restore them at fork(), but I'd like
to draw your attention to two concerns I have with this approach.

1. Inheritable resources can be potentially restored more than one time. Consider
   you have tasks tree look like this:

task-A  -[exe]-> A
 `- task-C1 -[exe]-> C
 `- task-C2 -[exe]-> C

IOW -- task A has executable A and two kids C1 and C2 that share executable C.

In that case the restore sequence should look like this

* Task A calls execve() on C
* Task A forks C1
* Task A forks C2
* Task A calls execve() on A

This does work, I agree, but task A has to call execve() two times. And even more, if
we had e.g. D1 and D2 kids with different exe D. Now, why I think that's a problem?
Please, see concern #2 :)

2. What you propose means we have to effectively strace and execve-ing task. As
compared with plain prlctl this is up to ~600 times slower. I've made such an experiment.

* Idle node with plenty of free RAM
* Simple proggie doing execve() on self for 1000 times, compiled statically to avoid
  ld.so spoiling the times, run under strace
* Another proggie doing open() + prlctl() 1000 times.

The first task took ~12 sec to complete. The second -- ~0.02 seconds.

If we take an average container of 100 tasks, even with all different exe links, your
approach would give us ~1 sec more to restore, while existing one would be almost
no op. And this hits us even without the inheritance scenario I demonstrated above.

Please, keep in mind, that checkpoint-restore in not only live-migration, we have
use cases where restore cannot be pre-restored for better down-time. It _must_ be
as fast as possible.

That said, Eric, I do agree with your concern about security, I _am_ ready to rework
this stuff and kill the whole bunch of prctls we have. But please! Very please! Can
we come up with mm->foo-s and ->exe_link restoration API that is at most ... 5 times
slower than existing prlctl? It's really-really important for us!

Maybe we can make prlctl() do lite-execve()? It will open the executable, read the
required amount of headers and just put data red from there onto mm-struct? This 
should be MUCH better, that full execve() with loading all binary data plus strace
and flushing old mm-s.

> Eric

Thanks,
Pavel

next prev parent reply	other threads:[~2014-02-17  8:34 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-14 14:13 [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns Andrey Vagin
2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
2014-02-14 16:05   ` Eric W. Biederman
2014-02-14 17:43     ` Andrew Vagin
2014-02-14 18:01       ` [CRIU] " Cyrill Gorcunov
2014-02-14 19:16         ` Eric W. Biederman
2014-02-14 19:47           ` Pavel Emelyanov
2014-02-14 20:06             ` Cyrill Gorcunov
2014-02-14 20:18               ` Eric W. Biederman
2014-02-15  6:29                 ` Cyrill Gorcunov
2014-02-15 23:01                   ` Eric W. Biederman
2014-02-14 20:09             ` Eric W. Biederman
2014-02-17  8:34               ` Pavel Emelyanov [this message]
2014-02-17  8:52                 ` Cyrill Gorcunov
2014-02-17 16:57                   ` Pavel Emelyanov
2014-03-07 13:51                 ` Pavel Emelyanov
2014-02-14 20:44           ` Andrey Wagin
2014-02-15 23:05             ` Eric W. Biederman
2014-02-14 14:13 ` [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link Andrey Vagin
2014-02-18  4:53   ` Serge E. Hallyn
2014-02-14 14:13 ` [PATCH 3/3] prctl: allow to use PR_MM_SET_* which affect only a current task Andrey Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5301C984.40904@parallels.com \
    --to=xemul@parallels.com \
    --cc=adityakali@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@gmail.com \
    --cc=criu@openvz.org \
    --cc=ebiederm@xmission.com \
    --cc=gorcunov@gmail.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=sfr@canb.auug.org.au \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.