Linux Container Development
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Subject: Re: [RFC][PATCH 4/4] PID: use the target ID specified in procfs
Date: Thu, 13 Mar 2008 16:01:02 -0400	[thread overview]
Message-ID: <47D987FE.9040909@cs.columbia.edu> (raw)
In-Reply-To: <m1r6eewmj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>



Eric W. Biederman wrote:
> Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> "Nadia Derbey" <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org> writes:
>>>
>>>> A couple of weeks ago, a discussion has started after Pierre's proposal for
>>>> a new syscall to change an ipc id (see thread
>>>> http://lkml.org/lkml/2008/1/29/209).
>>>>
>>>>
>>>> Oren's suggestion was to force an object's id during its creation, rather
>>>> than 1. create it, 2. change its id.
>>>>
>>>> So here is an implementation of what Oren has suggested.
>>>>
>>>> 2 new files are defined under /proc/self:
>>>>  . next_ipcid --> next id to use for ipc object creation
>>>>  . next_pids --> next upid nr(s) to use for next task to be forked
>>>>                  (see patch #2 for more details).
>>>>
>>>> When one of these files (or both of them) is filled, a structure pointed to
>>>> by the calling task struct is filled with these ids.
>>>>
>>>> Then, when the object is created, the id(s) present in that structure are
>>>> used, instead of the default ones.
>>>> A couple of weeks ago, a discussion has started after Pierre's proposal for
>>>> a new syscall to change an ipc id (see thread
>>>> http://lkml.org/lkml/2008/1/29/209).
>>>>
>>>>
>>>> Oren's suggestion was to force an object's id during its creation, rather
>>>> than 1. create it, 2. change its id.
>>>>
>>>> So here is an implementation of what Oren has suggested.
>>>>
>>>> 2 new files are defined under /proc/self:
>>>>  . next_ipcid --> next id to use for ipc object creation
>>>>  . next_pids --> next upid nr(s) to use for next task to be forked
>>>>                  (see patch #2 for more details).
>>>>
>>>> When one of these files (or both of them) is filled, a structure pointed to
>>>> by the calling task struct is filled with these ids.
>>>>
>>>> Then, when the object is created, the id(s) present in that structure are
>>>> used, instead of the default ones.
>>>
>>> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>
>>>> Right the alloc_pidmap() changes will probably be pretty much the same
>>>> no matter how we do set_it(), so it's worth discussing.  But I'm
>>>> particularly curious to see what opinions are on the sys_setid().
>>>
>>> A couple of comments.  With respect to alloc_pidmap we already have
>>> the necessary controls (a minimum and a maximum) in place for the
>>> allocation.  So except for double checking that those controls are exported
>>> in /proc/sys we don't necessarily need to do anything, special.
>>>
>>> Just play games with the minimum pid value before you fork.
>> Excellent idea! It's trus that properly setting things, we can make the loops
>> executed only once in case we want to use a predefined id.
>>
>>> Second at least to get the memory map correct we need additional
>>> kernel support.
>>>
>>>
>>> Third to actually get the values out it appears we need additional kernel
>>> support as well.  From my limited playing with these things at least
>>> parts of the code were easier to implement in the kernel.  The hoops
>>> you have to go to restore a single process (without threads) are
>>> absolutely horrendous in user space.
> 
> I was thinking in particular of the mm setup above.
> 
>> Ok, but if we have a process that belongs to nested namepsaces, we have no other
>> choice than provide its upid nrs hierarchy, right?
> 
> At least a part of it.  You only have to provide as much of the nested
> pid hiearchy as you are restoring from your checkpoint.
> 
>> So, I guess it's more the way it is presented that you don't agree with (of
>> course provided that we are in a user space oriented solution)?
> 
> Yes.  On the kernel side we are essentially fine.
> 
>>> So this patchset whether it is setid or setting the id at creation time
>>> seems to be jumping the gun.  For some namespaces renames are valid and
>>> we can support them.  For other namespaces setting the id is a big no-no,
>>> and possibly even controlling the id at creation time is a problem (at
>>> least in the general sense).
>> I'm sorry but I'm pretty new in this domain, so I don't see what are the
>> namespaces where setting (or pre-setting) the id would be a problem?
> 
> pids to some extent as people use them in all kinds of files.  Being
> able to force the pid of another process could make a hard to trigger
> security hole with file permissions absolutely trivial to hit.

Since the intent of this mechanism is to allow ckpt/restart, it makes
sense to only allow this operation during restart. For example, in zap,
containers have a state, e.g. running, stopped, ckpt, restart, and this
is only possible in restart state; Furthermore, a container can only be
put in restart state at creation time, and only by root. Of course, you
should only trust that as much as you trust the root  :O

> 
> Changing the pid on a process would be even worse because then how
> could you send it signals.
> 
> I haven't looked close enough to be able to say in situation X it is a
> problem or in situation Y it is clearly not a problem.  I just know
> there is a lot that happens with ids and security so we need to tread
> lightly, and carefully.
> 
>>> Because if you can easily control the id
>>> you may be able to more easily exploit security holes.
>>>
>>> I'm not at all inclined to make it easy for userspace to rename or set
>>> the id of a new resource unless it already makes sense in that
>>> namespace.
>>>
>>> We need to limit anything in a checkpoint to user space visible
>>> state.
>> OK, but a tasks that belongs to nested pid namespaces is known from other tasks
>> as one of its "intermediate" pids, depending on the namespace level we are
>> considering. So we can consider these "intermediate pids" as user space visible,
>> can't we?
> 
> Yes.
> 
>> Given the following pid namespaces hierarchy:
>> PNS0 -> PNS1 -> PNS2 -> PNS3 -> PNS4
>> A task that belongs to PSN2 has 3 upid nrs:
>> UP0, UP1, and UP2
>> So:
>> . UP0 can be obtained if we do a "ps -ef" in pid ns #1 (PNS0)
>> . UP1 can be obtained if we do a "ps -ef" in pid ns #1 (PNS1)
>> . UP2 can be obtained if we do a "ps -ef" in pid ns #1 (PNS2)
>>
>> So UP[0-2] are user space visible (again, depending on "where" we are in pid ns
>> hierarchy, aren't they?
> 
> Totally.
> 
>> Sure, I completely agree with you! So may be the 1st thing to do would be to
>> decide which approach (user space vs kernel) should be adopted for c/r? Sorry if
>> what I'm saying is stupid, but imho a clear answer to this question would make
>> us all go the same direction ==> save time for further investigations.
> 
> Agreed, although it isn't quite that cut and dry.
> 
> The question is what granularity do we export the checkpoint restore
> functionality with, and do we manage to have it serve additional
> functions as well.
> 
> Because ultimately it is user space saying checkpoint and it is user
> space saying restore.  Although we need to be careful that there
> are not cheaper migration solutions that we are precluding.
> 
> Getting the user space interface correct is going to be the tricky
> and important.
> 
> 
> 
> My inclination is that the cost in migrating a container is the cost
> of moving the data between machines.  Which (not counting filesystems)
> would be the anonymous pages and the shared memory areas.
> 
> So if we have a checkpoint as a directory.
> We could have files for the data of each shared memory region.
> 
> We could have files for the data of each shared anonymous region
> shared between mm's.
> 
> We could have ELF core files with extra notes to describe the full
> state of each process.
> 
> We could have files describing each sysvipc object (shared memory,
> message queues, and semaphores).
> 
> Futexes?
> 
> I would suggest a directory for each different kind of file, making
> conflicts easier to avoid, and data types easier to predict.
> 
> 
> Once we have a least one valid checkpoint format the question becomes
> how do we get the checkpoint out of the kernel, and how do we get
> the checkpoint into the kernel.  And especially how do we allow for
> incremental data movement so live migration with minimal interruption
> of service is possible.
> 
> This means that we need to preload all of the large memory areas into
> shared memory areas or processes.  Everything else file descriptors
> and the like I expect will be sufficiently inexpesive that we can
> treat their recreation as instantaneous.
> 
> 
> If incremental migration of data was not so useful I would say just
> have a single syscall to dump everything, and another syscall to
> restore everything.
> 
> Since incremental migration is so useful.  My gut feel is
> checkpointfs or something similar where we can repeated read
> and write a checkpoint before we commit to starting it is useful.
> 
> I am welcome to other better suggestions from the experience of the
> people who have implemented this before.
> 
> 
> A very fine grained user space solution where user space createsn
> each object using the traditional APIs and restores them similarly
> seems harder to implement, harder to get right, and harder to
> maintain.  In part because it hides what we are actually trying to do.
> 
> Further a fine grained user space approach appears to offer no
> advantages when it comes to restoring a checkpoint and incrementally
> updating it so that the down time between machines is negligible.
> 
>>> My inclination is that create with a specified set of ids is the
>>> proper internal kernel API, so we don't have rework things later,
>>> because reworking things seems to be a lot more work.
>> Completely agree with you: the earlier in the object's life we do it, the best
>> it is ;-)
>>
>>> How we want to
>>> export this to user space is another matter.
>>>
>>> One suggestion is to a have /proc or a proc like filesystem that
>>> allows us to read, create, and populate files to see all of the
>>> application state.
>> And that would be quite useful for debugging purpose, as you were saying
>> earlier.
> 
> Yes.  Part of why I suggested it in that way.  Debugging and
> checkpoint/restart have a lot in common.  If we can take advantage of
> that it would be cool.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

  parent reply	other threads:[~2008-03-13 20:01 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-10 13:50 [RFC][PATCH 0/4] Object creation with a specified id Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 1/4] Provide a new procfs interface to set next ipc id Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 2/4] Provide a new procfs interface to set next upid nr(s) Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 3/4] IPC: use the target ID specified in procfs Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 4/4] PID: " Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080310135209.769712000-6ktuUTfB/bM@public.gmane.org>
2008-03-11 12:04     ` Pavel Emelyanov
     [not found]       ` <47D67557.7080506-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 15:28         ` Nadia Derbey
     [not found]           ` <47D6A52D.6030701-6ktuUTfB/bM@public.gmane.org>
2008-03-11 15:37             ` Pavel Emelyanov
     [not found]               ` <47D6A741.9080708-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 15:55                 ` Nadia Derbey
2008-03-11 16:47                 ` Serge E. Hallyn
     [not found]                   ` <20080311164725.GA12918-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-03-11 16:55                     ` Pavel Emelyanov
     [not found]                       ` <47D6B990.4080400-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 17:53                         ` Serge E. Hallyn
     [not found]                           ` <20080311175328.GA14171-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-03-12 19:58                             ` Eric W. Biederman
     [not found]                               ` <m1zlt3yarj.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 10:41                                 ` Nadia Derbey
     [not found]                                   ` <47D904E4.4000208-6ktuUTfB/bM@public.gmane.org>
2008-03-13 17:40                                     ` Eric W. Biederman
     [not found]                                       ` <m1r6eewmj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 19:06                                         ` Serge E. Hallyn
2008-03-13 20:01                                         ` Oren Laadan [this message]
     [not found]                                           ` <47D987FE.9040909-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-13 23:12                                             ` Eric W. Biederman
     [not found]                                               ` <m163vqw74n.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 23:24                                                 ` Oren Laadan
     [not found] ` <20080310135054.312992000-6ktuUTfB/bM@public.gmane.org>
2008-03-13 23:16   ` [RFC][PATCH 0/4] Object creation with a specified id Oren Laadan
     [not found]     ` <47D9B5B7.6060803-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14  6:21       ` Nadia Derbey
     [not found]         ` <47DA195B.8070704-6ktuUTfB/bM@public.gmane.org>
2008-03-14 15:50           ` Oren Laadan
     [not found]             ` <47DA9EB5.8040704-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14 15:56               ` Pavel Emelyanov
     [not found]                 ` <47DAA041.9090009-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-14 16:02                   ` Oren Laadan
     [not found]                     ` <47DAA1A6.6010509-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14 16:08                       ` Pavel Emelyanov
2008-03-14 16:11                   ` Nadia Derbey
2008-03-14 16:11               ` Nadia Derbey
     [not found]                 ` <47DAA3AA.4050906-6ktuUTfB/bM@public.gmane.org>
2008-03-14 16:45                   ` Oren Laadan
     [not found]                     ` <47DAABAB.7000706-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-16  3:43                       ` Serge E. Hallyn
     [not found]                         ` <20080316034320.GA19793-6s5zFf/epYLPQpwDFJZrxFMas7LaWZ9n@public.gmane.org>
2008-03-16 19:08                           ` Oren Laadan
     [not found]                             ` <47DD703C.4030809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-17 14:44                               ` Serge E. Hallyn
  -- strict thread matches above, loose matches on Subject: below --
2008-03-28  9:53 [RFC][PATCH 0/4] Object creation with a pre-defined id (v2) Nadia.Derbey-6ktuUTfB/bM
2008-03-28  9:53 ` [RFC][PATCH 4/4] PID: use the target ID specified in procfs Nadia.Derbey-6ktuUTfB/bM
     [not found] <20080404145129.637145000@bull.net>
2008-04-04 14:51 ` Nadia.Derbey-6ktuUTfB/bM

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47D987FE.9040909@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org \
    --cc=xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox