All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Subject: Re: [RFC][PATCH 4/4] PID: use the target ID specified in procfs
Date: Thu, 13 Mar 2008 16:01:02 -0400	[thread overview]
Message-ID: <47D987FE.9040909@cs.columbia.edu> (raw)
In-Reply-To: <m1r6eewmj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>



Eric W. Biederman wrote:
> Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> "Nadia Derbey" <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org> writes:
>>>
>>>> A couple of weeks ago, a discussion has started after Pierre's proposal for
>>>> a new syscall to change an ipc id (see thread
>>>> http://lkml.org/lkml/2008/1/29/209).
>>>>
>>>>
>>>> Oren's suggestion was to force an object's id during its creation, rather
>>>> than 1. create it, 2. change its id.
>>>>
>>>> So here is an implementation of what Oren has suggested.
>>>>
>>>> 2 new files are defined under /proc/self:
>>>>  . next_ipcid --> next id to use for ipc object creation
>>>>  . next_pids --> next upid nr(s) to use for next task to be forked
>>>>                  (see patch #2 for more details).
>>>>
>>>> When one of these files (or both of them) is filled, a structure pointed to
>>>> by the calling task struct is filled with these ids.
>>>>
>>>> Then, when the object is created, the id(s) present in that structure are
>>>> used, instead of the default ones.
>>>> A couple of weeks ago, a discussion has started after Pierre's proposal for
>>>> a new syscall to change an ipc id (see thread
>>>> http://lkml.org/lkml/2008/1/29/209).
>>>>
>>>>
>>>> Oren's suggestion was to force an object's id during its creation, rather
>>>> than 1. create it, 2. change its id.
>>>>
>>>> So here is an implementation of what Oren has suggested.
>>>>
>>>> 2 new files are defined under /proc/self:
>>>>  . next_ipcid --> next id to use for ipc object creation
>>>>  . next_pids --> next upid nr(s) to use for next task to be forked
>>>>                  (see patch #2 for more details).
>>>>
>>>> When one of these files (or both of them) is filled, a structure pointed to
>>>> by the calling task struct is filled with these ids.
>>>>
>>>> Then, when the object is created, the id(s) present in that structure are
>>>> used, instead of the default ones.
>>>
>>> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>
>>>> Right the alloc_pidmap() changes will probably be pretty much the same
>>>> no matter how we do set_it(), so it's worth discussing.  But I'm
>>>> particularly curious to see what opinions are on the sys_setid().
>>>
>>> A couple of comments.  With respect to alloc_pidmap we already have
>>> the necessary controls (a minimum and a maximum) in place for the
>>> allocation.  So except for double checking that those controls are exported
>>> in /proc/sys we don't necessarily need to do anything, special.
>>>
>>> Just play games with the minimum pid value before you fork.
>> Excellent idea! It's trus that properly setting things, we can make the loops
>> executed only once in case we want to use a predefined id.
>>
>>> Second at least to get the memory map correct we need additional
>>> kernel support.
>>>
>>>
>>> Third to actually get the values out it appears we need additional kernel
>>> support as well.  From my limited playing with these things at least
>>> parts of the code were easier to implement in the kernel.  The hoops
>>> you have to go to restore a single process (without threads) are
>>> absolutely horrendous in user space.
> 
> I was thinking in particular of the mm setup above.
> 
>> Ok, but if we have a process that belongs to nested namepsaces, we have no other
>> choice than provide its upid nrs hierarchy, right?
> 
> At least a part of it.  You only have to provide as much of the nested
> pid hiearchy as you are restoring from your checkpoint.
> 
>> So, I guess it's more the way it is presented that you don't agree with (of
>> course provided that we are in a user space oriented solution)?
> 
> Yes.  On the kernel side we are essentially fine.
> 
>>> So this patchset whether it is setid or setting the id at creation time
>>> seems to be jumping the gun.  For some namespaces renames are valid and
>>> we can support them.  For other namespaces setting the id is a big no-no,
>>> and possibly even controlling the id at creation time is a problem (at
>>> least in the general sense).
>> I'm sorry but I'm pretty new in this domain, so I don't see what are the
>> namespaces where setting (or pre-setting) the id would be a problem?
> 
> pids to some extent as people use them in all kinds of files.  Being
> able to force the pid of another process could make a hard to trigger
> security hole with file permissions absolutely trivial to hit.

Since the intent of this mechanism is to allow ckpt/restart, it makes
sense to only allow this operation during restart. For example, in zap,
containers have a state, e.g. running, stopped, ckpt, restart, and this
is only possible in restart state; Furthermore, a container can only be
put in restart state at creation time, and only by root. Of course, you
should only trust that as much as you trust the root  :O

> 
> Changing the pid on a process would be even worse because then how
> could you send it signals.
> 
> I haven't looked close enough to be able to say in situation X it is a
> problem or in situation Y it is clearly not a problem.  I just know
> there is a lot that happens with ids and security so we need to tread
> lightly, and carefully.
> 
>>> Because if you can easily control the id
>>> you may be able to more easily exploit security holes.
>>>
>>> I'm not at all inclined to make it easy for userspace to rename or set
>>> the id of a new resource unless it already makes sense in that
>>> namespace.
>>>
>>> We need to limit anything in a checkpoint to user space visible
>>> state.
>> OK, but a tasks that belongs to nested pid namespaces is known from other tasks
>> as one of its "intermediate" pids, depending on the namespace level we are
>> considering. So we can consider these "intermediate pids" as user space visible,
>> can't we?
> 
> Yes.
> 
>> Given the following pid namespaces hierarchy:
>> PNS0 -> PNS1 -> PNS2 -> PNS3 -> PNS4
>> A task that belongs to PSN2 has 3 upid nrs:
>> UP0, UP1, and UP2
>> So:
>> . UP0 can be obtained if we do a "ps -ef" in pid ns #1 (PNS0)
>> . UP1 can be obtained if we do a "ps -ef" in pid ns #1 (PNS1)
>> . UP2 can be obtained if we do a "ps -ef" in pid ns #1 (PNS2)
>>
>> So UP[0-2] are user space visible (again, depending on "where" we are in pid ns
>> hierarchy, aren't they?
> 
> Totally.
> 
>> Sure, I completely agree with you! So may be the 1st thing to do would be to
>> decide which approach (user space vs kernel) should be adopted for c/r? Sorry if
>> what I'm saying is stupid, but imho a clear answer to this question would make
>> us all go the same direction ==> save time for further investigations.
> 
> Agreed, although it isn't quite that cut and dry.
> 
> The question is what granularity do we export the checkpoint restore
> functionality with, and do we manage to have it serve additional
> functions as well.
> 
> Because ultimately it is user space saying checkpoint and it is user
> space saying restore.  Although we need to be careful that there
> are not cheaper migration solutions that we are precluding.
> 
> Getting the user space interface correct is going to be the tricky
> and important.
> 
> 
> 
> My inclination is that the cost in migrating a container is the cost
> of moving the data between machines.  Which (not counting filesystems)
> would be the anonymous pages and the shared memory areas.
> 
> So if we have a checkpoint as a directory.
> We could have files for the data of each shared memory region.
> 
> We could have files for the data of each shared anonymous region
> shared between mm's.
> 
> We could have ELF core files with extra notes to describe the full
> state of each process.
> 
> We could have files describing each sysvipc object (shared memory,
> message queues, and semaphores).
> 
> Futexes?
> 
> I would suggest a directory for each different kind of file, making
> conflicts easier to avoid, and data types easier to predict.
> 
> 
> Once we have a least one valid checkpoint format the question becomes
> how do we get the checkpoint out of the kernel, and how do we get
> the checkpoint into the kernel.  And especially how do we allow for
> incremental data movement so live migration with minimal interruption
> of service is possible.
> 
> This means that we need to preload all of the large memory areas into
> shared memory areas or processes.  Everything else file descriptors
> and the like I expect will be sufficiently inexpesive that we can
> treat their recreation as instantaneous.
> 
> 
> If incremental migration of data was not so useful I would say just
> have a single syscall to dump everything, and another syscall to
> restore everything.
> 
> Since incremental migration is so useful.  My gut feel is
> checkpointfs or something similar where we can repeated read
> and write a checkpoint before we commit to starting it is useful.
> 
> I am welcome to other better suggestions from the experience of the
> people who have implemented this before.
> 
> 
> A very fine grained user space solution where user space createsn
> each object using the traditional APIs and restores them similarly
> seems harder to implement, harder to get right, and harder to
> maintain.  In part because it hides what we are actually trying to do.
> 
> Further a fine grained user space approach appears to offer no
> advantages when it comes to restoring a checkpoint and incrementally
> updating it so that the down time between machines is negligible.
> 
>>> My inclination is that create with a specified set of ids is the
>>> proper internal kernel API, so we don't have rework things later,
>>> because reworking things seems to be a lot more work.
>> Completely agree with you: the earlier in the object's life we do it, the best
>> it is ;-)
>>
>>> How we want to
>>> export this to user space is another matter.
>>>
>>> One suggestion is to a have /proc or a proc like filesystem that
>>> allows us to read, create, and populate files to see all of the
>>> application state.
>> And that would be quite useful for debugging purpose, as you were saying
>> earlier.
> 
> Yes.  Part of why I suggested it in that way.  Debugging and
> checkpoint/restart have a lot in common.  If we can take advantage of
> that it would be cool.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

  parent reply	other threads:[~2008-03-13 20:01 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-10 13:50 [RFC][PATCH 0/4] Object creation with a specified id Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 1/4] Provide a new procfs interface to set next ipc id Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 2/4] Provide a new procfs interface to set next upid nr(s) Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 3/4] IPC: use the target ID specified in procfs Nadia.Derbey-6ktuUTfB/bM
2008-03-10 13:50 ` [RFC][PATCH 4/4] PID: " Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080310135209.769712000-6ktuUTfB/bM@public.gmane.org>
2008-03-11 12:04     ` Pavel Emelyanov
     [not found]       ` <47D67557.7080506-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 15:28         ` Nadia Derbey
     [not found]           ` <47D6A52D.6030701-6ktuUTfB/bM@public.gmane.org>
2008-03-11 15:37             ` Pavel Emelyanov
     [not found]               ` <47D6A741.9080708-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 15:55                 ` Nadia Derbey
2008-03-11 16:47                 ` Serge E. Hallyn
     [not found]                   ` <20080311164725.GA12918-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-03-11 16:55                     ` Pavel Emelyanov
     [not found]                       ` <47D6B990.4080400-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-11 17:53                         ` Serge E. Hallyn
     [not found]                           ` <20080311175328.GA14171-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-03-12 19:58                             ` Eric W. Biederman
     [not found]                               ` <m1zlt3yarj.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 10:41                                 ` Nadia Derbey
     [not found]                                   ` <47D904E4.4000208-6ktuUTfB/bM@public.gmane.org>
2008-03-13 17:40                                     ` Eric W. Biederman
     [not found]                                       ` <m1r6eewmj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 19:06                                         ` Serge E. Hallyn
2008-03-13 20:01                                         ` Oren Laadan [this message]
     [not found]                                           ` <47D987FE.9040909-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-13 23:12                                             ` Eric W. Biederman
     [not found]                                               ` <m163vqw74n.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2008-03-13 23:24                                                 ` Oren Laadan
     [not found] ` <20080310135054.312992000-6ktuUTfB/bM@public.gmane.org>
2008-03-13 23:16   ` [RFC][PATCH 0/4] Object creation with a specified id Oren Laadan
     [not found]     ` <47D9B5B7.6060803-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14  6:21       ` Nadia Derbey
     [not found]         ` <47DA195B.8070704-6ktuUTfB/bM@public.gmane.org>
2008-03-14 15:50           ` Oren Laadan
     [not found]             ` <47DA9EB5.8040704-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14 15:56               ` Pavel Emelyanov
     [not found]                 ` <47DAA041.9090009-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-03-14 16:02                   ` Oren Laadan
     [not found]                     ` <47DAA1A6.6010509-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-14 16:08                       ` Pavel Emelyanov
2008-03-14 16:11                   ` Nadia Derbey
2008-03-14 16:11               ` Nadia Derbey
     [not found]                 ` <47DAA3AA.4050906-6ktuUTfB/bM@public.gmane.org>
2008-03-14 16:45                   ` Oren Laadan
     [not found]                     ` <47DAABAB.7000706-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-16  3:43                       ` Serge E. Hallyn
     [not found]                         ` <20080316034320.GA19793-6s5zFf/epYLPQpwDFJZrxFMas7LaWZ9n@public.gmane.org>
2008-03-16 19:08                           ` Oren Laadan
     [not found]                             ` <47DD703C.4030809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-03-17 14:44                               ` Serge E. Hallyn
  -- strict thread matches above, loose matches on Subject: below --
2008-03-28  9:53 [RFC][PATCH 0/4] Object creation with a pre-defined id (v2) Nadia.Derbey-6ktuUTfB/bM
2008-03-28  9:53 ` [RFC][PATCH 4/4] PID: use the target ID specified in procfs Nadia.Derbey-6ktuUTfB/bM
2008-04-04 14:51 [RFC][PATCH 0/4] Object creation with a specified id Nadia.Derbey
2008-04-04 14:51 ` [RFC][PATCH 4/4] PID: use the target ID specified in procfs Nadia.Derbey-6ktuUTfB/bM
2008-04-04 14:51 ` Nadia.Derbey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47D987FE.9040909@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org \
    --cc=xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.