linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
To: Chen Fan <chen.fan.fnst-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Konstantin Khlebnikov
	<khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Serge Hallyn
	<serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>,
	Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	"Eric W. Biederman"
	<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Linus Torvalds
	<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: Re: [PATCH RFC] pidns: introduce syscall getvpid
Date: Mon, 21 Sep 2015 09:22:22 -0500	[thread overview]
Message-ID: <20150921142222.GA24005@mail.hallyn.com> (raw)
In-Reply-To: <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

On Mon, Sep 21, 2015 at 10:49:39AM +0800, Chen Fan wrote:
> 
> On 09/17/2015 12:31 AM, Serge E. Hallyn wrote:
> >On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote:
> >>"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> >>
> >>>On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote:
> >>>>On 15.09.2015 20:41, Serge Hallyn wrote:
> >>>>>Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org):
> >>>>>>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote:
> >>>>>>>On 15.09.2015 17:27, Eric W. Biederman wrote:
> >>>>>>>>Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes:
> >>>>>>>>
> >>>>>>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target);
> >>>>>>>>>
> >>>>>>>>>This syscall converts pid from one pid-ns into pid in another pid-ns:
> >>>>>>>>>it takes @pid in namespace of @source task (zero for current) and
> >>>>>>>>>returns related pid in namespace of @target task (zero for current too).
> >>>>>>>>>If pid is unreachable from target pid-ns then it returns zero.
> >>>>>>>>This interface as presented is inherently racy.  It would be better
> >>>>>>>>if source and target were file descriptors referring to the namespaces
> >>>>>>>>you wish to translate between.
> >>>>>>>Yep, it's racy. As well as any operation with non-child pids.
> >>>>>>>With file descriptors for source/target result will be racy anyway.
> >>>>>>>
> >>>>>>>>>Such conversion is required for interaction between processes from
> >>>>>>>>>different pid-namespaces. For example when system service talks with
> >>>>>>>>>client from isolated container via socket about task in container:
> >>>>>>>>Sockets are already supported.  At least the metadata of sockets is.
> >>>>>>>>
> >>>>>>>>Maybe we need this but I am not convinced of it's utility.
> >>>>>>>>
> >>>>>>>>What are you trying to do that motivates this?
> >>>>>>>I'm working on hierarchical container management system which
> >>>>>>>allows to create and control nested sub-containers from containers
> >>>>>>>( https://github.com/yandex/porto ). Main server works in host and
> >>>>>>>have to interact with all levels of nested namespaces. This syscall
> >>>>>>>makes some operations much easier: server must remember only pid in
> >>>>>>>host pid namespace and convert it into right vpid on demand.
> >>>>>>Note that as Eric said earlier, sending a PID inside a ucred through a
> >>>>>>unix socket will have the pid translated.
> >>>>>>
> >>>>>>So while your solution certainly should be faster, you can already achieve
> >>>>>>what you want today by doing:
> >>>>>>
> >>>>>>== Translate PID in container to PID in host
> >>>>>>  - open a socket
> >>>>>>  - setns to container's pidns
> >>>>>>  - send ucred from that container containing the requested container PID
> >>>>>>  - host sees the host PID
> >>>>>>
> >>>>>>== Translate PID on host to PID in container
> >>>>>>  - open a socket
> >>>>>>  - setns to container's pidns
> >>>>>>  - send ucred from the host containing the request host PID
> >>>>>>    (send will fail if the host PID isn't part of that container)
> >>>>>>  - container sees the container PID
> >>>>>In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns
> >>>>>we now also have 'NSpid' etc in /proc/$$/status.
> >>>>>
> >>>>As I see this works perfectly only for converting host pid into virtual.
> >>>>
> >>>>Backward conversion is troublesome: we have to scan all pids in host
> >>>>procfs and somehow filter tasks from container and its sub-pid-ns.
> >>>>Or I am missing something trivial?
> >>>Ah, no that doesn't help with this.
> >>>
> >>>What Stéphane describes is what I've done in several projects.
> >>>Getting it right is however actually quite tricky.  I'm not
> >>>convinced it's at the level of "since you can do (sweep hands)
> >>>all this, we don't need a simple syscall to do it."
> >>>
> >>>So I'd encourage you to resend using namespace inode fds for
> >>>source and target as Eric suggested.  We still may decide that
> >>>the syscall isn't needed, but it's a trivial change to your
> >>>patch and removes that race.  And I'm not convinced it's not
> >>>needed.
> >>At this point my primary concern is that a pattern that would need to be
> >>convering to and from pids quickly is potentially fundamentally racy to
> >>the point of broken.
> >The cgmanager GetTasks and GetTasksRecursive, and reading of the
> >lxcfs cgroup /tasks files, require converting every pid from the
> >cgmanager's namespace to the reading task's namespace.
> >
> >>Especially with unix domain sockets passing and converting pids in a way
> >>that covers the common case.
> >>
> >>I am clearly missing some nuance of this use case.
> >lxcfs and cgmanager are imo proof that we *can* do without the new
> >syscall.  However, the git history will show that there are some
> >complications, and the system load when a few systemds are starting
> >will show that it does take a performance toll on the host at some
> >point.  Still as I say it's doable.  The syscall implementation was
> >very simple, though.
> 
> Yes, previous email discussed about the implementation of syscall or procfs:
> http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723
> 
> but it seems complicated implemented by procfs, the original discussion at:
> http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440

So please implement it, as Eric suggested, using the ns inode fds
instead of racy pid_t hints for namespaces.

  parent reply	other threads:[~2015-09-21 14:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-15 12:09 [PATCH RFC] pidns: introduce syscall getvpid Konstantin Khlebnikov
2015-09-15 14:20 ` Oleg Nesterov
2015-09-15 14:27 ` Eric W. Biederman
     [not found]   ` <87h9mvg3kw.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-09-15 15:01     ` Konstantin Khlebnikov
     [not found]       ` <55F832D2.1070605-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
2015-09-15 15:17         ` Stéphane Graber
2015-09-15 15:51           ` Konstantin Khlebnikov
2015-09-15 17:41           ` Serge Hallyn
2015-09-16  7:37             ` Konstantin Khlebnikov
     [not found]               ` <55F91C3D.1040209-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
2015-09-16 14:39                 ` Serge E. Hallyn
     [not found]                   ` <20150916143939.GA32226-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-09-16 14:49                     ` Eric W. Biederman
     [not found]                       ` <87twquzag1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-09-16 16:31                         ` Serge E. Hallyn
     [not found]                           ` <20150916163123.GA1039-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-09-21  2:49                             ` Chen Fan
     [not found]                               ` <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2015-09-21 14:22                                 ` Serge E. Hallyn [this message]
     [not found]                                   ` <20150921142222.GA24005-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-09-22  7:42                                     ` Konstantin Khlebnikov
     [not found]                                       ` <56010680.7000301-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
2015-09-22 21:00                                         ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150921142222.GA24005@mail.hallyn.com \
    --to=serge-a9i7lubdfnhqt0dzr+alfa@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=chen.fan.fnst-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org \
    --cc=khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org \
    --cc=torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).