From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755919AbbIUCxB (ORCPT ); Sun, 20 Sep 2015 22:53:01 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:18303 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1755827AbbIUCw7 convert rfc822-to-8bit (ORCPT ); Sun, 20 Sep 2015 22:52:59 -0400 X-IronPort-AV: E=Sophos;i="5.15,520,1432569600"; d="scan'208";a="100919561" Message-ID: <55FF7043.5020701@cn.fujitsu.com> Date: Mon, 21 Sep 2015 10:49:39 +0800 From: Chen Fan User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: "Serge E. Hallyn" , "Eric W. Biederman" CC: Konstantin Khlebnikov , Serge Hallyn , =?windows-1252?Q?St=E9phane_Graber?= , , , Oleg Nesterov , , Andrew Morton , Linus Torvalds Subject: Re: Re: [PATCH RFC] pidns: introduce syscall getvpid References: <20150915120924.14818.49490.stgit@buzz> <87h9mvg3kw.fsf@x220.int.ebiederm.org> <55F832D2.1070605@yandex-team.ru> <20150915151729.GA144242@dakara> <20150915174143.GE4699@ubuntumail> <55F91C3D.1040209@yandex-team.ru> <20150916143939.GA32226@mail.hallyn.com> <87twquzag1.fsf@x220.int.ebiederm.org> <20150916163123.GA1039@mail.hallyn.com> In-Reply-To: <20150916163123.GA1039@mail.hallyn.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 8BIT X-Originating-IP: [10.167.226.78] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/17/2015 12:31 AM, Serge E. Hallyn wrote: > On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote: >> "Serge E. Hallyn" writes: >> >>> On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: >>>> On 15.09.2015 20:41, Serge Hallyn wrote: >>>>> Quoting Stéphane Graber (stgraber@ubuntu.com): >>>>>> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >>>>>>> On 15.09.2015 17:27, Eric W. Biederman wrote: >>>>>>>> Konstantin Khlebnikov writes: >>>>>>>> >>>>>>>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >>>>>>>>> >>>>>>>>> This syscall converts pid from one pid-ns into pid in another pid-ns: >>>>>>>>> it takes @pid in namespace of @source task (zero for current) and >>>>>>>>> returns related pid in namespace of @target task (zero for current too). >>>>>>>>> If pid is unreachable from target pid-ns then it returns zero. >>>>>>>> This interface as presented is inherently racy. It would be better >>>>>>>> if source and target were file descriptors referring to the namespaces >>>>>>>> you wish to translate between. >>>>>>> Yep, it's racy. As well as any operation with non-child pids. >>>>>>> With file descriptors for source/target result will be racy anyway. >>>>>>> >>>>>>>>> Such conversion is required for interaction between processes from >>>>>>>>> different pid-namespaces. For example when system service talks with >>>>>>>>> client from isolated container via socket about task in container: >>>>>>>> Sockets are already supported. At least the metadata of sockets is. >>>>>>>> >>>>>>>> Maybe we need this but I am not convinced of it's utility. >>>>>>>> >>>>>>>> What are you trying to do that motivates this? >>>>>>> I'm working on hierarchical container management system which >>>>>>> allows to create and control nested sub-containers from containers >>>>>>> ( https://github.com/yandex/porto ). Main server works in host and >>>>>>> have to interact with all levels of nested namespaces. This syscall >>>>>>> makes some operations much easier: server must remember only pid in >>>>>>> host pid namespace and convert it into right vpid on demand. >>>>>> Note that as Eric said earlier, sending a PID inside a ucred through a >>>>>> unix socket will have the pid translated. >>>>>> >>>>>> So while your solution certainly should be faster, you can already achieve >>>>>> what you want today by doing: >>>>>> >>>>>> == Translate PID in container to PID in host >>>>>> - open a socket >>>>>> - setns to container's pidns >>>>>> - send ucred from that container containing the requested container PID >>>>>> - host sees the host PID >>>>>> >>>>>> == Translate PID on host to PID in container >>>>>> - open a socket >>>>>> - setns to container's pidns >>>>>> - send ucred from the host containing the request host PID >>>>>> (send will fail if the host PID isn't part of that container) >>>>>> - container sees the container PID >>>>> In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns >>>>> we now also have 'NSpid' etc in /proc/$$/status. >>>>> >>>> As I see this works perfectly only for converting host pid into virtual. >>>> >>>> Backward conversion is troublesome: we have to scan all pids in host >>>> procfs and somehow filter tasks from container and its sub-pid-ns. >>>> Or I am missing something trivial? >>> Ah, no that doesn't help with this. >>> >>> What Stéphane describes is what I've done in several projects. >>> Getting it right is however actually quite tricky. I'm not >>> convinced it's at the level of "since you can do (sweep hands) >>> all this, we don't need a simple syscall to do it." >>> >>> So I'd encourage you to resend using namespace inode fds for >>> source and target as Eric suggested. We still may decide that >>> the syscall isn't needed, but it's a trivial change to your >>> patch and removes that race. And I'm not convinced it's not >>> needed. >> At this point my primary concern is that a pattern that would need to be >> convering to and from pids quickly is potentially fundamentally racy to >> the point of broken. > The cgmanager GetTasks and GetTasksRecursive, and reading of the > lxcfs cgroup /tasks files, require converting every pid from the > cgmanager's namespace to the reading task's namespace. > >> Especially with unix domain sockets passing and converting pids in a way >> that covers the common case. >> >> I am clearly missing some nuance of this use case. > lxcfs and cgmanager are imo proof that we *can* do without the new > syscall. However, the git history will show that there are some > complications, and the system load when a few systemds are starting > will show that it does take a performance toll on the host at some > point. Still as I say it's doable. The syscall implementation was > very simple, though. Yes, previous email discussed about the implementation of syscall or procfs: http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723 but it seems complicated implemented by procfs, the original discussion at: http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440 Thanks, Chen > > -serge > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > . >