From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751521AbdJOWkk (ORCPT ); Sun, 15 Oct 2017 18:40:40 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:52157 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751030AbdJOWkj (ORCPT ); Sun, 15 Oct 2017 18:40:39 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Aleksa Sarai Cc: Linux Containers , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Christian Brauner , Evgeniy Polyakov , dev , "cyphar\@cyphar.com \>\> Aleksa Sarai" References: Date: Sun, 15 Oct 2017 17:40:08 -0500 In-Reply-To: (Aleksa Sarai's message of "Sun, 15 Oct 2017 21:05:49 +1100") Message-ID: <87r2u4nign.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1e3raB-00009X-93;;;mid=<87r2u4nign.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=67.3.233.18;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18eQA7Q2GmhVhsx+K5oGlZ59rXSL/hKXeA= X-SA-Exim-Connect-IP: 67.3.233.18 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa04 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa04 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Aleksa Sarai X-Spam-Relay-Country: X-Spam-Timing: total 5548 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.2 (0.1%), b_tie_ro: 2.3 (0.0%), parse: 0.94 (0.0%), extract_message_metadata: 15 (0.3%), get_uri_detail_list: 3.3 (0.1%), tests_pri_-1000: 4.1 (0.1%), tests_pri_-950: 1.15 (0.0%), tests_pri_-900: 0.98 (0.0%), tests_pri_-400: 30 (0.5%), check_bayes: 29 (0.5%), b_tokenize: 10 (0.2%), b_tok_get_all: 10 (0.2%), b_comp_prob: 3.4 (0.1%), b_tok_touch_all: 3.2 (0.1%), b_finish: 0.67 (0.0%), tests_pri_0: 609 (11.0%), check_dkim_signature: 0.65 (0.0%), check_dkim_adsp: 2.9 (0.1%), tests_pri_500: 4880 (88.0%), poll_dns_idle: 4870 (87.8%), rewrite_mail: 0.00 (0.0%) Subject: Re: RFC: making cn_proc work in {pid,user} namespaces X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Aleksa Sarai writes: > Hi all, > > At the moment, cn_proc is not usable by containers or container runtimes. In > addition, all connectors have an odd relationship with init_net (for example, > /proc/net/connectors only exists in init_net). There are two main use-cases that > would be perfect for cn_proc, which is the reason for me pushing this: > > First, when adding a process to an existing container, in certain modes runc > would like to know that process's exit code. But, when joining a PID namespace, > it is advisable[1] to always double-fork after doing the setns(2) to reparent > the joining process to the init of the container (this causes the SIGCHLD to be > received by the container init). It would also be useful to be able to monitor > the exit code of the init process in a container without being its parent. At > the moment, cn_proc doesn't allow unprivileged users to use it (making it a > problem for user namespaces and "rootless containers"). In addition, it also > doesn't allow nested containers to use it, because it requires the process to be > in init_pid. As a result, runc cannot use cn_proc and relies on SIGCHLD (which > can only be used if we don't double-fork, or keep around a long-running process > which is something that runc also cannot do). As far as I know there are no technical issues that require a daemonizing double fork when injecting a process into a pid namespaces. A fork is required because the pid is changing and that requires another process. Monitoring and acting on the monitored state without keeping around a single process to do the monitoring does not make sense to me. So I am just going to ignore that. So I don't think fixing cn_proc for this issue makes sense. > Secondly, there are/were some init systems that rely on cn_proc to manage > service state. From a "it would be neat" perspective, I think it would be quite > nice if such init systems could be used inside containers. But that requires > cn_proc to be able to be used as an unprivileged user and in a pid namespace > other than init_pid. Any pointers to these init systems? In general I agree. Given how much work it takes to go through a subsystem and ensure that it is safe for non-root users I am happy to see the work done, but I am not volunteering for the work when I have several I have as many tasks as I have on my plate right now. > The /proc/net/connectors thing is quite easily resolved (just make it the > connector driver perdev and make some small changes to make sure the interfaces > stay sane inside of a container's network namespace). I'm sure that we'll > probably have to make some changes to the registration API, so that a connector > can specify whether they want to be visible to non-init_net > namespaces. > > However, the cn_proc problem is a bit harder to resolve nicely and there are > quite a few interface questions that would need to be agreed upon. The basic > idea would be that a process can only get cn_proc events if it has > ptrace_may_access rights over said process (effectively a forced filter -- which > would ideally be done send-side but it looks like it might have to be done > receive-side). This should resolve possible concerns about an unprivileged > process being able to inspect (fairly granular) information about the host. And > obviously the pids, uids, and gids would all be translated according to the > receiving process's user namespaces (if it cannot be translated then the message > is not received). I guess that the translation would be done in the same way as > SCM_CREDENTIALS (and cgroup.procs files), which is that it's done on the receive > side not the send side. Hmm. We have several of these things such as bsd process accounting which appear to be working fine. The basic logic winds up being: for_each_receiver: compose_msg in receivers namespace send_msg. The tricky bit in my mind is dealing with receivers because of the connection with the network namespace. SCM_CREDENTIALS is an unfortunate case, that really should not be followed as a model. The major challenge there is not knowing the receiving socket, or the receiver. If I had been smarter when I coded that originally I would have forced everything into the namespace of the opener of the receiving socket. I may have to revisit that one again someday and see if there are improvements that can be made. > My reason for sending this email rather than just writing the patch is to see > whether anyone has any solid NACKs against the use-case or whether there is some > fundamental issue that I'm not seeing. If nobody objects, I'll be happy to work > on this. If you want a non-crazy (with respect to namespace involvement) model please look at kernel/acct.c:acc_process() If there are use cases that people still care about that use the proc connector and want to run in a container it seems sensible to dig in and sort things out. I think I have been hoping it is little enough used we won't have to mess with making it work in namespaces. > [1]: https://lwn.net/Articles/532748/ Eric