netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Serge Hallyn <serge.hallyn@ubuntu.com>
To: Gao feng <gaofeng@cn.fujitsu.com>
Cc: James Bottomley <jbottomley@parallels.com>,
	"systemd-devel@lists.freedesktop.org"
	<systemd-devel@lists.freedesktop.org>,
	"libvir-list@redhat.com" <libvir-list@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Kay Sievers <kay@vrfy.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	"lxc-devel@lists.sourceforge.net"
	<lxc-devel@lists.sourceforge.net>,
	"davem@davemloft.net" <davem@davemloft.net>
Subject: Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace
Date: Mon, 26 Aug 2013 08:53:37 -0500	[thread overview]
Message-ID: <20130826135337.GA9030@tp> (raw)
In-Reply-To: <521ACCEF.4050101@cn.fujitsu.com>

Quoting Gao feng (gaofeng@cn.fujitsu.com):
> On 08/26/2013 11:19 AM, James Bottomley wrote:
> > On Mon, 2013-08-26 at 09:06 +0800, Gao feng wrote:
> >> On 08/26/2013 02:16 AM, James Bottomley wrote:
> >>> On Sun, 2013-08-25 at 19:37 +0200, Kay Sievers wrote:
> >>>> On Sun, Aug 25, 2013 at 7:16 PM, James Bottomley
> >>>> <jbottomley@parallels.com> wrote:
> >>>>> On Wed, 2013-08-21 at 11:51 +0200, Kay Sievers wrote:
> >>>>>> On Wed, Aug 21, 2013 at 9:22 AM, Gao feng <gaofeng@cn.fujitsu.com> wrote:
> >>>>>>> On 08/21/2013 03:06 PM, Eric W. Biederman wrote:
> >>>>>>
> >>>>>>>> I suspect libvirt should simply not share /run or any other normally
> >>>>>>>> writable directory with the host.  Sharing /run /var/run or even /tmp
> >>>>>>>> seems extremely dubious if you want some kind of containment, and
> >>>>>>>> without strange things spilling through.
> >>>>>>
> >>>>>> Right, /run or /var cannot be shared. It's not only about sockets,
> >>>>>> many other things will also go really wrong that way.
> >>>>>
> >>>>> This is very narrow thinking about what a container might be and will
> >>>>> cause trouble as people start to create novel uses for containers in the
> >>>>> cloud if you try to impose this on our current infrastructure.
> >>>>>
> >>>>> One of the cgroup only container uses we see at Parallels (so no
> >>>>> separate filesystem and no net namespaces) is pure apache load balancer
> >>>>> type shared hosting.  In this scenario, base apache is effectively
> >>>>> brought up in the host environment, but then spawned instances are
> >>>>> resource limited using cgroups according to what the customer has paid.
> >>>>> Obviously all apache instances are sharing /var and /run from the host
> >>>>> (mostly for logging and pid storage and static pages).  The reason some
> >>>>> hosters do this is that it allows much higher density simple web serving
> >>>>> (either static pages from quota limited chroots or dynamic pages limited
> >>>>> by database space constraints) because each "instance" shares so much
> >>>>> from the host.  The service is obviously much more basic than giving
> >>>>> each customer a container running apache, but it's much easier for the
> >>>>> hoster to administer and it serves the customer just as well for a large
> >>>>> cross section of use cases and for those it doesn't serve, the hoster
> >>>>> usually has separate container hosting (for a higher price, of course).
> >>>>
> >>>> The "container" as we talk about has it's own init, and no, it cannot
> >>>> share /var or /run.
> >>>
> >>> This is what we would call an IaaS container: bringing up init and
> >>> effectively a new OS inside a container is the closest containers come
> >>> to being like hypervisors.  It's the most common use case of Parallels
> >>> containers in the field, so I'm certainly not telling you it's a bad
> >>> idea.
> >>>
> >>>> The stuff you talk about has nothing to do with that, it's not
> >>>> different from all services or a multi-instantiated service on the
> >>>> host sharing the same /run and /var.
> >>>
> >>> I gave you one example: a really simplistic one.  A more sophisticated
> >>> example is a PaaS or SaaS container where you bring the OS up in the
> >>> host but spawn a particular application into its own container (this is
> >>> essentially similar to what Docker does).  Often in this case, you do
> >>> add separate mount and network namespaces to make the application
> >>> isolated and migrateable with its own IP address.  The reason you share
> >>> init and most of the OS from the host is for elasticity and density,
> >>> which are fast becoming a holy grail type quest of cloud orchestration
> >>> systems: if you don't have to bring up the OS from init and you can just
> >>> start the application from a C/R image (orders of magnitude smaller than
> >>> a full system image) and slap on the necessary namespaces as you clone
> >>> it, you have something that comes online in miliseconds which is a feat
> >>> no hypervisor based virtualisation can match.
> >>>
> >>> I'm not saying don't pursue the IaaS case, it's definitely useful ...
> >>> I'm just saying it would be a serious mistake to think that's the only
> >>> use case for containers and we certainly shouldn't adjust Linux to serve
> >>> only that use case.
> >>>
> >>
> >> The feature you said above VS contianer-reboot-host bug, I prefer to
> >> fix
> >> the bug.
> > 
> > What bug?
> > 
> >>  and this feature can be achieved even container unshares /run
> >> directory
> >> with host by default, for libvirt, user can set the container
> >> configuration to
> >> make the container shares the /run directory with host.
> >>
> >> I would like to say, the reboot from container bug is more urgent and
> >> need
> >> to be fixed.
> > 
> > Are you talking about the old bug where trying to reboot an lxc
> > container from within it would reboot the entire system? 
> 
> Yes, we are discussing this problem in this whole thread.
> 
>  If so, OpenVZ
> > has never suffered from that problem and I thought it was fixed
> > upstream.  I've not tested lxc tools, but the latest vzctl from the
> > openvz website will bring up a container on the vanilla 3.9 kernel
> > (provided you have USER_NS compiled in) can also be used to reboot the
> > container, so I see no reason it wouldn't work for lxc as well.
> > 
> 
> I'm using libvirt lxc not lxc-tools.
> Not all of users enable user namespace, I trust these container management
> tools can have right/proper setting which inhibit this reboot-problem occur.
> but I don't think this reboot-problem won't happen in any configuration.

On any recent kernel, reboot syscall from inside a non-init pid-ns will
not reboot the host.  If from within a non-init pid-ns you are managing
to reboot the host, then you have a problem with how userspace is set
up.  The container is being allowed to request init on the host to
do the reboot - ie by sharing /dev/initctl inode with the host, or by
being in same net namespace as upstart on the host.

The fact that it's possible to create such containers is not a bug.

(On older kernels, you have to drop CAP_SYS_BOOT to prevent use of
reboot system call, as all lxc-like programs did.)

-serge

  parent reply	other threads:[~2013-08-26 13:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-21  4:31 [PATCH] netns: unix: only allow to find out unix socket in same net namespace Gao feng
     [not found] ` <1377059473-25526-1-git-send-email-gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  4:58   ` Gao feng
2013-08-21  5:30   ` Eric W. Biederman
     [not found]     ` <87d2p7vcdx.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-21  6:54       ` Gao feng
     [not found]         ` <5214641C.9030902-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  7:06           ` Eric W. Biederman
2013-08-21  7:22             ` Gao feng
     [not found]               ` <52146AC2.5070409-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  9:51                 ` [systemd-devel] " Kay Sievers
     [not found]                   ` <CAPXgP120YUEVnFiD0uPnqeO4x=5oRvHL79-cX5CnmEWc3d5mvQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-21  9:56                     ` Daniel P. Berrange
2013-08-25 17:16                     ` James Bottomley
2013-08-25 17:37                       ` Kay Sievers
     [not found]                         ` <CAPXgP115pEE8jxyCqauoMRWui3Qb0fBzPr9L2_SA411=gfnX3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-25 18:16                           ` James Bottomley
2013-08-26  1:06                             ` Gao feng
     [not found]                               ` <521AAA23.9050604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-26  3:19                                 ` James Bottomley
2013-08-26  3:35                                   ` Gao feng
     [not found]                                     ` <521ACCEF.4050101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-26  3:53                                       ` James Bottomley
2013-08-26 13:53                                     ` Serge Hallyn [this message]
2013-08-21 10:42                 ` Eric W. Biederman
2013-08-22  1:36                   ` Gao feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130826135337.GA9030@tp \
    --to=serge.hallyn@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=gaofeng@cn.fujitsu.com \
    --cc=jbottomley@parallels.com \
    --cc=kay@vrfy.org \
    --cc=libvir-list@redhat.com \
    --cc=lxc-devel@lists.sourceforge.net \
    --cc=netdev@vger.kernel.org \
    --cc=systemd-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).