All of lore.kernel.org
 help / color / mirror / Atom feed
From: Serge Hallyn <serge.hallyn@ubuntu.com>
To: Gao feng <gaofeng@cn.fujitsu.com>
Cc: James Bottomley <jbottomley@parallels.com>,
	"systemd-devel@lists.freedesktop.org"
	<systemd-devel@lists.freedesktop.org>,
	"libvir-list@redhat.com" <libvir-list@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Kay Sievers <kay@vrfy.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	"lxc-devel@lists.sourceforge.net"
	<lxc-devel@lists.sourceforge.net>,
	"davem@davemloft.net" <davem@davemloft.net>
Subject: Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace
Date: Mon, 26 Aug 2013 08:53:37 -0500	[thread overview]
Message-ID: <20130826135337.GA9030@tp> (raw)
In-Reply-To: <521ACCEF.4050101@cn.fujitsu.com>

Quoting Gao feng (gaofeng@cn.fujitsu.com):
> On 08/26/2013 11:19 AM, James Bottomley wrote:
> > On Mon, 2013-08-26 at 09:06 +0800, Gao feng wrote:
> >> On 08/26/2013 02:16 AM, James Bottomley wrote:
> >>> On Sun, 2013-08-25 at 19:37 +0200, Kay Sievers wrote:
> >>>> On Sun, Aug 25, 2013 at 7:16 PM, James Bottomley
> >>>> <jbottomley@parallels.com> wrote:
> >>>>> On Wed, 2013-08-21 at 11:51 +0200, Kay Sievers wrote:
> >>>>>> On Wed, Aug 21, 2013 at 9:22 AM, Gao feng <gaofeng@cn.fujitsu.com> wrote:
> >>>>>>> On 08/21/2013 03:06 PM, Eric W. Biederman wrote:
> >>>>>>
> >>>>>>>> I suspect libvirt should simply not share /run or any other normally
> >>>>>>>> writable directory with the host.  Sharing /run /var/run or even /tmp
> >>>>>>>> seems extremely dubious if you want some kind of containment, and
> >>>>>>>> without strange things spilling through.
> >>>>>>
> >>>>>> Right, /run or /var cannot be shared. It's not only about sockets,
> >>>>>> many other things will also go really wrong that way.
> >>>>>
> >>>>> This is very narrow thinking about what a container might be and will
> >>>>> cause trouble as people start to create novel uses for containers in the
> >>>>> cloud if you try to impose this on our current infrastructure.
> >>>>>
> >>>>> One of the cgroup only container uses we see at Parallels (so no
> >>>>> separate filesystem and no net namespaces) is pure apache load balancer
> >>>>> type shared hosting.  In this scenario, base apache is effectively
> >>>>> brought up in the host environment, but then spawned instances are
> >>>>> resource limited using cgroups according to what the customer has paid.
> >>>>> Obviously all apache instances are sharing /var and /run from the host
> >>>>> (mostly for logging and pid storage and static pages).  The reason some
> >>>>> hosters do this is that it allows much higher density simple web serving
> >>>>> (either static pages from quota limited chroots or dynamic pages limited
> >>>>> by database space constraints) because each "instance" shares so much
> >>>>> from the host.  The service is obviously much more basic than giving
> >>>>> each customer a container running apache, but it's much easier for the
> >>>>> hoster to administer and it serves the customer just as well for a large
> >>>>> cross section of use cases and for those it doesn't serve, the hoster
> >>>>> usually has separate container hosting (for a higher price, of course).
> >>>>
> >>>> The "container" as we talk about has it's own init, and no, it cannot
> >>>> share /var or /run.
> >>>
> >>> This is what we would call an IaaS container: bringing up init and
> >>> effectively a new OS inside a container is the closest containers come
> >>> to being like hypervisors.  It's the most common use case of Parallels
> >>> containers in the field, so I'm certainly not telling you it's a bad
> >>> idea.
> >>>
> >>>> The stuff you talk about has nothing to do with that, it's not
> >>>> different from all services or a multi-instantiated service on the
> >>>> host sharing the same /run and /var.
> >>>
> >>> I gave you one example: a really simplistic one.  A more sophisticated
> >>> example is a PaaS or SaaS container where you bring the OS up in the
> >>> host but spawn a particular application into its own container (this is
> >>> essentially similar to what Docker does).  Often in this case, you do
> >>> add separate mount and network namespaces to make the application
> >>> isolated and migrateable with its own IP address.  The reason you share
> >>> init and most of the OS from the host is for elasticity and density,
> >>> which are fast becoming a holy grail type quest of cloud orchestration
> >>> systems: if you don't have to bring up the OS from init and you can just
> >>> start the application from a C/R image (orders of magnitude smaller than
> >>> a full system image) and slap on the necessary namespaces as you clone
> >>> it, you have something that comes online in miliseconds which is a feat
> >>> no hypervisor based virtualisation can match.
> >>>
> >>> I'm not saying don't pursue the IaaS case, it's definitely useful ...
> >>> I'm just saying it would be a serious mistake to think that's the only
> >>> use case for containers and we certainly shouldn't adjust Linux to serve
> >>> only that use case.
> >>>
> >>
> >> The feature you said above VS contianer-reboot-host bug, I prefer to
> >> fix
> >> the bug.
> > 
> > What bug?
> > 
> >>  and this feature can be achieved even container unshares /run
> >> directory
> >> with host by default, for libvirt, user can set the container
> >> configuration to
> >> make the container shares the /run directory with host.
> >>
> >> I would like to say, the reboot from container bug is more urgent and
> >> need
> >> to be fixed.
> > 
> > Are you talking about the old bug where trying to reboot an lxc
> > container from within it would reboot the entire system? 
> 
> Yes, we are discussing this problem in this whole thread.
> 
>  If so, OpenVZ
> > has never suffered from that problem and I thought it was fixed
> > upstream.  I've not tested lxc tools, but the latest vzctl from the
> > openvz website will bring up a container on the vanilla 3.9 kernel
> > (provided you have USER_NS compiled in) can also be used to reboot the
> > container, so I see no reason it wouldn't work for lxc as well.
> > 
> 
> I'm using libvirt lxc not lxc-tools.
> Not all of users enable user namespace, I trust these container management
> tools can have right/proper setting which inhibit this reboot-problem occur.
> but I don't think this reboot-problem won't happen in any configuration.

On any recent kernel, reboot syscall from inside a non-init pid-ns will
not reboot the host.  If from within a non-init pid-ns you are managing
to reboot the host, then you have a problem with how userspace is set
up.  The container is being allowed to request init on the host to
do the reboot - ie by sharing /dev/initctl inode with the host, or by
being in same net namespace as upstart on the host.

The fact that it's possible to create such containers is not a bug.

(On older kernels, you have to drop CAP_SYS_BOOT to prevent use of
reboot system call, as all lxc-like programs did.)

-serge

  reply	other threads:[~2013-08-26 13:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-21  4:31 [PATCH] netns: unix: only allow to find out unix socket in same net namespace Gao feng
     [not found] ` <1377059473-25526-1-git-send-email-gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  4:58   ` Gao feng
2013-08-21  5:30   ` Eric W. Biederman
     [not found]     ` <87d2p7vcdx.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-21  6:54       ` Gao feng
     [not found]         ` <5214641C.9030902-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  7:06           ` Eric W. Biederman
2013-08-21  7:22             ` Gao feng
     [not found]               ` <52146AC2.5070409-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-21  9:51                 ` [systemd-devel] " Kay Sievers
     [not found]                   ` <CAPXgP120YUEVnFiD0uPnqeO4x=5oRvHL79-cX5CnmEWc3d5mvQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-21  9:56                     ` Daniel P. Berrange
2013-08-25 17:16                     ` James Bottomley
2013-08-25 17:37                       ` Kay Sievers
     [not found]                         ` <CAPXgP115pEE8jxyCqauoMRWui3Qb0fBzPr9L2_SA411=gfnX3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-25 18:16                           ` James Bottomley
2013-08-26  1:06                             ` Gao feng
     [not found]                               ` <521AAA23.9050604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-26  3:19                                 ` James Bottomley
2013-08-26  3:35                                   ` Gao feng
2013-08-26 13:53                                     ` Serge Hallyn [this message]
     [not found]                                     ` <521ACCEF.4050101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-08-26  3:53                                       ` James Bottomley
2013-08-26 13:53                                       ` Serge Hallyn
2013-08-21 10:42                 ` Eric W. Biederman
2013-08-22  1:36                   ` Gao feng
     [not found]                   ` <87haejtjet.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-22  1:36                     ` Gao feng
     [not found]             ` <87wqnfttdf.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-21  7:22               ` Gao feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130826135337.GA9030@tp \
    --to=serge.hallyn@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=gaofeng@cn.fujitsu.com \
    --cc=jbottomley@parallels.com \
    --cc=kay@vrfy.org \
    --cc=libvir-list@redhat.com \
    --cc=lxc-devel@lists.sourceforge.net \
    --cc=netdev@vger.kernel.org \
    --cc=systemd-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.