public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <haveblue@us.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Ulrich Drepper <drepper@redhat.com>,
	linux-kernel@vger.kernel.org,
	"Dinakar Guniguntala [imap]" <dino@in.ibm.com>,
	Sripathi Kodi <sripathik@in.ibm.com>
Subject: Re: [patch] PID namespace design bug, workaround
Date: Sat, 3 Nov 2007 21:12:51 +0100	[thread overview]
Message-ID: <20071103201251.GB26366@elte.hu> (raw)
In-Reply-To: <alpine.LFD.0.999.0711021038480.3342@woody.linux-foundation.org>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 2 Nov 2007, Dave Hansen wrote:
> > 
> > There are certainly more of these, but here is one In the futex 
> > userspace address, we install the current pid's vnr into a userspace 
> > address.
> 
> Now, realistically, why not just say "you can't use these things 
> across namespaces"? Does anybody really care? After all, somebody who 
> screws this up only screws himself, not anybody else.

i see two main categories of problems:

- one problem is that this condition is 'invisible'. If two namespaces 
  happen to access the same robust futex (say a yum update from two 
  PID namespaces sharing the same read-mostly filesystem) there's silent
  breakage and data corruption due to PID overlap. The other
  namespaces have no such problems. I think the "dont do that" answer is
  lame because most apps _will_ work across PID namespaces because 
  things like fcntl based locking does work. And there's no valid
  technical excuse why futexes shouldnt work: it's all controlled by the
  same native kernel, there's no untrusted network separating the nodes,
  etc.

- so via this we isolate an important category of syscalls from
  cross-namespace use perhaps forever. Pick just about any other kernel
  resource and they can be shared between namespaces. But not futexes -
  which happen to be the most scalable locking primitive and people will
  almost certainly want to use them across namespaces. A
  completely new breed of futexes has to be introduced and trickled
  through userspace and all the architectures to make it work again
  across namespaces. Who will do that work? Generally the people who
  introduce a new concept are the ones who should do that. But in this
  case they are apparently not interested in making it generic enough
  (they are concentrated on their 'isolate it all' aspect) so
  nobody else will do and we are stuck with an incomplete concept.

The answer of user-space/apps is predictable: they'll gravitate towards 
the path of least resistance, and that will be "dont use futexes". PID 
namespaces basically single out an important API category and use the 
natural pressure of the other 300 syscalls and tens of thousands of apps 
against this category. Linux is basically used against itself. The 
counter-force is relatively weak and there's no solution available _at 
all_ presently so it's not even the fight of patches against each other, 
it's the sheer lack of a feature which has an obvious end-result.

We've already got way too many incomplete concepts and APIs in the 
kernel. Maybe i'm over-worrying, but i fear we end up like with 
capabilities or sendfile - code merged too soon and never completed for 
many years - perhaps never completed at all. VMS and WNT did those 
things a bit better i think - their API frameworks were/are pervasive 
and complete, even in the corner cases.

Whether it's the right approach to force reasonable perfection of 
frameworks like this from the get go is another question - but in 
practice even for relatively popular new APIs like epoll we see a way 
too slow movement towards the 'completion of the API', and that hinders 
adoption of new APIs very much. (With splice being a notable exception - 
there the central concept was so strong that it quickly pushed itself to 
total completion - combined with a capable maintainer of the API.) But 
it's not that easy for futexes and we put another roadblock in the path 
of futexes.

	Ingo

  parent reply	other threads:[~2007-11-03 20:13 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-01 14:43 [patch] PID namespace design bug, workaround Ingo Molnar
2007-11-01 14:51 ` Pavel Emelyanov
2007-11-01 14:56   ` Peter Zijlstra
2007-11-01 15:06     ` Pavel Emelyanov
2007-11-01 15:17       ` Ingo Molnar
2007-11-01 15:30         ` Pavel Emelyanov
2007-11-01 14:56   ` Ulrich Drepper
2007-11-01 15:05     ` Pavel Emelyanov
2007-11-02  0:21       ` Ulrich Drepper
2007-11-02  7:55         ` Pavel Emelyanov
2007-11-02  8:04           ` Andrew Morton
2007-11-02  8:14             ` Pavel Emelyanov
2007-11-02 14:05               ` Ulrich Drepper
2007-11-02 14:21                 ` Pavel Emelyanov
2007-11-02 15:34                   ` Ulrich Drepper
2007-11-02 15:58                     ` Pavel Emelyanov
2007-11-02 21:39                       ` Theodore Tso
2007-11-03  4:34                       ` Ulrich Drepper
2007-11-06  7:49                         ` Pavel Emelyanov
2007-11-03 20:01                   ` sukadev
2007-11-04  7:17                     ` Eric W. Biederman
2007-11-02 17:30             ` Dave Hansen
2007-11-02 17:39               ` Linus Torvalds
2007-11-03  4:02                 ` Nicholas Miell
2007-11-03 20:12                 ` Ingo Molnar [this message]
2007-11-03 22:40                   ` Linus Torvalds
2007-11-03 23:55                     ` Arjan van de Ven
2007-11-04  0:21                       ` david
2007-11-04 10:38                     ` [patch] PID namespaces Ingo Molnar
2007-11-04 20:12                       ` Dave Hansen
2007-11-05 14:47                       ` Denys Vlasenko
2007-11-20 22:53                   ` Futexes and network filesystems Er ic W. Biederman
2007-11-21  6:16                     ` Kyle Moffett
2007-11-21  6:30                       ` Eric W. Biederman
2007-11-01 16:12     ` [patch] PID namespace design bug, workaround Dave Hansen
2007-11-01 14:53 ` Ulrich Drepper
2007-11-01 15:05   ` Ingo Molnar
2007-11-01 18:57     ` Theodore Tso
2007-11-01 19:53       ` Ingo Molnar
2007-11-02  0:23         ` Ulrich Drepper
2007-11-01 15:02 ` Pavel Emelyanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071103201251.GB26366@elte.hu \
    --to=mingo@elte.hu \
    --cc=akpm@linux-foundation.org \
    --cc=dino@in.ibm.com \
    --cc=drepper@redhat.com \
    --cc=haveblue@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sripathik@in.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox