public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Pavel Emelyanov <xemul@parallels.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Tejun Heo <tj@kernel.org>, Oleg Nesterov <oleg@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] pidns: Make pid_max per namespace
Date: Thu, 10 Mar 2011 02:44:59 -0800	[thread overview]
Message-ID: <20110310024459.e54fd99e.akpm@linux-foundation.org> (raw)
In-Reply-To: <4D78A2B8.1030605@parallels.com>

On Thu, 10 Mar 2011 13:06:48 +0300 Pavel Emelyanov <xemul@parallels.com> wrote:

> On 03/10/2011 12:50 PM, Andrew Morton wrote:
> > On Thu, 10 Mar 2011 12:35:32 +0300 Pavel Emelyanov <xemul@parallels.com> wrote:
> > 
> >> On 03/08/2011 02:58 AM, Andrew Morton wrote:
> >>> On Thu, 03 Mar 2011 11:39:17 +0300
> >>> Pavel Emelyanov <xemul@parallels.com> wrote:
> >>>
> >>>> Rationale:
> >>>>
> >>>> On x86_64 with big ram people running containers set pid_max on host to 
> >>>> large values to be able to launch more containers. At the same time 
> >>>> containers running 32-bit software experience problems with large pids - ps
> >>>> calls readdir/stat on proc entries and inode's i_ino happen to be too big 
> >>>> for the 32-bit API.
> >>>>
> >>>> Thus, the ability to limit the pid value inside container is required.
> >>>>
> >>>
> >>> This is a behavioural change, isn't it?  In current kernels a write to
> >>> /proc/sys/kernel/pid_max will change the max pid on all processes. 
> >>> After this change, that write will only affect processes in the current
> >>> namespace.  Anyone who was depending on the old behaviour might run
> >>> into problems?
> >>
> >> Hardly. If the behavior of some two apps depends on its synchronous change,
> >> these two might want to run in the same pid namespace.
> > 
> > I don't understand your answer.  What is this "synchronous change" of which
> > you speak?  Does your "might want to run" suggestion mean that userspace 
> > changes would be required for this operation to again work correctly?
> 
> Your concern was about "anyone who was depending on the old behaviour", where
> the old behavior meant "a write to sys.pid_max will change the max pid on all
> processes".
> 
> I wanted to say, that if someone changes pid_max and expects someone else to
> act differently after this, then these two should live in the same pid namespace.

So it's a non-back-compatible change to the userspace interface.  uh-oh.

> IOW, if X raises the pid_max, then all the processes X sees in its pid namespace
> *may* have pids up to this value. All the other process, that are not visible
> in X's pid space will have other values, but X doesn't see them, so why should
> we care?

Current userspace has no *need* to be running in the same pidns to
alter the pid_max of some processes.  So the chances are good that
any current userspace takes advantage of this.

Silly example:

	if (fork() == 0) {
		/* child */
		create_new_pidns();
		start_doing_stuff();
	} else {
		/* parent */
		increase_pid_max();
	}

Another example would be logging into a system as root in the init_ns
and modifying /proc/sys/kernel/pid_max by hand.

I don't have a clue how much code is out there using pid namespaces,
not how much of that code alters the default pid_max.  Hard.


The proposed interface is a bit weird and hacky anyway, isn't it?  We
have a single pseudo-file in a well-known location -
/proc/sys/kernel/pid_max.  One would expect alteration of that
system-wide file to have system-wide effects, only that isn't the case.
Instead a modification to the system-wide file has local-pidns-only
effects.  It would be much more logical to have a per-pidns pid_max
pseudo file.

And if we do that, we then need to work out what to do with writes to
/proc/sys/kernel/pid_max.  Remember the user expects those writes to
alter all processes on the machine!  I guess it would be acceptable to
permit that to continue to happen - a write to /proc/sys/kernel/pid_max
will overwrite all the per-pidns pid_max settings.


      reply	other threads:[~2011-03-10 10:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-03  8:39 [PATCH] pidns: Make pid_max per namespace Pavel Emelyanov
2011-03-07 23:58 ` Andrew Morton
2011-03-10  9:35   ` Pavel Emelyanov
2011-03-10  9:50     ` Andrew Morton
2011-03-10 10:06       ` Pavel Emelyanov
2011-03-10 10:44         ` Andrew Morton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110310024459.e54fd99e.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=tj@kernel.org \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox