[GIT PULL] Namespace file descriptors for 2.6.40

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GIT PULL] Namespace file descriptors for 2.6.40
@ 2011-05-23 21:05 Eric W. Biederman
  2011-05-25 21:05 ` C Anthony Risinger
  0 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-23 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Linux Containers, netdev, James Bottomley,
	Geert Uytterhoeven


Please pull the namespace file descriptor git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd.git

Because other syscall work has happened in other trees there
are conflicts on alpha and m68k.

For alpha all that is needed is a simple incrementing of the syscall
number in my tree and adding of my syscall to the end of the list.

For m68k please just delete all of the syscall entries the conflict will
add to arch/m68k/kernel/entry_mm.S.  The m68k tree has consolidated
everything in arch/m68k/kernel/syscalltable.S


This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
/proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
process at the time those files are opened, and can be bind mounted to
keep the specified namespace alive without a process.

This tree adds the setns system call that can be used to change the
specified namespace of a process to the namespace specified by a system
call.

This tree adds a new rtnetlink attribute that allows for moving a
network device into a network namespace specified by a file descriptor.

Support for the other namespaces is planned but is not ready for 2.6.40.

These changes dramatically simplify what a userspace process has to do
to keep a namespace alive, and to execute system calls in it.

The shortlog:

Stephen Rothwell (1):
      net: fix get_net_ns_by_fd for !CONFIG_NET_NS

Eric W. Biederman (11):
      ns: proc files for namespace naming policy.
      ns: Introduce the setns syscall
      ns proc: Add support for the network namespace.
      ns proc: Add support for the uts namespace
      ns proc: Add support for the ipc namespace
      net: Allow setting the network namespace by fd
      Merge commit '2e7bad5f34b5beed47542490c760ed26574e38ba' into HEAD
      Merge commit '7143b7d41218d4fc2ea33e6056c73609527ae687' into HEAD
      ns: Wire up the setns system call
      ns: Declare sys_setns in syscalls.h
      ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.

The diffstat:

 arch/alpha/include/asm/unistd.h        |    3 +-
 arch/alpha/kernel/systbls.S            |    1 +
 arch/arm/include/asm/unistd.h          |    1 +
 arch/arm/kernel/calls.S                |    1 +
 arch/avr32/include/asm/unistd.h        |    3 +-
 arch/avr32/kernel/syscall_table.S      |    1 +
 arch/blackfin/include/asm/unistd.h     |    3 +-
 arch/blackfin/mach-common/entry.S      |    1 +
 arch/cris/arch-v10/kernel/entry.S      |    1 +
 arch/cris/arch-v32/kernel/entry.S      |    1 +
 arch/cris/include/asm/unistd.h         |    3 +-
 arch/frv/include/asm/unistd.h          |    3 +-
 arch/frv/kernel/entry.S                |    1 +
 arch/h8300/include/asm/unistd.h        |    3 +-
 arch/h8300/kernel/syscalls.S           |    1 +
 arch/ia64/include/asm/unistd.h         |    3 +-
 arch/ia64/kernel/entry.S               |    1 +
 arch/m32r/include/asm/unistd.h         |    3 +-
 arch/m32r/kernel/syscall_table.S       |    1 +
 arch/m68k/include/asm/unistd.h         |    3 +-
 arch/m68k/kernel/syscalltable.S        |    1 +
 arch/microblaze/include/asm/unistd.h   |    3 +-
 arch/microblaze/kernel/syscall_table.S |    1 +
 arch/mips/include/asm/unistd.h         |   15 ++-
 arch/mips/kernel/scall32-o32.S         |    1 +
 arch/mips/kernel/scall64-64.S          |    1 +
 arch/mips/kernel/scall64-n32.S         |    1 +
 arch/mips/kernel/scall64-o32.S         |    1 +
 arch/mn10300/include/asm/unistd.h      |    3 +-
 arch/mn10300/kernel/entry.S            |    1 +
 arch/parisc/include/asm/unistd.h       |    4 +-
 arch/parisc/kernel/syscall_table.S     |    1 +
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/asm/unistd.h      |    3 +-
 arch/s390/include/asm/unistd.h         |    3 +-
 arch/s390/kernel/syscalls.S            |    1 +
 arch/sh/include/asm/unistd_32.h        |    3 +-
 arch/sh/include/asm/unistd_64.h        |    3 +-
 arch/sh/kernel/syscalls_32.S           |    1 +
 arch/sh/kernel/syscalls_64.S           |    1 +
 arch/sparc/include/asm/unistd.h        |    3 +-
 arch/sparc/kernel/systbls_32.S         |    2 +-
 arch/sparc/kernel/systbls_64.S         |    4 +-
 arch/x86/ia32/ia32entry.S              |    1 +
 arch/x86/include/asm/unistd_32.h       |    3 +-
 arch/x86/include/asm/unistd_64.h       |    2 +
 arch/x86/kernel/syscall_table_32.S     |    1 +
 arch/xtensa/include/asm/unistd.h       |    4 +-
 fs/proc/Makefile                       |    1 +
 fs/proc/base.c                         |   20 ++--
 fs/proc/inode.c                        |    7 +
 fs/proc/internal.h                     |   18 +++
 fs/proc/namespaces.c                   |  198 ++++++++++++++++++++++++++++++++
 include/asm-generic/unistd.h           |    4 +-
 include/linux/if_link.h                |    1 +
 include/linux/proc_fs.h                |   21 ++++
 include/linux/syscalls.h               |    1 +
 include/net/net_namespace.h            |    1 +
 ipc/namespace.c                        |   37 ++++++
 kernel/nsproxy.c                       |   42 +++++++
 kernel/utsname.c                       |   39 ++++++
 net/core/net_namespace.c               |   65 +++++++++++
 net/core/rtnetlink.c                   |    5 +-
 63 files changed, 525 insertions(+), 42 deletions(-)

Thanks,
Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-23 21:05 [GIT PULL] Namespace file descriptors for 2.6.40 Eric W. Biederman
@ 2011-05-25 21:05 ` C Anthony Risinger
  2011-05-25 21:38   ` Serge E. Hallyn
  0 siblings, 1 reply; 23+ messages in thread
From: C Anthony Risinger @ 2011-05-25 21:05 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, netdev, linux-kernel

On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
> /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
> process at the time those files are opened, and can be bind mounted to
> keep the specified namespace alive without a process.
>
> This tree adds the setns system call that can be used to change the
> specified namespace of a process to the namespace specified by a system
> call.

i just have a quick question regarding these, apologies if wrong place
to respond -- i trimmed to lists only.

if i understand correctly, mount namespaces (for example), allow one
to build such constructs as "private /tmp" and similar that even
`root` cannot access ... and there are many reasons `root` does not
deserve to completely know/interact with user processes (FUSE makes a
good example ... just because i [user] have SSH access to a machine,
why should `root`?)

would these /proc additions break such guarantees?  IOW, would it now
become possible for `root` to inject stuff into my private namespaces,
and/or has these guarantees never existed and i am mistaken?  is there
any kind of ACL mechanism that endows the origin process (or similar)
with the ability to dictate who can hold and/or interact with these
references?

Thanks for your time,

-- 

C Anthony

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 21:05 ` C Anthony Risinger
@ 2011-05-25 21:38   ` Serge E. Hallyn
  2011-05-25 21:55     ` C Anthony Risinger
  0 siblings, 1 reply; 23+ messages in thread
From: Serge E. Hallyn @ 2011-05-25 21:38 UTC (permalink / raw)
  To: C Anthony Risinger
  Cc: Eric W. Biederman, Linux Containers, netdev, linux-kernel

Quoting C Anthony Risinger (anthony@xtfx.me):
> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
> > process at the time those files are opened, and can be bind mounted to
> > keep the specified namespace alive without a process.
> >
> > This tree adds the setns system call that can be used to change the
> > specified namespace of a process to the namespace specified by a system
> > call.
> 
> i just have a quick question regarding these, apologies if wrong place
> to respond -- i trimmed to lists only.
> 
> if i understand correctly, mount namespaces (for example), allow one
> to build such constructs as "private /tmp" and similar that even
> `root` cannot access ... and there are many reasons `root` does not
> deserve to completely know/interact with user processes (FUSE makes a
> good example ... just because i [user] have SSH access to a machine,
> why should `root`?)
> 
> would these /proc additions break such guarantees?  IOW, would it now
> become possible for `root` to inject stuff into my private namespaces,
> and/or has these guarantees never existed and i am mistaken?  is there
> any kind of ACL mechanism that endows the origin process (or similar)
> with the ability to dictate who can hold and/or interact with these
> references?

If for instance you have a file open in your private /tmp, then root
in another mounts ns can open the file through /proc/$$/fd/N anyway.
If it's a directory, he can now traverse the whole fs.

-serge

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 21:38   ` Serge E. Hallyn
@ 2011-05-25 21:55     ` C Anthony Risinger
  2011-05-25 22:11       ` Michał Mirosław
  2011-05-25 23:40       ` Eric W. Biederman
  0 siblings, 2 replies; 23+ messages in thread
From: C Anthony Risinger @ 2011-05-25 21:55 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Eric W. Biederman, Linux Containers, netdev, linux-kernel

On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting C Anthony Risinger (anthony@xtfx.me):
>> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> >
>> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
>> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
>> > process at the time those files are opened, and can be bind mounted to
>> > keep the specified namespace alive without a process.
>> >
>> > This tree adds the setns system call that can be used to change the
>> > specified namespace of a process to the namespace specified by a system
>> > call.
>>
>> i just have a quick question regarding these, apologies if wrong place
>> to respond -- i trimmed to lists only.
>>
>> if i understand correctly, mount namespaces (for example), allow one
>> to build such constructs as "private /tmp" and similar that even
>> `root` cannot access ... and there are many reasons `root` does not
>> deserve to completely know/interact with user processes (FUSE makes a
>> good example ... just because i [user] have SSH access to a machine,
>> why should `root`?)
>>
>> would these /proc additions break such guarantees?  IOW, would it now
>> become possible for `root` to inject stuff into my private namespaces,
>> and/or has these guarantees never existed and i am mistaken?  is there
>> any kind of ACL mechanism that endows the origin process (or similar)
>> with the ability to dictate who can hold and/or interact with these
>> references?
>
> If for instance you have a file open in your private /tmp, then root
> in another mounts ns can open the file through /proc/$$/fd/N anyway.
> If it's a directory, he can now traverse the whole fs.

aaah right :-( ... there's always another way isn't there ... curse
you Linux for being so flexible! (just kidding baby i love you)

this seems like a more fundamental issue then?  or should i not expect
to be able to achieve separation like this?  i ask in the context of
OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
i've used/followed these technologies for couple years now ... and
it's starting to feel like "the right time".

C Anthony

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 21:55     ` C Anthony Risinger
@ 2011-05-25 22:11       ` Michał Mirosław
  2011-05-25 23:40       ` Eric W. Biederman
  1 sibling, 0 replies; 23+ messages in thread
From: Michał Mirosław @ 2011-05-25 22:11 UTC (permalink / raw)
  To: C Anthony Risinger
  Cc: Serge E. Hallyn, Eric W. Biederman, Linux Containers, netdev,
	linux-kernel

2011/5/25 C Anthony Risinger <anthony@xtfx.me>:
> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
[...]
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)
>
> this seems like a more fundamental issue then?  or should i not expect
> to be able to achieve separation like this?  i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".

You either trust the admin or don't use the machine. There is no third way.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 21:55     ` C Anthony Risinger
  2011-05-25 22:11       ` Michał Mirosław
@ 2011-05-25 23:40       ` Eric W. Biederman
  2011-05-27 20:18         ` C Anthony Risinger
  1 sibling, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-25 23:40 UTC (permalink / raw)
  To: C Anthony Risinger
  Cc: Serge E. Hallyn, Linux Containers, netdev, linux-kernel

C Anthony Risinger <anthony@xtfx.me> writes:

> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
>>> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>> >
>>> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
>>> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
>>> > process at the time those files are opened, and can be bind mounted to
>>> > keep the specified namespace alive without a process.
>>> >
>>> > This tree adds the setns system call that can be used to change the
>>> > specified namespace of a process to the namespace specified by a system
>>> > call.
>>>
>>> i just have a quick question regarding these, apologies if wrong place
>>> to respond -- i trimmed to lists only.
>>>
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>>>
>>> would these /proc additions break such guarantees?  IOW, would it now
>>> become possible for `root` to inject stuff into my private namespaces,
>>> and/or has these guarantees never existed and i am mistaken?  is there
>>> any kind of ACL mechanism that endows the origin process (or similar)
>>> with the ability to dictate who can hold and/or interact with these
>>> references?
>>
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
>
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)

Even more significant the access to the new files is guarded by the
ptrace access checks.  And if root can ptrace your process root
can remote control your process.

> this seems like a more fundamental issue then?  or should i not expect
> to be able to achieve separation like this?  i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".

I don't think anything really new is allowed, but we haven't designed
anything that radically reduces the power of root either.

At some point we may have the user namespace done and that should
give you a root like user with vastly reduced powers, but we aren't
there yet.

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 23:40       ` Eric W. Biederman
@ 2011-05-27 20:18         ` C Anthony Risinger
  0 siblings, 0 replies; 23+ messages in thread
From: C Anthony Risinger @ 2011-05-27 20:18 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Serge E. Hallyn, Linux Containers, netdev, linux-kernel

On Wed, May 25, 2011 at 6:40 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> C Anthony Risinger <anthony@xtfx.me> writes:
>
>> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>> Quoting C Anthony Risinger (anthony@xtfx.me):
>>>> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>> >
>>>> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
>>>> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
>>>> > process at the time those files are opened, and can be bind mounted to
>>>> > keep the specified namespace alive without a process.
>>>> >
>>>> > This tree adds the setns system call that can be used to change the
>>>> > specified namespace of a process to the namespace specified by a system
>>>> > call.
>>>>
>>>> i just have a quick question regarding these, apologies if wrong place
>>>> to respond -- i trimmed to lists only.
>>>>
>>>> if i understand correctly, mount namespaces (for example), allow one
>>>> to build such constructs as "private /tmp" and similar that even
>>>> `root` cannot access ... and there are many reasons `root` does not
>>>> deserve to completely know/interact with user processes (FUSE makes a
>>>> good example ... just because i [user] have SSH access to a machine,
>>>> why should `root`?)
>>>>
>>>> would these /proc additions break such guarantees?  IOW, would it now
>>>> become possible for `root` to inject stuff into my private namespaces,
>>>> and/or has these guarantees never existed and i am mistaken?  is there
>>>> any kind of ACL mechanism that endows the origin process (or similar)
>>>> with the ability to dictate who can hold and/or interact with these
>>>> references?
>>>
>>> If for instance you have a file open in your private /tmp, then root
>>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>>> If it's a directory, he can now traverse the whole fs.
>>
>> aaah right :-( ... there's always another way isn't there ... curse
>> you Linux for being so flexible! (just kidding baby i love you)
>
> Even more significant the access to the new files is guarded by the
> ptrace access checks.  And if root can ptrace your process root
> can remote control your process.
>
>> this seems like a more fundamental issue then?  or should i not expect
>> to be able to achieve separation like this?  i ask in the context of
>> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
>> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
>> i've used/followed these technologies for couple years now ... and
>> it's starting to feel like "the right time".
>
> I don't think anything really new is allowed, but we haven't designed
> anything that radically reduces the power of root either.
>
> At some point we may have the user namespace done and that should
> give you a root like user with vastly reduced powers, but we aren't
> there yet.

ok -- i knew there was some user namespace work still left for a
namespaced root -- i was specifically thinking of the root user in the
host.  i was under the impression that namespaces could achieve
separation even from the host (save the kernel itself) ... but it
seems i was mistaken ... still much to learn about Linux i suppose,
even though i use it everyday for years and years :-)  it kind of
makes sense i guess, since maybe the host needs supervisory powers
over the guests?  could be some merit for real separation in the
future (not only malevolent root host user, but say an attacker/script
that manages to break thru container?), though how possible i dont
know.  i wouldnt expect the root user to be prevented from killing/etc
the container, but maybe only prevented from snooping, eg. the
container looks like a black box that he may only resource control or
kill completely.  either way, what we have is just fine for my (and
likely many other's) uses.

anyways, thanks for all the answers and all the work on
namespacing/cgroups ... very useful constructs for a wide array of
problems.

-- 

C Anthony

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [GIT PULL] Namespace file descriptors for 2.6.40
@ 2011-05-21 23:39 Eric W. Biederman
  2011-05-21 23:42 ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-21 23:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Linux Containers, netdev, James Bottomley,
	Geert Uytterhoeven


Please pull the namespace file descriptor git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd.git

In a hopeless quest to avoid conflicts when merging a new system call
and wiring it up I have pulled in bits of net-next and the parisc tree.
You have already pulled the net-next bits.  The parisc bits in my tree
are:

James Bottomley (4):
      [PARISC] wire up fanotify syscalls
      [PARISC] wire up clock_adjtime syscall
      [PARISC] wire up the fhandle syscalls
      [PARISC] wire up syncfs syscall

Meelis Roos (1):
      [PARISC] fix pacache .size with new binutils

Since then I have gained conflicts in alpha and m68k.
For alpha all that is needed is a simple incrementing of
the syscall number in my tree and adding of my syscall to
the end of the list.

For m68k please just delete all of the syscall entries the conflict will
add to arch/m68k/kernel/entry_mm.S.  The m68k tree has consolidated
everything in arch/m68k/kernel/syscalltable.S


This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
/proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
process at the time those files are opened, and can be bind mounted to
keep the specified namespace alive without a process.

This tree adds the setns system call that can be used to change the
specified namespace of a process to the namespace specified by a system
call.

This tree adds a new rtnetlink attribute that allows for moving a
network device into a network namespace specified by a file descriptor.

Support for the other namespaces is planned but is not ready for 2.6.40.

These changes dramatically simplify what a userspace process has to do
to keep a namespace alive, and to execute system calls in it.

The shortlog:

Stephen Rothwell (1):
      net: fix get_net_ns_by_fd for !CONFIG_NET_NS

Eric W. Biederman (11):
      ns: proc files for namespace naming policy.
      ns: Introduce the setns syscall
      ns proc: Add support for the network namespace.
      ns proc: Add support for the uts namespace
      ns proc: Add support for the ipc namespace
      net: Allow setting the network namespace by fd
      Merge commit '2e7bad5f34b5beed47542490c760ed26574e38ba' into HEAD
      Merge commit '7143b7d41218d4fc2ea33e6056c73609527ae687' into HEAD
      ns: Wire up the setns system call
      ns: Declare sys_setns in syscalls.h
      ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.

The diffstat:

 arch/alpha/include/asm/unistd.h        |    3 +-
 arch/alpha/kernel/systbls.S            |    1 +
 arch/arm/include/asm/unistd.h          |    1 +
 arch/arm/kernel/calls.S                |    1 +
 arch/avr32/include/asm/unistd.h        |    3 +-
 arch/avr32/kernel/syscall_table.S      |    1 +
 arch/blackfin/include/asm/unistd.h     |    3 +-
 arch/blackfin/mach-common/entry.S      |    1 +
 arch/cris/arch-v10/kernel/entry.S      |    1 +
 arch/cris/arch-v32/kernel/entry.S      |    1 +
 arch/cris/include/asm/unistd.h         |    3 +-
 arch/frv/include/asm/unistd.h          |    3 +-
 arch/frv/kernel/entry.S                |    1 +
 arch/h8300/include/asm/unistd.h        |    3 +-
 arch/h8300/kernel/syscalls.S           |    1 +
 arch/ia64/include/asm/unistd.h         |    3 +-
 arch/ia64/kernel/entry.S               |    1 +
 arch/m32r/include/asm/unistd.h         |    3 +-
 arch/m32r/kernel/syscall_table.S       |    1 +
 arch/m68k/include/asm/unistd.h         |    3 +-
 arch/m68k/kernel/syscalltable.S        |    1 +
 arch/microblaze/include/asm/unistd.h   |    3 +-
 arch/microblaze/kernel/syscall_table.S |    1 +
 arch/mips/include/asm/unistd.h         |   15 ++-
 arch/mips/kernel/scall32-o32.S         |    1 +
 arch/mips/kernel/scall64-64.S          |    1 +
 arch/mips/kernel/scall64-n32.S         |    1 +
 arch/mips/kernel/scall64-o32.S         |    1 +
 arch/mn10300/include/asm/unistd.h      |    3 +-
 arch/mn10300/kernel/entry.S            |    1 +
 arch/parisc/include/asm/unistd.h       |   10 ++-
 arch/parisc/kernel/pacache.S           |    6 +-
 arch/parisc/kernel/sys_parisc32.c      |    8 ++
 arch/parisc/kernel/syscall_table.S     |    7 +
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/asm/unistd.h      |    3 +-
 arch/s390/include/asm/unistd.h         |    3 +-
 arch/s390/kernel/syscalls.S            |    1 +
 arch/sh/include/asm/unistd_32.h        |    3 +-
 arch/sh/include/asm/unistd_64.h        |    3 +-
 arch/sh/kernel/syscalls_32.S           |    1 +
 arch/sh/kernel/syscalls_64.S           |    1 +
 arch/sparc/include/asm/unistd.h        |    3 +-
 arch/sparc/kernel/systbls_32.S         |    2 +-
 arch/sparc/kernel/systbls_64.S         |    4 +-
 arch/x86/ia32/ia32entry.S              |    1 +
 arch/x86/include/asm/unistd_32.h       |    3 +-
 arch/x86/include/asm/unistd_64.h       |    2 +
 arch/x86/kernel/syscall_table_32.S     |    1 +
 arch/xtensa/include/asm/unistd.h       |    4 +-
 fs/proc/Makefile                       |    1 +
 fs/proc/base.c                         |   20 ++--
 fs/proc/inode.c                        |    7 +
 fs/proc/internal.h                     |   18 +++
 fs/proc/namespaces.c                   |  198 ++++++++++++++++++++++++++++++++
 include/asm-generic/unistd.h           |    4 +-
 include/linux/if_link.h                |    1 +
 include/linux/proc_fs.h                |   21 ++++
 include/linux/syscalls.h               |    1 +
 include/net/net_namespace.h            |    1 +
 ipc/namespace.c                        |   37 ++++++
 kernel/nsproxy.c                       |   42 +++++++
 kernel/utsname.c                       |   39 ++++++
 net/core/net_namespace.c               |   65 +++++++++++
 net/core/rtnetlink.c                   |    5 +-
 65 files changed, 547 insertions(+), 46 deletions(-)

Thanks,
Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-21 23:39 Eric W. Biederman
@ 2011-05-21 23:42 ` Linus Torvalds
  2011-05-22  0:33   ` Eric W. Biederman
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2011-05-21 23:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linux Containers, netdev, James Bottomley,
	Geert Uytterhoeven

On Sat, May 21, 2011 at 4:39 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> In a hopeless quest to avoid conflicts when merging a new system call
> and wiring it up I have pulled in bits of net-next and the parisc tree.
> You have already pulled the net-next bits.  The parisc bits in my tree
> are:

Ok, this just means that I won't pull from you.

It's that simple. We don't do this. Ever.

Why the hell did you even worry about wiring up parisc system calls?
That's not your job.

                              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-21 23:42 ` Linus Torvalds
@ 2011-05-22  0:33   ` Eric W. Biederman
       [not found]     ` <m1boyvpo9r.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-22  0:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Linux Containers, netdev, James Bottomley,
	Geert Uytterhoeven

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, May 21, 2011 at 4:39 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> In a hopeless quest to avoid conflicts when merging a new system call
>> and wiring it up I have pulled in bits of net-next and the parisc tree.
>> You have already pulled the net-next bits.  The parisc bits in my tree
>> are:
>
> Ok, this just means that I won't pull from you.

Sure.  I will try to be a little more patient and resend the pull
request after James has sent the pull request for the parisc tree.
At which point the only unique changes in my tree will be mine.

> It's that simple. We don't do this. Ever.

Hah. I seem to remember bits of pulling from non-rebasing trees being ok
in well defined contexts.  This seems like one.  Especially when you
have checked with the maintainers.

Plus all of the parisc bits in addition to being in the linux-next
are trivially correct.

> Why the hell did you even worry about wiring up parisc system calls?
> That's not your job.

Because in general it is the job of he who changes something to fix up
every possible place.

Now maybe I went a little too far in trying to resolve the conflicts,
but I did check with the David Miller and James Bottomley and they knew
what I was doing.

Quite honestly adding system calls is a mess that know one seems to
know how to do right.  So I flipped a coin and took a stab at it.

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <m1boyvpo9r.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>]

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
       [not found]     ` <m1boyvpo9r.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2011-05-22  7:13       ` James Bottomley
  2011-05-22  8:42         ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2011-05-22  7:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, netdev-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Geert Uytterhoeven

On Sat, 2011-05-21 at 17:33 -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:
> 
> > On Sat, May 21, 2011 at 4:39 PM, Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >>
> >> In a hopeless quest to avoid conflicts when merging a new system call
> >> and wiring it up I have pulled in bits of net-next and the parisc tree.
> >> You have already pulled the net-next bits.  The parisc bits in my tree
> >> are:
> >
> > Ok, this just means that I won't pull from you.
> 
> Sure.  I will try to be a little more patient and resend the pull
> request after James has sent the pull request for the parisc tree.
> At which point the only unique changes in my tree will be mine.

Right ... effectively you're running a postmerge tree, since you now
depend on bits I have in the parisc tree.

Traditionally, the arch trees tend to go a bit later because they wait
to see if there's any fallout from x86; but this time, I think it looks
OK, so I've sent the pull request:

http://marc.info/?l=linux-parisc&m=130604805417277

As soon as that's in, you should be good to go.

James


> > It's that simple. We don't do this. Ever.
> 
> Hah. I seem to remember bits of pulling from non-rebasing trees being ok
> in well defined contexts.  This seems like one.  Especially when you
> have checked with the maintainers.
> 
> Plus all of the parisc bits in addition to being in the linux-next
> are trivially correct.
> 
> > Why the hell did you even worry about wiring up parisc system calls?
> > That's not your job.
> 
> Because in general it is the job of he who changes something to fix up
> every possible place.
> 
> Now maybe I went a little too far in trying to resolve the conflicts,
> but I did check with the David Miller and James Bottomley and they knew
> what I was doing.
> 
> Quite honestly adding system calls is a mess that know one seems to
> know how to do right.  So I flipped a coin and took a stab at it.

Right, the solution is reasonable and means linux-next doesn't have to
carry a conflict resolution patch for this.  It also means we agree on
the syscall numbering ...

The only real mistake was not waiting for the merge sequence: the base
trees have to go first before you can push a postmerge tree.

James

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-22  7:13       ` James Bottomley
@ 2011-05-22  8:42         ` Ingo Molnar
  2011-05-24  7:03           ` Eric W. Biederman
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-05-22  8:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Eric W. Biederman, Linus Torvalds, linux-kernel, Linux Containers,
	netdev, Geert Uytterhoeven

* James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> Traditionally, the arch trees tend to go a bit later because they wait to see 
> if there's any fallout from x86; [...]

Not really - most of the arch trees 'traditionally' went late even when the x86 
tree itself was monolithic and was itself sent late in the merge window (with 
the notable exception of the powerpc tree).

> [...] but this time, I think it looks OK, [...]

That's not really a surprise, there hasn't been a serious 'problem' with the 
x86 tree for a long time, roughly since we switched to the finegrained Git 
topical split-up maintenance model about two years ago.

[ That split-up also means that there is no 'x86 tree' anymore as such: if you 
  check lkml we send roughly 20-30 independent trees in the merge window and 
  have done that for the past ~10 kernel cycles. ]

In fact exactly *because* there's few problems with the x86 topic trees can we 
push them so soon: if problems were frequent then 1) we would not be able to be 
ready on time and 2) i suspect we'd be pulled in later in the window as well as 
a maintainer generally wants to pull low risk items first, high risk items 
last, to maximize the utilization of testing capacity.

I agree with Linus's notion in this thread though, a core kernel change should 
generally not worry about hooking up rare-arch system calls (concentrate on the 
architectures that get tested most) - those are better enabled gradually 
anyway.

Also, system call table conflicts are trivial to resolve. Merging in net-next 
to avoid such a conflict is like cracking a nut with a sledgehammer.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-22  8:42         ` Ingo Molnar
@ 2011-05-24  7:03           ` Eric W. Biederman
  2011-05-24  7:16             ` Ingo Molnar
  2011-05-24  7:26             ` James Bottomley
  0 siblings, 2 replies; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-24  7:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Bottomley, Linus Torvalds, linux-kernel, Linux Containers,
	netdev, Geert Uytterhoeven

Ingo Molnar <mingo@elte.hu> writes:

> I agree with Linus's notion in this thread though, a core kernel change should 
> generally not worry about hooking up rare-arch system calls (concentrate on the 
> architectures that get tested most) - those are better enabled gradually 
> anyway.

The way I read it he was complaining about my having parisc bits and
asking for my branch to be merged before the parisc bits had been
merged.  Which I credit as a fair complaint.  If I am going to depend on
other peoples trees I should wait.

At the same time when I am busy looking for every possible source of
trouble and putting code into net-next to detect pending conflicts,
and when maintainers complain when I ask for review that my patches
conflict with their patches.  Being a contentious developer I am
inclined to do something.

Now that the reality has sunk in that it means waiting for other peoples
code to be merged before I request for my changes to be merged I don't
think I will structure a tree that way again while I remember.

> Also, system call table conflicts are trivial to resolve. Merging in net-next 
> to avoid such a conflict is like cracking a nut with a sledgehammer.

Well I still have trauma from how nasty it was to test with syscall
numbers continuing to change when I was working on the kexec_load system
call.

As far as I can tell any one system call conflict on any one
architecture is easy to resolve.  Resolving a conflict on all
architectures would amount to at least 50 files that need to be resolved
that feels a bit more than trivial.

My gut feel says we should really implement an
include/asm-generic/unistd-common.h to include all new system calls.

That way there would be only one file to touch instead of 50.
Certainly it works for include/asm-generic/unistd.h for the
architectures that use it.  And all we really need is just a little
abstraction on that concept.

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-24  7:03           ` Eric W. Biederman
@ 2011-05-24  7:16             ` Ingo Molnar
  2011-05-25  0:34               ` Valdis.Kletnieks
  2011-05-24  7:26             ` James Bottomley
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-05-24  7:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Linus Torvalds, linux-kernel, Linux Containers,
	netdev, Geert Uytterhoeven


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Also, system call table conflicts are trivial to resolve. Merging in 
> > net-next to avoid such a conflict is like cracking a nut with a 
> > sledgehammer.
> 
> Well I still have trauma from how nasty it was to test with syscall numbers 
> continuing to change when I was working on the kexec_load system call.
> 
> As far as I can tell any one system call conflict on any one
> architecture is easy to resolve.  Resolving a conflict on all
> architectures would amount to at least 50 files that need to be resolved
> that feels a bit more than trivial.

Of course - and the straightforward solution is to not do it but concentrate on 
the 2-3 archs you find to be the primary target of your patches. How many 
parisc systems are there on the planet, which in the future will be upgraded to 
both kernel and user-space running your new syscall for real? Less than 10? How 
many ARM and x86 systems?

> My gut feel says we should really implement an 
> include/asm-generic/unistd-common.h to include all new system calls.
> 
> That way there would be only one file to touch instead of 50. Certainly it 
> works for include/asm-generic/unistd.h for the architectures that use it.  
> And all we really need is just a little abstraction on that concept.

I suppose that could be tried, although in practice it would probably be 
somewhat complex due to the various compat syscall handling differences.
So i guess this is one of the 'lets see how ugly/fragile it becomes' patches.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-24  7:16             ` Ingo Molnar
@ 2011-05-25  0:34               ` Valdis.Kletnieks
  2011-05-25  8:25                 ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Valdis.Kletnieks @ 2011-05-25  0:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, James Bottomley, Linus Torvalds, linux-kernel,
	Linux Containers, netdev, Geert Uytterhoeven

[-- Attachment #1: Type: text/plain, Size: 953 bytes --]

On Tue, 24 May 2011 09:16:28 +0200, Ingo Molnar said:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> > My gut feel says we should really implement an
> > include/asm-generic/unistd-common.h to include all new system calls.
> >
> > That way there would be only one file to touch instead of 50. Certainly it
> > works for include/asm-generic/unistd.h for the architectures that use it. 
> > And all we really need is just a little abstraction on that concept.
>
> I suppose that could be tried, although in practice it would probably be
> somewhat complex due to the various compat syscall handling differences.

Can somebody fill us newcomers in on the arch-aeology of why some syscalls have
different numbers on different archs? I know it's partially because some simply
didn't implement some syscalls so there were numbering mismatches, but would it
have been *that* hard to wire all of those skipped syscalls up to one stub
'return -ENOSYS'?


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25  0:34               ` Valdis.Kletnieks
@ 2011-05-25  8:25                 ` Ingo Molnar
  2011-05-25  8:35                   ` Geert Uytterhoeven
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-05-25  8:25 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Eric W. Biederman, James Bottomley, Linus Torvalds, linux-kernel,
	Linux Containers, netdev, Geert Uytterhoeven

* Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

> On Tue, 24 May 2011 09:16:28 +0200, Ingo Molnar said:
> > * Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > My gut feel says we should really implement an
> > > include/asm-generic/unistd-common.h to include all new system calls.
> > >
> > > That way there would be only one file to touch instead of 50. Certainly it
> > > works for include/asm-generic/unistd.h for the architectures that use it. 
> > > And all we really need is just a little abstraction on that concept.
> >
> > I suppose that could be tried, although in practice it would probably be
> > somewhat complex due to the various compat syscall handling differences.
> 
> Can somebody fill us newcomers in on the arch-aeology of why some syscalls have
> different numbers on different archs? I know it's partially because some simply
> didn't implement some syscalls so there were numbering mismatches, but would it
> have been *that* hard to wire all of those skipped syscalls up to one stub
> 'return -ENOSYS'?

It was done so for hysterical raisons mostly, and once a bad ABI is done it's 
very hard to undo it: beyond pushing the 'good ABI' you'd also still have to 
deal with the bad ABI for a decade or more.

So the background is that most architectures start out as quick concept 
prototypes, doing:

	cp -a arch/existingarch arch/newarch

where 'existingarch' used to be arch/i386/ in the early days. Now i386 had a 
fair amount of x86 specific syscalls that were naturally removed from 
'newarch'. Those created 'holes' in the numbers, which were then filled in with 
new syscalls - a nice idea in itself!

Also sometimes 'newarch' did a 'clean', compressed list of syscall numbers 
straight away, reordering syscalls. Once the 'quick prototype' hack starts 
working on real hardware, once the syscall numbers get into the C library and 
binutils it's very hard to ever transition away: you'd break the world!

An added source of noise that architectures tend to add new syscalls in a 
different order: some are more interesting to them - some less.

So these syscall table hacks done very early during an arch's lifetime stick 
around and create wild numbering noise in 20+ syscall tables:

                                       [ slightly edited for readability ]

 arch/alpha/include/asm/unistd.h:      #define __NR_perf_event_open 493
 arch/arm/include/asm/unistd.h:        #define __NR_perf_event_open 364
 arch/blackfin/include/asm/unistd.h:   #define __NR_perf_event_open 369
 arch/frv/include/asm/unistd.h:        #define __NR_perf_event_open 336
 arch/m68k/include/asm/unistd.h:       #define __NR_perf_event_open 332
 arch/microblaze/include/asm/unistd.h: #define __NR_perf_event_open 366
 arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 333
 arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 292
 arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 296
 arch/mn10300/include/asm/unistd.h:    #define __NR_perf_event_open 337
 arch/parisc/include/asm/unistd.h:     #define __NR_perf_event_open 318
 arch/powerpc/include/asm/unistd.h:    #define __NR_perf_event_open 319
 arch/s390/include/asm/unistd.h:       #define __NR_perf_event_open 331
 arch/sh/include/asm/unistd_32.h:      #define __NR_perf_event_open 336
 arch/sh/include/asm/unistd_64.h:      #define __NR_perf_event_open 364
 arch/sparc/include/asm/unistd.h:      #define __NR_perf_event_open 327
 arch/x86/include/asm/unistd_32.h:     #define __NR_perf_event_open 336
 arch/x86/include/asm/unistd_64.h:     #define __NR_perf_event_open 298

To fix this we'd create a new, clean offset defined by each architecture, and a 
generic enumeration of new syscalls.

This would make it much easier to add new, generic syscalls to all 
architectures indeed.

It would still leave compat syscall wrappers unaddressed though: those are 
often numbered differently and sometimes need arch specific wrapper entry 
functions, which then call the real generic syscall.

But at least the primary, 'native' syscall table of every arch could be kept 
rather fresh via generic enumeration.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25  8:25                 ` Ingo Molnar
@ 2011-05-25  8:35                   ` Geert Uytterhoeven
  2011-05-25 12:47                     ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Geert Uytterhoeven @ 2011-05-25  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Valdis.Kletnieks, Eric W. Biederman, James Bottomley,
	Linus Torvalds, linux-kernel, Linux Containers, netdev

On Wed, May 25, 2011 at 10:25, Ingo Molnar <mingo@elte.hu> wrote:
> * Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:
>
>> On Tue, 24 May 2011 09:16:28 +0200, Ingo Molnar said:
>> > * Eric W. Biederman <ebiederm@xmission.com> wrote:
>> > > My gut feel says we should really implement an
>> > > include/asm-generic/unistd-common.h to include all new system calls.
>> > >
>> > > That way there would be only one file to touch instead of 50. Certainly it
>> > > works for include/asm-generic/unistd.h for the architectures that use it.
>> > > And all we really need is just a little abstraction on that concept.
>> >
>> > I suppose that could be tried, although in practice it would probably be
>> > somewhat complex due to the various compat syscall handling differences.
>>
>> Can somebody fill us newcomers in on the arch-aeology of why some syscalls have
>> different numbers on different archs? I know it's partially because some simply
>> didn't implement some syscalls so there were numbering mismatches, but would it
>> have been *that* hard to wire all of those skipped syscalls up to one stub
>> 'return -ENOSYS'?
>
> It was done so for hysterical raisons mostly, and once a bad ABI is done it's
> very hard to undo it: beyond pushing the 'good ABI' you'd also still have to
> deal with the bad ABI for a decade or more.
>
> So the background is that most architectures start out as quick concept
> prototypes, doing:
>
>        cp -a arch/existingarch arch/newarch
>
> where 'existingarch' used to be arch/i386/ in the early days. Now i386 had a
> fair amount of x86 specific syscalls that were naturally removed from
> 'newarch'. Those created 'holes' in the numbers, which were then filled in with
> new syscalls - a nice idea in itself!
>
> Also sometimes 'newarch' did a 'clean', compressed list of syscall numbers
> straight away, reordering syscalls. Once the 'quick prototype' hack starts
> working on real hardware, once the syscall numbers get into the C library and
> binutils it's very hard to ever transition away: you'd break the world!
>
> An added source of noise that architectures tend to add new syscalls in a
> different order: some are more interesting to them - some less.
>
> So these syscall table hacks done very early during an arch's lifetime stick
> around and create wild numbering noise in 20+ syscall tables:
>
>                                       [ slightly edited for readability ]
>
>  arch/alpha/include/asm/unistd.h:      #define __NR_perf_event_open 493
>  arch/arm/include/asm/unistd.h:        #define __NR_perf_event_open 364
>  arch/blackfin/include/asm/unistd.h:   #define __NR_perf_event_open 369
>  arch/frv/include/asm/unistd.h:        #define __NR_perf_event_open 336
>  arch/m68k/include/asm/unistd.h:       #define __NR_perf_event_open 332
>  arch/microblaze/include/asm/unistd.h: #define __NR_perf_event_open 366
>  arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 333
>  arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 292
>  arch/mips/include/asm/unistd.h:       #define __NR_perf_event_open 296
>  arch/mn10300/include/asm/unistd.h:    #define __NR_perf_event_open 337
>  arch/parisc/include/asm/unistd.h:     #define __NR_perf_event_open 318
>  arch/powerpc/include/asm/unistd.h:    #define __NR_perf_event_open 319
>  arch/s390/include/asm/unistd.h:       #define __NR_perf_event_open 331
>  arch/sh/include/asm/unistd_32.h:      #define __NR_perf_event_open 336
>  arch/sh/include/asm/unistd_64.h:      #define __NR_perf_event_open 364
>  arch/sparc/include/asm/unistd.h:      #define __NR_perf_event_open 327
>  arch/x86/include/asm/unistd_32.h:     #define __NR_perf_event_open 336
>  arch/x86/include/asm/unistd_64.h:     #define __NR_perf_event_open 298
>
> To fix this we'd create a new, clean offset defined by each architecture, and a
> generic enumeration of new syscalls.
>
> This would make it much easier to add new, generic syscalls to all
> architectures indeed.
>
> It would still leave compat syscall wrappers unaddressed though: those are
> often numbered differently and sometimes need arch specific wrapper entry
> functions, which then call the real generic syscall.
>
> But at least the primary, 'native' syscall table of every arch could be kept
> rather fresh via generic enumeration.

So we can start all over at offset 501 (alpha just started using 500)
with a unified,
clean, and compressed list of syscalls? Or do we have some more other-os-compat
syscalls around in this range?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25  8:35                   ` Geert Uytterhoeven
@ 2011-05-25 12:47                     ` Ingo Molnar
  2011-05-25 13:00                       ` Geert Uytterhoeven
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-05-25 12:47 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Valdis.Kletnieks, Eric W. Biederman, James Bottomley,
	Linus Torvalds, linux-kernel, Linux Containers, netdev

* Geert Uytterhoeven <geert@linux-m68k.org> wrote:

> > But at least the primary, 'native' syscall table of every arch 
> > could be kept rather fresh via generic enumeration.
> 
> So we can start all over at offset 501 (alpha just started using 
> 500) with a unified, clean, and compressed list of syscalls? Or do 
> we have some more other-os-compat syscalls around in this range?

No, that would leave a big hole in the syscall table of most 
architectures.

So what would be needed is for each architecture to define a 'generic 
syscall table base index', ARCH_SYSCALL_BASE or so, and the generic 
syscalls would be added for that.

Alpha would have 501, the others lower numbers.

The only general assumption we can rely on is that there's a range of 
not yet used syscall numbers starting at the end of the current 
syscall table.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 12:47                     ` Ingo Molnar
@ 2011-05-25 13:00                       ` Geert Uytterhoeven
  2011-05-25 13:17                         ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Geert Uytterhoeven @ 2011-05-25 13:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Valdis.Kletnieks, Eric W. Biederman, James Bottomley,
	Linus Torvalds, linux-kernel, Linux Containers, netdev

On Wed, May 25, 2011 at 14:47, Ingo Molnar <mingo@elte.hu> wrote:
> * Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>
>> > But at least the primary, 'native' syscall table of every arch
>> > could be kept rather fresh via generic enumeration.
>>
>> So we can start all over at offset 501 (alpha just started using
>> 500) with a unified, clean, and compressed list of syscalls? Or do
>> we have some more other-os-compat syscalls around in this range?
>
> No, that would leave a big hole in the syscall table of most
> architectures.

Sure, but we could (a) optimize for the case where the syscall number is
larger than 500 and/or (b) drop support for syscall numbers smaller than
501, depending on a config option.

> So what would be needed is for each architecture to define a 'generic
> syscall table base index', ARCH_SYSCALL_BASE or so, and the generic
> syscalls would be added for that.
>
> Alpha would have 501, the others lower numbers.
>
> The only general assumption we can rely on is that there's a range of
> not yet used syscall numbers starting at the end of the current
> syscall table.

Yep, that would work too.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 13:00                       ` Geert Uytterhoeven
@ 2011-05-25 13:17                         ` Ingo Molnar
  2011-05-25 15:22                           ` Geert Uytterhoeven
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-05-25 13:17 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Valdis.Kletnieks, Eric W. Biederman, James Bottomley,
	Linus Torvalds, linux-kernel, Linux Containers, netdev

* Geert Uytterhoeven <geert@linux-m68k.org> wrote:

> On Wed, May 25, 2011 at 14:47, Ingo Molnar <mingo@elte.hu> wrote:
> > * Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> >
> >> > But at least the primary, 'native' syscall table of every arch
> >> > could be kept rather fresh via generic enumeration.
> >>
> >> So we can start all over at offset 501 (alpha just started using
> >> 500) with a unified, clean, and compressed list of syscalls? Or do
> >> we have some more other-os-compat syscalls around in this range?
> >
> > No, that would leave a big hole in the syscall table of most
> > architectures.
> 
> Sure, but we could (a) optimize for the case where the syscall number is
> larger than 500 and/or (b) drop support for syscall numbers smaller than
> 501, depending on a config option.

Dunno why there is so much desire to complicate and break 
well-working ABIs while we have a 14+ MLOC kernel with so much code 
in it that is in dire need to be improved! :-)

Yes, we can reduce the syscall addition pain via the 
ARCH_SYSCALLS_BASE trick, but we should really forget about 
*removing* (or reordering) syscall numbers as the advantages are 
marginal at best while the disadvantages are huge.

Messy syscall tables are irreversibly ingrained in tens of millions 
of systems and there's nothing we can do about that. We can improve 
the future shape of syscall tables and we can try not to make new 
mistakes, and that's a large enough job in itself ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-25 13:17                         ` Ingo Molnar
@ 2011-05-25 15:22                           ` Geert Uytterhoeven
  0 siblings, 0 replies; 23+ messages in thread
From: Geert Uytterhoeven @ 2011-05-25 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Valdis.Kletnieks, Eric W. Biederman, James Bottomley,
	Linus Torvalds, linux-kernel, Linux Containers, netdev

On Wed, May 25, 2011 at 15:17, Ingo Molnar <mingo@elte.hu> wrote:
> * Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>> On Wed, May 25, 2011 at 14:47, Ingo Molnar <mingo@elte.hu> wrote:
>> > * Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>> >
>> >> > But at least the primary, 'native' syscall table of every arch
>> >> > could be kept rather fresh via generic enumeration.
>> >>
>> >> So we can start all over at offset 501 (alpha just started using
>> >> 500) with a unified, clean, and compressed list of syscalls? Or do
>> >> we have some more other-os-compat syscalls around in this range?
>> >
>> > No, that would leave a big hole in the syscall table of most
>> > architectures.
>>
>> Sure, but we could (a) optimize for the case where the syscall number is
>> larger than 500 and/or (b) drop support for syscall numbers smaller than
>> 501, depending on a config option.
>
> Dunno why there is so much desire to complicate and break
> well-working ABIs while we have a 14+ MLOC kernel with so much code
> in it that is in dire need to be improved! :-)

Because we (think we) need less active brain cells to write emails that to code.
So when we're not "active" enough to hack, we tend to respond to long winding
getting off-topic email threads...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-24  7:03           ` Eric W. Biederman
  2011-05-24  7:16             ` Ingo Molnar
@ 2011-05-24  7:26             ` James Bottomley
  2011-05-24  8:11               ` Eric W. Biederman
  1 sibling, 1 reply; 23+ messages in thread
From: James Bottomley @ 2011-05-24  7:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ingo Molnar, netdev, linux-kernel, Geert Uytterhoeven,
	Linux Containers, Linus Torvalds

On Tue, 2011-05-24 at 00:03 -0700, Eric W. Biederman wrote:
> Ingo Molnar <mingo@elte.hu> writes:
> 
> > I agree with Linus's notion in this thread though, a core kernel change should 
> > generally not worry about hooking up rare-arch system calls (concentrate on the 
> > architectures that get tested most) - those are better enabled gradually 
> > anyway.
> 
> The way I read it he was complaining about my having parisc bits and
> asking for my branch to be merged before the parisc bits had been
> merged.  Which I credit as a fair complaint.  If I am going to depend on
> other peoples trees I should wait.
> 
> At the same time when I am busy looking for every possible source of
> trouble and putting code into net-next to detect pending conflicts,
> and when maintainers complain when I ask for review that my patches
> conflict with their patches.  Being a contentious developer I am
> inclined to do something.

Right ... and the problem is that someone has to care, because the
conflict will show up in linux-next.  I think Stephen Rothwell would
appreciate us making his life easier rather than leaving it to him to
sort out the problems.

> Now that the reality has sunk in that it means waiting for other peoples
> code to be merged before I request for my changes to be merged I don't
> think I will structure a tree that way again while I remember.

Right.   This is quite a common occurrence in SCSI (mostly changes
entangled with block or libata).  If you don't feel comfortable running
a postmerge tree, just send me the bits and I'll do it (after all it
works either way around: I can pull in the syscalls and depend on your
tree rather than vice versa).

James

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
  2011-05-24  7:26             ` James Bottomley
@ 2011-05-24  8:11               ` Eric W. Biederman
  0 siblings, 0 replies; 23+ messages in thread
From: Eric W. Biederman @ 2011-05-24  8:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ingo Molnar, netdev, linux-kernel, Geert Uytterhoeven,
	Linux Containers, Linus Torvalds

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Tue, 2011-05-24 at 00:03 -0700, Eric W. Biederman wrote:
>> Ingo Molnar <mingo@elte.hu> writes:
>> 
>> > I agree with Linus's notion in this thread though, a core kernel change should 
>> > generally not worry about hooking up rare-arch system calls (concentrate on the 
>> > architectures that get tested most) - those are better enabled gradually 
>> > anyway.
>> 
>> The way I read it he was complaining about my having parisc bits and
>> asking for my branch to be merged before the parisc bits had been
>> merged.  Which I credit as a fair complaint.  If I am going to depend on
>> other peoples trees I should wait.
>> 
>> At the same time when I am busy looking for every possible source of
>> trouble and putting code into net-next to detect pending conflicts,
>> and when maintainers complain when I ask for review that my patches
>> conflict with their patches.  Being a contentious developer I am
                                         ^^^^^^^^^^^ conscientious
I didn't realize it was possible to make that typo.

>> inclined to do something.
>
> Right ... and the problem is that someone has to care, because the
> conflict will show up in linux-next.  I think Stephen Rothwell would
> appreciate us making his life easier rather than leaving it to him to
> sort out the problems.
>
>> Now that the reality has sunk in that it means waiting for other peoples
>> code to be merged before I request for my changes to be merged I don't
>> think I will structure a tree that way again while I remember.
>
> Right.   This is quite a common occurrence in SCSI (mostly changes
> entangled with block or libata).  If you don't feel comfortable running
> a postmerge tree, just send me the bits and I'll do it (after all it
> works either way around: I can pull in the syscalls and depend on your
> tree rather than vice versa).

Well for the moment I don't see too many problems.  I sent another pull
request to Linus earlier today now that your changes are in.  So I am
hoping either Linus will pull my tree or someone will educate me on what
he will Linus will accept.

Right now my tree is tested and in a good state.  Heck I'm running it
to send this email.  So I am reluctant to change anything without clear
feedback.

James when you refer to a postmerge tree what are the dynamics/semantics
usually associated with that?  Is this a tree that gets pulled a couple
of times?  Once with the non-conflicting bits.  Another time when the
bits it depends on have been merged?  Or is this a tree that gets pulled
after the merge window entirely?

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-05-27 20:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-23 21:05 [GIT PULL] Namespace file descriptors for 2.6.40 Eric W. Biederman
2011-05-25 21:05 ` C Anthony Risinger
2011-05-25 21:38   ` Serge E. Hallyn
2011-05-25 21:55     ` C Anthony Risinger
2011-05-25 22:11       ` Michał Mirosław
2011-05-25 23:40       ` Eric W. Biederman
2011-05-27 20:18         ` C Anthony Risinger
  -- strict thread matches above, loose matches on Subject: below --
2011-05-21 23:39 Eric W. Biederman
2011-05-21 23:42 ` Linus Torvalds
2011-05-22  0:33   ` Eric W. Biederman
     [not found]     ` <m1boyvpo9r.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2011-05-22  7:13       ` James Bottomley
2011-05-22  8:42         ` Ingo Molnar
2011-05-24  7:03           ` Eric W. Biederman
2011-05-24  7:16             ` Ingo Molnar
2011-05-25  0:34               ` Valdis.Kletnieks
2011-05-25  8:25                 ` Ingo Molnar
2011-05-25  8:35                   ` Geert Uytterhoeven
2011-05-25 12:47                     ` Ingo Molnar
2011-05-25 13:00                       ` Geert Uytterhoeven
2011-05-25 13:17                         ` Ingo Molnar
2011-05-25 15:22                           ` Geert Uytterhoeven
2011-05-24  7:26             ` James Bottomley
2011-05-24  8:11               ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).