* Re: [RFC PATCH 05/27] containers: Open a socket inside a container [not found] <m2o8z7t2w5.fsf@badgerous.net> @ 2019-09-27 14:46 ` Eric W. Biederman 2019-09-28 22:29 ` Alun Evans 0 siblings, 1 reply; 5+ messages in thread From: Eric W. Biederman @ 2019-09-27 14:46 UTC (permalink / raw) To: Alun Evans; +Cc: linux-kernel Alun Evans <alun@badgerous.net> writes: > Hi Eric, > > > On Tue, 19 Feb 2019, Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> David Howells <dhowells@redhat.com> writes: >> >> > Provide a system call to open a socket inside of a container, using that >> > container's network namespace. This allows netlink to be used to manage >> > the container. >> > >> > fd = container_socket(int container_fd, >> > int domain, int type, int protocol); >> > >> >> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> >> >> Use a namespace file descriptor if you need this. So far we have not >> added this system call as it is just a performance optimization. And it >> has been too niche to matter. >> >> If this that has changed we can add this separately from everything else >> you are doing here. > > I think I've found the niche. > > > I'm trying to use network namespaces from Go. Yes. Go sucks for this. > Since setns is thread > specific, I'm forced to use this pattern: > > runtime.LockOSThread() > defer runtime.UnlockOSThread() > … > err = netns.Set(newns) > > > This is only safe recently: > https://github.com/vishvananda/netns/issues/17#issuecomment-367325770 > > - but is still less than ideal performance wise, as it locks out other > socket operations. > > The socketat() / socketns() would be ideal: > > https://lwn.net/Articles/406684/ > https://lwn.net/Articles/407495/ > https://lkml.org/lkml/2011/10/3/220 > > > One thing that is interesting, the LockOSThread works pretty well for > receiving, since I can wrap it around the socket()/bind()/listen() at > startup. Then accept() can run outside of the lock. > > It's creating new outbound tcp connections via socket()/connect() pairs > that is the issue. As I understand it you should be able to write socketat in go something like: runtime.LockOSThread() err = netns.Set(newns); fd = socket(...); err = netns.Set(defaultns); runtime.UnlockOSThread() I have no real objections to a kernel system call doing that. It has just never risen to the level where it was necessary to optimize userspace yet. Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 05/27] containers: Open a socket inside a container 2019-09-27 14:46 ` [RFC PATCH 05/27] containers: Open a socket inside a container Eric W. Biederman @ 2019-09-28 22:29 ` Alun Evans 2019-09-30 10:02 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: Alun Evans @ 2019-09-28 22:29 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel On Fri 27 Sep '19 at 07:46 ebiederm@xmission.com (Eric W. Biederman) wrote: > > Alun Evans <alun@badgerous.net> writes: > >> Hi Eric, >> >> >> On Tue, 19 Feb 2019, Eric W. Biederman <ebiederm@xmission.com> wrote: >>> >>> David Howells <dhowells@redhat.com> writes: >>> >>> > Provide a system call to open a socket inside of a container, using that >>> > container's network namespace. This allows netlink to be used to manage >>> > the container. >>> > >>> > fd = container_socket(int container_fd, >>> > int domain, int type, int protocol); >>> > >>> >>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> >>> >>> Use a namespace file descriptor if you need this. So far we have not >>> added this system call as it is just a performance optimization. And it >>> has been too niche to matter. >>> >>> If this that has changed we can add this separately from everything else >>> you are doing here. >> >> I think I've found the niche. >> >> >> I'm trying to use network namespaces from Go. > > Yes. Go sucks for this. Haha... Neither confirm nor deny. >> Since setns is thread >> specific, I'm forced to use this pattern: >> >> runtime.LockOSThread() >> defer runtime.UnlockOSThread() >> … >> err = netns.Set(newns) >> >> >> This is only safe recently: >> https://github.com/vishvananda/netns/issues/17#issuecomment-367325770 >> >> - but is still less than ideal performance wise, as it locks out other >> socket operations. >> >> The socketat() / socketns() would be ideal: >> >> https://lwn.net/Articles/406684/ >> https://lwn.net/Articles/407495/ >> https://lkml.org/lkml/2011/10/3/220 >> >> >> One thing that is interesting, the LockOSThread works pretty well for >> receiving, since I can wrap it around the socket()/bind()/listen() at >> startup. Then accept() can run outside of the lock. >> >> It's creating new outbound tcp connections via socket()/connect() pairs >> that is the issue. > > As I understand it you should be able to write socketat in go something like: > > runtime.LockOSThread() > err = netns.Set(newns); > fd = socket(...); > err = netns.Set(defaultns); > runtime.UnlockOSThread() Yeah, this is currently what I'm having to do. It's painful because due to the Go runtime model of a single OS netpoller thread, locking the OS thread to the current goroutine blocks out the other goroutines doing network I/O. > I have no real objections to a kernel system call doing that. It has > just never risen to the level where it was necessary to optimize > userspace yet. Would you be able to accept the patch from this thread with the container API? fd = container_socket(int container_fd, int domain, int type, int protocol); I think that seems more coherent with the rest of the container world than a follow up of https://lkml.org/lkml/2011/10/3/220 : int socketns(int namespace, int domain, int type, int protocol) I could also put some up if required. A. -- Alun Evans. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 05/27] containers: Open a socket inside a container 2019-09-28 22:29 ` Alun Evans @ 2019-09-30 10:02 ` Eric W. Biederman 0 siblings, 0 replies; 5+ messages in thread From: Eric W. Biederman @ 2019-09-30 10:02 UTC (permalink / raw) To: Alun Evans; +Cc: linux-kernel Alun Evans <alun@badgerous.net> writes: > On Fri 27 Sep '19 at 07:46 ebiederm@xmission.com (Eric W. Biederman) wrote: >> >> Alun Evans <alun@badgerous.net> writes: >> >>> Hi Eric, >>> >>> >>> On Tue, 19 Feb 2019, Eric W. Biederman <ebiederm@xmission.com> wrote: >>>> >>>> David Howells <dhowells@redhat.com> writes: >>>> >>>> > Provide a system call to open a socket inside of a container, using that >>>> > container's network namespace. This allows netlink to be used to manage >>>> > the container. >>>> > >>>> > fd = container_socket(int container_fd, >>>> > int domain, int type, int protocol); >>>> > >>>> >>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> >>>> >>>> Use a namespace file descriptor if you need this. So far we have not >>>> added this system call as it is just a performance optimization. And it >>>> has been too niche to matter. >>>> >>>> If this that has changed we can add this separately from everything else >>>> you are doing here. >>> >>> I think I've found the niche. >>> >>> >>> I'm trying to use network namespaces from Go. >> >> Yes. Go sucks for this. > > Haha... Neither confirm nor deny. > >>> Since setns is thread >>> specific, I'm forced to use this pattern: >>> >>> runtime.LockOSThread() >>> defer runtime.UnlockOSThread() >>> … >>> err = netns.Set(newns) >>> >>> >>> This is only safe recently: >>> https://github.com/vishvananda/netns/issues/17#issuecomment-367325770 >>> >>> - but is still less than ideal performance wise, as it locks out other >>> socket operations. >>> >>> The socketat() / socketns() would be ideal: >>> >>> https://lwn.net/Articles/406684/ >>> https://lwn.net/Articles/407495/ >>> https://lkml.org/lkml/2011/10/3/220 >>> >>> >>> One thing that is interesting, the LockOSThread works pretty well for >>> receiving, since I can wrap it around the socket()/bind()/listen() at >>> startup. Then accept() can run outside of the lock. >>> >>> It's creating new outbound tcp connections via socket()/connect() pairs >>> that is the issue. >> >> As I understand it you should be able to write socketat in go something like: >> >> runtime.LockOSThread() >> err = netns.Set(newns); >> fd = socket(...); >> err = netns.Set(defaultns); >> runtime.UnlockOSThread() > > Yeah, this is currently what I'm having to do. It's painful because due > to the Go runtime model of a single OS netpoller thread, locking the OS > thread to the current goroutine blocks out the other goroutines doing > network I/O. Just to be clear you know that only the setns and the socket calls need to block out switching threads and all of those should be currently quite fast. Hmm. So this is a global Go lock and not simply locking the current go routine onto it's current kernel thread? Yes that does sound quite painful. It would be very nice if Go could provide an idiom where a series of calls could be fixed to a single kernel thread. >> I have no real objections to a kernel system call doing that. It has >> just never risen to the level where it was necessary to optimize >> userspace yet. > > Would you be able to accept the patch from this thread with the > container API? > > fd = container_socket(int container_fd, > int domain, int type, int protocol); > > I think that seems more coherent with the rest of the container world > than a follow up of https://lkml.org/lkml/2011/10/3/220 : > Given container_socket implies the need to create a namespace of namespaces. No. Given that container_socket can't be used in iptools because it has a different concept of container. No. Given that no one has ever proposed solving the entire migration story when the have wanted to define a container and thus all of this implies breaking CRIU. No. > int socketns(int netns_fd, int domain, int type, int protocol) > Yes please. I suspect in the current world where system calls are much more expensive (because of mitigations for speculative execution bugs) with a little bit of timing we could come up with a reasonable case even for non GO runtimes. To that end I would like to see performance numbers of at least a micro benchmark in C. Just so we can quantify the improvement. Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC PATCH 00/27] Containers and using authenticated filesystems
@ 2019-02-15 16:07 David Howells
2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside a container David Howells
0 siblings, 1 reply; 5+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
To: keyrings, trond.myklebust, sfrench
Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
dhowells, linux-kernel
Here's a collection of patches that containerises the kernel keys and makes
it possible to separate keys by namespace. This can be extended to any
filesystem that uses request_key() to obtain the pertinent authentication
token on entry to VFS or socket methods.
I have this working with AFS and AF_RXRPC so far, but it could be extended
to other filesystems, such as NFS and CIFS.
The following changes are made:
(1) Add optional namespace tags to a key's index_key. This allows the
following:
(a) Automatic invalidation of all keys with that tag when the
namespace is removed.
(b) Mixing of keys with the same description, but different areas of
operation within a keyring.
(c) Sharing of cache keyrings, such as the DNS lookup cache.
(d) Diversion of upcalls based on namespace criteria.
(2) Provide each network namespace with a tag that can be used with (1).
This is used by the DNS query, rxrpc, nfs idmapper keys.
[!] Note that it might still be better to move these keyrings into the
network namespace.
(3) Provide key ACLs. These allow:
(a) The permissions can be split more finely, in particular separating
out Invalidate and Join.
(b) Permits to be granted to non-standard subjects. So, for instance,
Search permission could be granted to a container object, allowing
a search of the container keyring by a denizen of the container to
find a key that they can't otherwise see.
(4) Provide a kernel container object. Currently, this is created with a
system call and passed flags that indicate the namespaces to be
inherited or replaced. It might be better to actually use something
like fsconfig() to configure the container by setting key=val type
options.
The kernel container object provides the following facilities:
(a) request_key upcall interception. The manager of a container can
intercept requests made inside the container and, using a series
of filters, can cause the authkeys to be placed into keyrings that
serve as queues for one or more upcall processing programs. These
upcall programs use key notifications to monitor those keyrings.
(b) Per-container keyring. A keyring can be attached to the container
such that this is searched by a request_key() performed by a
denizen of the container after searching the thread, process and
session keyrings. The keyring and the keys contained therein must
be granted Search for that container.
This allows:
(i) Authenticated filesystems to be used transparently inside of
the container without any cooperation from the occupant
thereof. All the key maintenance can be done by the manager.
(ii) Keys to be made available to the denizens of a container (by
granting extra permissions to the container subject).
(c) Per-container ID that can be used in audit messages.
(d) Container object creation gives the manager a file descriptor that
can:
(i) Be passed to a dirfd parameter to a VFS syscall, such as
mkdirat(), allowing an operation to be done inside the
container.
(ii) Be passed to fsopen()/fsconfig() to indicate that the target
filesystem is going to be created inside a container, in that
container's namespaces.
(iii) Be passed to the move_mount() syscall as a destination for
setting the root filesystem inside a new mount namespace made
upon container creation.
(e) The ability to configure the container with namespaces or
whatever, and then fork a process into that container to 'boot'
it.
Three sample programs are provided:
(1) test-container. This:
- Creates a kernel container with a blank mount ns.
- Creates its root mount and moves it to the container root.
- Mounts /proc therein.
- Creates a keyring called "_container"
- Sets that as the container keyring.
- Grants Search permission to the container on that keyring.
- Removes owner permission on that keyring.
- Creates a sample user key "foobar" in the container keyring.
- Grants various permissions to the container on that key.
- Creates a keyring called "upcall"
- Intercepts "user" key upcalls from the container to there.
- Forks a process into the container
- Prints the container keyring ID if it can
- Exec's bash.
This program expects to be given the device name for a partition it
can mount as the root and expects it to contain things like /etc,
/bin, /sbin, /lib, /usr containing programs that can be run and /proc
to mount procfs upon. E.g.:
./test-container /dev/sda3
(2) test-upcall. This is a service program that monitors the "upcall"
keyring created by test-container for authkeys appearing, which it
then hands off to /sbin/request-key. This:
- Opens /dev/watch_queue.
- Sets the size to 1 page.
- Sets a filter to watch for "Link creation" key events.
- Sets a watch on the upcall keyring.
- Polls the watch queue for events
- When an event comes in:
- Gets the authkey ID from the event buffer.
- Queries the authkey.
- Forks of a handler which:
- Moves the authkey to its thread keyring
- Sets up a new session keyring with the authkey in it.
- Execs /sbin/request-key.
This can be run in a shell that shares the session keyring with
test-container, from which it will find the upcall keyring.
Alternatively, the keyring ID can be provided on the command line:
./test-upcall [<upcall-keyring>]
It can be triggered from inside of the container with something like:
keyctl request2 user debug:e a @s
and something like:
ptrs h=4 t=2 m=2000003
NOTIFY[00000004-00000002] ty=0003 sy=0002 i=01000010
KEY 78543393 change=2 aux=141053003
Authentication key 141053003
- create 779280685
- uid=0 gid=0
- rings=0,0,798528519
- callout='a'
RQDebug keyid: 779280685
RQDebug desc: debug:e
RQDebug callout: a
RQDebug session keyring: 798528519
will appear on stdout/stderr from it and /sbin/request-key.
(3) test-cont-grant. This is a program to make the nominated key
available to a container's denizens. It:
- Grants search permission to the nominated key.
- Links the nominated key into the container keyring.
It can be run from outside of the keyring like so:
./test-cont-grant <key> [<container-keyring>]
If the keyring isn't given, it will look for one called "_container"
in the session keyring where test-container is expected to have placed
it.
With kAFS, it can be used like follows:
kinit dhowells@REDHAT.COM
kafs-aklog redhat.com
which would log into kerberos and then get a key for accessing an AFS
cell called "redhat.com". This can be seen in the session keyring by
calling "keyctl show":
120378984 --alswrv 0 0 keyring: _ses
474754113 ---lswrv 0 65534 \_ keyring: _uid.0
64049961 --alswrv 0 0 \_ rxrpc: afs@redhat.com
78543393 --alswrv 0 0 \_ keyring: upcall
661655334 --alswrv 0 0 \_ keyring: _container
639103010 --alswrv 0 0 \_ user: foobar
Then doing:
./test-cont-grant 64049961
will result in:
120378984 --alswrv 0 0 keyring: _ses
474754113 ---lswrv 0 65534 \_ keyring: _uid.0
64049961 --alswrv 0 0 \_ rxrpc: afs@procyon.org.uk
78543393 --alswrv 0 0 \_ keyring: upcall
661655334 --alswrv 0 0 \_ keyring: _container
639103010 --alswrv 0 0 \_ user: foobar
64049961 --alswrv 0 0 \_ rxrpc: afs@procyon.org.uk
Inside the container, the cell could be mounted:
mount -t afs "%redhat.com:root.cell" /mnt
and then operations in /mnt will be done using the token that has been
made available. However, this can be overridden locally inside the
container by doing kinit and kafs-aklog there with a different user.
More to the point, the container manager could mount the container's
rootfs, say, over authenticated AFS and then attach the token to the
container and mount the rootfs into the container and the container's
inhabitant need not have any means to gain a kerberos login.
[?] I do wonder if the possibility to use container key searches for
direct mounts should be controlled by a mount option, say:
fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
where you have to have the container handle available.
[!] Note that test-cont-grant picks the container by name and does not
require the container handle when setting the key ACL - but the
name must come from the set of children of the current container.
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container
Note that this is dependent on the mount-api-viro, fsinfo, notifications
and keys-namespace branches.
David
---
David Howells (27):
containers: Rename linux/container.h to linux/container_dev.h
containers: Implement containers as kernel objects
containers: Provide /proc/containers
containers: Allow a process to be forked into a container
containers: Open a socket inside a container
containers, vfs: Allow syscall dirfd arguments to take a container fd
containers: Make fsopen() able to create a superblock in a container
containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS
vfs: Allow mounting to other namespaces
containers: Provide fs_context op for container setting
containers: Sample program for driving container objects
containers: Allow a daemon to intercept request_key upcalls in a container
keys: Provide a keyctl to query a request_key authentication key
keys: Break bits out of key_unlink()
keys: Make __key_link_begin() handle lockdep nesting
keys: Grant Link permission to possessers of request_key auth keys
keys: Add a keyctl to move a key between keyrings
keys: Find the least-recently used unseen key in a keyring.
containers: Sample: request_key upcall handling
container, keys: Add a container keyring
keys: Fix request_key() lack of Link perm check on found key
KEYS: Replace uid/gid/perm permissions checking with an ACL
KEYS: Provide KEYCTL_GRANT_PERMISSION
keys: Allow a container to be specified as a subject in a key's ACL
keys: Provide a way to ask for the container keyring
keys: Allow containers to be included in key ACLs by name
containers: Sample to grant access to a key in a container
arch/x86/entry/syscalls/syscall_32.tbl | 3
arch/x86/entry/syscalls/syscall_64.tbl | 3
arch/x86/ia32/sys_ia32.c | 2
certs/blacklist.c | 7
certs/system_keyring.c | 12
drivers/acpi/container.c | 2
drivers/base/container.c | 2
drivers/md/dm-crypt.c | 2
drivers/nvdimm/security.c | 2
fs/afs/security.c | 2
fs/afs/super.c | 18 +
fs/cifs/cifs_spnego.c | 25 +
fs/cifs/cifsacl.c | 28 +
fs/cifs/connect.c | 4
fs/crypto/keyinfo.c | 2
fs/ecryptfs/ecryptfs_kernel.h | 2
fs/ecryptfs/keystore.c | 2
fs/fs_context.c | 39 +
fs/fscache/object-list.c | 2
fs/fsopen.c | 54 ++
fs/namei.c | 45 +-
fs/namespace.c | 129 ++++-
fs/nfs/nfs4idmap.c | 29 +
fs/proc/root.c | 20 +
fs/ubifs/auth.c | 2
include/linux/container.h | 100 +++-
include/linux/container_dev.h | 25 +
include/linux/cred.h | 3
include/linux/fs_context.h | 5
include/linux/init_task.h | 1
include/linux/key-type.h | 2
include/linux/key.h | 122 +++--
include/linux/lsm_hooks.h | 20 +
include/linux/nsproxy.h | 7
include/linux/pid.h | 5
include/linux/proc_ns.h | 6
include/linux/sched.h | 3
include/linux/sched/task.h | 3
include/linux/security.h | 15 +
include/linux/socket.h | 3
include/linux/syscalls.h | 6
include/uapi/linux/container.h | 28 +
include/uapi/linux/keyctl.h | 85 +++
include/uapi/linux/mount.h | 4
init/Kconfig | 7
init/init_task.c | 3
ipc/mqueue.c | 10
kernel/Makefile | 2
kernel/container.c | 532 ++++++++++++++++++++
kernel/cred.c | 45 ++
kernel/exit.c | 1
kernel/fork.c | 111 ++++
kernel/namespaces.h | 15 +
kernel/nsproxy.c | 32 +
kernel/pid.c | 4
kernel/sys_ni.c | 5
lib/digsig.c | 2
net/ceph/ceph_common.c | 2
net/compat.c | 2
net/dns_resolver/dns_key.c | 12
net/dns_resolver/dns_query.c | 15 -
net/rxrpc/key.c | 16 -
net/socket.c | 34 +
samples/vfs/Makefile | 12
samples/vfs/test-cont-grant.c | 84 +++
samples/vfs/test-container.c | 382 ++++++++++++++
samples/vfs/test-upcall.c | 243 +++++++++
security/integrity/digsig.c | 31 -
security/integrity/digsig_asymmetric.c | 2
security/integrity/evm/evm_crypto.c | 2
security/integrity/ima/ima_mok.c | 13
security/integrity/integrity.h | 4
.../integrity/platform_certs/platform_keyring.c | 13
security/keys/Makefile | 2
security/keys/compat.c | 20 +
security/keys/container.c | 419 ++++++++++++++++
security/keys/encrypted-keys/encrypted.c | 2
security/keys/encrypted-keys/masterkey_trusted.c | 2
security/keys/gc.c | 2
security/keys/internal.h | 34 +
security/keys/key.c | 35 -
security/keys/keyctl.c | 176 +++++--
security/keys/keyring.c | 198 ++++++-
security/keys/permission.c | 446 +++++++++++++++--
security/keys/persistent.c | 27 +
security/keys/proc.c | 17 -
security/keys/process_keys.c | 102 +++-
security/keys/request_key.c | 70 ++-
security/keys/request_key_auth.c | 21 +
security/security.c | 12
security/selinux/hooks.c | 16 +
security/smack/smack_lsm.c | 3
92 files changed, 3696 insertions(+), 425 deletions(-)
create mode 100644 include/linux/container_dev.h
create mode 100644 include/uapi/linux/container.h
create mode 100644 kernel/container.c
create mode 100644 kernel/namespaces.h
create mode 100644 samples/vfs/test-cont-grant.c
create mode 100644 samples/vfs/test-container.c
create mode 100644 samples/vfs/test-upcall.c
create mode 100644 security/keys/container.c
^ permalink raw reply [flat|nested] 5+ messages in thread* [RFC PATCH 05/27] containers: Open a socket inside a container 2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells @ 2019-02-15 16:07 ` David Howells 2019-02-19 16:41 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw) To: keyrings, trond.myklebust, sfrench Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb, dhowells, linux-kernel Provide a system call to open a socket inside of a container, using that container's network namespace. This allows netlink to be used to manage the container. fd = container_socket(int container_fd, int domain, int type, int protocol); Signed-off-by: David Howells <dhowells@redhat.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/socket.h | 3 ++- include/linux/syscalls.h | 2 ++ kernel/sys_ni.c | 1 + net/compat.c | 2 +- net/socket.c | 34 +++++++++++++++++++++++++++----- 7 files changed, 37 insertions(+), 7 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 8666693510f9..f4c9beff77a6 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -409,3 +409,4 @@ 395 i386 sb_notify sys_sb_notify __ia32_sys_sb_notify 396 i386 container_create sys_container_create __ia32_sys_container_create 397 i386 fork_into_container sys_fork_into_container __ia32_sys_fork_into_container +398 i386 container_socket sys_container_socket __ia32_sys_container_socket diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index d40d4790fcb2..e20cdf7b5527 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -354,6 +354,7 @@ 343 common sb_notify __x64_sys_sb_notify 344 common container_create __x64_sys_container_create 345 common fork_into_container __x64_sys_fork_into_container +346 common container_socket __x64_sys_container_socket # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/socket.h b/include/linux/socket.h index ab2041a00e01..154ac900a8a5 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -10,6 +10,7 @@ #include <linux/compiler.h> /* __user */ #include <uapi/linux/socket.h> +struct net; struct pid; struct cred; @@ -376,7 +377,7 @@ extern int __sys_sendto(int fd, void __user *buff, size_t len, int addr_len); extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags); -extern int __sys_socket(int family, int type, int protocol); +extern int __sys_socket(struct net *net, int family, int type, int protocol); extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen); extern int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 15e5cc704df3..547334c6ffc2 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -947,6 +947,8 @@ asmlinkage long sys_container_create(const char __user *name, unsigned int flags unsigned long spare3, unsigned long spare4, unsigned long spare5); asmlinkage long sys_fork_into_container(int containerfd); +asmlinkage long sys_container_socket(int containerfd, + int domain, int type, int protocol); /* * Architecture-specific system calls diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index a23ad529d548..ce9c5bb30e7f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -236,6 +236,7 @@ COND_SYSCALL(shmdt); /* net/socket.c */ COND_SYSCALL(socket); COND_SYSCALL(socketpair); +COND_SYSCALL(container_socket); COND_SYSCALL(bind); COND_SYSCALL(listen); COND_SYSCALL(accept); diff --git a/net/compat.c b/net/compat.c index 959d1c51826d..1b2db740fd33 100644 --- a/net/compat.c +++ b/net/compat.c @@ -856,7 +856,7 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args) switch (call) { case SYS_SOCKET: - ret = __sys_socket(a0, a1, a[2]); + ret = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]); break; case SYS_BIND: ret = __sys_bind(a0, compat_ptr(a1), a[2]); diff --git a/net/socket.c b/net/socket.c index 7d271a1d0c7e..7406580598b9 100644 --- a/net/socket.c +++ b/net/socket.c @@ -80,6 +80,7 @@ #include <linux/highmem.h> #include <linux/mount.h> #include <linux/fs_context.h> +#include <linux/container.h> #include <linux/security.h> #include <linux/syscalls.h> #include <linux/compat.h> @@ -1326,9 +1327,9 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct } EXPORT_SYMBOL(sock_create_kern); -int __sys_socket(int family, int type, int protocol) +int __sys_socket(struct net *net, int family, int type, int protocol) { - int retval; + long retval; struct socket *sock; int flags; @@ -1346,7 +1347,7 @@ int __sys_socket(int family, int type, int protocol) if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK)) flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK; - retval = sock_create(family, type, protocol, &sock); + retval = __sock_create(net, family, type, protocol, &sock, 0); if (retval < 0) return retval; @@ -1355,9 +1356,32 @@ int __sys_socket(int family, int type, int protocol) SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) { - return __sys_socket(family, type, protocol); + return __sys_socket(current->nsproxy->net_ns, family, type, protocol); } +/* + * Create a socket inside a container. + */ +#ifdef CONFIG_CONTAINERS +SYSCALL_DEFINE4(container_socket, + int, containerfd, int, family, int, type, int, protocol) +{ + struct fd f = fdget(containerfd); + long ret; + + if (!f.file) + return -EBADF; + ret = -EINVAL; + if (is_container_file(f.file)) { + struct container *c = f.file->private_data; + + ret = __sys_socket(c->ns->net_ns, family, type, protocol); + } + fdput(f); + return ret; +} +#endif + /* * Create a pair of connected sockets. */ @@ -2555,7 +2579,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) switch (call) { case SYS_SOCKET: - err = __sys_socket(a0, a1, a[2]); + err = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]); break; case SYS_BIND: err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]); ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 05/27] containers: Open a socket inside a container 2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside a container David Howells @ 2019-02-19 16:41 ` Eric W. Biederman 0 siblings, 0 replies; 5+ messages in thread From: Eric W. Biederman @ 2019-02-19 16:41 UTC (permalink / raw) To: David Howells Cc: keyrings, trond.myklebust, sfrench, linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel David Howells <dhowells@redhat.com> writes: > Provide a system call to open a socket inside of a container, using that > container's network namespace. This allows netlink to be used to manage > the container. > > fd = container_socket(int container_fd, > int domain, int type, int protocol); > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> Use a namespace file descriptor if you need this. So far we have not added this system call as it is just a performance optimization. And it has been too niche to matter. If this that has changed we can add this separately from everything else you are doing here. > Signed-off-by: David Howells <dhowells@redhat.com> > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/socket.h | 3 ++- > include/linux/syscalls.h | 2 ++ > kernel/sys_ni.c | 1 + > net/compat.c | 2 +- > net/socket.c | 34 +++++++++++++++++++++++++++----- > 7 files changed, 37 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 8666693510f9..f4c9beff77a6 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -409,3 +409,4 @@ > 395 i386 sb_notify sys_sb_notify __ia32_sys_sb_notify > 396 i386 container_create sys_container_create __ia32_sys_container_create > 397 i386 fork_into_container sys_fork_into_container __ia32_sys_fork_into_container > +398 i386 container_socket sys_container_socket __ia32_sys_container_socket > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index d40d4790fcb2..e20cdf7b5527 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -354,6 +354,7 @@ > 343 common sb_notify __x64_sys_sb_notify > 344 common container_create __x64_sys_container_create > 345 common fork_into_container __x64_sys_fork_into_container > +346 common container_socket __x64_sys_container_socket > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/socket.h b/include/linux/socket.h > index ab2041a00e01..154ac900a8a5 100644 > --- a/include/linux/socket.h > +++ b/include/linux/socket.h > @@ -10,6 +10,7 @@ > #include <linux/compiler.h> /* __user */ > #include <uapi/linux/socket.h> > > +struct net; > struct pid; > struct cred; > > @@ -376,7 +377,7 @@ extern int __sys_sendto(int fd, void __user *buff, size_t len, > int addr_len); > extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, > int __user *upeer_addrlen, int flags); > -extern int __sys_socket(int family, int type, int protocol); > +extern int __sys_socket(struct net *net, int family, int type, int protocol); > extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen); > extern int __sys_connect(int fd, struct sockaddr __user *uservaddr, > int addrlen); > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 15e5cc704df3..547334c6ffc2 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -947,6 +947,8 @@ asmlinkage long sys_container_create(const char __user *name, unsigned int flags > unsigned long spare3, unsigned long spare4, > unsigned long spare5); > asmlinkage long sys_fork_into_container(int containerfd); > +asmlinkage long sys_container_socket(int containerfd, > + int domain, int type, int protocol); > > /* > * Architecture-specific system calls > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index a23ad529d548..ce9c5bb30e7f 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -236,6 +236,7 @@ COND_SYSCALL(shmdt); > /* net/socket.c */ > COND_SYSCALL(socket); > COND_SYSCALL(socketpair); > +COND_SYSCALL(container_socket); > COND_SYSCALL(bind); > COND_SYSCALL(listen); > COND_SYSCALL(accept); > diff --git a/net/compat.c b/net/compat.c > index 959d1c51826d..1b2db740fd33 100644 > --- a/net/compat.c > +++ b/net/compat.c > @@ -856,7 +856,7 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args) > > switch (call) { > case SYS_SOCKET: > - ret = __sys_socket(a0, a1, a[2]); > + ret = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]); > break; > case SYS_BIND: > ret = __sys_bind(a0, compat_ptr(a1), a[2]); > diff --git a/net/socket.c b/net/socket.c > index 7d271a1d0c7e..7406580598b9 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -80,6 +80,7 @@ > #include <linux/highmem.h> > #include <linux/mount.h> > #include <linux/fs_context.h> > +#include <linux/container.h> > #include <linux/security.h> > #include <linux/syscalls.h> > #include <linux/compat.h> > @@ -1326,9 +1327,9 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct > } > EXPORT_SYMBOL(sock_create_kern); > > -int __sys_socket(int family, int type, int protocol) > +int __sys_socket(struct net *net, int family, int type, int protocol) > { > - int retval; > + long retval; > struct socket *sock; > int flags; > > @@ -1346,7 +1347,7 @@ int __sys_socket(int family, int type, int protocol) > if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK)) > flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK; > > - retval = sock_create(family, type, protocol, &sock); > + retval = __sock_create(net, family, type, protocol, &sock, 0); > if (retval < 0) > return retval; > > @@ -1355,9 +1356,32 @@ int __sys_socket(int family, int type, int protocol) > > SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) > { > - return __sys_socket(family, type, protocol); > + return __sys_socket(current->nsproxy->net_ns, family, type, protocol); > } > > +/* > + * Create a socket inside a container. > + */ > +#ifdef CONFIG_CONTAINERS > +SYSCALL_DEFINE4(container_socket, > + int, containerfd, int, family, int, type, int, protocol) > +{ > + struct fd f = fdget(containerfd); > + long ret; > + > + if (!f.file) > + return -EBADF; > + ret = -EINVAL; > + if (is_container_file(f.file)) { > + struct container *c = f.file->private_data; > + > + ret = __sys_socket(c->ns->net_ns, family, type, protocol); > + } > + fdput(f); > + return ret; > +} > +#endif > + > /* > * Create a pair of connected sockets. > */ > @@ -2555,7 +2579,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) > > switch (call) { > case SYS_SOCKET: > - err = __sys_socket(a0, a1, a[2]); > + err = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]); > break; > case SYS_BIND: > err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]); ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-09-30 10:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <m2o8z7t2w5.fsf@badgerous.net>
2019-09-27 14:46 ` [RFC PATCH 05/27] containers: Open a socket inside a container Eric W. Biederman
2019-09-28 22:29 ` Alun Evans
2019-09-30 10:02 ` Eric W. Biederman
2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside a container David Howells
2019-02-19 16:41 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox