* Interaction user namespace, /proc/1 ownership & cap_set
@ 2013-07-01 16:16 Daniel P. Berrange
[not found] ` <20130701161625.GQ15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 16+ messages in thread
From: Daniel P. Berrange @ 2013-07-01 16:16 UTC (permalink / raw)
To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: Richard Weinberger, Serge Hallyn, Eric W. Biederman
[-- Attachment #1: Type: text/plain, Size: 4762 bytes --]
I'm struggling debugging a strange problem with interaction between user
namespaces, cap_set and ownership of files in /proc/1/
I'm using a modified version (attached to this mail) of the demo program
userns_child_exec.c linked on https://lwn.net/Articles/532593/
$ gcc -lcap -Wall -o userns_child_exec userns_child_exec.c
First normal execution appears to work just fine (as root):
$ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
Launching child init
# umount /proc/sys/fs/binfmt_misc
# umount /proc/sys/fs/binfmt_misc
# umount /proc/fs/nfsd
# umount /proc
# mount -t proc proc /proc/
# ls -al /proc/1/environ
-r--------. 1 root root 0 Jul 1 17:04 /proc/1/environ
My modification adds support for a '-c' arg to call the program to use
cap_set() from libcap.so in order to remove the CAP_SYS_MODULE capability.
If I run the program with the '-c' arg present, then the files in
the /proc/1/ directory all end up owned by nfsnobody.nfsbody
$ ./userns_child_exec -c -p -m -U -M '0 1000 1' -G '0 1000 1' bash
Launching child init
# umount /proc/sys/fs/binfmt_misc
# umount /proc/sys/fs/binfmt_misc
# umount /proc/fs/nfsd
# umount /proc
# mount -t proc proc /proc/
# ls -al /proc/1/environ
-r--------. 1 nfsnobody nfsnobody 0 Jul 1 17:01 /proc/1/environ
Why on earth would calling 'cap_set()' to drop a capability cause
the user/group ownership of files in /proc/1/ to change ?
Any child processes launched from this point get correct ownership
on their /proc/NNN files - only /proc/1/ seems to be affected.
Via strace, we can see the libcap code only calls 3 syscalls:
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SET
UID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MO
DULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_S
YS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER
|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RA
W|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_N
ICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0
capset({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NIC
E|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0
though, for added fun, when running the demo program via strace
the problem does not appear :-(
On a slightly related topic, I've noticed that it is not possible to
invoke prctl(PR_CAPBSET_DROP) to clear the bounding set for processes
inside a container. The kernel code uses capable() instead of ns_capable().
Is this intended, or a missing conversion ?
Indeed, even ignoring namespaces for a minute, I'm curious as to why
CAP_SETPCAP is required at all for PR_CAPBSET_DROP ? Is it really
a security risk to allow a non-privileged user to remove bits from
the bounding set ? For KVM I'd like to be able to use PR_CAPBSET_DROP
to prevent a compromised KVM process from using any setuid program to
re-gain any kind of capabilities. Similarly I think a container admin
may well wish to make use of PR_CAPBSET_DROP to lock down applications
there.
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
[-- Attachment #2: userns_child_exec.c --]
[-- Type: text/plain, Size: 8120 bytes --]
/* userns_child_exec.c
Copyright 2013, Michael Kerrisk
Licensed under GNU General Public License v2 or later
Create a child process that executes a shell command in new
namespace(s); allow UID and GID mappings to be specified when
creating a user namespace.
*/
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <signal.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <sys/capability.h>
/* A simple error-handling function: print an error message based
on the value in 'errno' and terminate the calling process */
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
struct child_args {
char **argv; /* Command to be executed by child, with arguments */
int pipe_fd[2]; /* Pipe used to synchronize parent and child */
};
static int verbose;
static int dropcaps;
static void
usage(char *pname)
{
fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
fprintf(stderr, "Create a child process that executes a shell command "
"in a new user namespace,\n"
"and possibly also other new namespace(s).\n\n");
fprintf(stderr, "Options can be:\n\n");
#define fpe(str) fprintf(stderr, " %s", str);
fpe("-c Drop caps\n");
fpe("-i New IPC namespace\n");
fpe("-m New mount namespace\n");
fpe("-n New network namespace\n");
fpe("-p New PID namespace\n");
fpe("-u New UTS namespace\n");
fpe("-U New user namespace\n");
fpe("-M uid_map Specify UID map for user namespace\n");
fpe("-G gid_map Specify GID map for user namespace\n");
fpe(" If -M or -G is specified, -U is required\n");
fpe("-v Display verbose messages\n");
fpe("\n");
fpe("Map strings for -M and -G consist of records of the form:\n");
fpe("\n");
fpe(" ID-inside-ns ID-outside-ns len\n");
fpe("\n");
fpe("A map string can contain multiple records, separated by commas;\n");
fpe("the commas are replaced by newlines before writing to map files.\n");
exit(EXIT_FAILURE);
}
/* Update the mapping file 'map_file', with the value provided in
'mapping', a string that defines a UID or GID mapping. A UID or
GID mapping consists of one or more newline-delimited records
of the form:
ID_inside-ns ID-outside-ns length
Requiring the user to supply a string that contains newlines is
of course inconvenient for command-line use. Thus, we permit the
use of commas to delimit records in this string, and replace them
with newlines before writing the string to the file. */
static void
update_map(char *mapping, char *map_file)
{
int fd, j;
size_t map_len; /* Length of 'mapping' */
/* Replace commas in mapping string with newlines */
map_len = strlen(mapping);
for (j = 0; j < map_len; j++)
if (mapping[j] == ',')
mapping[j] = '\n';
fd = open(map_file, O_RDWR);
if (fd == -1) {
fprintf(stderr, "open %s: %s\n", map_file, strerror(errno));
exit(EXIT_FAILURE);
}
if (write(fd, mapping, map_len) != map_len) {
fprintf(stderr, "write %s: %s\n", map_file, strerror(errno));
exit(EXIT_FAILURE);
}
close(fd);
}
static int /* Start function for cloned child */
childFunc(void *arg)
{
struct child_args *args = (struct child_args *) arg;
char ch;
/* Wait until the parent has updated the UID and GID mappings. See
the comment in main(). We wait for end of file on a pipe that will
be closed by the parent process once it has updated the mappings. */
close(args->pipe_fd[1]); /* Close our descriptor for the write end
of the pipe so that we see EOF when
parent closes its descriptor */
if (read(args->pipe_fd[0], &ch, 1) != 0) {
fprintf(stderr, "Failure in child: read from pipe returned != 0\n");
exit(EXIT_FAILURE);
}
/* Execute a shell command */
if (setreuid(0, 0) < 0)
errExit("setreuid");
if (setregid(0, 0) < 0)
errExit("setregid");
if (dropcaps) {
cap_t caps;
cap_value_t val[] = { CAP_SYS_MODULE };
caps = cap_get_proc();
cap_set_flag(caps,
CAP_EFFECTIVE,
1, val, CAP_CLEAR);
cap_set_flag(caps,
CAP_PERMITTED,
1, val, CAP_CLEAR);
cap_set_flag(caps,
CAP_INHERITABLE,
1, val, CAP_CLEAR);
cap_set_proc(caps);
}
fprintf(stderr, "Launching child init\n");
execvp(args->argv[0], args->argv);
errExit("execvp");
}
#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE]; /* Space for child's stack */
int
main(int argc, char *argv[])
{
int flags, opt;
pid_t child_pid;
struct child_args args;
char *uid_map, *gid_map;
char map_path[PATH_MAX];
/* Parse command-line options. The initial '+' character in
the final getopt() argument prevents GNU-style permutation
of command-line options. That's useful, since sometimes
the 'command' to be executed by this program itself
has command-line options. We don't want getopt() to treat
those as options to this program. */
flags = 0;
verbose = 0;
gid_map = NULL;
uid_map = NULL;
while ((opt = getopt(argc, argv, "+imnpucUM:G:v")) != -1) {
switch (opt) {
case 'i': flags |= CLONE_NEWIPC; break;
case 'm': flags |= CLONE_NEWNS; break;
case 'n': flags |= CLONE_NEWNET; break;
case 'p': flags |= CLONE_NEWPID; break;
case 'u': flags |= CLONE_NEWUTS; break;
case 'c': dropcaps = 1; break;
case 'v': verbose = 1; break;
case 'M': uid_map = optarg; break;
case 'G': gid_map = optarg; break;
case 'U': flags |= CLONE_NEWUSER; break;
default: usage(argv[0]);
}
}
/* -M or -G without -U is nonsensical */
if ((uid_map != NULL || gid_map != NULL) &&
!(flags & CLONE_NEWUSER))
usage(argv[0]);
args.argv = &argv[optind];
/* We use a pipe to synchronize the parent and child, in order to
ensure that the parent sets the UID and GID maps before the child
calls execve(). This ensures that the child maintains its
capabilities during the execve() in the common case where we
want to map the child's effective user ID to 0 in the new user
namespace. Without this synchronization, the child would lose
its capabilities if it performed an execve() with nonzero
user IDs (see the capabilities(7) man page for details of the
transformation of a process's capabilities during execve()). */
if (pipe(args.pipe_fd) == -1)
errExit("pipe");
/* Create the child in new namespace(s) */
child_pid = clone(childFunc, child_stack + STACK_SIZE,
flags | SIGCHLD, &args);
if (child_pid == -1)
errExit("clone");
/* Parent falls through to here */
if (verbose)
printf("%s: PID of child created by clone() is %ld\n",
argv[0], (long) child_pid);
/* Update the UID and GID maps in the child */
if (uid_map != NULL) {
snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
(long) child_pid);
update_map(uid_map, map_path);
}
if (gid_map != NULL) {
snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
(long) child_pid);
update_map(gid_map, map_path);
}
/* Close the write end of the pipe, to signal to the child that we
have updated the UID and GID maps */
close(args.pipe_fd[1]);
if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
errExit("waitpid");
if (verbose)
printf("%s: terminating\n", argv[0]);
exit(EXIT_SUCCESS);
}
[-- Attachment #3: Type: text/plain, Size: 205 bytes --]
_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 16+ messages in thread[parent not found: <20130701161625.GQ15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130701161625.GQ15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-07-01 16:19 ` Daniel P. Berrange [not found] ` <20130701161946.GR15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-07-02 5:14 ` Gao feng 1 sibling, 1 reply; 16+ messages in thread From: Daniel P. Berrange @ 2013-07-01 16:19 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Richard Weinberger, Serge Hallyn, Eric W. Biederman On Mon, Jul 01, 2013 at 05:16:25PM +0100, Daniel P. Berrange wrote: > I'm struggling debugging a strange problem with interaction between user > namespaces, cap_set and ownership of files in /proc/1/ > > I'm using a modified version (attached to this mail) of the demo program > userns_child_exec.c linked on https://lwn.net/Articles/532593/ > > $ gcc -lcap -Wall -o userns_child_exec userns_child_exec.c > > First normal execution appears to work just fine (as root): > > $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash > Launching child init > # umount /proc/sys/fs/binfmt_misc > # umount /proc/sys/fs/binfmt_misc > # umount /proc/fs/nfsd > # umount /proc > # mount -t proc proc /proc/ > # ls -al /proc/1/environ > -r--------. 1 root root 0 Jul 1 17:04 /proc/1/environ > > > My modification adds support for a '-c' arg to call the program to use > cap_set() from libcap.so in order to remove the CAP_SYS_MODULE capability. > > If I run the program with the '-c' arg present, then the files in > the /proc/1/ directory all end up owned by nfsnobody.nfsbody > > $ ./userns_child_exec -c -p -m -U -M '0 1000 1' -G '0 1000 1' bash > Launching child init > # umount /proc/sys/fs/binfmt_misc > # umount /proc/sys/fs/binfmt_misc > # umount /proc/fs/nfsd > # umount /proc > # mount -t proc proc /proc/ > # ls -al /proc/1/environ > -r--------. 1 nfsnobody nfsnobody 0 Jul 1 17:01 /proc/1/environ > > Why on earth would calling 'cap_set()' to drop a capability cause > the user/group ownership of files in /proc/1/ to change ? > > Any child processes launched from this point get correct ownership > on their /proc/NNN files - only /proc/1/ seems to be affected. > > Via strace, we can see the libcap code only calls 3 syscalls: > > capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 > capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SET > UID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MO > DULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_S > YS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER > |CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RA > W|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_N > ICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0 > capset({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_N ICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0 > > though, for added fun, when running the demo program via strace > the problem does not appear :-( > > > > On a slightly related topic, I've noticed that it is not possible to > invoke prctl(PR_CAPBSET_DROP) to clear the bounding set for processes > inside a container. The kernel code uses capable() instead of ns_capable(). > Is this intended, or a missing conversion ? > > Indeed, even ignoring namespaces for a minute, I'm curious as to why > CAP_SETPCAP is required at all for PR_CAPBSET_DROP ? Is it really > a security risk to allow a non-privileged user to remove bits from > the bounding set ? For KVM I'd like to be able to use PR_CAPBSET_DROP > to prevent a compromised KVM process from using any setuid program to > re-gain any kind of capabilities. Similarly I think a container admin > may well wish to make use of PR_CAPBSET_DROP to lock down applications > there. Opps, I should have mentioned that I'm using 3.9.4 kernel. Basically the Fedora 3.9.4-303 build, but with CONFIG_XFS_FS=n and CONFIG_USER_NS=y set in the Kconfig. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20130701161946.GR15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130701161946.GR15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-07-01 16:24 ` Richard Weinberger 0 siblings, 0 replies; 16+ messages in thread From: Richard Weinberger @ 2013-07-01 16:24 UTC (permalink / raw) To: Daniel P. Berrange Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Eric W. Biederman Am 01.07.2013 18:19, schrieb Daniel P. Berrange: > On Mon, Jul 01, 2013 at 05:16:25PM +0100, Daniel P. Berrange wrote: >> I'm struggling debugging a strange problem with interaction between user >> namespaces, cap_set and ownership of files in /proc/1/ >> >> I'm using a modified version (attached to this mail) of the demo program >> userns_child_exec.c linked on https://lwn.net/Articles/532593/ >> >> $ gcc -lcap -Wall -o userns_child_exec userns_child_exec.c >> >> First normal execution appears to work just fine (as root): >> >> $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash >> Launching child init >> # umount /proc/sys/fs/binfmt_misc >> # umount /proc/sys/fs/binfmt_misc >> # umount /proc/fs/nfsd >> # umount /proc >> # mount -t proc proc /proc/ >> # ls -al /proc/1/environ >> -r--------. 1 root root 0 Jul 1 17:04 /proc/1/environ >> >> >> My modification adds support for a '-c' arg to call the program to use >> cap_set() from libcap.so in order to remove the CAP_SYS_MODULE capability. >> >> If I run the program with the '-c' arg present, then the files in >> the /proc/1/ directory all end up owned by nfsnobody.nfsbody >> >> $ ./userns_child_exec -c -p -m -U -M '0 1000 1' -G '0 1000 1' bash >> Launching child init >> # umount /proc/sys/fs/binfmt_misc >> # umount /proc/sys/fs/binfmt_misc >> # umount /proc/fs/nfsd >> # umount /proc >> # mount -t proc proc /proc/ >> # ls -al /proc/1/environ >> -r--------. 1 nfsnobody nfsnobody 0 Jul 1 17:01 /proc/1/environ >> >> Why on earth would calling 'cap_set()' to drop a capability cause >> the user/group ownership of files in /proc/1/ to change ? >> >> Any child processes launched from this point get correct ownership >> on their /proc/NNN files - only /proc/1/ seems to be affected. >> >> Via strace, we can see the libcap code only calls 3 syscalls: >> >> capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 >> capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SET >> UID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MO >> DULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_S >> YS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER >> |CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RA >> W|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_N >> ICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0 >> capset({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_ NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP, 0}) = 0 >> >> though, for added fun, when running the demo program via strace >> the problem does not appear :-( >> >> >> >> On a slightly related topic, I've noticed that it is not possible to >> invoke prctl(PR_CAPBSET_DROP) to clear the bounding set for processes >> inside a container. The kernel code uses capable() instead of ns_capable(). >> Is this intended, or a missing conversion ? >> >> Indeed, even ignoring namespaces for a minute, I'm curious as to why >> CAP_SETPCAP is required at all for PR_CAPBSET_DROP ? Is it really >> a security risk to allow a non-privileged user to remove bits from >> the bounding set ? For KVM I'd like to be able to use PR_CAPBSET_DROP >> to prevent a compromised KVM process from using any setuid program to >> re-gain any kind of capabilities. Similarly I think a container admin >> may well wish to make use of PR_CAPBSET_DROP to lock down applications >> there. > > > Opps, I should have mentioned that I'm using 3.9.4 kernel. Basically the > Fedora 3.9.4-303 build, but with CONFIG_XFS_FS=n and CONFIG_USER_NS=y > set in the Kconfig. FWIW, I can reproduce the issue on v3.10 vanilla. Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130701161625.GQ15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-07-01 16:19 ` Daniel P. Berrange @ 2013-07-02 5:14 ` Gao feng [not found] ` <51D261D3.3030002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> 1 sibling, 1 reply; 16+ messages in thread From: Gao feng @ 2013-07-02 5:14 UTC (permalink / raw) To: Daniel P. Berrange Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Eric W. Biederman On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: > I'm struggling debugging a strange problem with interaction between user > namespaces, cap_set and ownership of files in /proc/1/ > This problem is occured after we call setuid/gid. for example, a task whose pid is 1234 calls setregid(10,10); setreuid(10,10); The uid/gid of the /proc/1234 is 10:0 ll /proc/1234 -d dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 the uid/gid of the files under /proc/1234 are two kinds... ll /proc/1234 dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup ... dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net dr-x--x--x 2 root root 0 Jul 2 10:58 ns ... dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task I checked the pre_revalidate and found the owner of the files under /proc/<pid> will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). Is this what we expected? why? For user namespace,the owner of /proc/1/* is incorrect and after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect too. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <51D261D3.3030002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <51D261D3.3030002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2013-07-02 8:44 ` Eric W. Biederman [not found] ` <87wqp9uz9a.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2013-07-02 8:44 UTC (permalink / raw) To: Gao feng Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >> I'm struggling debugging a strange problem with interaction between user >> namespaces, cap_set and ownership of files in /proc/1/ >> > > This problem is occured after we call setuid/gid. > > for example, a task whose pid is 1234 calls > setregid(10,10); > setreuid(10,10); > > > The uid/gid of the /proc/1234 is 10:0 > ll /proc/1234 -d > dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 > > the uid/gid of the files under /proc/1234 are two kinds... > ll /proc/1234 > dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr > -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup > ... > dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net > dr-x--x--x 2 root root 0 Jul 2 10:58 ns > ... > dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task > > I checked the pre_revalidate and found the owner of the files under /proc/<pid> > will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). > Is this what we expected? why? Expected yes. Perfect perhaps not. That piece of code has not been examined to see if it is safe to use make_kuid(task_user_ns(task), 0), instead of GLOBAL_ROOT_UID. > For user namespace,the owner of /proc/1/* is incorrect and > after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect > too. From the current semantics of dumpable GLOBAL_ROOT_UID is correct. Please double check but I believe /proc/self should continue to work, despite this. The practical question is there anything that can go wrong if we allow the root of the user namespace of the process to read it. Especially since several permission changes can happen a process may stop being dumpable before we enter the user namespace. So it is not immediately clear that relaxing the dumpable rules is safe. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <87wqp9uz9a.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <87wqp9uz9a.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-07-02 8:56 ` Richard Weinberger [not found] ` <51D295C5.1080003-/L3Ra7n9ekc@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Richard Weinberger @ 2013-07-02 8:56 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn Am 02.07.2013 10:44, schrieb Eric W. Biederman: > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > >> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>> I'm struggling debugging a strange problem with interaction between user >>> namespaces, cap_set and ownership of files in /proc/1/ >>> >> >> This problem is occured after we call setuid/gid. >> >> for example, a task whose pid is 1234 calls >> setregid(10,10); >> setreuid(10,10); >> >> >> The uid/gid of the /proc/1234 is 10:0 >> ll /proc/1234 -d >> dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 >> >> the uid/gid of the files under /proc/1234 are two kinds... >> ll /proc/1234 >> dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr >> -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup >> ... >> dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net >> dr-x--x--x 2 root root 0 Jul 2 10:58 ns >> ... >> dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task >> >> I checked the pre_revalidate and found the owner of the files under /proc/<pid> >> will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). >> Is this what we expected? why? > > Expected yes. Perfect perhaps not. > > That piece of code has not been examined to see if it is safe to use > make_kuid(task_user_ns(task), 0), instead of GLOBAL_ROOT_UID. > >> For user namespace,the owner of /proc/1/* is incorrect and >> after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect >> too. > > From the current semantics of dumpable GLOBAL_ROOT_UID is correct. > > Please double check but I believe /proc/self should continue to work, > despite this. /proc/self is not an option. systemd (in particular some of it's tools with pid != 1) read from /proc/1/environ to find out what environment variables it got to detect LXC and other visualization environments. With userns enabled this check fails and systemd goes nuts because it thinks that it lives on top of a "real" Linux. Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <51D295C5.1080003-/L3Ra7n9ekc@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <51D295C5.1080003-/L3Ra7n9ekc@public.gmane.org> @ 2013-07-02 9:25 ` Daniel P. Berrange [not found] ` <20130702092554.GD2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Daniel P. Berrange @ 2013-07-02 9:25 UTC (permalink / raw) To: Richard Weinberger Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Eric W. Biederman On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: > Am 02.07.2013 10:44, schrieb Eric W. Biederman: > > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > > > >> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: > >>> I'm struggling debugging a strange problem with interaction between user > >>> namespaces, cap_set and ownership of files in /proc/1/ > >>> > >> > >> This problem is occured after we call setuid/gid. > >> > >> for example, a task whose pid is 1234 calls > >> setregid(10,10); > >> setreuid(10,10); If seems to get reset to the right values (0:0) when we execve() the init binary though. This doesn't happen if we have invoked the capset() syscall in between the setregid & the execve() calls. > >> > >> > >> The uid/gid of the /proc/1234 is 10:0 > >> ll /proc/1234 -d > >> dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 > >> > >> the uid/gid of the files under /proc/1234 are two kinds... > >> ll /proc/1234 > >> dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr > >> -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup > >> ... > >> dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net > >> dr-x--x--x 2 root root 0 Jul 2 10:58 ns > >> ... > >> dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task > >> > >> I checked the pre_revalidate and found the owner of the files under /proc/<pid> > >> will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). > >> Is this what we expected? why? > > > > Expected yes. Perfect perhaps not. > > > > That piece of code has not been examined to see if it is safe to use > > make_kuid(task_user_ns(task), 0), instead of GLOBAL_ROOT_UID. > > > >> For user namespace,the owner of /proc/1/* is incorrect and > >> after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect > >> too. > > > > From the current semantics of dumpable GLOBAL_ROOT_UID is correct. > > > > Please double check but I believe /proc/self should continue to work, > > despite this. > > /proc/self is not an option. systemd (in particular some of it's tools with pid != 1) read from /proc/1/environ to find out > what environment variables it got to detect LXC and other visualization environments. > With userns enabled this check fails and systemd goes nuts because it thinks that it lives on top of a "real" Linux. I don't even see how /proc/self would solve this, since it is just a symlink pointing to /proc/1 in this scenario, so the ownership of files at /proc/1/XXXX would still be wrong. This isn't really a systemd specific problem either, I think any app would expect to be able to read its own files under /proc/$PID/ Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20130702092554.GD2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130702092554.GD2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-07-02 9:45 ` Richard Weinberger 2013-07-02 9:57 ` Eric W. Biederman 1 sibling, 0 replies; 16+ messages in thread From: Richard Weinberger @ 2013-07-02 9:45 UTC (permalink / raw) To: Daniel P. Berrange Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Eric W. Biederman Am 02.07.2013 11:25, schrieb Daniel P. Berrange: > On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>> >>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>>>> I'm struggling debugging a strange problem with interaction between user >>>>> namespaces, cap_set and ownership of files in /proc/1/ >>>>> >>>> >>>> This problem is occured after we call setuid/gid. >>>> >>>> for example, a task whose pid is 1234 calls >>>> setregid(10,10); >>>> setreuid(10,10); > > If seems to get reset to the right values (0:0) when we execve() > the init binary though. This doesn't happen if we have invoked > the capset() syscall in between the setregid & the execve() calls. > >>>> >>>> >>>> The uid/gid of the /proc/1234 is 10:0 >>>> ll /proc/1234 -d >>>> dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 >>>> >>>> the uid/gid of the files under /proc/1234 are two kinds... >>>> ll /proc/1234 >>>> dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr >>>> -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup >>>> ... >>>> dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net >>>> dr-x--x--x 2 root root 0 Jul 2 10:58 ns >>>> ... >>>> dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task >>>> >>>> I checked the pre_revalidate and found the owner of the files under /proc/<pid> >>>> will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). >>>> Is this what we expected? why? >>> >>> Expected yes. Perfect perhaps not. >>> >>> That piece of code has not been examined to see if it is safe to use >>> make_kuid(task_user_ns(task), 0), instead of GLOBAL_ROOT_UID. >>> >>>> For user namespace,the owner of /proc/1/* is incorrect and >>>> after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect >>>> too. >>> >>> From the current semantics of dumpable GLOBAL_ROOT_UID is correct. >>> >>> Please double check but I believe /proc/self should continue to work, >>> despite this. >> >> /proc/self is not an option. systemd (in particular some of it's tools with pid != 1) read from /proc/1/environ to find out >> what environment variables it got to detect LXC and other visualization environments. >> With userns enabled this check fails and systemd goes nuts because it thinks that it lives on top of a "real" Linux. > > I don't even see how /proc/self would solve this, since it > is just a symlink pointing to /proc/1 in this scenario, so > the ownership of files at /proc/1/XXXX would still be wrong. Yep. > This isn't really a systemd specific problem either, I think > any app would expect to be able to read its own files under > /proc/$PID/ True. I was not blaming systemd. In my case systemd is just the program which suffers from the issue. Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130702092554.GD2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-07-02 9:45 ` Richard Weinberger @ 2013-07-02 9:57 ` Eric W. Biederman [not found] ` <87ehbhthbl.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2013-07-02 9:57 UTC (permalink / raw) To: Daniel P. Berrange Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >> > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >> > >> >> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >> >>> I'm struggling debugging a strange problem with interaction between user >> >>> namespaces, cap_set and ownership of files in /proc/1/ >> >>> >> >> >> >> This problem is occured after we call setuid/gid. >> >> >> >> for example, a task whose pid is 1234 calls >> >> setregid(10,10); >> >> setreuid(10,10); > > If seems to get reset to the right values (0:0) when we execve() > the init binary though. This doesn't happen if we have invoked > the capset() syscall in between the setregid & the execve() calls. Yes, execve() should reset the dumpable state. I took a quick look and I don't see a way around set_dumpable calls in setup_new_exec. Why the process remains undumpable after exec is worth investigating. That logic should not be user namespace specific however. >> >> The uid/gid of the /proc/1234 is 10:0 >> >> ll /proc/1234 -d >> >> dr-xr-xr-x 8 uucp wheel 0 Jul 2 10:57 /proc/1234 >> >> >> >> the uid/gid of the files under /proc/1234 are two kinds... >> >> ll /proc/1234 >> >> dr-xr-xr-x 2 uucp wheel 0 Jul 2 10:58 attr >> >> -rw-r--r-- 1 root root 0 Jul 2 10:58 autogroup >> >> ... >> >> dr-xr-xr-x 5 uucp wheel 0 Jul 2 10:58 net >> >> dr-x--x--x 2 root root 0 Jul 2 10:58 ns >> >> ... >> >> dr-xr-xr-x 3 uucp wheel 0 Jul 2 10:58 task >> >> >> >> I checked the pre_revalidate and found the owner of the files under /proc/<pid> >> >> will be set to the GLOBAL_ROOT_UID if the task executed setuid/setgid(task_dumpable is false). >> >> Is this what we expected? why? >> > >> > Expected yes. Perfect perhaps not. >> > >> > That piece of code has not been examined to see if it is safe to use >> > make_kuid(task_user_ns(task), 0), instead of GLOBAL_ROOT_UID. >> > >> >> For user namespace,the owner of /proc/1/* is incorrect and >> >> after task call setuid/gid in user namespace, the owner of /proc/<pid-of-this-task>/* is incorrect >> >> too. >> > >> > From the current semantics of dumpable GLOBAL_ROOT_UID is correct. >> > >> > Please double check but I believe /proc/self should continue to work, >> > despite this. >> >> /proc/self is not an option. systemd (in particular some of it's >> tools with pid != 1) read from /proc/1/environ to find out what >> environment variables it got to detect LXC and other visualization >> environments. With userns enabled this check fails and systemd goes >> nuts because it thinks that it lives on top of a "real" Linux. How odd. Last I was paying attention it was the selinux policy that you could only access your own proc files, because of the way ptrace was limited. As for systemd doing the wrong thing, it sounds like Richard has found a fertile source of imperfections. > I don't even see how /proc/self would solve this, since it > is just a symlink pointing to /proc/1 in this scenario, so > the ownership of files at /proc/1/XXXX would still be wrong. > > This isn't really a systemd specific problem either, I think > any app would expect to be able to read its own files under > /proc/$PID/ I meant there is a special case in the permission check for accessing your own files as you must do when going through /proc/self. It is worth verifying that special case for accessing your own files continues to work even when you are in a user namespace. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <87ehbhthbl.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <87ehbhthbl.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-07-02 10:07 ` Gao feng [not found] ` <51D2A649.9030102-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Gao feng @ 2013-07-02 10:07 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn On 07/02/2013 05:57 PM, Eric W. Biederman wrote: > "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>>> >>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>>>>> I'm struggling debugging a strange problem with interaction between user >>>>>> namespaces, cap_set and ownership of files in /proc/1/ >>>>>> >>>>> >>>>> This problem is occured after we call setuid/gid. >>>>> >>>>> for example, a task whose pid is 1234 calls >>>>> setregid(10,10); >>>>> setreuid(10,10); >> >> If seems to get reset to the right values (0:0) when we execve() >> the init binary though. This doesn't happen if we have invoked >> the capset() syscall in between the setregid & the execve() calls. > > Yes, execve() should reset the dumpable state. > > I took a quick look and I don't see a way around set_dumpable calls in > setup_new_exec. Why the process remains undumpable after exec is worth > investigating. That logic should not be user namespace specific > however. > I think it's the install_exec_creds, it calls commit_creds to set process undumpable /* dumpability changes */ if (!uid_eq(old->euid, new->euid) || !gid_eq(old->egid, new->egid) || !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; smp_wmb(); } ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <51D2A649.9030102-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <51D2A649.9030102-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2013-07-02 16:35 ` Eric W. Biederman [not found] ` <8761wsudgk.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2013-07-02 16:35 UTC (permalink / raw) To: Gao feng Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > On 07/02/2013 05:57 PM, Eric W. Biederman wrote: >> "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >> >>> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >>>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >>>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>>>> >>>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>>>>>> I'm struggling debugging a strange problem with interaction between user >>>>>>> namespaces, cap_set and ownership of files in /proc/1/ >>>>>>> >>>>>> >>>>>> This problem is occured after we call setuid/gid. >>>>>> >>>>>> for example, a task whose pid is 1234 calls >>>>>> setregid(10,10); >>>>>> setreuid(10,10); >>> >>> If seems to get reset to the right values (0:0) when we execve() >>> the init binary though. This doesn't happen if we have invoked >>> the capset() syscall in between the setregid & the execve() calls. >> >> Yes, execve() should reset the dumpable state. >> >> I took a quick look and I don't see a way around set_dumpable calls in >> setup_new_exec. Why the process remains undumpable after exec is worth >> investigating. That logic should not be user namespace specific >> however. >> > > I think it's the install_exec_creds, it calls commit_creds to set process undumpable > > /* dumpability changes */ > if (!uid_eq(old->euid, new->euid) || > !gid_eq(old->egid, new->egid) || > !uid_eq(old->fsuid, new->fsuid) || > !gid_eq(old->fsgid, new->fsgid) || > !cred_cap_issubset(old, new)) { > if (task->mm) > set_dumpable(task->mm, suid_dumpable); > task->pdeath_signal = 0; > smp_wmb(); > } That looks like it could do it. Especially if exec is increasing your capabilities. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <8761wsudgk.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <8761wsudgk.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-07-02 16:45 ` Daniel P. Berrange [not found] ` <20130702164514.GB2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Daniel P. Berrange @ 2013-07-02 16:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn On Tue, Jul 02, 2013 at 09:35:39AM -0700, Eric W. Biederman wrote: > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > > > On 07/02/2013 05:57 PM, Eric W. Biederman wrote: > >> "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> > >>> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: > >>>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: > >>>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: > >>>>> > >>>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: > >>>>>>> I'm struggling debugging a strange problem with interaction between user > >>>>>>> namespaces, cap_set and ownership of files in /proc/1/ > >>>>>>> > >>>>>> > >>>>>> This problem is occured after we call setuid/gid. > >>>>>> > >>>>>> for example, a task whose pid is 1234 calls > >>>>>> setregid(10,10); > >>>>>> setreuid(10,10); > >>> > >>> If seems to get reset to the right values (0:0) when we execve() > >>> the init binary though. This doesn't happen if we have invoked > >>> the capset() syscall in between the setregid & the execve() calls. > >> > >> Yes, execve() should reset the dumpable state. > >> > >> I took a quick look and I don't see a way around set_dumpable calls in > >> setup_new_exec. Why the process remains undumpable after exec is worth > >> investigating. That logic should not be user namespace specific > >> however. > >> > > > > I think it's the install_exec_creds, it calls commit_creds to set process undumpable > > > > /* dumpability changes */ > > if (!uid_eq(old->euid, new->euid) || > > !gid_eq(old->egid, new->egid) || > > !uid_eq(old->fsuid, new->fsuid) || > > !gid_eq(old->fsgid, new->fsgid) || > > !cred_cap_issubset(old, new)) { > > if (task->mm) > > set_dumpable(task->mm, suid_dumpable); > > task->pdeath_signal = 0; > > smp_wmb(); > > } > > That looks like it could do it. Especially if exec is increasing your > capabilities. Ah, yes, that would explain it. My demo is removing the SYS_MODULE capability, and then exec'ing the shell binary. Since we are uid==0, and prctl(PR_CAPBSET_DROP) is not available inside the user namespace, the rules for capabilities vs execve() call will cause the shell binary to regain SYS_MODULE capability bit. So the problem I'm seeing in libvirt is all a result of the fact that we can't use PR_CAPBSET_DROP inside the user namespace. Given that there's no point trying to drop any capabilities inside the user namespace. The only slight problem here is that we want to drop CAP_MKNOD so that systemd can detect that it shouldn't attempt to run any units which would rely on mknod. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20130702164514.GB2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <20130702164514.GB2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-07-02 17:12 ` Eric W. Biederman [not found] ` <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2013-07-02 17:12 UTC (permalink / raw) To: Daniel P. Berrange Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On Tue, Jul 02, 2013 at 09:35:39AM -0700, Eric W. Biederman wrote: >> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >> >> > On 07/02/2013 05:57 PM, Eric W. Biederman wrote: >> >> "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >> >> >> >>> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >> >>>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >> >>>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >> >>>>> >> >>>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >> >>>>>>> I'm struggling debugging a strange problem with interaction between user >> >>>>>>> namespaces, cap_set and ownership of files in /proc/1/ >> >>>>>>> >> >>>>>> >> >>>>>> This problem is occured after we call setuid/gid. >> >>>>>> >> >>>>>> for example, a task whose pid is 1234 calls >> >>>>>> setregid(10,10); >> >>>>>> setreuid(10,10); >> >>> >> >>> If seems to get reset to the right values (0:0) when we execve() >> >>> the init binary though. This doesn't happen if we have invoked >> >>> the capset() syscall in between the setregid & the execve() calls. >> >> >> >> Yes, execve() should reset the dumpable state. >> >> >> >> I took a quick look and I don't see a way around set_dumpable calls in >> >> setup_new_exec. Why the process remains undumpable after exec is worth >> >> investigating. That logic should not be user namespace specific >> >> however. >> >> >> > >> > I think it's the install_exec_creds, it calls commit_creds to set process undumpable >> > >> > /* dumpability changes */ >> > if (!uid_eq(old->euid, new->euid) || >> > !gid_eq(old->egid, new->egid) || >> > !uid_eq(old->fsuid, new->fsuid) || >> > !gid_eq(old->fsgid, new->fsgid) || >> > !cred_cap_issubset(old, new)) { >> > if (task->mm) >> > set_dumpable(task->mm, suid_dumpable); >> > task->pdeath_signal = 0; >> > smp_wmb(); >> > } >> >> That looks like it could do it. Especially if exec is increasing your >> capabilities. > > Ah, yes, that would explain it. My demo is removing the SYS_MODULE > capability, and then exec'ing the shell binary. Since we are uid==0, > and prctl(PR_CAPBSET_DROP) is not available inside the user namespace, > the rules for capabilities vs execve() call will cause the shell > binary to regain SYS_MODULE capability bit. > > So the problem I'm seeing in libvirt is all a result of the fact > that we can't use PR_CAPBSET_DROP inside the user namespace. Given > that there's no point trying to drop any capabilities inside the > user namespace. > > The only slight problem here is that we want to drop CAP_MKNOD so > that systemd can detect that it shouldn't attempt to run any units > which would rely on mknod. I just looked at that and I don't see a justification for the restriciton. Could you try the patch below and see if it fixes things for you? Eric From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Date: Tue, 2 Jul 2013 10:04:54 -0700 Subject: [PATCH] userns: Allow PR_CAPBSET_DROP in a user namespace. As the capabilites and capability bounding set are per user namespace properties it is safe to allow changing them with just CAP_SETPCAP permission in the user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- security/commoncap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/security/commoncap.c b/security/commoncap.c index 4d787e6..fd9b08f 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -843,7 +843,7 @@ int cap_task_setnice(struct task_struct *p, int nice) */ static long cap_prctl_drop(struct cred *new, unsigned long cap) { - if (!capable(CAP_SETPCAP)) + if (!ns_capable(current_user_ns(), CAP_SETPCAP)) return -EPERM; if (!cap_valid(cap)) return -EINVAL; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-07-02 20:24 ` Richard Weinberger 2013-07-09 10:35 ` Richard Weinberger 2013-07-12 10:04 ` Daniel P. Berrange 2 siblings, 0 replies; 16+ messages in thread From: Richard Weinberger @ 2013-07-02 20:24 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn Am 02.07.2013 19:12, schrieb Eric W. Biederman: > "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> On Tue, Jul 02, 2013 at 09:35:39AM -0700, Eric W. Biederman wrote: >>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>> >>>> On 07/02/2013 05:57 PM, Eric W. Biederman wrote: >>>>> "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >>>>> >>>>>> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >>>>>>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >>>>>>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>>>>>>> >>>>>>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>>>>>>>>> I'm struggling debugging a strange problem with interaction between user >>>>>>>>>> namespaces, cap_set and ownership of files in /proc/1/ >>>>>>>>>> >>>>>>>>> >>>>>>>>> This problem is occured after we call setuid/gid. >>>>>>>>> >>>>>>>>> for example, a task whose pid is 1234 calls >>>>>>>>> setregid(10,10); >>>>>>>>> setreuid(10,10); >>>>>> >>>>>> If seems to get reset to the right values (0:0) when we execve() >>>>>> the init binary though. This doesn't happen if we have invoked >>>>>> the capset() syscall in between the setregid & the execve() calls. >>>>> >>>>> Yes, execve() should reset the dumpable state. >>>>> >>>>> I took a quick look and I don't see a way around set_dumpable calls in >>>>> setup_new_exec. Why the process remains undumpable after exec is worth >>>>> investigating. That logic should not be user namespace specific >>>>> however. >>>>> >>>> >>>> I think it's the install_exec_creds, it calls commit_creds to set process undumpable >>>> >>>> /* dumpability changes */ >>>> if (!uid_eq(old->euid, new->euid) || >>>> !gid_eq(old->egid, new->egid) || >>>> !uid_eq(old->fsuid, new->fsuid) || >>>> !gid_eq(old->fsgid, new->fsgid) || >>>> !cred_cap_issubset(old, new)) { >>>> if (task->mm) >>>> set_dumpable(task->mm, suid_dumpable); >>>> task->pdeath_signal = 0; >>>> smp_wmb(); >>>> } >>> >>> That looks like it could do it. Especially if exec is increasing your >>> capabilities. >> >> Ah, yes, that would explain it. My demo is removing the SYS_MODULE >> capability, and then exec'ing the shell binary. Since we are uid==0, >> and prctl(PR_CAPBSET_DROP) is not available inside the user namespace, >> the rules for capabilities vs execve() call will cause the shell >> binary to regain SYS_MODULE capability bit. >> >> So the problem I'm seeing in libvirt is all a result of the fact >> that we can't use PR_CAPBSET_DROP inside the user namespace. Given >> that there's no point trying to drop any capabilities inside the >> user namespace. >> >> The only slight problem here is that we want to drop CAP_MKNOD so >> that systemd can detect that it shouldn't attempt to run any units >> which would rely on mknod. > > I just looked at that and I don't see a justification for the > restriciton. > > Could you try the patch below and see if it fixes things for you? With the patch applied my test program is able to drop it's caps (using libcap-ng) and does not regain them upon execve. Also reading from /proc/1/environ works. :) > Eric > > > From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Date: Tue, 2 Jul 2013 10:04:54 -0700 > Subject: [PATCH] userns: Allow PR_CAPBSET_DROP in a user namespace. > > As the capabilites and capability bounding set are per user namespace > properties it is safe to allow changing them with just CAP_SETPCAP > permission in the user namespace. > > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Tested-by: Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> > --- > security/commoncap.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 4d787e6..fd9b08f 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -843,7 +843,7 @@ int cap_task_setnice(struct task_struct *p, int nice) > */ > static long cap_prctl_drop(struct cred *new, unsigned long cap) > { > - if (!capable(CAP_SETPCAP)) > + if (!ns_capable(current_user_ns(), CAP_SETPCAP)) > return -EPERM; > if (!cap_valid(cap)) > return -EINVAL; > Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-07-02 20:24 ` Richard Weinberger @ 2013-07-09 10:35 ` Richard Weinberger 2013-07-12 10:04 ` Daniel P. Berrange 2 siblings, 0 replies; 16+ messages in thread From: Richard Weinberger @ 2013-07-09 10:35 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn Am 02.07.2013 19:12, schrieb Eric W. Biederman: > "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> On Tue, Jul 02, 2013 at 09:35:39AM -0700, Eric W. Biederman wrote: >>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>> >>>> On 07/02/2013 05:57 PM, Eric W. Biederman wrote: >>>>> "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >>>>> >>>>>> On Tue, Jul 02, 2013 at 10:56:37AM +0200, Richard Weinberger wrote: >>>>>>> Am 02.07.2013 10:44, schrieb Eric W. Biederman: >>>>>>>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes: >>>>>>>> >>>>>>>>> On 07/02/2013 12:16 AM, Daniel P. Berrange wrote: >>>>>>>>>> I'm struggling debugging a strange problem with interaction between user >>>>>>>>>> namespaces, cap_set and ownership of files in /proc/1/ >>>>>>>>>> >>>>>>>>> >>>>>>>>> This problem is occured after we call setuid/gid. >>>>>>>>> >>>>>>>>> for example, a task whose pid is 1234 calls >>>>>>>>> setregid(10,10); >>>>>>>>> setreuid(10,10); >>>>>> >>>>>> If seems to get reset to the right values (0:0) when we execve() >>>>>> the init binary though. This doesn't happen if we have invoked >>>>>> the capset() syscall in between the setregid & the execve() calls. >>>>> >>>>> Yes, execve() should reset the dumpable state. >>>>> >>>>> I took a quick look and I don't see a way around set_dumpable calls in >>>>> setup_new_exec. Why the process remains undumpable after exec is worth >>>>> investigating. That logic should not be user namespace specific >>>>> however. >>>>> >>>> >>>> I think it's the install_exec_creds, it calls commit_creds to set process undumpable >>>> >>>> /* dumpability changes */ >>>> if (!uid_eq(old->euid, new->euid) || >>>> !gid_eq(old->egid, new->egid) || >>>> !uid_eq(old->fsuid, new->fsuid) || >>>> !gid_eq(old->fsgid, new->fsgid) || >>>> !cred_cap_issubset(old, new)) { >>>> if (task->mm) >>>> set_dumpable(task->mm, suid_dumpable); >>>> task->pdeath_signal = 0; >>>> smp_wmb(); >>>> } >>> >>> That looks like it could do it. Especially if exec is increasing your >>> capabilities. >> >> Ah, yes, that would explain it. My demo is removing the SYS_MODULE >> capability, and then exec'ing the shell binary. Since we are uid==0, >> and prctl(PR_CAPBSET_DROP) is not available inside the user namespace, >> the rules for capabilities vs execve() call will cause the shell >> binary to regain SYS_MODULE capability bit. >> >> So the problem I'm seeing in libvirt is all a result of the fact >> that we can't use PR_CAPBSET_DROP inside the user namespace. Given >> that there's no point trying to drop any capabilities inside the >> user namespace. >> >> The only slight problem here is that we want to drop CAP_MKNOD so >> that systemd can detect that it shouldn't attempt to run any units >> which would rely on mknod. > > I just looked at that and I don't see a justification for the > restriciton. > > Could you try the patch below and see if it fixes things for you? > > Eric > > > From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Date: Tue, 2 Jul 2013 10:04:54 -0700 > Subject: [PATCH] userns: Allow PR_CAPBSET_DROP in a user namespace. > > As the capabilites and capability bounding set are per user namespace > properties it is safe to allow changing them with just CAP_SETPCAP > permission in the user namespace. > > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > security/commoncap.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 4d787e6..fd9b08f 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -843,7 +843,7 @@ int cap_task_setnice(struct task_struct *p, int nice) > */ > static long cap_prctl_drop(struct cred *new, unsigned long cap) > { > - if (!capable(CAP_SETPCAP)) > + if (!ns_capable(current_user_ns(), CAP_SETPCAP)) > return -EPERM; > if (!cap_valid(cap)) > return -EINVAL; > Is this fix already on it's way to mainline? Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Interaction user namespace, /proc/1 ownership & cap_set [not found] ` <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-07-02 20:24 ` Richard Weinberger 2013-07-09 10:35 ` Richard Weinberger @ 2013-07-12 10:04 ` Daniel P. Berrange 2 siblings, 0 replies; 16+ messages in thread From: Daniel P. Berrange @ 2013-07-12 10:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn On Tue, Jul 02, 2013 at 10:12:34AM -0700, Eric W. Biederman wrote: > "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > > Ah, yes, that would explain it. My demo is removing the SYS_MODULE > > capability, and then exec'ing the shell binary. Since we are uid==0, > > and prctl(PR_CAPBSET_DROP) is not available inside the user namespace, > > the rules for capabilities vs execve() call will cause the shell > > binary to regain SYS_MODULE capability bit. > > > > So the problem I'm seeing in libvirt is all a result of the fact > > that we can't use PR_CAPBSET_DROP inside the user namespace. Given > > that there's no point trying to drop any capabilities inside the > > user namespace. > > > > The only slight problem here is that we want to drop CAP_MKNOD so > > that systemd can detect that it shouldn't attempt to run any units > > which would rely on mknod. > > I just looked at that and I don't see a justification for the > restriciton. > > Could you try the patch below and see if it fixes things for you? > > Eric > > > From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > Date: Tue, 2 Jul 2013 10:04:54 -0700 > Subject: [PATCH] userns: Allow PR_CAPBSET_DROP in a user namespace. > > As the capabilites and capability bounding set are per user namespace > properties it is safe to allow changing them with just CAP_SETPCAP > permission in the user namespace. > > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > security/commoncap.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 4d787e6..fd9b08f 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -843,7 +843,7 @@ int cap_task_setnice(struct task_struct *p, int nice) > */ > static long cap_prctl_drop(struct cred *new, unsigned long cap) > { > - if (!capable(CAP_SETPCAP)) > + if (!ns_capable(current_user_ns(), CAP_SETPCAP)) > return -EPERM; > if (!cap_valid(cap)) > return -EINVAL; Yes, that works in my testing with libvirt. Feel free to add Tested-by: Daniel P. Berrange <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-07-12 10:04 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-01 16:16 Interaction user namespace, /proc/1 ownership & cap_set Daniel P. Berrange
[not found] ` <20130701161625.GQ15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-07-01 16:19 ` Daniel P. Berrange
[not found] ` <20130701161946.GR15954-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-07-01 16:24 ` Richard Weinberger
2013-07-02 5:14 ` Gao feng
[not found] ` <51D261D3.3030002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-07-02 8:44 ` Eric W. Biederman
[not found] ` <87wqp9uz9a.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-07-02 8:56 ` Richard Weinberger
[not found] ` <51D295C5.1080003-/L3Ra7n9ekc@public.gmane.org>
2013-07-02 9:25 ` Daniel P. Berrange
[not found] ` <20130702092554.GD2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-07-02 9:45 ` Richard Weinberger
2013-07-02 9:57 ` Eric W. Biederman
[not found] ` <87ehbhthbl.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-07-02 10:07 ` Gao feng
[not found] ` <51D2A649.9030102-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-07-02 16:35 ` Eric W. Biederman
[not found] ` <8761wsudgk.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-07-02 16:45 ` Daniel P. Berrange
[not found] ` <20130702164514.GB2524-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-07-02 17:12 ` Eric W. Biederman
[not found] ` <87k3l8sx6l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-07-02 20:24 ` Richard Weinberger
2013-07-09 10:35 ` Richard Weinberger
2013-07-12 10:04 ` Daniel P. Berrange
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.