* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-05-10 11:49 UTC (permalink / raw)
To: Roberto Sassu, Rob Landley, viro
Cc: linux-security-module, linux-integrity, initramfs, linux-api,
linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan
In-Reply-To: <bf0d02fc-d6ce-ef1d-bb7d-7ca14432c6fd@huawei.com>
On Fri, 2019-05-10 at 08:56 +0200, Roberto Sassu wrote:
> On 5/9/2019 8:34 PM, Rob Landley wrote:
> > On 5/9/19 6:24 AM, Roberto Sassu wrote:
> >> The difference with another proposal
> >> (https://lore.kernel.org/patchwork/cover/888071/) is that xattrs can be
> >> included in an image without changing the image format, as opposed to
> >> defining a new one. As seen from the discussion, if a new format has to be
> >> defined, it should fix the issues of the existing format, which requires
> >> more time.
> >
> > So you've explicitly chosen _not_ to address Y2038 while you're there.
>
> Can you be more specific?
Right, this patch set avoids incrementing the CPIO magic number and
the resulting changes required (eg. increasing the timestamp field
size), by including a file with the security xattrs in the CPIO. In
either case, including the security xattrs in the initramfs header or
as a separate file, the initramfs, itself, needs to be signed.
Mimi
^ permalink raw reply
* [PATCH v11 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-05-10 15:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, Andy Lutomirski, Kees Cook, Elvira Khabirova,
Eugene Syromyatnikov, Benjamin Herrenschmidt, Greentime Hu,
Helge Deller, James E.J. Bottomley, James Hogan, Michael Ellerman,
Paul Burton, Paul Mackerras, Ralf Baechle, Richard Kuo,
Shuah Khan, Vincent Chen, linux-api, linux-hexagon, linux-kernel,
linux-kselftest
[Andrew, could you take this patchset into your tree, please?
Besides the patch for hexagon, all patches in this series have
Acked-by or Reviewed-by tags already.
I have been waiting and pinging the hexagon maintainer since November
without any visible effect. The last Acked-by from the hexagon maintainer
in linux.git was in October and the last Signed-off-by was in July. Since
that time not a single change affecting hexagon was able to attract
attention of the hexagon maintainer, so I don't think it's worth waiting
any longer.]
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
PTRACE_GET_SYSCALL_INFO returns the following structure:
struct ptrace_syscall_info {
__u8 op; /* PTRACE_SYSCALL_INFO_* */
__u32 arch __attribute__((__aligned__(sizeof(__u32))));
__u64 instruction_pointer;
__u64 stack_pointer;
union {
struct {
__u64 nr;
__u64 args[6];
} entry;
struct {
__s64 rval;
__u8 is_error;
} exit;
struct {
__u64 nr;
__u64 args[6];
__u32 ret_data;
} seccomp;
};
};
The structure was chosen according to [2], except for the following
changes:
* seccomp substructure was added as a superset of entry substructure;
* the type of nr field was changed from int to __u64 because syscall
numbers are, as a practical matter, 64 bits;
* stack_pointer field was added along with instruction_pointer field
since it is readily available and can save the tracer from extra
PTRACE_GETREGS/PTRACE_GETREGSET calls;
* arch is always initialized to aid with tracing system calls
* such as execve();
* instruction_pointer and stack_pointer are always initialized
so they could be easily obtained for non-syscall stops;
* a boolean is_error field was added along with rval field, this way
the tracer can more reliably distinguish a return value
from an error value.
strace has been ported to PTRACE_GET_SYSCALL_INFO.
Starting with release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API
as the preferred mechanism of obtaining syscall information.
[1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF_eVyg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3AwEHA@mail.gmail.com/
---
Notes:
v11:
* Added more Acked-by.
* Rebased back to mainline as the prerequisite syscall_get_arch patchset
has already been merged via audit tree.
v10:
* Added more Acked-by.
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Moved syscall_get_arch() specific patches to a separate patchset
which is now merged into audit/next tree.
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7:
* Rebased to v5.0-rc1.
* 5 arch-specific preparatory patches out of 25 have been merged
into v5.0-rc1 via arch trees.
v6:
* Added syscall_get_arguments and syscall_set_arguments wrappers
to asm-generic/syscall.h, requested by Geert.
* Changed PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Changed struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Merged separate series and patches into the single series.
* Changed PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Changed struct ptrace_syscall_info: generalized instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Added PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Patched all remaining architectures to provide all necessary
syscall_get_* functions.
* Made available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
* Added a test for PTRACE_GET_SYSCALL_INFO to selftests/ptrace.
v4:
* Revisited PTRACE_EVENT_SECCOMP support:
do not introduce task_struct.ptrace_event, use child->last_siginfo->si_code instead.
* Implemented PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Changed struct ptrace_syscall_info.
* Added PTRACE_EVENT_SECCOMP support by adding ptrace_event to task_struct.
* Added proper defines for ptrace_syscall_info.op values.
* Renamed PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
and moved them to uapi.
v2:
* Stopped using task->ptrace.
* Replaced entry_info.is_compat with entry_info.arch, used syscall_get_arch().
* Used addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
Dmitry V. Levin (6):
nds32: fix asm/syscall.h # acked
hexagon: define syscall_get_error() and syscall_get_return_value() # waiting for ack since November
mips: define syscall_get_error() # acked
parisc: define syscall_get_error() # acked
powerpc: define syscall_get_error() # acked
selftests/ptrace: add a test case for PTRACE_GET_SYSCALL_INFO # acked
Elvira Khabirova (1):
ptrace: add PTRACE_GET_SYSCALL_INFO request # reviewed
arch/hexagon/include/asm/syscall.h | 14 +
arch/mips/include/asm/syscall.h | 6 +
arch/nds32/include/asm/syscall.h | 27 +-
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/include/asm/syscall.h | 10 +
include/linux/tracehook.h | 9 +-
include/uapi/linux/ptrace.h | 35 +++
kernel/ptrace.c | 101 ++++++-
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/get_syscall_info.c | 271 ++++++++++++++++++
11 files changed, 468 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/get_syscall_info.c
--
ldv
^ permalink raw reply
* [PATCH v11 6/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-05-10 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Kees Cook,
linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromyatnikov,
Oleg Nesterov, Andy Lutomirski,
strace-devel-3+4lAyCyj6AWlMsSdNXQLw
In-Reply-To: <20190510152640.GA28529-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
From: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
Retrieve information about the syscall that caused the stop.
The information is placed into the buffer pointed by "data"
argument, which should be a pointer to a buffer of type
"struct ptrace_syscall_info".
The "addr" argument contains the size of the buffer pointed to
by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
The return value contains the number of bytes available
to be written by the kernel.
If the size of data to be written by the kernel exceeds the size
specified by "addr" argument, the output is truncated.
Co-authored-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Eugene Syromyatnikov <esyr-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
Signed-off-by: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
---
Notes:
v11: unchanged
v10: unchanged
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7: unchanged
v6:
* Changed PTRACE_GET_SYSCALL_INFO return code after discussion with Oleg:
do not take trailing paddings into account, use the end of the last field
of the structure being written.
* Changed struct ptrace_syscall_info after discussion in lkml:
* removed .frame_pointer field, is is not needed and not portable;
* made .arch field explicitly aligned, removed no longer needed
padding before .arch field;
* removed trailing pads, they are no longer needed.
* Added Reviewed-by
from https://lore.kernel.org/lkml/20181210141107.GB4177-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org/
and https://lore.kernel.org/lkml/CAGXu5j+t1LqRC7KCHkdYhv6icgf01Lk6v=fAhPWGys=1g49=Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
v5:
* Changed PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Changed struct ptrace_syscall_info: generalized instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Added PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Made available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
v4:
* Revisited PTRACE_EVENT_SECCOMP support:
do not introduce task_struct.ptrace_event, use child->last_siginfo->si_code instead.
* Implemented PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Changed struct ptrace_syscall_info.
* Added PTRACE_EVENT_SECCOMP support by adding ptrace_event to task_struct.
* Added proper defines for ptrace_syscall_info.op values.
* Renamed PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
and moved them to uapi.
v2:
* Stopped using task->ptrace.
* Replaced entry_info.is_compat with entry_info.arch, used syscall_get_arch().
* Used addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
include/linux/tracehook.h | 9 ++--
include/uapi/linux/ptrace.h | 35 +++++++++++++
kernel/ptrace.c | 101 +++++++++++++++++++++++++++++++++++-
3 files changed, 141 insertions(+), 4 deletions(-)
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index df20f8bdbfa3..6bc7a3d58e2f 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -57,13 +57,15 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline int ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs,
+ unsigned long message)
{
int ptrace = current->ptrace;
if (!(ptrace & PT_PTRACED))
return 0;
+ current->ptrace_message = message;
ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
/*
@@ -76,6 +78,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
current->exit_code = 0;
}
+ current->ptrace_message = 0;
return fatal_signal_pending(current);
}
@@ -101,7 +104,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- return ptrace_report_syscall(regs);
+ return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
}
/**
@@ -126,7 +129,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
if (step)
user_single_step_report(regs);
else
- ptrace_report_syscall(regs);
+ ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
}
/**
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..a71b6e3b03eb 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,41 @@ struct seccomp_metadata {
__u64 flags; /* Output: filter's flags */
};
+#define PTRACE_GET_SYSCALL_INFO 0x420e
+#define PTRACE_SYSCALL_INFO_NONE 0
+#define PTRACE_SYSCALL_INFO_ENTRY 1
+#define PTRACE_SYSCALL_INFO_EXIT 2
+#define PTRACE_SYSCALL_INFO_SECCOMP 3
+
+struct ptrace_syscall_info {
+ __u8 op; /* PTRACE_SYSCALL_INFO_* */
+ __u32 arch __attribute__((__aligned__(sizeof(__u32))));
+ __u64 instruction_pointer;
+ __u64 stack_pointer;
+ union {
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ } entry;
+ struct {
+ __s64 rval;
+ __u8 is_error;
+ } exit;
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ __u32 ret_data;
+ } seccomp;
+ };
+};
+
+/*
+ * These values are stored in task->ptrace_message
+ * by tracehook_report_syscall_* to describe the current syscall-stop.
+ */
+#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
+#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
+
/* Read signals from a shared (process wide) queue */
#define PTRACE_PEEKSIGINFO_SHARED (1 << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 6f357f4fc859..de3817de6327 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,8 @@
#include <linux/compat.h>
#include <linux/sched/signal.h>
+#include <asm/syscall.h> /* for syscall_get_* */
+
/*
* Access another process' address space via ptrace.
* Source/target buffer must be kernel space,
@@ -879,7 +881,100 @@ static int ptrace_regset(struct task_struct *task, int req, unsigned int type,
* to ensure no machine forgets it.
*/
EXPORT_SYMBOL_GPL(task_user_regset_view);
-#endif
+
+static unsigned long
+ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ unsigned long args[ARRAY_SIZE(info->entry.args)];
+ int i;
+
+ info->op = PTRACE_SYSCALL_INFO_ENTRY;
+ info->entry.nr = syscall_get_nr(child, regs);
+ syscall_get_arguments(child, regs, args);
+ for (i = 0; i < ARRAY_SIZE(args); i++)
+ info->entry.args[i] = args[i];
+
+ /* args is the last field in struct ptrace_syscall_info.entry */
+ return offsetofend(struct ptrace_syscall_info, entry.args);
+}
+
+static unsigned long
+ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ /*
+ * As struct ptrace_syscall_info.entry is currently a subset
+ * of struct ptrace_syscall_info.seccomp, it makes sense to
+ * initialize that subset using ptrace_get_syscall_info_entry().
+ * This can be reconsidered in the future if these structures
+ * diverge significantly enough.
+ */
+ ptrace_get_syscall_info_entry(child, regs, info);
+ info->op = PTRACE_SYSCALL_INFO_SECCOMP;
+ info->seccomp.ret_data = child->ptrace_message;
+
+ /* ret_data is the last field in struct ptrace_syscall_info.seccomp */
+ return offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
+}
+
+static unsigned long
+ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ info->op = PTRACE_SYSCALL_INFO_EXIT;
+ info->exit.rval = syscall_get_error(child, regs);
+ info->exit.is_error = !!info->exit.rval;
+ if (!info->exit.is_error)
+ info->exit.rval = syscall_get_return_value(child, regs);
+
+ /* is_error is the last field in struct ptrace_syscall_info.exit */
+ return offsetofend(struct ptrace_syscall_info, exit.is_error);
+}
+
+static int
+ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
+ void __user *datavp)
+{
+ struct pt_regs *regs = task_pt_regs(child);
+ struct ptrace_syscall_info info = {
+ .op = PTRACE_SYSCALL_INFO_NONE,
+ .arch = syscall_get_arch(child),
+ .instruction_pointer = instruction_pointer(regs),
+ .stack_pointer = user_stack_pointer(regs),
+ };
+ unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
+ unsigned long write_size;
+
+ /*
+ * This does not need lock_task_sighand() to access
+ * child->last_siginfo because ptrace_freeze_traced()
+ * called earlier by ptrace_check_attach() ensures that
+ * the tracee cannot go away and clear its last_siginfo.
+ */
+ switch (child->last_siginfo ? child->last_siginfo->si_code : 0) {
+ case SIGTRAP | 0x80:
+ switch (child->ptrace_message) {
+ case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+ actual_size = ptrace_get_syscall_info_entry(child, regs,
+ &info);
+ break;
+ case PTRACE_EVENTMSG_SYSCALL_EXIT:
+ actual_size = ptrace_get_syscall_info_exit(child, regs,
+ &info);
+ break;
+ }
+ break;
+ case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8):
+ actual_size = ptrace_get_syscall_info_seccomp(child, regs,
+ &info);
+ break;
+ }
+
+ write_size = min(actual_size, user_size);
+ return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
+}
+#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data)
@@ -1096,6 +1191,10 @@ int ptrace_request(struct task_struct *child, long request,
ret = __put_user(kiov.iov_len, &uiov->iov_len);
break;
}
+
+ case PTRACE_GET_SYSCALL_INFO:
+ ret = ptrace_get_syscall_info(child, addr, datavp);
+ break;
#endif
case PTRACE_SECCOMP_GET_FILTER:
--
ldv
--
Strace-devel mailing list
Strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
https://lists.strace.io/mailman/listinfo/strace-devel
^ permalink raw reply related
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Jann Horn @ 2019-05-10 20:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Aleksa Sarai, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Andrew Morton,
Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, containers, linux-fsdevel, Linux API
In-Reply-To: <87o94d6aql.fsf@xmission.com>
On Tue, May 07, 2019 at 07:38:58PM -0500, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of;
>
> Once upon a time in a galaxy far far away I fixed a bug where we missing
> ptrace_may_access checks on various proc files and systems using selinux
> stopped working. At the time selinux did not allow ptrace like access
> to yourself. The "introspection" special case was the quick and simple
> work-around.
>
> There is nothing fundamental in having the "introspection" special case
> except that various lsms have probably grown to depend upon it being
> there. I expect without difficulty we could move the check down
> into the various lsms. Which would get that check out of the core
> kernel code.
Oh, if that's an option, that would be great, I think.
But this means, for example, that a non-root, non-dumpable process can't
open /proc/self/maps anymore, or open /proc/self/fd/*, and things like
that, without making itself dumpable. I would be surprised if there is
no code out there that relies on that.
>From what I can tell, without the introspection special case,
introspection would fail in the following cases (assuming that the
process is not capable and isn't using sys_setfs[ug]id()):
- ruid/euid/suid are not all the same
- rgid/egid/sgid are not all the same
- process is not dumpable
I think that there probably should be some way for a non-dumpable
process to look at its own procfs entries? If we could start from a
clean slate, I'd propose an opt-in flag to openat() for that, but
since we don't have a clean slate, I'd be afraid of breaking things
with that. But maybe I'm just being overly careful here?
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Jann Horn @ 2019-05-10 20:41 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
Arnd Bergmann, David Howells, Eric Biederman, Andrew Morton,
Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, containers, linux-fsdevel, Linux API
In-Reply-To: <20190506191735.nmzf7kwfh7b6e2tf@yavin>
On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of; this
> > makes it possible to open things in procfs that are related to the
> > current process even if the credentials of the process wouldn't permit
> > accessing another process like it. I think the proper fix to deal with
> > this would be to add a prctl() flag for "set whether introspection is
> > allowed for this process", and if userspace has manually un-set that
> > flag, any introspection special-case logic would be skipped.
>
> We could do PR_SET_DUMPABLE=3 for this, I guess?
Hmm... I'd make it a new prctl() command, since introspection is
somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
think the introspection flag should be per-thread.
> > An additional problem: /proc/*/exe can be used to open a file for
> > writing; I think it may have been Andy Lutomirski who pointed out some
> > time ago that it would be nice if you couldn't use /proc/*/fd/* to
> > re-open files with more privileges, which is sort of the same thing.
>
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.
Ah, nice!
> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).
(For what it's worth, I'm mostly concerned about read vs write, not
really about execute, since execute really is just another form of
reading in my opinion.)
> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Rob Landley @ 2019-05-10 20:46 UTC (permalink / raw)
To: Mimi Zohar, Roberto Sassu, viro
Cc: linux-security-module, linux-integrity, initramfs, linux-api,
linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan
In-Reply-To: <1557488971.10635.102.camel@linux.ibm.com>
On 5/10/19 6:49 AM, Mimi Zohar wrote:
> On Fri, 2019-05-10 at 08:56 +0200, Roberto Sassu wrote:
>> On 5/9/2019 8:34 PM, Rob Landley wrote:
>>> On 5/9/19 6:24 AM, Roberto Sassu wrote:
>
>>>> The difference with another proposal
>>>> (https://lore.kernel.org/patchwork/cover/888071/) is that xattrs can be
>>>> included in an image without changing the image format, as opposed to
>>>> defining a new one. As seen from the discussion, if a new format has to be
>>>> defined, it should fix the issues of the existing format, which requires
>>>> more time.
>>>
>>> So you've explicitly chosen _not_ to address Y2038 while you're there.
>>
>> Can you be more specific?
>
> Right, this patch set avoids incrementing the CPIO magic number and
> the resulting changes required (eg. increasing the timestamp field
> size), by including a file with the security xattrs in the CPIO. In
> either case, including the security xattrs in the initramfs header or
> as a separate file, the initramfs, itself, needs to be signed.
The /init binary in the initramfs runs as root and launches all other processes
on the system. Presumably it can write any xattrs it wants to, and doesn't need
any extra permissions granted to it to do so. But as soon as you start putting
xattrs on _other_ files within the initramfs that are _not_ necessarily running
as PID 1, _that's_ when the need to sign the initramfs comes in?
Presumably the signing occurs on the gzipped file. How does that affect the cpio
parsing _after_ it's decompressed? Why would that be part of _this_ patch?
Rob
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Andy Lutomirski @ 2019-05-10 21:20 UTC (permalink / raw)
To: Jann Horn
Cc: Aleksa Sarai, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, Linux Containers, linux-fsdevel
In-Reply-To: <20190510204141.GB253532@google.com>
On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > In my opinion, CVE-2019-5736 points out two different problems:
> > >
> > > The big problem: The __ptrace_may_access() logic has a special-case
> > > short-circuit for "introspection" that you can't opt out of; this
> > > makes it possible to open things in procfs that are related to the
> > > current process even if the credentials of the process wouldn't permit
> > > accessing another process like it. I think the proper fix to deal with
> > > this would be to add a prctl() flag for "set whether introspection is
> > > allowed for this process", and if userspace has manually un-set that
> > > flag, any introspection special-case logic would be skipped.
> >
> > We could do PR_SET_DUMPABLE=3 for this, I guess?
>
> Hmm... I'd make it a new prctl() command, since introspection is
> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> think the introspection flag should be per-thread.
I've lost track of the context here, but it seems to me that
mitigating attacks involving accidental following of /proc links
shouldn't depend on dumpability. What's the actual problem this is
trying to solve again?
^ permalink raw reply
* Re: [PATCH v2 1/3] fs: add ksys_lsetxattr() wrapper
From: Jann Horn @ 2019-05-10 21:28 UTC (permalink / raw)
To: Roberto Sassu
Cc: viro, linux-security-module, linux-integrity, initramfs,
linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
james.w.mcmechan
In-Reply-To: <20190509112420.15671-2-roberto.sassu@huawei.com>
On Thu, May 09, 2019 at 01:24:18PM +0200, Roberto Sassu wrote:
> Similarly to commit 03450e271a16 ("fs: add ksys_fchmod() and do_fchmodat()
> helpers and ksys_chmod() wrapper; remove in-kernel calls to syscall"), this
> patch introduces the ksys_lsetxattr() helper to avoid in-kernel calls to
> the sys_lsetxattr() syscall.
>
> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
[...]
> +int ksys_lsetxattr(const char __user *pathname,
> + const char __user *name, const void __user *value,
> + size_t size, int flags)
> +{
> + return path_setxattr(pathname, name, value, size, flags, 0);
> +}
Instead of exposing ksys_lsetxattr(), wouldn't it be cleaner to use
kern_path() and vfs_setxattr(), or something like that? Otherwise you're
adding more code that has to cast between kernel and user pointers.
^ permalink raw reply
* Re: [PATCH v2 3/3] initramfs: introduce do_readxattrs()
From: Jann Horn @ 2019-05-10 21:33 UTC (permalink / raw)
To: Roberto Sassu
Cc: viro, linux-security-module, linux-integrity, initramfs,
linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
james.w.mcmechan
In-Reply-To: <20190509112420.15671-4-roberto.sassu@huawei.com>
On Thu, May 09, 2019 at 01:24:20PM +0200, Roberto Sassu wrote:
> This patch adds support for an alternative method to add xattrs to files in
> the rootfs filesystem. Instead of extracting them directly from the ram
> disk image, they are extracted from a regular file called .xattr-list, that
> can be added by any ram disk generator available today.
[...]
> +struct path_hdr {
> + char p_size[10]; /* total size including p_size field */
> + char p_data[]; /* <path>\0<xattrs> */
> +};
> +
> +static int __init do_readxattrs(void)
> +{
> + struct path_hdr hdr;
> + char str[sizeof(hdr.p_size) + 1];
> + unsigned long file_entry_size;
> + size_t size, name_buf_size, total_size;
> + struct kstat st;
> + int ret, fd;
> +
> + ret = vfs_lstat(XATTR_LIST_FILENAME, &st);
> + if (ret < 0)
> + return ret;
> +
> + total_size = st.size;
> +
> + fd = ksys_open(XATTR_LIST_FILENAME, O_RDONLY, 0);
> + if (fd < 0)
> + return fd;
> +
> + while (total_size) {
> + size = ksys_read(fd, (char *)&hdr, sizeof(hdr));
[...]
> + ksys_close(fd);
> +
> + if (ret < 0)
> + error("Unable to parse xattrs");
> +
> + return ret;
> +}
Please use something like filp_open()+kernel_read()+fput() instead of
ksys_open()+ksys_read()+ksys_close(). I understand that some of the init
code needs to use the syscall wrappers because no equivalent VFS
functions are available, but please use the VFS functions when that's
easy to do.
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-05-10 22:38 UTC (permalink / raw)
To: Rob Landley, Roberto Sassu, viro
Cc: linux-security-module, linux-integrity, initramfs, linux-api,
linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan
In-Reply-To: <3a9d717e-0e12-9d62-a3cf-afb7a5dbf166@landley.net>
On Fri, 2019-05-10 at 15:46 -0500, Rob Landley wrote:
> On 5/10/19 6:49 AM, Mimi Zohar wrote:
> > On Fri, 2019-05-10 at 08:56 +0200, Roberto Sassu wrote:
> >> On 5/9/2019 8:34 PM, Rob Landley wrote:
> >>> On 5/9/19 6:24 AM, Roberto Sassu wrote:
> >
> >>>> The difference with another proposal
> >>>> (https://lore.kernel.org/patchwork/cover/888071/) is that xattrs can be
> >>>> included in an image without changing the image format, as opposed to
> >>>> defining a new one. As seen from the discussion, if a new format has to be
> >>>> defined, it should fix the issues of the existing format, which requires
> >>>> more time.
> >>>
> >>> So you've explicitly chosen _not_ to address Y2038 while you're there.
> >>
> >> Can you be more specific?
> >
> > Right, this patch set avoids incrementing the CPIO magic number and
> > the resulting changes required (eg. increasing the timestamp field
> > size), by including a file with the security xattrs in the CPIO. In
> > either case, including the security xattrs in the initramfs header or
> > as a separate file, the initramfs, itself, needs to be signed.
>
> The /init binary in the initramfs runs as root and launches all other processes
> on the system. Presumably it can write any xattrs it wants to, and doesn't need
> any extra permissions granted to it to do so. But as soon as you start putting
> xattrs on _other_ files within the initramfs that are _not_ necessarily running
> as PID 1, _that's_ when the need to sign the initramfs comes in?
>
> Presumably the signing occurs on the gzipped file. How does that affect the cpio
> parsing _after_ it's decompressed? Why would that be part of _this_ patch?
The signing and verification of the initramfs is a separate issue, not
part of this patch set. The only reason for mentioning it here was to
say that both methods of including the security xattrs require the
initramfs be signed. Just as the kernel image needs to be signed and
verified, the initramfs should be too.
Mimi
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Jann Horn @ 2019-05-10 22:55 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Aleksa Sarai, Al Viro, Jeff Layton, J. Bruce Fields,
Arnd Bergmann, David Howells, Eric Biederman, Andrew Morton,
Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, Linux Containers, linux-fsdevel,
Linux API <linux-api@
In-Reply-To: <CALCETrW2nn=omqJb4p+m-BDsCOhg+YZQ3ELd4BdhODV3G44gfA@mail.gmail.com>
On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > >
> > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > short-circuit for "introspection" that you can't opt out of; this
> > > > makes it possible to open things in procfs that are related to the
> > > > current process even if the credentials of the process wouldn't permit
> > > > accessing another process like it. I think the proper fix to deal with
> > > > this would be to add a prctl() flag for "set whether introspection is
> > > > allowed for this process", and if userspace has manually un-set that
> > > > flag, any introspection special-case logic would be skipped.
> > >
> > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> >
> > Hmm... I'd make it a new prctl() command, since introspection is
> > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > think the introspection flag should be per-thread.
>
> I've lost track of the context here, but it seems to me that
> mitigating attacks involving accidental following of /proc links
> shouldn't depend on dumpability. What's the actual problem this is
> trying to solve again?
The one actual security problem that I've seen related to this is
CVE-2019-5736. There is a write-up of it at
<https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
under "Successful approach", but it goes more or less as follows:
A container is running that doesn't use user namespaces (because for
some reason I don't understand, apparently some people still do that).
An evil process is running inside the container with UID 0 (as in,
GLOBAL_ROOT_UID); so if the evil process inside the container was able
to reach root-owned files on the host filesystem, it could write into
them.
The container engine wants to spawn a new process inside the container.
It forks off a child that joins the container's namespaces (including
PID and mount namespaces), and then the child calls execve() on some
path in the container.
The attacker replaces the executable in the container with a symlink
to /proc/self/exe and replaces a library inside the container with a
malicious one.
When the container engine calls execve(), intending to run an executable
inside the container, it instead goes through ptrace_may_access() using
the introspection short-circuit and re-executes its own executable
through the jumped symlink /proc/self/exe (which is normally unreachable
for the container). After the execve(), the process loads an evil
library from inside the container and is under the control of the
container.
Now the container controls a process whose /proc/self/exe is a jumped
symlink to a host executable, and the container can write into it.
Some container engines are now using an extremely ugly hack to work
around this - whenever they want to enter a container, they copy the
host binary into a new memfd and execute that to avoid exposing the
original host binary to containers:
<https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
In my opinion, the problems here are:
- Apparently some people run untrusted containers without user
namespaces. It would be really nice if people could not do that.
(Probably the biggest problem here.)
- ptrace_may_access() has a short-circuit that permits a process to
unintentionally look at itself even if it has dropped privileges -
here, it permits the execve("/proc/self/exe", ...) that would
normally be blocked by the check for CAP_SYS_PTRACE if the process
is nondumpable.
- You can use /proc/*/exe to get a writable fd.
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Christian Brauner @ 2019-05-10 23:36 UTC (permalink / raw)
To: Jann Horn
Cc: Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <20190510225527.GA59914@google.com>
On Sat, May 11, 2019 at 12:55 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > > >
> > > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > > short-circuit for "introspection" that you can't opt out of; this
> > > > > makes it possible to open things in procfs that are related to the
> > > > > current process even if the credentials of the process wouldn't permit
> > > > > accessing another process like it. I think the proper fix to deal with
> > > > > this would be to add a prctl() flag for "set whether introspection is
> > > > > allowed for this process", and if userspace has manually un-set that
> > > > > flag, any introspection special-case logic would be skipped.
> > > >
> > > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> > >
> > > Hmm... I'd make it a new prctl() command, since introspection is
> > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > > think the introspection flag should be per-thread.
> >
> > I've lost track of the context here, but it seems to me that
> > mitigating attacks involving accidental following of /proc links
> > shouldn't depend on dumpability. What's the actual problem this is
> > trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
> - Apparently some people run untrusted containers without user
> namespaces. It would be really nice if people could not do that.
> (Probably the biggest problem here.)
I know I sound like a broken record since I've been going on about this
forever together with a lot of other people but honestly,
the fact that people are running untrusted workloads in privileged containers
is the real issue here.
Aleksa is a good friend of mine and we have discussed this a lot so I hope
he doesn't hate me for saying this again: it is crazy that there are container
runtimes out there that promise (or at least do not state the opposite)
containers without user namespaces or containers with user namespaces
that allow to map the host root id to anything can be safe. They cannot.
Even if this /proc/*/exe thing is somehow blocked there
are other ways of escaping from a privileged container.
We (i.e. LXC) literally do not accept CVEs for privileged containers
because we do not consider them safe by design.
It seems to me to be heading in the wrong direction to keep up the
illusion that with enough effort we can make this all nice and safe.
Yes, the userspace memfd hack we came up with is as ugly as a security
patch can be but if you make promises you can't keep you better be
prepared to pay the price when things start to fall apart.
So if this part of the patch is just needed to handle this do we really
want to do all that tricky work or is there more to gain from this that
makes it worth it.
Christian
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-11 15:49 UTC (permalink / raw)
To: Christian Brauner
Cc: Jann Horn, Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Linus Torvalds,
Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <CAHrFyr5vjTZfgtMsHwr6iwVVFxVsU3UCOiEq=FM-rjr0kPGHUw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3202 bytes --]
On 2019-05-11, Christian Brauner <christian@brauner.io> wrote:
> > In my opinion, the problems here are:
> >
> > - Apparently some people run untrusted containers without user
> > namespaces. It would be really nice if people could not do that.
> > (Probably the biggest problem here.)
>
> I know I sound like a broken record since I've been going on about this
> forever together with a lot of other people but honestly,
> the fact that people are running untrusted workloads in privileged containers
> is the real issue here.
I completely agree. It's a shit-show, and it's caused by bad defaults in
Docker and (now) podman. To be fair, they both now support rootless
containers but the default is still privileged containers.
They do support user namespaces (though it should be noted that LXD's
support is much nicer from a security standpoint) but unless it's the
default the support is almost pointless. In the case of Docker it can
lead to some usability issues when you enable it (which I believe is the
main justification for it not being the default).
> Aleksa is a good friend of mine and we have discussed this a lot so I hope
> he doesn't hate me for saying this again: it is crazy that there are container
> runtimes out there that promise (or at least do not state the opposite)
> containers without user namespaces or containers with user namespaces
> that allow to map the host root id to anything can be safe. They cannot.
Yeah, the fact that we (runc) don't scream from the rooftops that this
setup is insecure is definitely a problem. I have mentioned this
whenever I've had a chance, but the fact that the most popular runtimes
(which use runc) don't use user namespaces compounds the issue. I'm
willing to bet that >90% of users of runc-based runtimes don't use user
namespaces at all, and this is all down to bad defaults.
There are also some other misfeatures we have in runc that we're
basically forced to support because some users use them, and we can't
really break entire projects (even though it's the projects' fault they
have an insecure setup).
> It seems to me to be heading in the wrong direction to keep up the
> illusion that with enough effort we can make this all nice and safe.
> Yes, the userspace memfd hack we came up with is as ugly as a security
> patch can be but if you make promises you can't keep you better be
> prepared to pay the price when things start to fall apart.
> So if this part of the patch is just needed to handle this do we really
> want to do all that tricky work or is there more to gain from this that
> makes it worth it.
I dropped this patch in v7, I don't think it's required for the
overarching feature. Looking back on it, it doesn't make much sense
given the context that privileged containers are unsafe in the first
place.
I do think that being able to block introspection might be a useful
hardening feature though. During attachment it would be nice to be sure
that nothing will be able to touch the attaching process's /proc/$pid --
even itself.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Andy Lutomirski @ 2019-05-11 17:00 UTC (permalink / raw)
To: Jann Horn
Cc: Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, Linux Containers, linux-fsdevel
In-Reply-To: <20190510225527.GA59914@google.com>
> On May 10, 2019, at 3:55 PM, Jann Horn <jannh@google.com> wrote:
>
>> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
>>> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
>>>
>>>> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
>>>>> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
>>>>> In my opinion, CVE-2019-5736 points out two different problems:
>>>>>
>>>>> The big problem: The __ptrace_may_access() logic has a special-case
>>>>> short-circuit for "introspection" that you can't opt out of; this
>>>>> makes it possible to open things in procfs that are related to the
>>>>> current process even if the credentials of the process wouldn't permit
>>>>> accessing another process like it. I think the proper fix to deal with
>>>>> this would be to add a prctl() flag for "set whether introspection is
>>>>> allowed for this process", and if userspace has manually un-set that
>>>>> flag, any introspection special-case logic would be skipped.
>>>>
>>>> We could do PR_SET_DUMPABLE=3 for this, I guess?
>>>
>>> Hmm... I'd make it a new prctl() command, since introspection is
>>> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
>>> think the introspection flag should be per-thread.
>>
>> I've lost track of the context here, but it seems to me that
>> mitigating attacks involving accidental following of /proc links
>> shouldn't depend on dumpability. What's the actual problem this is
>> trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
I think that, at this point, the task should be considered owned by the container. Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid. In fact, the process is arguably pwned even *before* execve.
A better “spawn” API should fix this. In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
Cute.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
> - Apparently some people run untrusted containers without user
> namespaces. It would be really nice if people could not do that.
> (Probably the biggest problem here.)
> - ptrace_may_access() has a short-circuit that permits a process to
> unintentionally look at itself even if it has dropped privileges -
> here, it permits the execve("/proc/self/exe", ...) that would
> normally be blocked by the check for CAP_SYS_PTRACE if the process
> is nondumpable.
I don’t see this as a problem. Dumpable is about protecting a task from others, not about protecting a task against itself.
> - You can use /proc/*/exe to get a writable fd.
This is IMO the real bug.
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Linus Torvalds @ 2019-05-11 17:21 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jann Horn, Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <C60DC580-854D-478D-AF23-5F29FB7C3E50@amacapital.net>
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> A better “spawn” API should fix this.
Andy, stop with the "spawn would be better".
Spawn is garbage. It's garbage because it's fundamentally too
inflexible, and it's garbage because it is quite complex to try to
work around the inflexibility by having those complex "action pointer
arrays" to make up for its failings.
And spawn() would fundamentally have all the same permission issues
that you now point to execve() as having, so it doesn't even *solve*
anything.
You've said this whole "spawn would fix things" thing before, and it's
wrong. Spawn isn't better. Really. If fixes absolutely zero things,
and the only reason for spawn existing is because VMS and NT had that
broken and inflexible model.
There's at least one paper from some MS people about how "spawn()" is
wonderful, and maybe you bought into the garbage from that. But that
paper is about how they hate fork(), not because of execve(). And if
you hate fork, use "vfork()" instead (preferably with an immediate
call to a non-returning function in the child to avoid the stack
re-use issue that makes it so simple to screw up vfork() in hard to
debug ways).
execve() is a _fine_ model. That's not the problem in this whole issue
at all - never was, and never will be.
The problem in this discussion is (a) having privileges you shouldn't
have and (b) having other interfaces that make it easyish to change
the filesystem layout to confuse those entities with privileges.
So the reason the open flags can be problematic is exactly because
they effectively change filesystem layout. And no, it's not just
AT_THIS_ROOT, although that's the obvious one. Things like "you can't
follow symlinks" can also effectively change the layout: imagine if
you have a PATH-like lookup model, and you end up having symlinks as
part of the standard filesystem layout.. Now a "don't follow symlinks"
can turn the *standard* executable into something that isn't found,
and then you might end up executing something else instead (think root
having '.' as the last entry in path, which some people used to
suggest as the fix for the completely bad "first entry" case)..
Notice? None of the real problems are about execve or would be solved
by any spawn API. You just think that because you've apparently been
talking to too many MS people that think fork (and thus indirectly
execve()) is bad process management.
Linus
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-11 17:26 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
Arnd Bergmann, David Howells, Eric Biederman, Andrew Morton,
Alexei Starovoitov, Kees Cook, Christian Brauner, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, Linux Containers, linux-fsdevel
In-Reply-To: <C60DC580-854D-478D-AF23-5F29FB7C3E50@amacapital.net>
[-- Attachment #1: Type: text/plain, Size: 3017 bytes --]
On 2019-05-11, Andy Lutomirski <luto@amacapital.net> wrote:
> >> I've lost track of the context here, but it seems to me that
> >> mitigating attacks involving accidental following of /proc links
> >> shouldn't depend on dumpability. What's the actual problem this is
> >> trying to solve again?
> >
> > The one actual security problem that I've seen related to this is
> > CVE-2019-5736. There is a write-up of it at
> > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> > under "Successful approach", but it goes more or less as follows:
> >
> > A container is running that doesn't use user namespaces (because for
> > some reason I don't understand, apparently some people still do that).
> > An evil process is running inside the container with UID 0 (as in,
> > GLOBAL_ROOT_UID); so if the evil process inside the container was able
> > to reach root-owned files on the host filesystem, it could write into
> > them.
> >
> > The container engine wants to spawn a new process inside the container.
> > It forks off a child that joins the container's namespaces (including
> > PID and mount namespaces), and then the child calls execve() on some
> > path in the container.
>
> I think that, at this point, the task should be considered owned by
> the container. Maybe we should have a better API than execve() to
> execute a program in a safer way, but fiddling with dumpability seems
> like a band-aid. In fact, the process is arguably pwned even *before*
> execve.
Yeah, execve is just the vector (though in this case it's done in order
to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar
but was attacking a dirfd that runc had open into the container (LXC had
a very similar bug too) -- setting !mm->dumpable was one of the
workarounds we had for this.
> A better “spawn” API should fix this. In the mean time, I think it
> should be assumed that, if you join a container’s namespaces, you are
> at its mercy.
This is generally how we treat containers as runtime authors, but it's
not a trivial thing to get right. In many cases the kernel APIs are
working against you -- Christian and myself have written a fair few
patches to fix holes in the kernel APIs so we can avoid these kinds of
assumptions.
But yes, one of the most risky parts of a container runtime is when
you're attaching to a running container because all of the helpful
introspection APIs in /proc/ suddenly become a security nightmare. A
better "spawn a process in these namespaces" API might help improve the
situation (or at least, I hope it would).
> > - You can use /proc/*/exe to get a writable fd.
>
> This is IMO the real bug.
I will try to send an RFC of the patchset I have for this next week or
so. Funnily enough, currently /proc/*/exe has the write bit set in its
"mode" (my series fixes this).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Linus Torvalds @ 2019-05-11 17:26 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jann Horn, Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <CAHk-=wh1JJD_RabMaFfinsAQp1vHGJOQ1rKqihafY=r7yHc8sQ@mail.gmail.com>
On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
Side note: a good policy has been (and remains) to make suid binaries
not be dynamically linked. And in the absence of that, the dynamic
linker at least resets the library path when it notices itself being
dynamic, and it certainly doesn't inherit any open flags from the
non-trusted environment.
And by the same logic, a suid interpreter must *definitely* should not
inherit any execve() flags from the non-trusted environment. So I
think Aleksa's patch to use the passed-in open flags is *exactly* the
wrong thing to do for security reasons. It doesn't close holes, it
opens them.
Linus
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-11 17:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <CAHk-=whOL-NBso8X5S8s597yZEOMBoU8chkMFVTi8b-ff2qARg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Notice? None of the real problems are about execve or would be solved
> > by any spawn API. You just think that because you've apparently been
> > talking to too many MS people that think fork (and thus indirectly
> > execve()) is bad process management.
>
> Side note: a good policy has been (and remains) to make suid binaries
> not be dynamically linked. And in the absence of that, the dynamic
> linker at least resets the library path when it notices itself being
> dynamic, and it certainly doesn't inherit any open flags from the
> non-trusted environment.
>
> And by the same logic, a suid interpreter must *definitely* should not
> inherit any execve() flags from the non-trusted environment. So I
> think Aleksa's patch to use the passed-in open flags is *exactly* the
> wrong thing to do for security reasons. It doesn't close holes, it
> opens them.
Yup, I've dropped the patch for the next version. (To be honest, I'm not
sure why I included any of the other flags -- the only one that would've
been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Linus Torvalds @ 2019-05-11 17:43 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Andy Lutomirski, Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <20190511173113.qhqmv5q5f74povix@yavin>
On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Yup, I've dropped the patch for the next version. (To be honest, I'm not
> sure why I included any of the other flags -- the only one that would've
> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
I do wonder if we could try to just set AT_NO_MAGICLINKS
unconditionally for execve() (and certainly for the suid case).
I'd rather try to do these things across the board, than have "suid
binaries are treated specially" if at all possible.
The main use case for having /proc/<pid>/exe thing is for finding open
file descriptors, and for 'ps' kind of use, or to find the startup
directory when people don't populate the execve() environment fully
(ie "readlink(/proc/self/exe)" is afaik pretty common.
Sadly, googling for
execve /proc/self/exe
does actually find hits, including one that implies that chrome does
exactly that. So it might not be possible.
Somewhat odd, but it does just confirm the whole "users will at some
point do everything in their power to use every odd special case,
intended or not".
Linus
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Christian Brauner @ 2019-05-11 17:48 UTC (permalink / raw)
To: Linus Torvalds, Aleksa Sarai
Cc: Andy Lutomirski, Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <CAHk-=wgo-X9pDbVf8khfDsgEKn3wSvLJkB890OxHL+42Hosypw@mail.gmail.com>
On May 11, 2019 7:43:44 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>>
>> Yup, I've dropped the patch for the next version. (To be honest, I'm
>not
>> sure why I included any of the other flags -- the only one that
>would've
>> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
>I do wonder if we could try to just set AT_NO_MAGICLINKS
>unconditionally for execve() (and certainly for the suid case).
>
>I'd rather try to do these things across the board, than have "suid
>binaries are treated specially" if at all possible.
>
>The main use case for having /proc/<pid>/exe thing is for finding open
>file descriptors, and for 'ps' kind of use, or to find the startup
>directory when people don't populate the execve() environment fully
>(ie "readlink(/proc/self/exe)" is afaik pretty common.
>
>Sadly, googling for
>
> execve /proc/self/exe
>
>does actually find hits, including one that implies that chrome does
>exactly that. So it might not be possible.
>
>Somewhat odd, but it does just confirm the whole "users will at some
>point do everything in their power to use every odd special case,
>intended or not".
>
> Linus
Sadly I have to admit that we are using this.
Also, execveat on glibc is implemented via
/proc/self/fd/<nr> on kernels that do not
have a proper execveat.
See fexecve...
Christian
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai @ 2019-05-11 18:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Jann Horn, Andy Lutomirski, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <CAHk-=wgo-X9pDbVf8khfDsgEKn3wSvLJkB890OxHL+42Hosypw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > Yup, I've dropped the patch for the next version. (To be honest, I'm not
> > sure why I included any of the other flags -- the only one that would've
> > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
> I do wonder if we could try to just set AT_NO_MAGICLINKS
> unconditionally for execve() (and certainly for the suid case).
>
> I'd rather try to do these things across the board, than have "suid
> binaries are treated specially" if at all possible.
>
> The main use case for having /proc/<pid>/exe thing is for finding open
> file descriptors, and for 'ps' kind of use, or to find the startup
> directory when people don't populate the execve() environment fully
> (ie "readlink(/proc/self/exe)" is afaik pretty common.
>
> Sadly, googling for
>
> execve /proc/self/exe
>
> does actually find hits, including one that implies that chrome does
> exactly that. So it might not be possible.
>
> Somewhat odd, but it does just confirm the whole "users will at some
> point do everything in their power to use every odd special case,
> intended or not".
*sheepishly* Actually we use this in runc very liberally.
It's done because we need to run namespace-related code but runc is
written in Go so (long story short) we re-exec ourselves in order to
run some __attribute__((constructor)) code which sets up the namespaces
and then lets the Go runtime boot.
I suspect just writing everything in C would've been orders of magnitude
simpler, but I wasn't around when that decision was made. :P
Also as Christian mentioned, fexecve(3) in glibc is implemented using
/proc/self/fd on old kernels (then again, if we change the behaviour on
new kernels it won't matter because glibc uses execveat(AT_EMPTY_PATH)
if it's available).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Andy Lutomirski @ 2019-05-11 22:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jann Horn, Andy Lutomirski, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linux Containers, linux-fsdevel
In-Reply-To: <CAHk-=wh1JJD_RabMaFfinsAQp1vHGJOQ1rKqihafY=r7yHc8sQ@mail.gmail.com>
> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> A better “spawn” API should fix this.
>
> Andy, stop with the "spawn would be better".
It doesn’t have to be spawn per se. But the current situation sucks.
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
>
>
I’ve literally never spoken to an MS person about it.
What container managers and init systems *want* is a way to drop
privileges, change namespaces, etc and then run something in a
controlled way so that the intermediate states aren’t dangerous. An
API for this could be spawn-like or exec-like — that particular
distinction is beside the point. Having personally written code that
mucks with namepsaces, I've wanted two particular abilities that are
both quite awkward:
a) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary in the container's
filesystem, all atomically enough that I don't need to worry about
accidentally leaking privileges into the container. A
super-duper-non-dumpable mode would kind of allow this, but I'd worry
that there's some other hole besides ptrace() and /proc/self.
b) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary that is *not* in the
container's filesystem. This happens, for example, if the container's
mount namespace has no exec mounts at all. We don't have a fantastic
way to do this at all right now due to /proc/self/exe.
Regardless, the actual CVE at hand would have been nicely avoided if
writing to /proc/self/exe didn’t work, and I see no reason we can’t
make that happen.
I suppose we could also consider a change to disable /proc/self/exe if
it's not reachable from /proc/self/root. By "disable", I mean that
readlink() should maybe still work, but actually trying to open it
could probably fail safely.
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Andy Lutomirski @ 2019-05-11 22:44 UTC (permalink / raw)
To: Roberto Sassu
Cc: Al Viro, LSM List, linux-integrity, initramfs, Linux API,
Linux FS Devel, LKML, Mimi Zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, H. Peter Anvin,
Arnd Bergmann, Rob Landley, james.w.mcmechan
In-Reply-To: <20190509112420.15671-1-roberto.sassu@huawei.com>
On Thu, May 9, 2019 at 4:27 AM Roberto Sassu <roberto.sassu@huawei.com> wrote:
>
> This patch set aims at solving the following use case: appraise files from
> the initial ram disk. To do that, IMA checks the signature/hash from the
> security.ima xattr. Unfortunately, this use case cannot be implemented
> currently, as the CPIO format does not support xattrs.
>
> This proposal consists in marshaling pathnames and xattrs in a file called
> .xattr-list. They are unmarshaled by the CPIO parser after all files have
> been extracted.
>
> The difference from v1 (https://lkml.org/lkml/2018/11/22/1182) is that all
> xattrs are stored in a single file and not per file (solves the file name
> limitation issue, as it is not necessary to add a suffix to files
> containing xattrs).
>
> The difference with another proposal
> (https://lore.kernel.org/patchwork/cover/888071/) is that xattrs can be
> included in an image without changing the image format, as opposed to
> defining a new one. As seen from the discussion, if a new format has to be
> defined, it should fix the issues of the existing format, which requires
> more time.
I read some of those emails. ISTM that adding TAR support should be
seriously considered. Sure, it's baroque, but it's very, very well
supported, and it does exactly what we need.
--Andy
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Rob Landley @ 2019-05-12 4:04 UTC (permalink / raw)
To: Andy Lutomirski, Roberto Sassu
Cc: Al Viro, LSM List, linux-integrity, initramfs, Linux API,
Linux FS Devel, LKML, Mimi Zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, H. Peter Anvin,
Arnd Bergmann, james.w.mcmechan
In-Reply-To: <CALCETrXy7gqmmy37=nrMAisGadZ+qbjZjXtWFF8Crq86xNpsfA@mail.gmail.com>
On 5/11/19 5:44 PM, Andy Lutomirski wrote:
> I read some of those emails. ISTM that adding TAR support should be
> seriously considered. Sure, it's baroque, but it's very, very well
> supported, and it does exactly what we need.
Which means you now have two parsers supported in parallel forevermore, and are
reversing the design decision initially made when this went in without new info.
Also, I just did a tar implementation for toybox: It took me a month to debug it
(_not_ starting from scratch but from a submission), I only just added sparse
file support (because something in the android build was generating a sparse
file), there are historical tarballs I know it won't extract (I'm just testing
against what the current one produces with the default flags), and I haven't
even started on xattr support yet.
Instead I was experimenting with corner cases like "S records replace the
prefix[] field starting at byte 386 with an offset/length pair array, but
prefix[] starts at 345, do those first 41 bytes still function as a prefix and
is there any circumstance under which existing tar binaries will populate them?
Also, why does every instance of an S record generated by gnu/tar end with a
gratuitous length zero segment?"
"cpio -H newc" is a _much_ simpler format. And posix no longer specifies
_either_ format usefully, hasn't for years. From toybox tar's header comment:
* For the command, see
* http://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html
* For the modern file format, see
*
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_06
* https://en.wikipedia.org/wiki/Tar_(computing)#File_format
* https://www.gnu.org/software/tar/manual/html_node/Tar-Internals.html
And no, that isn't _enough_ information, you still have to "tar | hd" a lot and
squint. (There's no current spec, it's pieced together from multiple sources
because posix abdicated responsibility for this to Jorg Schilling.)
Rob
P.S. Yes that gnu/dammit page starts with a "this will be deleted" comment which
according to archive.org has been there for over a dozen years.
P.P.S. Sadly, if you want an actually standardized standard format where
implementations adhere to the standard: IETF RFC 1991 was published in 1996 and
remains compatible with files an archivers in service. Or we could stick with
cpio and make minor changes to it, since we have to remain backwards compatible
with it _anyway_....
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Rob Landley @ 2019-05-12 4:12 UTC (permalink / raw)
To: Andy Lutomirski, Roberto Sassu
Cc: Al Viro, LSM List, linux-integrity, initramfs, Linux API,
Linux FS Devel, LKML, Mimi Zohar, silviu.vlasceanu,
dmitry.kasatkin, takondra, kamensky, H. Peter Anvin,
Arnd Bergmann, james.w.mcmechan
In-Reply-To: <4aee6e10-0eec-1d76-af66-dc8c7b68b766@landley.net>
On 5/11/19 11:04 PM, Rob Landley wrote:
> P.P.S. Sadly, if you want an actually standardized standard format where
> implementations adhere to the standard: IETF RFC 1991 was published in 1996 and
Nope, darn it, checked my notes and that wasn't it. I thought zip had an RFC,
it's just zlib, deflate, and gzip, and that's not the number of any of them.
I still think sticking with a lightly modified cpio makes the most sense,
just... in band signalling that _doesn't_ solve the y2038 problem, the file size
limit, or address sparse files seems kinda silly.
Rob
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox