* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Andy Lutomirski @ 2019-04-15 23:58 UTC (permalink / raw)
To: Jonathan Kowalski
Cc: Andy Lutomirski, Aleksa Sarai, Enrico Weigelt, metux IT consult,
Christian Brauner, Linus Torvalds, Al Viro, Jann Horn,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel
In-Reply-To: <CAGLj2rEOfW=3wis3YWomWK1AVDdCX6DMqbvSHXba3-x=kpkcNA@mail.gmail.com>
On Mon, Apr 15, 2019 at 2:26 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski <luto@kernel.org> wrote:
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes. And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit. I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable. I suppose it would be nifty if there were a way for a
>
> Hmm, isn't what you're describing roughly what systemd-run -t does? It
> will serialize the argument list, ask PID 1 to create a transient unit
> (go through the polkit stuff), and then set the stdout/stderr and
> stdin of the service to your tty, make it the controlling terminal of
> the process and
> reset it. So I guess it should work with sudo/su just fine too.
>
> There is also s6-sudod (and a s6-sudoc client to it) that works in a
> similar fashion, though it's a lot less fancy.
Cute. Now we just distros to work out the kinks and to ship these as
sudo and su :)
>
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec(). It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs". This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
>
> My idea was that this intent could be supplied at clone time, you
> could attach ptrace access modes to a pidfd (we could make those a bit
> granular, perhaps) and any API that takes PIDs and checks against the
> caller's ptrace access mode could instead derive so from the pidfd.
> Since killing is a bit convoluted due to setuid binaries, that should
> work if one is CAP_KILL capable in the owning userns of the task, and
> if not that, has permissions to kill and the target has NNP set.
This CAP_KILL trick makes me nervous. This particular permission is
really quite powerful, and it would need some analysis to conclude
that it's not *more* powerful than CAP_KILL.
> This
> would allow you to bind kill privileges in a way that is compatible
> with both worlds, the upshot being NNP allows for the functionality to
> be available to a lot more of userspace. Ofcourse, this would require
> a new clone version, possibly with taking a clone2 struct which sets a
> few parameters for the process and the flags for the pidfd.
>
> Another point is that you have a pidfd_open (or something else) that
> can create multiple pidfds from a pidfd obtained at clone time and
> create pidfds with varying level of rights. It can also work by taking
> a TID to open a pidfd for an external task (and then for all the
> rights you wish to acquire on it, check against your ambient
> authority).
Indeed.
>
> (Actually, in general, having FMODE_* style bits spanning all methods
> a file descriptor can take (through system calls), with the type of
> object as key (class containing a set), and be able to enable/disable
> them and seal them would be a useful addition, this all happening at
> the struct file level instead of inode level sealing in memfds).
At the risk of saying a dirty word, the Windows API works quite a bit
like this :)
^ permalink raw reply
* [PATCH linux-next v10 6/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-04-15 23:44 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Kees Cook,
linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromyatnikov,
Oleg Nesterov, Andy Lutomirski,
strace-devel-3+4lAyCyj6AWlMsSdNXQLw
In-Reply-To: <20190415234307.GA9364-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
From: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
Retrieve information about the syscall that caused the stop.
The information is placed into the buffer pointed by "data"
argument, which should be a pointer to a buffer of type
"struct ptrace_syscall_info".
The "addr" argument contains the size of the buffer pointed to
by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
The return value contains the number of bytes available
to be written by the kernel.
If the size of data to be written by the kernel exceeds the size
specified by "addr" argument, the output is truncated.
Co-authored-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Eugene Syromyatnikov <esyr-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
Signed-off-by: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
---
Notes:
v10: unchanged
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7: unchanged
v6:
* Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Change struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Change struct ptrace_syscall_info: generalize instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Make available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
v4:
* Do not introduce task_struct.ptrace_event,
use child->last_siginfo->si_code instead.
* Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Change struct ptrace_syscall_info.
* Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct.
* Add proper defines for ptrace_syscall_info.op values.
* Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
* and move them to uapi.
v2:
* Do not use task->ptrace.
* Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch().
* Use addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
include/linux/tracehook.h | 9 ++--
include/uapi/linux/ptrace.h | 35 ++++++++++++
kernel/ptrace.c | 103 +++++++++++++++++++++++++++++++++++-
3 files changed, 143 insertions(+), 4 deletions(-)
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index df20f8bdbfa3..6bc7a3d58e2f 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -57,13 +57,15 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline int ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs,
+ unsigned long message)
{
int ptrace = current->ptrace;
if (!(ptrace & PT_PTRACED))
return 0;
+ current->ptrace_message = message;
ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
/*
@@ -76,6 +78,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
current->exit_code = 0;
}
+ current->ptrace_message = 0;
return fatal_signal_pending(current);
}
@@ -101,7 +104,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- return ptrace_report_syscall(regs);
+ return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
}
/**
@@ -126,7 +129,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
if (step)
user_single_step_report(regs);
else
- ptrace_report_syscall(regs);
+ ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
}
/**
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..a71b6e3b03eb 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,41 @@ struct seccomp_metadata {
__u64 flags; /* Output: filter's flags */
};
+#define PTRACE_GET_SYSCALL_INFO 0x420e
+#define PTRACE_SYSCALL_INFO_NONE 0
+#define PTRACE_SYSCALL_INFO_ENTRY 1
+#define PTRACE_SYSCALL_INFO_EXIT 2
+#define PTRACE_SYSCALL_INFO_SECCOMP 3
+
+struct ptrace_syscall_info {
+ __u8 op; /* PTRACE_SYSCALL_INFO_* */
+ __u32 arch __attribute__((__aligned__(sizeof(__u32))));
+ __u64 instruction_pointer;
+ __u64 stack_pointer;
+ union {
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ } entry;
+ struct {
+ __s64 rval;
+ __u8 is_error;
+ } exit;
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ __u32 ret_data;
+ } seccomp;
+ };
+};
+
+/*
+ * These values are stored in task->ptrace_message
+ * by tracehook_report_syscall_* to describe the current syscall-stop.
+ */
+#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
+#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
+
/* Read signals from a shared (process wide) queue */
#define PTRACE_PEEKSIGINFO_SHARED (1 << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 6f357f4fc859..f68c0d44461e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,10 @@
#include <linux/compat.h>
#include <linux/sched/signal.h>
+#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
+#include <asm/syscall.h> /* For syscall_get_* */
+#endif
+
/*
* Access another process' address space via ptrace.
* Source/target buffer must be kernel space,
@@ -879,7 +883,100 @@ static int ptrace_regset(struct task_struct *task, int req, unsigned int type,
* to ensure no machine forgets it.
*/
EXPORT_SYMBOL_GPL(task_user_regset_view);
-#endif
+
+static unsigned long
+ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ unsigned long args[ARRAY_SIZE(info->entry.args)];
+ int i;
+
+ info->op = PTRACE_SYSCALL_INFO_ENTRY;
+ info->entry.nr = syscall_get_nr(child, regs);
+ syscall_get_arguments(child, regs, args);
+ for (i = 0; i < ARRAY_SIZE(args); i++)
+ info->entry.args[i] = args[i];
+
+ /* args is the last field in struct ptrace_syscall_info.entry */
+ return offsetofend(struct ptrace_syscall_info, entry.args);
+}
+
+static unsigned long
+ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ /*
+ * As struct ptrace_syscall_info.entry is currently a subset
+ * of struct ptrace_syscall_info.seccomp, it makes sense to
+ * initialize that subset using ptrace_get_syscall_info_entry().
+ * This can be reconsidered in the future if these structures
+ * diverge significantly enough.
+ */
+ ptrace_get_syscall_info_entry(child, regs, info);
+ info->op = PTRACE_SYSCALL_INFO_SECCOMP;
+ info->seccomp.ret_data = child->ptrace_message;
+
+ /* ret_data is the last field in struct ptrace_syscall_info.seccomp */
+ return offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
+}
+
+static unsigned long
+ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ info->op = PTRACE_SYSCALL_INFO_EXIT;
+ info->exit.rval = syscall_get_error(child, regs);
+ info->exit.is_error = !!info->exit.rval;
+ if (!info->exit.is_error)
+ info->exit.rval = syscall_get_return_value(child, regs);
+
+ /* is_error is the last field in struct ptrace_syscall_info.exit */
+ return offsetofend(struct ptrace_syscall_info, exit.is_error);
+}
+
+static int
+ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
+ void __user *datavp)
+{
+ struct pt_regs *regs = task_pt_regs(child);
+ struct ptrace_syscall_info info = {
+ .op = PTRACE_SYSCALL_INFO_NONE,
+ .arch = syscall_get_arch(child),
+ .instruction_pointer = instruction_pointer(regs),
+ .stack_pointer = user_stack_pointer(regs),
+ };
+ unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
+ unsigned long write_size;
+
+ /*
+ * This does not need lock_task_sighand() to access
+ * child->last_siginfo because ptrace_freeze_traced()
+ * called earlier by ptrace_check_attach() ensures that
+ * the tracee cannot go away and clear its last_siginfo.
+ */
+ switch (child->last_siginfo ? child->last_siginfo->si_code : 0) {
+ case SIGTRAP | 0x80:
+ switch (child->ptrace_message) {
+ case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+ actual_size = ptrace_get_syscall_info_entry(child, regs,
+ &info);
+ break;
+ case PTRACE_EVENTMSG_SYSCALL_EXIT:
+ actual_size = ptrace_get_syscall_info_exit(child, regs,
+ &info);
+ break;
+ }
+ break;
+ case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8):
+ actual_size = ptrace_get_syscall_info_seccomp(child, regs,
+ &info);
+ break;
+ }
+
+ write_size = min(actual_size, user_size);
+ return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
+}
+#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data)
@@ -1096,6 +1193,10 @@ int ptrace_request(struct task_struct *child, long request,
ret = __put_user(kiov.iov_len, &uiov->iov_len);
break;
}
+
+ case PTRACE_GET_SYSCALL_INFO:
+ ret = ptrace_get_syscall_info(child, addr, datavp);
+ break;
#endif
case PTRACE_SECCOMP_GET_FILTER:
--
ldv
--
Strace-devel mailing list
Strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
https://lists.strace.io/mailman/listinfo/strace-devel
^ permalink raw reply related
* [PATCH linux-next v10 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-04-15 23:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Elvira Khabirova, Eugene Syromyatnikov, Oleg Nesterov,
Andy Lutomirski, Benjamin Herrenschmidt, Greentime Hu,
Helge Deller, James E.J. Bottomley, James Hogan, Kees Cook,
Michael Ellerman, Paul Burton, Paul Mackerras, Ralf Baechle,
Richard Kuo, Shuah Khan, Vincent Chen, linux-api, linux-hexagon,
linux-kernel, linux-kselftest
[Andrew, could you take this patchset into your tree, please?]
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
PTRACE_GET_SYSCALL_INFO returns the following structure:
struct ptrace_syscall_info {
__u8 op; /* PTRACE_SYSCALL_INFO_* */
__u32 arch __attribute__((__aligned__(sizeof(__u32))));
__u64 instruction_pointer;
__u64 stack_pointer;
union {
struct {
__u64 nr;
__u64 args[6];
} entry;
struct {
__s64 rval;
__u8 is_error;
} exit;
struct {
__u64 nr;
__u64 args[6];
__u32 ret_data;
} seccomp;
};
};
The structure was chosen according to [2], except for the following
changes:
* seccomp substructure was added as a superset of entry substructure;
* the type of nr field was changed from int to __u64 because syscall
numbers are, as a practical matter, 64 bits;
* stack_pointer field was added along with instruction_pointer field
since it is readily available and can save the tracer from extra
PTRACE_GETREGS/PTRACE_GETREGSET calls;
* arch is always initialized to aid with tracing system calls
* such as execve();
* instruction_pointer and stack_pointer are always initialized
so they could be easily obtained for non-syscall stops;
* a boolean is_error field was added along with rval field, this way
the tracer can more reliably distinguish a return value
from an error value.
strace has been ported to PTRACE_GET_SYSCALL_INFO.
Starting with release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API
as the preferred mechanism of obtaining syscall information.
[1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF_eVyg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3AwEHA@mail.gmail.com/
---
Notes:
v10: added more Acked-by.
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Moved syscall_get_arch() specific patches to a separate patchset
which is now merged into audit/next tree.
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7:
* Rebased to v5.0-rc1.
* 5 arch-specific preparatory patches out of 25 have been merged
into v5.0-rc1 via arch trees.
v6:
* Add syscall_get_arguments and syscall_set_arguments wrappers
to asm-generic/syscall.h, requested by Geert.
* Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Change struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Merge separate series and patches into the single series.
* Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Change struct ptrace_syscall_info: generalize instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Patch all remaining architectures to provide all necessary
syscall_get_* functions.
* Make available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
* Add a test for PTRACE_GET_SYSCALL_INFO to selftests/ptrace.
v4:
* Do not introduce task_struct.ptrace_event,
use child->last_siginfo->si_code instead.
* Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Change struct ptrace_syscall_info.
* Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct.
* Add proper defines for ptrace_syscall_info.op values.
* Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
* and move them to uapi.
v2:
* Do not use task->ptrace.
* Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch().
* Use addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
Dmitry V. Levin (6):
nds32: fix asm/syscall.h # acked
hexagon: define syscall_get_error() and syscall_get_return_value() # waiting for ack since November
mips: define syscall_get_error() # acked
parisc: define syscall_get_error() # acked
powerpc: define syscall_get_error() # waiting for ack since early December
selftests/ptrace: add a test case for PTRACE_GET_SYSCALL_INFO # acked
Elvira Khabirova (1):
ptrace: add PTRACE_GET_SYSCALL_INFO request # reviewed
arch/hexagon/include/asm/syscall.h | 14 +
arch/mips/include/asm/syscall.h | 6 +
arch/nds32/include/asm/syscall.h | 27 +-
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/include/asm/syscall.h | 10 +
include/linux/tracehook.h | 9 +-
include/uapi/linux/ptrace.h | 35 +++
kernel/ptrace.c | 103 ++++++-
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/get_syscall_info.c | 271 ++++++++++++++++++
11 files changed, 470 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/get_syscall_info.c
--
ldv
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Jonathan Kowalski @ 2019-04-15 21:27 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Aleksa Sarai, Enrico Weigelt, metux IT consult, Christian Brauner,
Linus Torvalds, Al Viro, Jann Horn, David Howells, Linux API,
LKML, Serge E. Hallyn, Arnd Bergmann, Eric W. Biederman,
Kees Cook, Thomas Gleixner, Michael Kerrisk, Andrew Morton,
Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CALCETrWxMnaPvwicqkMLswMynWvJVteazD-bFv3ZnBKWp-1joQ@mail.gmail.com>
On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult <lkml@metux.net> wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
>
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes. And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit. I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable. I suppose it would be nifty if there were a way for a
Hmm, isn't what you're describing roughly what systemd-run -t does? It
will serialize the argument list, ask PID 1 to create a transient unit
(go through the polkit stuff), and then set the stdout/stderr and
stdin of the service to your tty, make it the controlling terminal of
the process and
reset it. So I guess it should work with sudo/su just fine too.
There is also s6-sudod (and a s6-sudoc client to it) that works in a
similar fashion, though it's a lot less fancy.
> process, by mutual agreement, to reparent itself to an unrelated
> process.
>
> Anyway, clone(2) is an enormous mess. Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec(). It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs". This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.
My idea was that this intent could be supplied at clone time, you
could attach ptrace access modes to a pidfd (we could make those a bit
granular, perhaps) and any API that takes PIDs and checks against the
caller's ptrace access mode could instead derive so from the pidfd.
Since killing is a bit convoluted due to setuid binaries, that should
work if one is CAP_KILL capable in the owning userns of the task, and
if not that, has permissions to kill and the target has NNP set. This
would allow you to bind kill privileges in a way that is compatible
with both worlds, the upshot being NNP allows for the functionality to
be available to a lot more of userspace. Ofcourse, this would require
a new clone version, possibly with taking a clone2 struct which sets a
few parameters for the process and the flags for the pidfd.
Another point is that you have a pidfd_open (or something else) that
can create multiple pidfds from a pidfd obtained at clone time and
create pidfds with varying level of rights. It can also work by taking
a TID to open a pidfd for an external task (and then for all the
rights you wish to acquire on it, check against your ambient
authority).
(Actually, in general, having FMODE_* style bits spanning all methods
a file descriptor can take (through system calls), with the type of
object as key (class containing a set), and be able to enable/disable
them and seal them would be a useful addition, this all happening at
the struct file level instead of inode level sealing in memfds).
>
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc. The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.
>
> Sharing the parent's VM, signal set, fd table, etc, should all be
> options, but they should default to *off*.
Historical note: Plan 9's rfork has RFC* flags for resources like
namespace/env/fd table, which means supplying those means you start
with an clean/empty view of that resource.
>
> (Many other operating systems allow one to create a process and gain a
> capability to do all kinds of things to that process. It's a
> generally good idea.)
>
> --Andy
^ permalink raw reply
* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Alan Tull @ 2019-04-15 21:17 UTC (permalink / raw)
To: Wu Hao
Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
Xu Yilun
In-Reply-To: <20190412025019.GA27997@hao-dev>
On Thu, Apr 11, 2019 at 10:06 PM Wu Hao <hao.wu@intel.com> wrote:
>
> On Thu, Apr 11, 2019 at 03:07:35PM -0500, Alan Tull wrote:
> > On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> >
> > Hi Hao,
> >
> > >
> > > This patch adds support for power management private feature under
> > > FPGA Management Engine (FME), sysfs interfaces are introduced for
> > > different power management functions, users could use these sysfs
> > > interface to get current number of consumed power, throttling
> >
> > How about
> > s/number/measurement/
> > ?
>
> Sounds better. : )
>
> >
> > > thresholds, threshold status and other information, and configure
> > > different value for throttling thresholds too.
> > >
> > > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > > ---
> > > Documentation/ABI/testing/sysfs-platform-dfl-fme | 56 +++++
> > > drivers/fpga/dfl-fme-main.c | 257 +++++++++++++++++++++++
> > > 2 files changed, 313 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > index d3aeb88..4b6448f 100644
> > > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > @@ -100,3 +100,59 @@ Description: Read-only. Read this file to get the policy of temperature
> > > threshold1. It only supports two value (policy):
> > > 0 - AP2 state (90% throttling)
> > > 1 - AP1 state (50% throttling)
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. It returns current power consumed by FPGA.
> >
> > What are the units?
> >
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-Write. Read/Write this file to get/set current power
> > > + threshold1 in Watts.
> >
> > Perhaps document error codes here and for threshold2 below.
> >
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-Write. Read/Write this file to get/set current power
> > > + threshold2 in Watts.
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. It returns 1 if power consumption reaches the
> > > + threshold1, otherwise 0.
> >
> > I'm used to things like this requiring user to reset the status, so it
> > may be worth making it explicit that it will return to zero if
> > consumption drops below threshold if that's what's happening here.
> > If it's correct, perhaps could just say something like 'returns 1 if
> > power consumption is currently at or above threshold1, otherwise 0'
> >
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. It returns 1 if power consumption reaches the
> > > + threshold2, otherwise 0.
> >
> > Same here.
>
> Sure, will fix all above comments in this sysfs doc.
>
> >
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. Read this file to get current Latency Tolerance
> > > + Reporting (ltr) value, it's only valid for integrated
> > > + solution as it blocks CPU on low power state.
> >
> > If we're not on the integrated solution, it returns a value but it is
> > not really real?
>
> Currently only integrated solution is implementing this private feature, other
> devices e.g. Intel PAC card is not using this private feature, so user will
> not see these sysfs interfaces at all.
OK then perhaps the "it's only valid for integrated solution as it
blocks CPU on low power state" explanation doesn't need to be here and
can lead to confusion.
>
> If in the future, other devices want
>
> >
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/xeon_limit
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. Read this file to get power limit for xeon, it
> > > + is only valid for integrated solution.
> > > +
> > > +What: /sys/bus/platform/devices/dfl-fme.0/power_mgmt/fpga_limit
> > > +Date: March 2019
> > > +KernelVersion: 5.2
> > > +Contact: Wu Hao <hao.wu@intel.com>
> > > +Description: Read-only. Read this file to get power limit for fpga, it
> > > + is only valid for integrated solution.
> > > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > > index 449a17d..dafa6580 100644
> > > --- a/drivers/fpga/dfl-fme-main.c
> > > +++ b/drivers/fpga/dfl-fme-main.c
> > > @@ -415,6 +415,259 @@ static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> > > .uinit = fme_thermal_mgmt_uinit,
> > > };
> > >
> > > +#define FME_PWR_STATUS 0x8
> > > +#define FME_LATENCY_TOLERANCE BIT_ULL(18)
> > > +#define PWR_CONSUMED GENMASK_ULL(17, 0)
> > > +
> > > +#define FME_PWR_THRESHOLD 0x10
> > > +#define PWR_THRESHOLD1 GENMASK_ULL(6, 0) /* in Watts */
> > > +#define PWR_THRESHOLD2 GENMASK_ULL(14, 8) /* in Watts */
> > > +#define PWR_THRESHOLD_MAX 0x7f
> > > +#define PWR_THRESHOLD1_STATUS BIT_ULL(16)
> > > +#define PWR_THRESHOLD2_STATUS BIT_ULL(17)
> > > +
> > > +#define FME_PWR_XEON_LIMIT 0x18
> > > +#define XEON_PWR_LIMIT GENMASK_ULL(14, 0)
> > > +#define XEON_PWR_EN BIT_ULL(15)
> > > +#define FME_PWR_FPGA_LIMIT 0x20
> > > +#define FPGA_PWR_LIMIT GENMASK_ULL(14, 0)
> > > +#define FPGA_PWR_EN BIT_ULL(15)
> > > +
> > > +#define POWER_ATTR(_name, _mode, _show, _store) \
> > > +struct device_attribute power_attr_##_name = \
> > > + __ATTR(_name, _mode, _show, _store)
> > > +
> > > +#define POWER_ATTR_RO(_name, _show) \
> > > + POWER_ATTR(_name, 0444, _show, NULL)
> > > +
> > > +#define POWER_ATTR_RW(_name, _show, _store) \
> > > + POWER_ATTR(_name, 0644, _show, _store)
> >
> > Are these #defines necessary? Seems like you could just use DEVICE_ATTR*
>
> Actually it adds a prefix power_attr_xxx there to avoid name conflicts with
> other ones from different private features, e.g. for the thermal threshold.
Ah yes, I see it now, thanks.
>
> >
> > > +
> > > +static ssize_t pwr_consumed_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_STATUS);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(PWR_CONSUMED, v));
> > > +}
> > > +static POWER_ATTR_RO(consumed, pwr_consumed_show);
> > > +
> > > +static ssize_t pwr_threshold1_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(PWR_THRESHOLD1, v));
> > > +}
> > > +
> > > +static ssize_t pwr_threshold1_store(struct device *dev,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t count)
> > > +{
> > > + struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > > + void __iomem *base;
> > > + u8 threshold;
> > > + int ret;
> > > + u64 v;
> > > +
> > > + ret = kstrtou8(buf, 0, &threshold);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (threshold > PWR_THRESHOLD_MAX)
> > > + return -EINVAL;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + mutex_lock(&pdata->lock);
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > + v &= ~PWR_THRESHOLD1;
> > > + v |= FIELD_PREP(PWR_THRESHOLD1, threshold);
> > > + writeq(v, base + FME_PWR_THRESHOLD);
> > > + mutex_unlock(&pdata->lock);
> > > +
> > > + return count;
> > > +}
> > > +static POWER_ATTR_RW(threshold1, pwr_threshold1_show, pwr_threshold1_store);
> > > +
> > > +static ssize_t pwr_threshold2_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(PWR_THRESHOLD2, v));
> > > +}
> > > +
> > > +static ssize_t pwr_threshold2_store(struct device *dev,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t count)
> > > +{
> > > + struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > > + void __iomem *base;
> > > + u8 threshold;
> > > + int ret;
> > > + u64 v;
> > > +
> > > + ret = kstrtou8(buf, 0, &threshold);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (threshold > PWR_THRESHOLD_MAX)
> > > + return -EINVAL;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + mutex_lock(&pdata->lock);
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > + v &= ~PWR_THRESHOLD2;
> > > + v |= FIELD_PREP(PWR_THRESHOLD2, threshold);
> > > + writeq(v, base + FME_PWR_THRESHOLD);
> > > + mutex_unlock(&pdata->lock);
> > > +
> > > + return count;
> > > +}
> > > +static POWER_ATTR_RW(threshold2, pwr_threshold2_show, pwr_threshold2_store);
> > > +
> > > +static ssize_t pwr_threshold1_status_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> > > +}
> > > +static POWER_ATTR_RO(threshold1_status, pwr_threshold1_status_show);
> > > +
> > > +static ssize_t pwr_threshold2_status_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> > > +}
> > > +static POWER_ATTR_RO(threshold2_status, pwr_threshold2_status_show);
> > > +
> > > +static ssize_t ltr_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_STATUS);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > + (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> > > +}
> > > +static POWER_ATTR_RO(ltr, ltr_show);
> > > +
> > > +static ssize_t xeon_limit_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u16 xeon_limit = 0;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_XEON_LIMIT);
> > > +
> > > + if (FIELD_GET(XEON_PWR_EN, v))
> > > + xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit);
> > > +}
> > > +static POWER_ATTR_RO(xeon_limit, xeon_limit_show);
> > > +
> > > +static ssize_t fpga_limit_show(struct device *dev,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + void __iomem *base;
> > > + u16 fpga_limit = 0;
> > > + u64 v;
> > > +
> > > + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > + v = readq(base + FME_PWR_FPGA_LIMIT);
> > > +
> > > + if (FIELD_GET(FPGA_PWR_EN, v))
> > > + fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> > > +
> > > + return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit);
> > > +}
> > > +static POWER_ATTR_RO(fpga_limit, fpga_limit_show);
> > > +
> > > +static struct attribute *power_mgmt_attrs[] = {
> > > + &power_attr_consumed.attr,
> > > + &power_attr_threshold1.attr,
> > > + &power_attr_threshold2.attr,
> > > + &power_attr_threshold1_status.attr,
> > > + &power_attr_threshold2_status.attr,
> > > + &power_attr_xeon_limit.attr,
> > > + &power_attr_fpga_limit.attr,
> > > + &power_attr_ltr.attr,
> >
> > This is a nit, but I would expect to see these listed in the same
> > order as their show/store functions above. So ltr_attr would come
> > between threshold2_status_attr and xeon_limit_attr.
>
> Sure, it does make sense.
>
> >
> > > + NULL,
> > > +};
> > > +
> > > +static struct attribute_group power_mgmt_attr_group = {
> > > + .attrs = power_mgmt_attrs,
> > > + .name = "power_mgmt",
> > > +};
> > > +
> > > +static int fme_power_mgmt_init(struct platform_device *pdev,
> > > + struct dfl_feature *feature)
> > > +{
> > > + return sysfs_create_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > > +}
> > > +
> > > +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> > > + struct dfl_feature *feature)
> > > +{
> > > + sysfs_remove_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > > +}
> > > +
> > > +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> > > + {.id = FME_FEATURE_ID_POWER_MGMT,},
> > > + {0,}
> > > +};
> > > +
> > > +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> > > + .init = fme_power_mgmt_init,
> > > + .uinit = fme_power_mgmt_uinit,
> > > +};
> > > +
> > > static struct dfl_feature_driver fme_feature_drvs[] = {
> > > {
> > > .id_table = fme_hdr_id_table,
> > > @@ -429,6 +682,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
> > > .ops = &fme_thermal_mgmt_ops,
> > > },
> > > {
> > > + .id_table = fme_power_mgmt_id_table,
> > > + .ops = &fme_power_mgmt_ops,
> > > + },
> > > + {
> > > .ops = NULL,
> > > },
> > > };
> > > --
> > > 2.7.4
> > >
>
> Thanks a lot for the review and comments. :)
>
> Hao
>
> >
> > Thanks,
> > Alan
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Andy Lutomirski @ 2019-04-15 20:29 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Enrico Weigelt, metux IT consult, Christian Brauner,
Linus Torvalds, Al Viro, Jann Horn, David Howells, Linux API,
LKML, Serge E. Hallyn, Andrew Lutomirski, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <20190415195911.z7b7miwsj67ha54y@yavin>
On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2019-04-15, Enrico Weigelt, metux IT consult <lkml@metux.net> wrote:
> > > This patchset makes it possible to retrieve pid file descriptors at
> > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > clone() system call as previously discussed.
> >
> > Sorry, for highjacking this thread, but I'm curious on what things to
> > consider when introducing new CLONE_* flags.
> >
> > The reason I'm asking is:
> >
> > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > processes can change their own namespace at will. For that, certain
> > traditional unix'ish things have to be disabled, most notably suid.
> > As forbidding suid can be helpful in other scenarios, too, I thought
> > about making this its own feature. Doing that switch on clone() seems
> > a nice place for that, IMHO.
>
> Just spit-balling -- is no_new_privs not sufficient for this usecase?
> Not granting privileges such as setuid during execve(2) is the main
> point of that flag.
>
I would personally *love* it if distros started setting no_new_privs
for basically all processes. And pidfd actually gets us part of the
way toward a straightforward way to make sudo and su still work in a
no_new_privs world: su could call into a daemon that would spawn the
privileged task, and su would get a (read-only!) pidfd back and then
wait for the fd and exit. I suppose that, done naively, this might
cause some odd effects with respect to tty handling, but I bet it's
solveable. I suppose it would be nifty if there were a way for a
process, by mutual agreement, to reparent itself to an unrelated
process.
Anyway, clone(2) is an enormous mess. Surely the right solution here
is to have a whole new process creation API that takes a big,
extensible struct as an argument, and supports *at least* the full
abilities of posix_spawn() and ideally covers all the use cases for
fork() + do stuff + exec(). It would be nifty if this API also had a
way to say "add no_new_privs and therefore enable extra functionality
that doesn't work without no_new_privs". This functionality would
include things like returning a future extra-privileged pidfd that
gives ptrace-like access.
As basic examples, the improved process creation API should take a
list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
from, fds to close (or, maybe even better, a list of fds to *not*
close), a list of rlimit changes to make, a list of signal changes to
make, the ability to set sid, pgrp, uid, gid (as in
setresuid/setresgid), the ability to do capset() operations, etc. The
posix_spawn() API, for all that it's rather complicated, covers a
bunch of the basics pretty well.
Sharing the parent's VM, signal set, fd table, etc, should all be
options, but they should default to *off*.
(Many other operating systems allow one to create a process and gain a
capability to do all kinds of things to that process. It's a
generally good idea.)
--Andy
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Aleksa Sarai @ 2019-04-15 19:59 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Christian Brauner, torvalds, viro, jannh, dhowells, linux-api,
linux-kernel, serge, luto, arnd, ebiederm, keescook, tglx,
mtk.manpages, akpm, oleg, joel, dancol
In-Reply-To: <dc05ffe3-c2ff-8b3e-d181-e0cc620bf91d@metux.net>
[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]
On 2019-04-15, Enrico Weigelt, metux IT consult <lkml@metux.net> wrote:
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
>
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
>
> The reason I'm asking is:
>
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain
> traditional unix'ish things have to be disabled, most notably suid.
> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.
Just spit-balling -- is no_new_privs not sufficient for this usecase?
Not granting privileges such as setuid during execve(2) is the main
point of that flag.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Daniel Colascione @ 2019-04-15 19:39 UTC (permalink / raw)
To: Jonathan Kowalski
Cc: Oleg Nesterov, Christian Brauner, Linus Torvalds, Al Viro,
Jann Horn, David Howells, Linux API, linux-kernel,
Serge E. Hallyn, Andy Lutomirski, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner,
Michael Kerrisk-manpages, Andrew Morton, Aleksa Sarai,
Joel Fernandes
In-Reply-To: <CAGLj2rHpZ92txk2=yf7dtHFcic2H0bGB0=FzEo7w2rNpC2zxDA@mail.gmail.com>
On Mon, Apr 15, 2019 at 10:15 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> > Why else do we want pidfd?
>
> Apart from what others have already pointed out, there are two other
> things I am looking forward to:
Everything that Christian, Joel, and Jonathan have said is right.
If I can wax philosophical for a bit (as I've been accused to doing
:-)), there's a lot of value in consistency itself, a "more than the
sum of its parts" effect that arises from modeling all kernel-resource
handles as file descriptors. You get lifecycle consistency, API
consistency (e.g., dup, close), introspection consistency (via
/proc/pid/fd and friends), wait consistency, IPC consistency, and tons
of other benefits from using a file descriptor. The alternatives tend
to be very ugly: one of SysV IPC's* biggest mistakes, for example, was
having users manage its resources via non-file-descriptor kernel
handles. The process is, I think, the last major class of kernel
resource that users can't manipulate via file descriptor. Even if
using pidfds didn't provide strong immediate and direct benefits, it'd
*still* be worth moving to a file descriptor resource handle model for
the sake of making the system interface regular and uniform.
* Does anyone know *why* the SysV people didn't use FDs? The
consistency argument I'm making was just as relevant then as it is
now.
^ permalink raw reply
* Re: [RFC PATCH v1 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Steve Grubb @ 2019-04-15 18:47 UTC (permalink / raw)
To: Jan Kara
Cc: Mickaël Salaün, linux-kernel, Al Viro, James Morris,
Jonathan Corbet, Kees Cook, Matthew Garrett, Michael Kerrisk,
Mickaël Salaün, Mimi Zohar, Philippe Trébuchet,
Shuah Khan, Thibaut Sautereau, Vincent Strubel, Yves-Alexis Perez,
kernel-hardening, linux-api, linux-security-module
In-Reply-To: <20181212144306.GA19945@quack2.suse.cz>
Hello,
On Wednesday, December 12, 2018 9:43:06 AM EDT Jan Kara wrote:
> On Wed 12-12-18 09:17:08, Mickaël Salaün wrote:
> > When the O_MAYEXEC flag is passed, sys_open() may be subject to
> > additional restrictions depending on a security policy implemented by an
> > LSM through the inode_permission hook.
> >
> > The underlying idea is to be able to restrict scripts interpretation
> > according to a policy defined by the system administrator. For this to
> > be possible, script interpreters must use the O_MAYEXEC flag
> > appropriately. To be fully effective, these interpreters also need to
> > handle the other ways to execute code (for which the kernel can't help):
> > command line parameters (e.g., option -e for Perl), module loading
> > (e.g., option -m for Python), stdin, file sourcing, environment
> > variables, configuration files... According to the threat model, it may
> > be acceptable to allow some script interpreters (e.g. Bash) to interpret
> > commands from stdin, may it be a TTY or a pipe, because it may not be
> > enough to (directly) perform syscalls.
> >
> > A simple security policy implementation is available in a following
> > patch for Yama.
> >
> > This is an updated subset of the patch initially written by Vincent
> > Strubel for CLIP OS:
> > https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d
> > 6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch This patch has
> > been used for more than 10 years with customized script interpreters.
> > Some examples can be found here:
> > https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYE
> > XEC
> >
> > Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
> > Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
> > Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
>
> ...
>
> > diff --git a/fs/open.c b/fs/open.c
> > index 0285ce7dbd51..75479b79a58f 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -974,6 +974,10 @@ static inline int build_open_flags(int flags,
> > umode_t mode, struct open_flags *o>
> > if (flags & O_APPEND)
> >
> > acc_mode |= MAY_APPEND;
> >
> > + /* Check execution permissions on open. */
> > + if (flags & O_MAYEXEC)
> > + acc_mode |= MAY_OPENEXEC;
> > +
> >
> > op->acc_mode = acc_mode;
> >
> > op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
>
> I don't feel experienced enough in security to tell whether we want this
> functionality or not. But if we do this, shouldn't we also set FMODE_EXEC
> on the resulting struct file? That way also security_file_open() can be
> used to arbitrate such executable opens and in particular
> fanotify permission event FAN_OPEN_EXEC will get properly generated which I
> guess is desirable (support for it is sitting in my tree waiting for the
> merge window) - adding some audit people involved in FAN_OPEN_EXEC to CC.
> Just an idea...
Late in replying. But I think it's important to have a deep look into the
issue.
TL;DR - This is a gentle man's handshake. It won't _really_ solve the
problem.
This flag that is being proposed means that you would have to patch all
interpreters to use it. If you are sure that upstreams will accept that, why
not just change the policy to interpreters shouldn't execute anything unless
the execute bit is set? That is simpler and doesn't need a kernel change. And
setting the execute bit is an auditable event.
The bottom line is that any interpreter has to become a security policy
enforcement point whether by indicating it wants to execute by setting a flag
or by refusing to use a file without execute bit set. But this just moves the
problem to one that is harder to fix. Why in the world does any programming
language allow programs to be loaded via stdin?
It is possible to wget a program and pipe it into python which subsequently
pulls down an ELF shared object and runs it all without touching disk via
memfd_create (e.g. SnakeEater). This is all direct to memory execution. And
direct to memory bypasses anti-virus, selinux, IMA, application whitelisting,
and other integrity schemes.
So, to fix this problem, you really need to not allow any programs to load via
stdin so that everything that executes has to touch disk. This way you can
get a fanotify event and see the application and vote yes/no on allowing it.
And this will be particularly harder with the memfd_create fix for the runc
container breakout. Prior to that, there were very few uses of that system
call. Now it may be very common which means finding malicious use just got
harder to spot.
But assuming these problems got fixed, then we have yet another place to look.
Many interpreters allow you to specify a command to run via arguments. Some
have a small buffer and some allow lengthy programs to be entered as an
argument. One strategy might be that an attacker can bootstrap a lengthier
program across the network. Python for example allows loading modules across
a network. All you need to put in the commandline is the override for the
module loader and a couple modules to import. It then loads the modules
remotely. Getting rid of this hole will likely lead to some unhappy people -
meaning it can't be fixed.
And even if we get that fixed, we have one last hole to plug. Shells. One can
simply start a shell and paste their program into the shell and then execute
it. You can easily do this with bash or python or any language that has a
REPL (read–eval–print loop). To fix this means divorcing the notion of a
language from a REPL. Production systems really do not need a Python shell,
they need the interpreter. I doubt that this would be popular. But fixing each
of these issues is what it would take to prevent unknown software from
running. Not going this far leaves holes.
Best Regards,
-Steve
^ permalink raw reply
* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Adhemerval Zanella @ 2019-04-15 17:22 UTC (permalink / raw)
To: hpa, Florian Weimer; +Cc: libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <A278227C-039B-49F4-B80D-650B785AE225@zytor.com>
On 15/04/2019 12:53, hpa@zytor.com wrote:
> On April 12, 2019 12:50:41 AM PDT, Florian Weimer <fweimer@redhat.com> wrote:
>> * Adhemerval Zanella:
>>
>>> On 11/04/2019 08:07, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> This allows us to adjust the baud rates to non-standard values
>> using termios
>>>>> interfaces without to resorting to add new headers and use a
>> different API
>>>>> (ioctl).
>>>>
>>>> How much symbol versioning will be required for this change?
>>>
>>> I think all interfaces that have termios as input for sparc and mips
>>> (tcgetattr, tcsetattr, cfmakeraw, cfgetispeed, cfgetospeed,
>> cfsetispeed,
>>> cfsetospeed, cfsetspeed).
>>>
>>> Alpha will also need to use termios1 for pre-4.20 kernels.
>>
>> So only new symbol versions there? Hmm.
>>
>>>>> As Peter Anvin has indicated, he create a POC [1] with the
>> aforementioned
>>>>> new interfaces. It has not been rebased against master, more
>> specially against
>>>>> my termios refactor to simplify the multiple architecture header
>> definitions,
>>>>> but I intend to use as a base.
>>>>
>>>> Reference [1] is still missing. 8-(
>>>
>>> Oops... it is
>> https://git.zytor.com/users/hpa/glibc/termbaud.git/log/?h=wip.termbaud
>>
>> This doesn't really illuminate things. “Drop explicit baud setting
>> interfaces in favor of cfenc|decspeed()” removes the new symbol version
>> for the cf* functions.
>>
>> My gut feeling is that it's safer to add new interfaces, based on the
>> actual kernel/userspace interface, rather than trying to fix up
>> existing
>> interfaces with symbol versioning. The main reason is that code
>> involving serial interfaces is difficult to test, so it will take years
>> until we find the last application broken by the glibc interface bump.
>>
>> I don't feel strongly about this. This came out of a request for
>> enabling TCGETS2 support downstream. If I can't fix this upstream, I
>> will just reject that request.
>>
>> Thanks,
>> Florian
>
> New interfaces are only necessary for the handful of architectures that don't have the speed fields *and* to space to put them in.
Based on your WIP, it seems that both sparc and mips could be adapted.
Do we still have glibc supported architecture that would require compat
symbols?
>
> Using symbol versioning doesn't really help much since the real problem is that struct termios can be passed around in userspace, and the interfaces between user space libraries don't have any versioning. However, my POC code deals with that too by only seeing BOTHER when necessary, so if the structure is extended garbage in the extra fields will be ignored unless new baud rates are in use.
Yeah, we discussed this earlier and if recall correctly it was not settled
that all architectures would allow the use to extra space for the new
fields. It seems the case, which makes the adaptation for termios2 even easier.
The question I have for kernel side is whether termios2 is fully compatible
with termios, meaning that if there is conner cases we need to handle in
userland.
>
> My POC code deals with Alpha as well by falling back to the old interfaces if necessary and possible, otherwise return error.
>
> Exporting termios2 to user space feels a bit odd at this stage as it would only be usable as a fallback on old glibc. Call it kernel_termios2 at least. ioctls using struct termios *must* be changed to kernel_termios anyway!
>
I still prefer to avoid export it to userland and make it usable through
default termios, as your wip does. My understanding is new interfaces
should be semantic equal to current one with the only deviation that
non-standard baudrates will handled as its values. The only issue I
can foresee is if POSIX starts to export new bauds value.
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Jonathan Kowalski @ 2019-04-15 17:15 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Linus Torvalds, Al Viro, Jann Horn,
David Howells, Linux API, linux-kernel, Serge E. Hallyn,
Andy Lutomirski, Arnd Bergmann, Eric W. Biederman, Kees Cook,
Thomas Gleixner, Michael Kerrisk-manpages, Andrew Morton,
Aleksa Sarai, Joel Fernandes, Daniel Colascione
In-Reply-To: <20190415132416.GB22204@redhat.com>
On Mon, Apr 15, 2019 at 2:25 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 04/15, Christian Brauner wrote:
> >
> > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > >
> > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > return ERR_PTR(-EINVAL);
> > >
> > > at the start of copy_process() ?
> > >
> > > Then it can do
> > >
> > > if (clone_flags & CLONE_PIDFD) {
> > > retval = pidfd_create(pid, &pidfdf);
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > retval = put_user(retval, parent_tidptr)
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > }
> >
> > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > let us return the pid and the pidfd in one go and we can also start
> > pidfd numbering at 0.
>
> Christian, sorry if it was already discussed, but I can't force myself to
> read all the previous discussions ;)
>
> If we forget about CONFIG_PROC_FS, why do we really want to create a file?
>
>
> Suppose we add a global u64 counter incremented by copy_process and reported
> in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> *parent_tidptr. Let's denote this counter as UNIQ_PID.
>
> Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> can do
>
> kill_by_pid_uniq(int pid, u64 uniq_pid)
> {
> pidfd = open("/proc/$pid", O_DIRECTORY);
>
> status = openat(pidfd, "status");
> u64 this_uniq_pid = ... read UNIQ_PID from status ...;
>
> if (uniq_pid != this_uniq_pid)
> return;
>
> pidfd_send_signal(pidfd);
> }
>
> Why else do we want pidfd?
Apart from what others have already pointed out, there are two other
things I am looking forward to:
* Currently, when ptracing from a thread, waitpid means that I need to
block or constantly loop over with pauses to receive the ptrace
related results, since ptrace is thread directed (and to be able to
poll other event sources as well, eg. to receive further commands over
a pipe/message passing fd), and related waitpid responses only arrive
to the attached thread. The waitfd patchset was rejected on the
grounds that one could use a separate thread to do the waitpid while
polling from the attached thread or a new thread, but due to ptrace
this is false. pidfds would allow for this to work (this does mean
we'd also need to be able to return one at ATTACH/SEIZE time, though).
Note that waitid and other variants throw away a lot of needed
information.
* Descriptors mean you can optionally choose to bind your privileges
to the file descriptor and then pass it around to others. They do not
work this way now but the choice of such an extension has been kept
open. One of the examples is binding one's CAP_KILL capability and
then pass it to another process, so that it can freely signal the said
process (and only that), or be able to optionally poke holes in the
restrictions imposed by PID namespaces (possibly in the future), etc.
>
> Oleg.
>
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Joel Fernandes @ 2019-04-15 16:25 UTC (permalink / raw)
To: Christian Brauner
Cc: Oleg Nesterov, torvalds, viro, jannh, dhowells, linux-api,
linux-kernel, serge, luto, arnd, ebiederm, keescook, tglx,
mtk.manpages, akpm, cyphar, dancol
In-Reply-To: <20190415135246.d6pvyf3pkt3sbh6t@brauner.io>
On Mon, Apr 15, 2019 at 03:52:48PM +0200, Christian Brauner wrote:
> On Mon, Apr 15, 2019 at 03:24:16PM +0200, Oleg Nesterov wrote:
> > On 04/15, Christian Brauner wrote:
> > >
> > > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > > >
> > > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > > return ERR_PTR(-EINVAL);
> > > >
> > > > at the start of copy_process() ?
> > > >
> > > > Then it can do
> > > >
> > > > if (clone_flags & CLONE_PIDFD) {
> > > > retval = pidfd_create(pid, &pidfdf);
> > > > if (retval < 0)
> > > > goto bad_fork_free_pid;
> > > > retval = put_user(retval, parent_tidptr)
> > > > if (retval < 0)
> > > > goto bad_fork_free_pid;
> > > > }
> > >
> > > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > > let us return the pid and the pidfd in one go and we can also start
> > > pidfd numbering at 0.
> >
> > Christian, sorry if it was already discussed, but I can't force myself to
> > read all the previous discussions ;)
> >
> > If we forget about CONFIG_PROC_FS, why do we really want to create a file?
> >
> >
> > Suppose we add a global u64 counter incremented by copy_process and reported
> > in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> > *parent_tidptr. Let's denote this counter as UNIQ_PID.
> >
> > Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> > can do
> >
> > kill_by_pid_uniq(int pid, u64 uniq_pid)
> > {
> > pidfd = open("/proc/$pid", O_DIRECTORY);
> >
> > status = openat(pidfd, "status");
> > u64 this_uniq_pid = ... read UNIQ_PID from status ...;
> >
> > if (uniq_pid != this_uniq_pid)
> > return;
> >
> > pidfd_send_signal(pidfd);
> > }
> >
> > Why else do we want pidfd?
>
> I think this was thrown around at one point but this is rather
> inelegant imho. It basically makes a process unique by using a
> combination of two identifiers. You end up with a similar concept but
> you make it way less flexible and extensible imho. With pidfds you can
> not care about pids at all if you don't want to. The UNIQ_PID thing
> would require you to always juggle two identifiers.
>
> Your example would also only work if CONFIG_PROC_FS is set (Not sure if
> that's what you meant by "forget about CONFIG_PROC_FS")? Say, you get
> a pid from clone() and your UNIQ_PID thing. Then you still can't
> reliably kill a process because pidfd_send_signal() is not useable since
> you can't get pidfds. And if you go the kill way you end up with the same
> problem. Yes, you could solve this by probably extending syscalls to
> take a UNIQ_PID argument but that seems very inelegant.
>
> The UNIQ_PID implementation would also require being tracked in the
> kernel either in task_struct or struct pid potentially and thus would
> probably add more infrastructure in the kernel. We don't need any of
> that if we simply rely on pidfds.
>
> Most of all, the pidfd concept allows one way more flexibility in
> extending it. For example, Joel is working on a patchset to make pidfds
> pollable so you can get information about a process death by polling
> them. We also want to be able to potentially wait on a process with
> waitid(W_PIDFD) or similar as suggested by Linus in earlier threads. At
> that point you end up in a similar situation as tgkill() where you pass
> a tgid and a pid already to make sure that the pid you pass has the tgid
> as thread-group leader. That is all way simpler with pidfds.
I agree the pidfd file descriptor approach is simpler than dealing with 2
pids and is needed for the poll notification support I posted.
Also in the future it allows for a pidfd to sent over IPC to another
process using binder or unix domain sockets.
thanks,
- Joel
^ permalink raw reply
* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: hpa @ 2019-04-15 15:54 UTC (permalink / raw)
To: Florian Weimer; +Cc: libc-alpha, adhemerval.zanella, linux-api, linuxppc-dev
In-Reply-To: <87r2aaz74f.fsf@oldenburg2.str.redhat.com>
On April 9, 2019 11:50:24 PM PDT, Florian Weimer <fweimer@redhat.com> wrote:
>* hpa:
>
>> PowerPC doesn't need it at all.
>
>Because the definition would be the same as struct termios?
>
>I think we should still define struct termios2 using the struct termios
>definition, and also define TCGETS2 and TCSETS2. This way,
>applications
>can use the struct termios2 type without affecting portability to
>POWER.
>
>Thanks,
>Florian
Agreed. There is just no reason to have a new set of ioctl numbers for PowerPC; just make them aliases.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
^ permalink raw reply
* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: hpa @ 2019-04-15 15:53 UTC (permalink / raw)
To: Florian Weimer, Adhemerval Zanella; +Cc: libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <87lg0fbr1q.fsf@oldenburg2.str.redhat.com>
On April 12, 2019 12:50:41 AM PDT, Florian Weimer <fweimer@redhat.com> wrote:
>* Adhemerval Zanella:
>
>> On 11/04/2019 08:07, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> This allows us to adjust the baud rates to non-standard values
>using termios
>>>> interfaces without to resorting to add new headers and use a
>different API
>>>> (ioctl).
>>>
>>> How much symbol versioning will be required for this change?
>>
>> I think all interfaces that have termios as input for sparc and mips
>> (tcgetattr, tcsetattr, cfmakeraw, cfgetispeed, cfgetospeed,
>cfsetispeed,
>> cfsetospeed, cfsetspeed).
>>
>> Alpha will also need to use termios1 for pre-4.20 kernels.
>
>So only new symbol versions there? Hmm.
>
>>>> As Peter Anvin has indicated, he create a POC [1] with the
>aforementioned
>>>> new interfaces. It has not been rebased against master, more
>specially against
>>>> my termios refactor to simplify the multiple architecture header
>definitions,
>>>> but I intend to use as a base.
>>>
>>> Reference [1] is still missing. 8-(
>>
>> Oops... it is
>https://git.zytor.com/users/hpa/glibc/termbaud.git/log/?h=wip.termbaud
>
>This doesn't really illuminate things. “Drop explicit baud setting
>interfaces in favor of cfenc|decspeed()” removes the new symbol version
>for the cf* functions.
>
>My gut feeling is that it's safer to add new interfaces, based on the
>actual kernel/userspace interface, rather than trying to fix up
>existing
>interfaces with symbol versioning. The main reason is that code
>involving serial interfaces is difficult to test, so it will take years
>until we find the last application broken by the glibc interface bump.
>
>I don't feel strongly about this. This came out of a request for
>enabling TCGETS2 support downstream. If I can't fix this upstream, I
>will just reject that request.
>
>Thanks,
>Florian
New interfaces are only necessary for the handful of architectures that don't have the speed fields *and* to space to put them in.
Using symbol versioning doesn't really help much since the real problem is that struct termios can be passed around in userspace, and the interfaces between user space libraries don't have any versioning. However, my POC code deals with that too by only seeing BOTHER when necessary, so if the structure is extended garbage in the extra fields will be ignored unless new baud rates are in use.
My POC code deals with Alpha as well by falling back to the old interfaces if necessary and possible, otherwise return error.
Exporting termios2 to user space feels a bit odd at this stage as it would only be usable as a fallback on old glibc. Call it kernel_termios2 at least. ioctls using struct termios *must* be changed to kernel_termios anyway!
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Serge E. Hallyn @ 2019-04-15 15:50 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Christian Brauner, torvalds, viro, jannh, dhowells, linux-api,
linux-kernel, serge, luto, arnd, ebiederm, keescook, tglx,
mtk.manpages, akpm, oleg, cyphar, joel, dancol
In-Reply-To: <dc05ffe3-c2ff-8b3e-d181-e0cc620bf91d@metux.net>
On Mon, Apr 15, 2019 at 12:08:09PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 14.04.19 22:14, Christian Brauner wrote:
>
> Hi folks,
>
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
>
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
>
> The reason I'm asking is:
>
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain
Is there any place where we can see previous discussion about this?
> traditional unix'ish things have to be disabled, most notably suid.
If you have to disable suid anyway, then is there any reason why the
existing ability to do this in a private user namespace, with only
your own uid mapped (which you can do without any privilege) does
not suffice? That was actually one of the main design goals of user
namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,
then clone(CLONE_NEWNS) and bind mount at will.
> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.
>
> As there might be potentially even more CLONE_* flags in the future,
> and the bitmask size is limited, this raises the question on how to
> proceed with those flag additions in the future.
>
> What's your thoughts on that ?
>
>
> --mtx
>
> --
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> info@metux.net -- +49-151-27565287
^ permalink raw reply
* Re: [PATCH v8 02/16] sched/core: Add bucket local max tracking
From: Patrick Bellasi @ 2019-04-15 14:51 UTC (permalink / raw)
To: linux-kernel, linux-pm, linux-api
Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190402104153.25404-3-patrick.bellasi@arm.com>
On 02-Apr 11:41, Patrick Bellasi wrote:
> Because of bucketization, different task-specific clamp values are
> tracked in the same bucket. For example, with 20% bucket size and
> assuming to have:
> Task1: util_min=25%
> Task2: util_min=35%
> both tasks will be refcounted in the [20..39]% bucket and always boosted
> only up to 20% thus implementing a simple floor aggregation normally
> used in histograms.
>
> In systems with only few and well-defined clamp values, it would be
> useful to track the exact clamp value required by a task whenever
> possible. For example, if a system requires only 23% and 47% boost
> values then it's possible to track the exact boost required by each
> task using only 3 buckets of ~33% size each.
>
> Introduce a mechanism to max aggregate the requested clamp values of
> RUNNABLE tasks in the same bucket. Keep it simple by resetting the
> bucket value to its base value only when a bucket becomes inactive.
> Allow a limited and controlled overboosting margin for tasks recounted
> in the same bucket.
>
> In systems where the boost values are not known in advance, it is still
> possible to control the maximum acceptable overboosting margin by tuning
> the number of clamp groups. For example, 20 groups ensure a 5% maximum
> overboost.
>
> Remove the rq bucket initialization code since a correct bucket value
> is now computed when a task is refcounted into a CPU's rq.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> --
> Changes in v8:
> Message-ID: <20190313193916.GQ2482@worktop.programming.kicks-ass.net>
> - split this code out from the previous patch
> ---
> kernel/sched/core.c | 46 ++++++++++++++++++++++++++-------------------
> 1 file changed, 27 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 032211b72110..6e1beae5f348 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -778,6 +778,11 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> * When a task is enqueued on a rq, the clamp bucket currently defined by the
> * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> * updates the rq's clamp value if required.
> + *
> + * Tasks can have a task-specific value requested from user-space, track
> + * within each bucket the maximum value for tasks refcounted in it.
> + * This "local max aggregation" allows to track the exact "requested" value
> + * for each bucket when all its RUNNABLE tasks require the same clamp.
> */
> static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> unsigned int clamp_id)
> @@ -789,8 +794,15 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> bucket->tasks++;
>
> + /*
> + * Local max aggregation: rq buckets always track the max
> + * "requested" clamp value of its RUNNABLE tasks.
> + */
> + if (uc_se->value > bucket->value)
> + bucket->value = uc_se->value;
> +
> if (uc_se->value > READ_ONCE(uc_rq->value))
> - WRITE_ONCE(uc_rq->value, bucket->value);
> + WRITE_ONCE(uc_rq->value, uc_se->value);
> }
>
> /*
> @@ -815,6 +827,12 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> if (likely(bucket->tasks))
> bucket->tasks--;
>
> + /*
> + * Keep "local max aggregation" simple and accept to (possibly)
> + * overboost some RUNNABLE tasks in the same bucket.
> + * The rq clamp bucket value is reset to its base value whenever
> + * there are no more RUNNABLE tasks refcounting it.
> + */
> if (likely(bucket->tasks))
> return;
>
> @@ -824,8 +842,14 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> * e.g. due to future modification, warn and fixup the expected value.
> */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> - if (bucket->value >= rq_clamp)
> + if (bucket->value >= rq_clamp) {
> + /*
> + * Reset clamp bucket value to its nominal value whenever
> + * there are anymore RUNNABLE tasks refcounting it.
> + */
> + bucket->value = uclamp_bucket_base_value(bucket->value);
While running tests on Android, I found that the snipped above should
be better done in uclamp_rq_inc_id() for two main reasons:
1. because of the early return in this function, we skip the reset in
case the task is not the last one running on a CPU, thus
triggering the above SCHED_WARN_ON
2. since a non active bucket is already ignored, we don't care about
resetting its "local max value"
Will move the "max local update" in uclamp_rq_inc_id() where something
like:
/*
* Local max aggregation: rq buckets always track the max
* "requested" clamp value of its RUNNABLE tasks.
*/
if (bucket->tasks == 1 || uc_se->value > bucket->value)
bucket->value = uc_se->value;
Should grant to always have an updated local max every time a bucket
is active.
> WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> + }
> }
>
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> @@ -855,25 +879,9 @@ static void __init init_uclamp(void)
> unsigned int clamp_id;
> int cpu;
>
> - for_each_possible_cpu(cpu) {
> - struct uclamp_bucket *bucket;
> - struct uclamp_rq *uc_rq;
> - unsigned int bucket_id;
> -
> + for_each_possible_cpu(cpu)
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> - uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> -
> - bucket_id = 1;
> - while (bucket_id < UCLAMP_BUCKETS) {
> - bucket = &uc_rq->bucket[bucket_id];
> - bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> - ++bucket_id;
> - }
> - }
> - }
> -
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
>
> --
> 2.20.1
>
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-15 13:52 UTC (permalink / raw)
To: Oleg Nesterov
Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
joel, dancol
In-Reply-To: <20190415132416.GB22204@redhat.com>
On Mon, Apr 15, 2019 at 03:24:16PM +0200, Oleg Nesterov wrote:
> On 04/15, Christian Brauner wrote:
> >
> > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > >
> > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > return ERR_PTR(-EINVAL);
> > >
> > > at the start of copy_process() ?
> > >
> > > Then it can do
> > >
> > > if (clone_flags & CLONE_PIDFD) {
> > > retval = pidfd_create(pid, &pidfdf);
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > retval = put_user(retval, parent_tidptr)
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > }
> >
> > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > let us return the pid and the pidfd in one go and we can also start
> > pidfd numbering at 0.
>
> Christian, sorry if it was already discussed, but I can't force myself to
> read all the previous discussions ;)
>
> If we forget about CONFIG_PROC_FS, why do we really want to create a file?
>
>
> Suppose we add a global u64 counter incremented by copy_process and reported
> in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> *parent_tidptr. Let's denote this counter as UNIQ_PID.
>
> Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> can do
>
> kill_by_pid_uniq(int pid, u64 uniq_pid)
> {
> pidfd = open("/proc/$pid", O_DIRECTORY);
>
> status = openat(pidfd, "status");
> u64 this_uniq_pid = ... read UNIQ_PID from status ...;
>
> if (uniq_pid != this_uniq_pid)
> return;
>
> pidfd_send_signal(pidfd);
> }
>
> Why else do we want pidfd?
I think this was thrown around at one point but this is rather
inelegant imho. It basically makes a process unique by using a
combination of two identifiers. You end up with a similar concept but
you make it way less flexible and extensible imho. With pidfds you can
not care about pids at all if you don't want to. The UNIQ_PID thing
would require you to always juggle two identifiers.
Your example would also only work if CONFIG_PROC_FS is set (Not sure if
that's what you meant by "forget about CONFIG_PROC_FS")? Say, you get
a pid from clone() and your UNIQ_PID thing. Then you still can't
reliably kill a process because pidfd_send_signal() is not useable since
you can't get pidfds. And if you go the kill way you end up with the same
problem. Yes, you could solve this by probably extending syscalls to
take a UNIQ_PID argument but that seems very inelegant.
The UNIQ_PID implementation would also require being tracked in the
kernel either in task_struct or struct pid potentially and thus would
probably add more infrastructure in the kernel. We don't need any of
that if we simply rely on pidfds.
Most of all, the pidfd concept allows one way more flexibility in
extending it. For example, Joel is working on a patchset to make pidfds
pollable so you can get information about a process death by polling
them. We also want to be able to potentially wait on a process with
waitid(W_PIDFD) or similar as suggested by Linus in earlier threads. At
that point you end up in a similar situation as tgkill() where you pass
a tgid and a pid already to make sure that the pid you pass has the tgid
as thread-group leader. That is all way simpler with pidfds.
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-15 13:24 UTC (permalink / raw)
To: Christian Brauner
Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
joel, dancol
In-Reply-To: <20190415114204.ydczeuwmi74wfsuv@brauner.io>
On 04/15, Christian Brauner wrote:
>
> > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> >
> > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > return ERR_PTR(-EINVAL);
> >
> > at the start of copy_process() ?
> >
> > Then it can do
> >
> > if (clone_flags & CLONE_PIDFD) {
> > retval = pidfd_create(pid, &pidfdf);
> > if (retval < 0)
> > goto bad_fork_free_pid;
> > retval = put_user(retval, parent_tidptr)
> > if (retval < 0)
> > goto bad_fork_free_pid;
> > }
>
> Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> let us return the pid and the pidfd in one go and we can also start
> pidfd numbering at 0.
Christian, sorry if it was already discussed, but I can't force myself to
read all the previous discussions ;)
If we forget about CONFIG_PROC_FS, why do we really want to create a file?
Suppose we add a global u64 counter incremented by copy_process and reported
in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
*parent_tidptr. Let's denote this counter as UNIQ_PID.
Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
can do
kill_by_pid_uniq(int pid, u64 uniq_pid)
{
pidfd = open("/proc/$pid", O_DIRECTORY);
status = openat(pidfd, "status");
u64 this_uniq_pid = ... read UNIQ_PID from status ...;
if (uniq_pid != this_uniq_pid)
return;
pidfd_send_signal(pidfd);
}
Why else do we want pidfd?
Oleg.
^ permalink raw reply
* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Florian Weimer @ 2019-04-15 12:28 UTC (permalink / raw)
To: Adhemerval Zanella; +Cc: libc-alpha, hpa, linux-api, linuxppc-dev
In-Reply-To: <87lg0fbr1q.fsf@oldenburg2.str.redhat.com>
* Florian Weimer:
> My gut feeling is that it's safer to add new interfaces, based on the
> actual kernel/userspace interface, rather than trying to fix up existing
> interfaces with symbol versioning. The main reason is that code
> involving serial interfaces is difficult to test, so it will take years
> until we find the last application broken by the glibc interface bump.
>
> I don't feel strongly about this. This came out of a request for
> enabling TCGETS2 support downstream. If I can't fix this upstream, I
> will just reject that request.
Any further comments on this matter?
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-15 11:42 UTC (permalink / raw)
To: Oleg Nesterov
Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
joel, dancol
In-Reply-To: <20190415105209.GA22204@redhat.com>
On Mon, Apr 15, 2019 at 12:52:09PM +0200, Oleg Nesterov wrote:
> On 04/14, Christian Brauner wrote:
> >
> > @@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
> > }
> >
> > put_pid(pid);
> > +
> > + if (clone_flags & CLONE_PIDFD)
> > + nr = pidfd;
> > +
>
> Well, this doesn't look nice ...
>
> CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
>
> if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> (CLONE_PIDFD|CLONE_PARENT_SETTID))
> return ERR_PTR(-EINVAL);
>
> at the start of copy_process() ?
>
> Then it can do
>
> if (clone_flags & CLONE_PIDFD) {
> retval = pidfd_create(pid, &pidfdf);
> if (retval < 0)
> goto bad_fork_free_pid;
> retval = put_user(retval, parent_tidptr)
> if (retval < 0)
> goto bad_fork_free_pid;
> }
Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
let us return the pid and the pidfd in one go and we can also start
pidfd numbering at 0.
^ permalink raw reply
* Re: bug#35275: spelling mistake manual pages
From: Michael Kerrisk (man-pages) @ 2019-04-15 11:08 UTC (permalink / raw)
To: Paul Eggert
Cc: Vikas Talan, 35275-done, majordomo, lkml, linux-kernel-announce,
Linux API, netdev, libc-alpha-subscribe, libc-help-subscribe,
manpages, listar, linux-man
In-Reply-To: <8970d433-8c7e-23d8-7d04-b9344adbf3ad@cs.ucla.edu>
But also, the report is simply wrong.
On Mon, 15 Apr 2019 at 07:34, Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> That report appears to be about the glibc manual. Please report glibc bugs as
> described here:
>
> https://sourceware.org/glibc/wiki/FilingBugs
>
> Also, please use plain text rather than images to describe bugs in a text file.
^ permalink raw reply
* Re: [PATCH 2/4] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-15 10:52 UTC (permalink / raw)
To: Christian Brauner
Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
joel, dancol
In-Reply-To: <20190414201436.19502-3-christian@brauner.io>
On 04/14, Christian Brauner wrote:
>
> @@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
> }
>
> put_pid(pid);
> +
> + if (clone_flags & CLONE_PIDFD)
> + nr = pidfd;
> +
Well, this doesn't look nice ...
CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
(CLONE_PIDFD|CLONE_PARENT_SETTID))
return ERR_PTR(-EINVAL);
at the start of copy_process() ?
Then it can do
if (clone_flags & CLONE_PIDFD) {
retval = pidfd_create(pid, &pidfdf);
if (retval < 0)
goto bad_fork_free_pid;
retval = put_user(retval, parent_tidptr)
if (retval < 0)
goto bad_fork_free_pid;
}
Oleg.
^ permalink raw reply
* Re: [PATCH 0/4] clone: add CLONE_PIDFD
From: Enrico Weigelt, metux IT consult @ 2019-04-15 10:16 UTC (permalink / raw)
To: Christian Brauner, torvalds, viro, jannh, dhowells, linux-api,
linux-kernel
Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
oleg, cyphar, joel, dancol
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>
On 14.04.19 22:14, Christian Brauner wrote:
Hi,
> When clone is called with CLONE_PIDFD a pidfd instead of a pid will be
> returned.
Uh, changing the return type by some flags produces a very strange
feeling in my stomach. Isn't there any other way for fetching a pidfs
for some pid ?
> To make it possible for users of CLONE_PIDFD to apply standard
> error checking that is common all across userspace, file descriptor
> numbering for pidfds starts at 1 and not 0.
Feels even more strange. I'm used to the kernel always picking the
lowest available pid. In some cases, one really wants to have fd 0.
Even though the actual usecases for doing that w/ a pidfd might be
pretty limited, I don't feel good w/ breaking this old habit.
--mtx
--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
^ permalink raw reply
* RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Enrico Weigelt, metux IT consult @ 2019-04-15 10:08 UTC (permalink / raw)
To: Christian Brauner, torvalds, viro, jannh, dhowells, linux-api,
linux-kernel
Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
oleg, cyphar, joel, dancol
In-Reply-To: <20190414201436.19502-1-christian@brauner.io>
On 14.04.19 22:14, Christian Brauner wrote:
Hi folks,
> This patchset makes it possible to retrieve pid file descriptors at
> process creation time by introducing the new flag CLONE_PIDFD to the
> clone() system call as previously discussed.
Sorry, for highjacking this thread, but I'm curious on what things to
consider when introducing new CLONE_* flags.
The reason I'm asking is:
I'm working on implementing plan9-like fs namespaces, where unprivileged
processes can change their own namespace at will. For that, certain
traditional unix'ish things have to be disabled, most notably suid.
As forbidding suid can be helpful in other scenarios, too, I thought
about making this its own feature. Doing that switch on clone() seems
a nice place for that, IMHO.
As there might be potentially even more CLONE_* flags in the future,
and the bitmask size is limited, this raises the question on how to
proceed with those flag additions in the future.
What's your thoughts on that ?
--mtx
--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
^ permalink raw reply
* Re: [PATCH] fanotify: Make wait for permission events interruptible
From: Jan Kara @ 2019-04-15 9:59 UTC (permalink / raw)
To: linux-fsdevel
Cc: Orion Poplawski, Vivek Trivedi, Amir Goldstein, linux-api, LKML,
Eric W. Biederman, Jan Kara
In-Reply-To: <20190321151142.17104-1-jack@suse.cz>
On Thu 21-03-19 16:11:42, Jan Kara wrote:
> Switch waiting for response to fanotify permission events interruptible.
> This allows e.g. the system to be suspended while there are some
> fanotify permission events pending (which is reportedly pretty common
> when for example AV solution is in use). However just making the wait
> interruptible can result in e.g. open(2) returning -EINTR where
> previously such error code never happened in practice. To avoid
> confusion of userspace due to this error code, return -ERESTARTNOINTR
> instead.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/notify/fanotify/fanotify.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> Orion, can you give this patch some testing with your usecase? Also if anybody
> sees any issue with returning -ERESTARTNOINTR I have missed, please speak up.
Ping Orion? Did you have any chance to give this patch a try? Does it fix
hibernation issues you observe without causing issues with bash and other
programs? I'd like to queue this patch for the coming merge window but
I'd like to see some testing results showing that it actually helps
anything... Thanks!
Honza
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6b9c27548997..eb790853844b 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -92,10 +92,17 @@ static int fanotify_get_response(struct fsnotify_group *group,
>
> pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>
> - ret = wait_event_killable(group->fanotify_data.access_waitq,
> - event->state == FAN_EVENT_ANSWERED);
> + ret = wait_event_interruptible(group->fanotify_data.access_waitq,
> + event->state == FAN_EVENT_ANSWERED);
> /* Signal pending? */
> if (ret < 0) {
> + /*
> + * Force restarting a syscall so that this is mostly invisible
> + * for userspace which is not prepared for handling EINTR e.g.
> + * from open(2).
> + */
> + if (ret == -ERESTARTSYS)
> + ret = -ERESTARTNOINTR;
> spin_lock(&group->notification_lock);
> /* Event reported to userspace and no answer yet? */
> if (event->state == FAN_EVENT_REPORTED) {
> --
> 2.16.4
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox