Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC] Add critical process prctl
From: Andy Lutomirski @ 2019-09-10 16:56 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: Tim Murray, Suren Baghdasaryan, LKML, Linux API
In-Reply-To: <20190905005313.126823-1-dancol@google.com>

On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
>
> A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> meaning that if the task ever exits, the kernel panics. This facility
> is intended for use by low-level core system processes that cannot
> gracefully restart without a reboot. This prctl allows these processes
> to ensure that the system restarts when they die regardless of whether
> the rest of userspace is operational.

The kind of panic produced by init crashing is awful -- logs don't get
written, etc.  I'm wondering if you would be better off with a new
watchdog-like device that, when closed, kills the system in a
configurable way (e.g. after a certain amount of time, while still
logging something and having a decent chance of getting the logs
written out.)  This could plausibly even be an extension to the
existing /dev/watchdog API.

--Andy

^ permalink raw reply

* Re: [RFC] Add critical process prctl
From: Daniel Colascione @ 2019-09-10 17:42 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Tim Murray, Suren Baghdasaryan, LKML, Linux API
In-Reply-To: <CALCETrU2Wycgdfo8vLZQUnx1J9ro=6ddSkP37BhsfBkKL1mbMA@mail.gmail.com>

On Tue, Sep 10, 2019 at 9:57 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> > meaning that if the task ever exits, the kernel panics. This facility
> > is intended for use by low-level core system processes that cannot
> > gracefully restart without a reboot. This prctl allows these processes
> > to ensure that the system restarts when they die regardless of whether
> > the rest of userspace is operational.
>
> The kind of panic produced by init crashing is awful -- logs don't get
> written, etc.

True today --- but that's a separate problem, and one that can be
solved in a few ways, e.g., pre-registering log buffers to be
incorporated into any kexec kernel memory dumps. If a system aiming
for reliability can't diagnose panics, that's a problem with or
without my patch.

> I'm wondering if you would be better off with a new
> watchdog-like device that, when closed, kills the system in a
> configurable way (e.g. after a certain amount of time, while still
> logging something and having a decent chance of getting the logs
> written out.)  This could plausibly even be an extension to the
> existing /dev/watchdog API.

There are lots of approaches that work today: a few people have
suggested just having init watch processes, perhaps with pidfds. What
I worry about is increasing the length (both in terms of time and
complexity) of the critical path between something going wrong in a
critical process and the system getting back into a known-good state.
A panic at the earliest moment we know that a marked-critical process
has become doomed seems like the most reliable approach, especially
since alternatives can get backed up behind things like file
descriptor closing and various forms of scheduling delay.

^ permalink raw reply

* Re: [RFC] Add critical process prctl
From: Andy Lutomirski @ 2019-09-10 18:15 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Andy Lutomirski, Tim Murray, Suren Baghdasaryan, LKML, Linux API
In-Reply-To: <CAKOZuevMiomDQwzrHVb4qU6nhKOiENWsEmFhVKrBvjVNa0ff+w@mail.gmail.com>

On Tue, Sep 10, 2019 at 10:43 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, Sep 10, 2019 at 9:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> > > meaning that if the task ever exits, the kernel panics. This facility
> > > is intended for use by low-level core system processes that cannot
> > > gracefully restart without a reboot. This prctl allows these processes
> > > to ensure that the system restarts when they die regardless of whether
> > > the rest of userspace is operational.
> >
> > The kind of panic produced by init crashing is awful -- logs don't get
> > written, etc.
>
> True today --- but that's a separate problem, and one that can be
> solved in a few ways, e.g., pre-registering log buffers to be
> incorporated into any kexec kernel memory dumps. If a system aiming
> for reliability can't diagnose panics, that's a problem with or
> without my patch.

It's been a problem for years and years and no one has convincingly
fixed it.  But the particular type of failure you're handling is
unlike most panics: no locks are held, nothing is corrupt, and the
kernel is generally functional.

>
> > I'm wondering if you would be better off with a new
> > watchdog-like device that, when closed, kills the system in a
> > configurable way (e.g. after a certain amount of time, while still
> > logging something and having a decent chance of getting the logs
> > written out.)  This could plausibly even be an extension to the
> > existing /dev/watchdog API.
>
> There are lots of approaches that work today: a few people have
> suggested just having init watch processes, perhaps with pidfds. What
> I worry about is increasing the length (both in terms of time and
> complexity) of the critical path between something going wrong in a
> critical process and the system getting back into a known-good state.
> A panic at the earliest moment we know that a marked-critical process
> has become doomed seems like the most reliable approach, especially
> since alternatives can get backed up behind things like file
> descriptor closing and various forms of scheduling delay.

I think this all depends on exactly what types of failures you care
about.  If the kernel is dead (actually crashed, deadlocked, or merely
livelocked or otherwise failing to make progress) then you have no
particular guarantee that your critical task will make it to do_exit()
in the first place.  Otherwise, I see no real reason that you should
panic immediately in do_exit() rather than waiting the tiny amount of
time it would take for a watchdog driver to notice that the descriptor
was closed.

So, if I were designing this, I think I would want to use a watchdog.
Program it to die immediately if the descriptor is closed and also
program it to die if the descriptor isn't pinged periodically.  The
latter catches the case where the system is failing to make progress.
And "die" can mean "notify a logging daemon and give it five seconds
to do it's thing and declare it's done; panic or reboot after five
seconds if it doesn't declare that it's done."

--Andy

^ permalink raw reply

* Re: [PATCH v6 1/3] Documentation: fpga: dfl: add descriptions for thermal/power management interfaces
From: Guenter Roeck @ 2019-09-10 19:29 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-hwmon, jdelvare,
	gregkh, Xu Yilun
In-Reply-To: <1568094640-4920-2-git-send-email-hao.wu@intel.com>

On Tue, Sep 10, 2019 at 01:50:38PM +0800, Wu Hao wrote:
> From: Xu Yilun <yilun.xu@intel.com>
> 
> This patch add introductions to thermal/power interfaces. They are
> implemented as hwmon sysfs interfaces by thermal/power private
> feature drivers.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>

Acked-by: Guenter Roeck <linux@roeck-us.net>

> ---
>  Documentation/fpga/dfl.rst | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
> index 6fa483f..094fc8a 100644
> --- a/Documentation/fpga/dfl.rst
> +++ b/Documentation/fpga/dfl.rst
> @@ -108,6 +108,16 @@ More functions are exposed through sysfs
>       error reporting sysfs interfaces allow user to read errors detected by the
>       hardware, and clear the logged errors.
>  
> + Power management (dfl_fme_power hwmon)
> +     power management hwmon sysfs interfaces allow user to read power management
> +     information (power consumption, thresholds, threshold status, limits, etc.)
> +     and configure power thresholds for different throttling levels.
> +
> + Thermal management (dfl_fme_thermal hwmon)
> +     thermal management hwmon sysfs interfaces allow user to read thermal
> +     management information (current temperature, thresholds, threshold status,
> +     etc.).
> +
>  
>  FIU - PORT
>  ==========
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* [RFC PATCH 1/4] rseq: Fix: Reject unknown flags on rseq unregister
From: Mathieu Desnoyers @ 2019-09-11  0:27 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Paul Turner
  Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	H . Peter Anvin, linux-api

It is preferrable to reject unknown flags within rseq unregistration
rather than to ignore them. It is an oversight caused by the fact that
the check for unknown flags is after the rseq unregister flag check.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: linux-api@vger.kernel.org
---
 kernel/rseq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 27c48eb7de40..a4f86a9d6937 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -310,6 +310,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	int ret;
 
 	if (flags & RSEQ_FLAG_UNREGISTER) {
+		if (flags & ~RSEQ_FLAG_UNREGISTER)
+			return -EINVAL;
 		/* Unregister rseq for current thread. */
 		if (current->rseq != rseq || !current->rseq)
 			return -EINVAL;
-- 
2.17.1

^ permalink raw reply related

* [RFC PATCH 2/4] rseq: Fix: Unregister rseq for CLONE_TLS
From: Mathieu Desnoyers @ 2019-09-11  0:27 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Paul Turner
  Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	H . Peter Anvin, Dmitry Vyukov, linux-api
In-Reply-To: <20190911002744.8690-1-mathieu.desnoyers@efficios.com>

It has been reported by Google that rseq is not behaving properly
with respect to clone when CLONE_VM is used without CLONE_THREAD.
It keeps the prior thread's rseq TLS registered when the TLS of the
thread has moved, so the kernel deals with the wrong TLS.

The approach of clearing the per task-struct rseq registration
on clone with CLONE_THREAD flag is incomplete. It does not cover
the use-case of clone with CLONE_VM set, but without CLONE_THREAD.

Looking more closely at each of the clone flags:

- CLONE_THREAD,
- CLONE_VM,
- CLONE_SETTLS.

It appears that the flag we really want to track is CLONE_SETTLS, which
moves the location of the TLS for the child, which makes the rseq
registration point to the wrong TLS.

Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: linux-api@vger.kernel.org
---
 include/linux/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..deb4154dbf11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1919,11 +1919,11 @@ static inline void rseq_migrate(struct task_struct *t)
 
 /*
  * If parent process has a registered restartable sequences area, the
- * child inherits. Only applies when forking a process, not a thread.
+ * child inherits. Unregister rseq for a clone with CLONE_TLS set.
  */
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 {
-	if (clone_flags & CLONE_THREAD) {
+	if (clone_flags & CLONE_TLS) {
 		t->rseq = NULL;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
-- 
2.17.1

^ permalink raw reply related

* [RFC PATCH 3/4] rseq: Introduce unreg_clone_flags
From: Mathieu Desnoyers @ 2019-09-11  0:27 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Paul Turner
  Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	H . Peter Anvin, Dmitry Vyukov, linux-api
In-Reply-To: <20190911002744.8690-1-mathieu.desnoyers@efficios.com>

Considering that some custom libc could possibly choose not to use
CLONE_SETTLS, we should allow the libc to override the choice of clone
flags meant to unregister rseq. This is a policy decision which should
not be made by the kernel.

Therefore, introduce a new RSEQ_FLAG_UNREG_CLONE_FLAGS, which makes the
rseq system call expect an additional 5th argument: a mask of all the
clone flags which may each ensure rseq is unregistered upon clone.

So even if CLONE_SETTLS is eventually replaced by some other flag in the
future, the libc will be able to adapt and pass this new flag upon rseq
registration as well.

The default when RSEQ_FLAG_UNREG_CLONE_FLAGS is unset is to unregister
rseq on clone with CLONE_SETTLS.

Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: linux-api@vger.kernel.org
---
 include/linux/sched.h     |  9 +++++++--
 include/linux/syscalls.h  |  2 +-
 include/uapi/linux/rseq.h |  1 +
 kernel/rseq.c             | 14 +++++++++++---
 4 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index deb4154dbf11..c8faa6f8493d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1138,6 +1138,7 @@ struct task_struct {
 	 * with respect to preemption.
 	 */
 	unsigned long rseq_event_mask;
+	int rseq_unreg_clone_flags;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
@@ -1919,18 +1920,21 @@ static inline void rseq_migrate(struct task_struct *t)
 
 /*
  * If parent process has a registered restartable sequences area, the
- * child inherits. Unregister rseq for a clone with CLONE_TLS set.
+ * child inherits, except if it has been required to be explicitly
+ * unregistered when any of the rseq_unreg_clone_flags are passed to clone.
  */
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 {
-	if (clone_flags & CLONE_TLS) {
+	if (clone_flags & t->rseq_unreg_clone_flags) {
 		t->rseq = NULL;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
+		t->rseq_unreg_clone_flags = 0;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
+		t->rseq_unreg_clone_flags = current->rseq_unreg_clone_flags;
 	}
 }
 
@@ -1939,6 +1943,7 @@ static inline void rseq_execve(struct task_struct *t)
 	t->rseq = NULL;
 	t->rseq_sig = 0;
 	t->rseq_event_mask = 0;
+	t->rseq_unreg_clone_flags = 0;
 }
 
 #else
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88145da7d140..6a242cfcc360 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -987,7 +987,7 @@ asmlinkage long sys_pkey_free(int pkey);
 asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
-			 int flags, uint32_t sig);
+			 int flags, uint32_t sig, int unreg_clone_flags);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 9a402fdb60e9..d71e3c6b7fdb 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -20,6 +20,7 @@ enum rseq_cpu_id_state {
 
 enum rseq_flags {
 	RSEQ_FLAG_UNREGISTER = (1 << 0),
+	RSEQ_FLAG_UNREG_CLONE_FLAGS = (1 << 1),
 };
 
 enum rseq_cs_flags_bit {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9d6937..c59b8d3dc275 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -304,8 +304,8 @@ void rseq_syscall(struct pt_regs *regs)
 /*
  * sys_rseq - setup restartable sequences for caller thread.
  */
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
-		int, flags, u32, sig)
+SYSCALL_DEFINE5(rseq, struct rseq __user *, rseq, u32, rseq_len,
+		int, flags, u32, sig, int, unreg_clone_flags)
 {
 	int ret;
 
@@ -324,12 +324,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 			return ret;
 		current->rseq = NULL;
 		current->rseq_sig = 0;
+		current->rseq_unreg_clone_flags = 0;
 		return 0;
 	}
 
-	if (unlikely(flags))
+	if (unlikely(flags & ~RSEQ_FLAG_UNREG_CLONE_FLAGS))
 		return -EINVAL;
 
+	if (!(flags & RSEQ_FLAG_UNREG_CLONE_FLAGS))
+		unreg_clone_flags = CLONE_SETTLS;
+
 	if (current->rseq) {
 		/*
 		 * If rseq is already registered, check whether
@@ -338,6 +342,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		 */
 		if (current->rseq != rseq || rseq_len != sizeof(*rseq))
 			return -EINVAL;
+		if ((flags & RSEQ_FLAG_UNREG_CLONE_FLAGS) &&
+		    current->rseq_unreg_clone_flags != unreg_clone_flags)
+			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
 		/* Already registered. */
@@ -355,6 +362,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return -EFAULT;
 	current->rseq = rseq;
 	current->rseq_sig = sig;
+	current->rseq_unreg_clone_flags = unreg_clone_flags;
 	/*
 	 * If rseq was previously inactive, and has just been
 	 * registered, ensure the cpu_id_start and cpu_id fields
-- 
2.17.1

^ permalink raw reply related

* Re: [RFC PATCH 2/4] rseq: Fix: Unregister rseq for CLONE_TLS
From: Mathieu Desnoyers @ 2019-09-11  0:31 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Paul Turner
  Cc: linux-kernel, paulmck, Boqun Feng, H. Peter Anvin, Dmitry Vyukov,
	linux-api
In-Reply-To: <20190911002744.8690-2-mathieu.desnoyers@efficios.com>

Of course, this patch title should read:

  rseq: Fix: Unregister rseq for CLONE_SETTLS

----- On Sep 11, 2019, at 1:27 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

 
> /*
>  * If parent process has a registered restartable sequences area, the
> - * child inherits. Only applies when forking a process, not a thread.
> + * child inherits. Unregister rseq for a clone with CLONE_TLS set.

and here CLONE_SETTLS as well.

>  */
> static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> {
> -	if (clone_flags & CLONE_THREAD) {
> +	if (clone_flags & CLONE_TLS) {

.. and here.

Thanks,

Mathieu

> 		t->rseq = NULL;
> 		t->rseq_sig = 0;
> 		t->rseq_event_mask = 0;
> --
> 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-11 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-ia64, linux-sh, Alexander Shishkin, Rasmus Villemoes,
	Alexei Starovoitov, linux-kernel, David Howells, linux-kselftest,
	sparclinux, Jiri Olsa, linux-arch, linux-s390, Tycho Andersen,
	Aleksa Sarai, Shuah Khan, Ingo Molnar, linux-arm-kernel,
	linux-mips, linux-xtensa, Kees Cook, Arnd Bergmann, Jann Horn,
	linuxppc-dev, linux-m68k, Al Viro, Andy Lutomirski, Shuah Khan
In-Reply-To: <20190905105749.GW2386@hirez.programming.kicks-ass.net>


[-- Attachment #1.1: Type: text/plain, Size: 3879 bytes --]

On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > > +/**
> > > > > + * copy_struct_to_user: copy a struct to user space
> > > > > + * @dst:   Destination address, in user space.
> > > > > + * @usize: Size of @dst struct.
> > > > > + * @src:   Source address, in kernel space.
> > > > > + * @ksize: Size of @src struct.
> > > > > + *
> > > > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > > > + * struct extensions are made such that all new fields are *appended* to the
> > > > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > > > + * struct).
> > > > > + *
> > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > > > + * The recommended usage is something like the following:
> > > > > + *
> > > > > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > > + *   {
> > > > > + *      int err;
> > > > > + *      struct foo karg = {};
> > > > > + *
> > > > > + *      // do something with karg
> > > > > + *
> > > > > + *      err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > > > + *      if (err)
> > > > > + *        return err;
> > > > > + *
> > > > > + *      // ...
> > > > > + *   }
> > > > > + *
> > > > > + * There are three cases to consider:
> > > > > + *  * If @usize == @ksize, then it's copied verbatim.
> > > > > + *  * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > > > + *    older user space. In order to avoid user space getting incomplete
> > > > > + *    information (new fields might be important), all trailing bytes in @src
> > > > > + *    (@ksize - @usize) must be zerored
> > > > 
> > > > s/zerored/zero/, right?
> > > 
> > > It should've been "zeroed".
> > 
> > That reads wrong to me; that way it reads like this function must take
> > that action and zero out the 'rest'; which is just wrong.
> > 
> > This function must verify those bytes are zero, not make them zero.
> > 
> > > > >                                          , otherwise -EFBIG is returned.
> > > > 
> > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> > > 
> > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > > struct passed to user-space. I would personally have preferred EMSGSIZE
> > > instead of EFBIG, but felt using the existing error codes would be less
> > > confusing.
> > 
> > Sadly a recent commit:
> > 
> >   1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
> > 
> > Made the situation even 'worse'.
> 
> And thinking more about things; I'm not convinced the above patch is
> actually right.
> 
> Do we really want to simply truncate all the attributes of the task?
> 
> And should we not at least set sched_flags when there are non-default
> clamp values applied?
> 
> See; that is I think the primary bug that had chrt failing; we tried to
> publish the default clamp values as !0.

I just saw this patch in -rc8 -- should I even attempt to port
sched_getattr(2) to copy_struct_to_user()? I agree that publishing a
default non-zero value is a mistake -- once you do that, old user space
will either get confused or lose information.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Florian Weimer @ 2019-09-11 18:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Carlos O'Donell, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <20190807142726.2579-2-mathieu.desnoyers@efficios.com>

* Mathieu Desnoyers:

> +#ifdef SHARED
> +  if (rtld_active ())
> +    {
> +      /* Register rseq ABI to the kernel.   */
> +      (void) rseq_register_current_thread ();
> +    }
> +#else

I think this will need *another* check for the inner libc in an audit
module.  See what we do in malloc.  __libc_multiple_libcs is supposed to
cover that, but it's unfortunately not reliable.

I believe without that additional check, the first audit module we load
will claim rseq, and the main program will not have control over the
rseq area.  Reversed roles would be desirable here.

In October, I hope to fix __libc_multiple_libcs, and then you can use
just that.  (We have a Fedora patch that is supposed to fix it, I need
to documen the mechanism and upstream it.)

> +/* Advertise Restartable Sequences registration ownership across
> +   application and shared libraries.
> +
> +   Libraries and applications must check whether this variable is zero or
> +   non-zero if they wish to perform rseq registration on their own. If it
> +   is zero, it means restartable sequence registration is not handled, and
> +   the library or application is free to perform rseq registration. In
> +   that case, the library or application is taking ownership of rseq
> +   registration, and may set __rseq_handled to 1. It may then set it back
> +   to 0 after it completes unregistering rseq.
> +
> +   If __rseq_handled is found to be non-zero, it means that another
> +   library (or the application) is currently handling rseq registration.
> +
> +   Typical use of __rseq_handled is within library constructors and
> +   destructors, or at program startup.  */
> +
> +int __rseq_handled;

Are there any programs that use __rseq_handled *today*?

I'm less convinced that we actually need this.  I don't think we have
ever done anything like that before, and I don't think it's necessary.
Any secondary rseq library just needs to note if it could perform
registration, and if it failed to do so, do not perform unregistration
in a pthread destructor callback.

Sure, there's the matter of pthread destructor ordering, but that
problem is not different from any other singleton (thread-local or not),
and the fix for the last time this has come up (TLS destructors vs
dlclose) was to upgrade glibc.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Carlos O'Donell @ 2019-09-11 19:00 UTC (permalink / raw)
  To: Florian Weimer, Mathieu Desnoyers
  Cc: Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <8736h2sn8y.fsf@oldenburg2.str.redhat.com>

On 9/11/19 2:26 PM, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> +#ifdef SHARED
>> +  if (rtld_active ())
>> +    {
>> +      /* Register rseq ABI to the kernel.   */
>> +      (void) rseq_register_current_thread ();
>> +    }
>> +#else
> 
> I think this will need *another* check for the inner libc in an audit
> module.  See what we do in malloc.  __libc_multiple_libcs is supposed to
> cover that, but it's unfortunately not reliable.
> 
> I believe without that additional check, the first audit module we load
> will claim rseq, and the main program will not have control over the
> rseq area.  Reversed roles would be desirable here.
> 
> In October, I hope to fix __libc_multiple_libcs, and then you can use
> just that.  (We have a Fedora patch that is supposed to fix it, I need
> to documen the mechanism and upstream it.)

This is a technical issue we can resolve.

>> +/* Advertise Restartable Sequences registration ownership across
>> +   application and shared libraries.
>> +
>> +   Libraries and applications must check whether this variable is zero or
>> +   non-zero if they wish to perform rseq registration on their own. If it
>> +   is zero, it means restartable sequence registration is not handled, and
>> +   the library or application is free to perform rseq registration. In
>> +   that case, the library or application is taking ownership of rseq
>> +   registration, and may set __rseq_handled to 1. It may then set it back
>> +   to 0 after it completes unregistering rseq.
>> +
>> +   If __rseq_handled is found to be non-zero, it means that another
>> +   library (or the application) is currently handling rseq registration.
>> +
>> +   Typical use of __rseq_handled is within library constructors and
>> +   destructors, or at program startup.  */
>> +
>> +int __rseq_handled;
> 
> Are there any programs that use __rseq_handled *today*?
> 
> I'm less convinced that we actually need this.  I don't think we have
> ever done anything like that before, and I don't think it's necessary.
> Any secondary rseq library just needs to note if it could perform
> registration, and if it failed to do so, do not perform unregistration
> in a pthread destructor callback.
> 
> Sure, there's the matter of pthread destructor ordering, but that
> problem is not different from any other singleton (thread-local or not),
> and the fix for the last time this has come up (TLS destructors vs
> dlclose) was to upgrade glibc.

This is a braoder issue.

Mathieu,

It would be easier to merge the patch set if it were just an unconditional
registration like we do for set_robust_list().

What's your thought there?

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Florian Weimer @ 2019-09-11 19:08 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Mathieu Desnoyers, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <7db64714-3dc5-b322-1edc-736b08ee7d63@redhat.com>

* Carlos O'Donell:

> It would be easier to merge the patch set if it were just an unconditional
> registration like we do for set_robust_list().

Note that this depends on the in-tree system call numbers list, which I
still need to finish according to Joseph's specifications.

(We have something that should work for us as long as we can get large
machines from the lab, but I agree that it's not very useful if glibc
bot-cycle time is roughly one business day.)

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Carlos O'Donell @ 2019-09-11 19:45 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <87ef0mr6qj.fsf@oldenburg2.str.redhat.com>

On 9/11/19 3:08 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> It would be easier to merge the patch set if it were just an unconditional
>> registration like we do for set_robust_list().
> 
> Note that this depends on the in-tree system call numbers list, which I
> still need to finish according to Joseph's specifications.

Which work is this? Do you have a URL reference to WIP?

> (We have something that should work for us as long as we can get large
> machines from the lab, but I agree that it's not very useful if glibc
> bot-cycle time is roughly one business day.)

Yeah, we have to discuss how to accelerate this.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Florian Weimer @ 2019-09-11 19:54 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Mathieu Desnoyers, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <4a6f6326-ea82-e031-0fe0-7263ed97e797@redhat.com>

* Carlos O'Donell:

> On 9/11/19 3:08 PM, Florian Weimer wrote:
>> * Carlos O'Donell:
>> 
>>> It would be easier to merge the patch set if it were just an unconditional
>>> registration like we do for set_robust_list().
>> 
>> Note that this depends on the in-tree system call numbers list, which I
>> still need to finish according to Joseph's specifications.
>
> Which work is this? Do you have a URL reference to WIP?

  <https://sourceware.org/ml/libc-alpha/2019-05/msg00630.html>
  <https://sourceware.org/ml/libc-alpha/2019-06/msg00015.html>

I think realistically this is needed for the Y2038 work as well if we
want to support building glibc with older kernel headers.  “glibc 2.31
will have Y2038 support and rseq support, but only if it runs on a
current and it happens to have been built against sufficiently recent
kernel headers” is a bit difficult to explain.  The current kernel part
is easy enough to understand, but the impact of the kernel headers on
the feature set has always been tough to explain.  Especially if you
factor in vendor kernels with system call backports.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Florian Weimer @ 2019-09-11 19:58 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Mathieu Desnoyers, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <877e6er4ls.fsf@oldenburg2.str.redhat.com>

* Florian Weimer:

> * Carlos O'Donell:
>
>> On 9/11/19 3:08 PM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>> 
>>>> It would be easier to merge the patch set if it were just an unconditional
>>>> registration like we do for set_robust_list().
>>> 
>>> Note that this depends on the in-tree system call numbers list, which I
>>> still need to finish according to Joseph's specifications.
>>
>> Which work is this? Do you have a URL reference to WIP?
>
>   <https://sourceware.org/ml/libc-alpha/2019-05/msg00630.html>
>   <https://sourceware.org/ml/libc-alpha/2019-06/msg00015.html>

Sorry, there was also this:

  <https://sourceware.org/ml/libc-alpha/2019-05/msg00629.html>

I also posted a build-many-glibcs.py patch at some point with an
automatic table update, but that still had the massive bot-cycle time.

My current line of thinking is to implement some --use-compilers flag,
so that you can build a fresh glibc checkout against an old, pre-built
compilers for the system call tables update, and then use the patched
glibc sources for one (and hopefully final) bot-cycle.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Rich Felker @ 2019-09-11 20:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, Mathieu Desnoyers, Joseph Myers,
	Szabolcs Nagy, libc-alpha, Thomas Gleixner, Ben Maurer,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Will Deacon,
	Dave Watson, Paul Turner, linux-kernel, linux-api
In-Reply-To: <877e6er4ls.fsf@oldenburg2.str.redhat.com>

On Wed, Sep 11, 2019 at 09:54:23PM +0200, Florian Weimer wrote:
> * Carlos O'Donell:
> 
> > On 9/11/19 3:08 PM, Florian Weimer wrote:
> >> * Carlos O'Donell:
> >> 
> >>> It would be easier to merge the patch set if it were just an unconditional
> >>> registration like we do for set_robust_list().
> >> 
> >> Note that this depends on the in-tree system call numbers list, which I
> >> still need to finish according to Joseph's specifications.
> >
> > Which work is this? Do you have a URL reference to WIP?
> 
>   <https://sourceware.org/ml/libc-alpha/2019-05/msg00630.html>
>   <https://sourceware.org/ml/libc-alpha/2019-06/msg00015.html>
> 
> I think realistically this is needed for the Y2038 work as well if we
> want to support building glibc with older kernel headers.  “glibc 2..31
> will have Y2038 support and rseq support, but only if it runs on a
> current and it happens to have been built against sufficiently recent
> kernel headers” is a bit difficult to explain.  The current kernel part
> is easy enough to understand, but the impact of the kernel headers on
> the feature set has always been tough to explain.  Especially if you
> factor in vendor kernels with system call backports.

I'm in favor of in-tree syscall numbers list. If you don't want O(n)
per-arch work, though, you could just define the 'base number' for
each arch and use the fact that all the new syscalls share a common
numbering (i.e. base + constant depending only on syscall). I think
including the list with glibc is more robust though, and would
eliminate the need to check for definitions of older (pre-unification)
syscalls glibc wants to use.

Rich

^ permalink raw reply

* Re: [RFC PATCH 3/4] rseq: Introduce unreg_clone_flags
From: Mathieu Desnoyers @ 2019-09-13 14:26 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Paul Turner
  Cc: linux-kernel, paulmck, Boqun Feng, H. Peter Anvin, Dmitry Vyukov,
	linux-api
In-Reply-To: <20190911002744.8690-3-mathieu.desnoyers@efficios.com>

----- On Sep 10, 2019, at 8:27 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> Considering that some custom libc could possibly choose not to use
> CLONE_SETTLS, we should allow the libc to override the choice of clone
> flags meant to unregister rseq. This is a policy decision which should
> not be made by the kernel.
> 
> Therefore, introduce a new RSEQ_FLAG_UNREG_CLONE_FLAGS, which makes the
> rseq system call expect an additional 5th argument: a mask of all the
> clone flags which may each ensure rseq is unregistered upon clone.
> 
> So even if CLONE_SETTLS is eventually replaced by some other flag in the
> future, the libc will be able to adapt and pass this new flag upon rseq
> registration as well.
> 
> The default when RSEQ_FLAG_UNREG_CLONE_FLAGS is unset is to unregister
> rseq on clone with CLONE_SETTLS.

Now that I think about this a bit more, perhaps it's better to just
clear on clone CLONE_SETTLS, and wait until a libc actually requires this
unreg_clone_flags override before introducing it.

Anyway, a "custom" libc could unregister / re-register around clone()
as a stopgap solution.

So I am tempted to drop this specific patch for now until actual libc
requiring it show up.

Thoughts ?

Thanks,

Mathieu

> 
> Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "H . Peter Anvin" <hpa@zytor.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: linux-api@vger.kernel.org
> ---
> include/linux/sched.h     |  9 +++++++--
> include/linux/syscalls.h  |  2 +-
> include/uapi/linux/rseq.h |  1 +
> kernel/rseq.c             | 14 +++++++++++---
> 4 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index deb4154dbf11..c8faa6f8493d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1138,6 +1138,7 @@ struct task_struct {
> 	 * with respect to preemption.
> 	 */
> 	unsigned long rseq_event_mask;
> +	int rseq_unreg_clone_flags;
> #endif
> 
> 	struct tlbflush_unmap_batch	tlb_ubc;
> @@ -1919,18 +1920,21 @@ static inline void rseq_migrate(struct task_struct *t)
> 
> /*
>  * If parent process has a registered restartable sequences area, the
> - * child inherits. Unregister rseq for a clone with CLONE_TLS set.
> + * child inherits, except if it has been required to be explicitly
> + * unregistered when any of the rseq_unreg_clone_flags are passed to clone.
>  */
> static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> {
> -	if (clone_flags & CLONE_TLS) {
> +	if (clone_flags & t->rseq_unreg_clone_flags) {
> 		t->rseq = NULL;
> 		t->rseq_sig = 0;
> 		t->rseq_event_mask = 0;
> +		t->rseq_unreg_clone_flags = 0;
> 	} else {
> 		t->rseq = current->rseq;
> 		t->rseq_sig = current->rseq_sig;
> 		t->rseq_event_mask = current->rseq_event_mask;
> +		t->rseq_unreg_clone_flags = current->rseq_unreg_clone_flags;
> 	}
> }
> 
> @@ -1939,6 +1943,7 @@ static inline void rseq_execve(struct task_struct *t)
> 	t->rseq = NULL;
> 	t->rseq_sig = 0;
> 	t->rseq_event_mask = 0;
> +	t->rseq_unreg_clone_flags = 0;
> }
> 
> #else
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 88145da7d140..6a242cfcc360 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -987,7 +987,7 @@ asmlinkage long sys_pkey_free(int pkey);
> asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> 			  unsigned mask, struct statx __user *buffer);
> asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
> -			 int flags, uint32_t sig);
> +			 int flags, uint32_t sig, int unreg_clone_flags);
> asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
> asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
> 			       int to_dfd, const char __user *to_path,
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index 9a402fdb60e9..d71e3c6b7fdb 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -20,6 +20,7 @@ enum rseq_cpu_id_state {
> 
> enum rseq_flags {
> 	RSEQ_FLAG_UNREGISTER = (1 << 0),
> +	RSEQ_FLAG_UNREG_CLONE_FLAGS = (1 << 1),
> };
> 
> enum rseq_cs_flags_bit {
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index a4f86a9d6937..c59b8d3dc275 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -304,8 +304,8 @@ void rseq_syscall(struct pt_regs *regs)
> /*
>  * sys_rseq - setup restartable sequences for caller thread.
>  */
> -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
> -		int, flags, u32, sig)
> +SYSCALL_DEFINE5(rseq, struct rseq __user *, rseq, u32, rseq_len,
> +		int, flags, u32, sig, int, unreg_clone_flags)
> {
> 	int ret;
> 
> @@ -324,12 +324,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
> 			return ret;
> 		current->rseq = NULL;
> 		current->rseq_sig = 0;
> +		current->rseq_unreg_clone_flags = 0;
> 		return 0;
> 	}
> 
> -	if (unlikely(flags))
> +	if (unlikely(flags & ~RSEQ_FLAG_UNREG_CLONE_FLAGS))
> 		return -EINVAL;
> 
> +	if (!(flags & RSEQ_FLAG_UNREG_CLONE_FLAGS))
> +		unreg_clone_flags = CLONE_SETTLS;
> +
> 	if (current->rseq) {
> 		/*
> 		 * If rseq is already registered, check whether
> @@ -338,6 +342,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
> 		 */
> 		if (current->rseq != rseq || rseq_len != sizeof(*rseq))
> 			return -EINVAL;
> +		if ((flags & RSEQ_FLAG_UNREG_CLONE_FLAGS) &&
> +		    current->rseq_unreg_clone_flags != unreg_clone_flags)
> +			return -EINVAL;
> 		if (current->rseq_sig != sig)
> 			return -EPERM;
> 		/* Already registered. */
> @@ -355,6 +362,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
> 		return -EFAULT;
> 	current->rseq = rseq;
> 	current->rseq_sig = sig;
> +	current->rseq_unreg_clone_flags = unreg_clone_flags;
> 	/*
> 	 * If rseq was previously inactive, and has just been
> 	 * registered, ensure the cpu_id_start and cpu_id fields
> --
> 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH for 5.3 1/3] rseq: Fix: Reject unknown flags on rseq unregister
From: Mathieu Desnoyers @ 2019-09-13 15:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, stable

It is preferrable to reject unknown flags within rseq unregistration
rather than to ignore them. It is an oversight caused by the fact that
the check for unknown flags is after the rseq unregister flag check.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: linux-api@vger.kernel.org
Cc: <stable@vger.kernel.org>
---
 kernel/rseq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 27c48eb7de40..a4f86a9d6937 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -310,6 +310,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	int ret;
 
 	if (flags & RSEQ_FLAG_UNREGISTER) {
+		if (flags & ~RSEQ_FLAG_UNREGISTER)
+			return -EINVAL;
 		/* Unregister rseq for current thread. */
 		if (current->rseq != rseq || !current->rseq)
 			return -EINVAL;
-- 
2.17.1

^ permalink raw reply related

* [PATCH for 5.3 2/3] rseq: Fix: Unregister rseq for CLONE_SETTLS
From: Mathieu Desnoyers @ 2019-09-13 15:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, Dmitry Vyukov,
	linux-api, stable
In-Reply-To: <20190913151220.3105-1-mathieu.desnoyers@efficios.com>

It has been reported by Google that rseq is not behaving properly
with respect to clone when CLONE_VM is used without CLONE_THREAD.
It keeps the prior thread's rseq TLS registered when the TLS of the
thread has moved, so the kernel deals with the wrong TLS.

The approach of clearing the per task-struct rseq registration
on clone with CLONE_THREAD flag is incomplete. It does not cover
the use-case of clone with CLONE_VM set, but without CLONE_THREAD.

Looking more closely at each of the clone flags:

- CLONE_THREAD,
- CLONE_VM,
- CLONE_SETTLS.

It appears that the flag we really want to track is CLONE_SETTLS, which
moves the location of the TLS for the child, making the rseq
registration point to the wrong TLS.

Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: linux-api@vger.kernel.org
Cc: <stable@vger.kernel.org>
---
 include/linux/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..76bf55b5cccf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1919,11 +1919,11 @@ static inline void rseq_migrate(struct task_struct *t)
 
 /*
  * If parent process has a registered restartable sequences area, the
- * child inherits. Only applies when forking a process, not a thread.
+ * child inherits. Unregister rseq for a clone with CLONE_SETTLS set.
  */
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 {
-	if (clone_flags & CLONE_THREAD) {
+	if (clone_flags & CLONE_SETTLS) {
 		t->rseq = NULL;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Mathieu Desnoyers @ 2019-09-13 15:58 UTC (permalink / raw)
  To: carlos
  Cc: Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <7db64714-3dc5-b322-1edc-736b08ee7d63@redhat.com>

----- On Sep 11, 2019, at 3:00 PM, carlos carlos@redhat.com wrote:

> On 9/11/19 2:26 PM, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>> 
>>> +#ifdef SHARED
>>> +  if (rtld_active ())
>>> +    {
>>> +      /* Register rseq ABI to the kernel.   */
>>> +      (void) rseq_register_current_thread ();
>>> +    }
>>> +#else
>> 
>> I think this will need *another* check for the inner libc in an audit
>> module.  See what we do in malloc.  __libc_multiple_libcs is supposed to
>> cover that, but it's unfortunately not reliable.
>> 
>> I believe without that additional check, the first audit module we load
>> will claim rseq, and the main program will not have control over the
>> rseq area.  Reversed roles would be desirable here.
>> 
>> In October, I hope to fix __libc_multiple_libcs, and then you can use
>> just that.  (We have a Fedora patch that is supposed to fix it, I need
>> to documen the mechanism and upstream it.)
> 
> This is a technical issue we can resolve.

I'm unsure whether there are changes I need to do in my rseq patchset, or
if this is a separate issue that will be fixed separately before glibc 2.31
is out, which would then update the rseq bits accordingly ?

> 
>>> +/* Advertise Restartable Sequences registration ownership across
>>> +   application and shared libraries.
>>> +
>>> +   Libraries and applications must check whether this variable is zero or
>>> +   non-zero if they wish to perform rseq registration on their own. If it
>>> +   is zero, it means restartable sequence registration is not handled, and
>>> +   the library or application is free to perform rseq registration. In
>>> +   that case, the library or application is taking ownership of rseq
>>> +   registration, and may set __rseq_handled to 1. It may then set it back
>>> +   to 0 after it completes unregistering rseq.
>>> +
>>> +   If __rseq_handled is found to be non-zero, it means that another
>>> +   library (or the application) is currently handling rseq registration.
>>> +
>>> +   Typical use of __rseq_handled is within library constructors and
>>> +   destructors, or at program startup.  */
>>> +
>>> +int __rseq_handled;
>> 
>> Are there any programs that use __rseq_handled *today*?

No, because I told all open source project developers asking whether they
can use rseq to wait until we agree on _this_ precise user-level ABI
(__rseq_abi and __rseq_handled). Until we agree on this, there _can_
be no users, unless they are willing to suffer conflicts when their
application/program is linked against an updated glibc.

>> 
>> I'm less convinced that we actually need this.  I don't think we have
>> ever done anything like that before, and I don't think it's necessary.
>> Any secondary rseq library just needs to note if it could perform
>> registration, and if it failed to do so, do not perform unregistration
>> in a pthread destructor callback.

If that secondary rseq library happens to try to perform registration within
its library constructor (before glibc has performed the __rseq_abi TLS
registration), we end up in a situation where the secondary library takes
ownership of rseq, even though libc would require ownership. This is a
scenario we want to avoid.

Making sure libc reserves ownership through __rseq_handled (which is
a non-TLS variable that can be accessed early in the program lifetime)
protects against this.

>> 
>> Sure, there's the matter of pthread destructor ordering, but that
>> problem is not different from any other singleton (thread-local or not),
>> and the fix for the last time this has come up (TLS destructors vs
>> dlclose) was to upgrade glibc.
> 
> This is a braoder issue.
> 
> Mathieu,
> 
> It would be easier to merge the patch set if it were just an unconditional
> registration like we do for set_robust_list().
> 
> What's your thought there?

I don't expect set_robust_list was really useful without glibc support.
In the current case, rseq can be used by applications and libraries even
with older glibc. My goal is to enable such use and not wait for years
before end-users upgrade their glibc before rseq can be used.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)
From: Florian Weimer @ 2019-09-14  1:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1137395748.2754.1568390288746.JavaMail.zimbra@efficios.com>

* Mathieu Desnoyers:

> I'm unsure whether there are changes I need to do in my rseq patchset, or
> if this is a separate issue that will be fixed separately before glibc 2.31
> is out, which would then update the rseq bits accordingly ?

Someone else (perhaps me) has to fix __libc_multiple_libcs.  Then you
can use it instead/in addition to the rtld_active check (depending on
the semantics we agree upon for __libc_multiple_libcs).

Fixing __libc_multiple_libcs may also address the early initialization
issue because for that to be always correct, we need to run the
initialization code before ELF constructors.

>>> I'm less convinced that we actually need this.  I don't think we have
>>> ever done anything like that before, and I don't think it's necessary.
>>> Any secondary rseq library just needs to note if it could perform
>>> registration, and if it failed to do so, do not perform unregistration
>>> in a pthread destructor callback.
>
> If that secondary rseq library happens to try to perform registration within
> its library constructor (before glibc has performed the __rseq_abi TLS
> registration), we end up in a situation where the secondary library takes
> ownership of rseq, even though libc would require ownership. This is a
> scenario we want to avoid.

We can avoid that if we run the glibc initialization before user code
(except IFUNC resolvers).  glibc itself doesn't have to do the
initialization from an ELF constructor.

> Making sure libc reserves ownership through __rseq_handled (which is
> a non-TLS variable that can be accessed early in the program lifetime)
> protects against this.

If that's it's only purpose, I don't think it's necessary.  If the
kernel can fail the second registration attempt, that would be all the
information the alternative rseq implementation needs (plus the matter
of destruction).

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH for 5.3 2/3] rseq: Fix: Unregister rseq for CLONE_SETTLS
From: Mathieu Desnoyers @ 2019-09-14 14:21 UTC (permalink / raw)
  To: Thomas Gleixner, Neel Natu
  Cc: linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin,
	Paul Turner, Dmitry Vyukov, linux-api, stable
In-Reply-To: <20190913151220.3105-2-mathieu.desnoyers@efficios.com>

There is an ongoing discussion on the choice of flag we want to care
about here. Therefore, please don't pull this patch until we reach an
agreement.

Thanks,

Mathieu

----- On Sep 13, 2019, at 11:12 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> It has been reported by Google that rseq is not behaving properly
> with respect to clone when CLONE_VM is used without CLONE_THREAD.
> It keeps the prior thread's rseq TLS registered when the TLS of the
> thread has moved, so the kernel deals with the wrong TLS.
> 
> The approach of clearing the per task-struct rseq registration
> on clone with CLONE_THREAD flag is incomplete. It does not cover
> the use-case of clone with CLONE_VM set, but without CLONE_THREAD.
> 
> Looking more closely at each of the clone flags:
> 
> - CLONE_THREAD,
> - CLONE_VM,
> - CLONE_SETTLS.
> 
> It appears that the flag we really want to track is CLONE_SETTLS, which
> moves the location of the TLS for the child, making the rseq
> registration point to the wrong TLS.
> 
> Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "H . Peter Anvin" <hpa@zytor.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: linux-api@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> ---
> include/linux/sched.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9f51932bd543..76bf55b5cccf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1919,11 +1919,11 @@ static inline void rseq_migrate(struct task_struct *t)
> 
> /*
>  * If parent process has a registered restartable sequences area, the
> - * child inherits. Only applies when forking a process, not a thread.
> + * child inherits. Unregister rseq for a clone with CLONE_SETTLS set.
>  */
> static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> {
> -	if (clone_flags & CLONE_THREAD) {
> +	if (clone_flags & CLONE_SETTLS) {
> 		t->rseq = NULL;
> 		t->rseq_sig = 0;
> 		t->rseq_event_mask = 0;
> --
> 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH for 5.3 2/3] rseq: Fix: Unregister rseq for CLONE_SETTLS
From: Mathieu Desnoyers @ 2019-09-16 20:26 UTC (permalink / raw)
  To: Thomas Gleixner, Neel Natu
  Cc: linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin,
	Paul Turner, Dmitry Vyukov, linux-api, stable
In-Reply-To: <819646407.3304.1568470889470.JavaMail.zimbra@efficios.com>

----- On Sep 14, 2019, at 10:21 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> There is an ongoing discussion on the choice of flag we want to care
> about here. Therefore, please don't pull this patch until we reach an
> agreement.

Following discussion with Neel Natu (Google) and Paul Turner (Google),
I plan to modify this patch, and unregister RSEQ on clone CLONE_VM for the
following reasons:

1) CLONE_THREAD requires CLONE_SIGHAND, which requires CLONE_VM to be
   set. Therefore, just checking for CLONE_VM covers all CLONE_THREAD uses,

2) There is the possibility of an unlikely scenario where CLONE_SETTLS is used
   without CLONE_VM. In order to be an issue, it would require that the rseq
   TLS is in a shared memory area.

   I do not plan on adding CLONE_SETTLS to the set of clone flags which
   unregister RSEQ, because it would require that we also unregister RSEQ
   on set_thread_area(2) and arch_prctl(2) ARCH_SET_FS for completeness.
   So rather than doing a partial solution, it appears better to let user-space
   explicitly perform rseq unregistration across clone if needed in scenarios
   where CLONE_VM is not set.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
> 
> ----- On Sep 13, 2019, at 11:12 AM, Mathieu Desnoyers
> mathieu.desnoyers@efficios.com wrote:
> 
>> It has been reported by Google that rseq is not behaving properly
>> with respect to clone when CLONE_VM is used without CLONE_THREAD.
>> It keeps the prior thread's rseq TLS registered when the TLS of the
>> thread has moved, so the kernel deals with the wrong TLS.
>> 
>> The approach of clearing the per task-struct rseq registration
>> on clone with CLONE_THREAD flag is incomplete. It does not cover
>> the use-case of clone with CLONE_VM set, but without CLONE_THREAD.
>> 
>> Looking more closely at each of the clone flags:
>> 
>> - CLONE_THREAD,
>> - CLONE_VM,
>> - CLONE_SETTLS.
>> 
>> It appears that the flag we really want to track is CLONE_SETTLS, which
>> moves the location of the TLS for the child, making the rseq
>> registration point to the wrong TLS.
>> 
>> Suggested-by: "H . Peter Anvin" <hpa@zytor.com>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: "H . Peter Anvin" <hpa@zytor.com>
>> Cc: Paul Turner <pjt@google.com>
>> Cc: Dmitry Vyukov <dvyukov@google.com>
>> Cc: linux-api@vger.kernel.org
>> Cc: <stable@vger.kernel.org>
>> ---
>> include/linux/sched.h | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 9f51932bd543..76bf55b5cccf 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1919,11 +1919,11 @@ static inline void rseq_migrate(struct task_struct *t)
>> 
>> /*
>>  * If parent process has a registered restartable sequences area, the
>> - * child inherits. Only applies when forking a process, not a thread.
>> + * child inherits. Unregister rseq for a clone with CLONE_SETTLS set.
>>  */
>> static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
>> {
>> -	if (clone_flags & CLONE_THREAD) {
>> +	if (clone_flags & CLONE_SETTLS) {
>> 		t->rseq = NULL;
>> 		t->rseq_sig = 0;
>> 		t->rseq_event_mask = 0;
>> --
>> 2.17.1
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH for 5.4 0/3] Restartable Sequences Fixes
From: Mathieu Desnoyers @ 2019-09-17 18:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, stable,
	Mathieu Desnoyers

Hi,

Here is a small set of rseq fixes aiming Linux 5.4. Those should be
backported to stable kernels >= 4.18.

Thanks,

Mathieu

Mathieu Desnoyers (3):
  rseq: Fix: Reject unknown flags on rseq unregister
  rseq: Fix: Unregister rseq for clone CLONE_VM
  rseq/selftests: Fix: Namespace gettid() for compatibility with glibc
    2.30

 include/linux/sched.h                     |  4 ++--
 kernel/rseq.c                             |  2 ++
 tools/testing/selftests/rseq/param_test.c | 18 ++++++++++--------
 3 files changed, 14 insertions(+), 10 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH for 5.4 1/3] rseq: Fix: Reject unknown flags on rseq unregister
From: Mathieu Desnoyers @ 2019-09-17 18:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, stable,
	Mathieu Desnoyers
In-Reply-To: <20190917182959.16333-1-mathieu.desnoyers@efficios.com>

It is preferrable to reject unknown flags within rseq unregistration
rather than to ignore them. It is an oversight caused by the fact that
the check for unknown flags is after the rseq unregister flag check.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Paul Turner <pjt@google.com>
Cc: linux-api@vger.kernel.org
Cc: <stable@vger.kernel.org>	# v4.18+
---
 kernel/rseq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 27c48eb7de40..a4f86a9d6937 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -310,6 +310,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	int ret;
 
 	if (flags & RSEQ_FLAG_UNREGISTER) {
+		if (flags & ~RSEQ_FLAG_UNREGISTER)
+			return -EINVAL;
 		/* Unregister rseq for current thread. */
 		if (current->rseq != rseq || !current->rseq)
 			return -EINVAL;
-- 
2.17.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox