Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Florian Weimer @ 2019-06-12 14:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1329439108.43041.1560348962006.JavaMail.zimbra@efficios.com>

* Mathieu Desnoyers:

>> It's the registration from libc.so which needs some care.  In
>> particular, we must not override an existing registration.
>
> OK, so it could check if __rseq_abi.cpu_id is -1, and only
> perform registration if it is the case. Or do you have another
> approach in mind ?

No, __rseq_abi will not be shared with the outer libc, so the inner libc
will always see -1 there, even if the outer libc has performed
registration.

libio/vtables.c has some example what you can do:

  /* In case this libc copy is in a non-default namespace, we always
     need to accept foreign vtables because there is always a
     possibility that FILE * objects are passed across the linking
     boundary.  */
  {
    Dl_info di;
    struct link_map *l;
    if (!rtld_active ()
        || (_dl_addr (_IO_vtable_check, &di, &l, NULL) != 0
            && l->l_ns != LM_ID_BASE))
      return;
  }

_IO_vtable_check would have to be replaced with your own function; the
actual function doesn't really matter.

The rtld_active check covers the static dlopen case, where
rtld_active () is false in the inner libc.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-06-12 14:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <87wohzorj0.fsf@oldenburg2.str.redhat.com>

----- On Jun 6, 2019, at 1:57 PM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
[...]
> 
>>> The final remaining case is static dlopen.  There is a copy of ld.so on
>>> the dynamic side, but it is completely inactive and has never run.  I do
>>> not think we need to support that because multi-threading does not work
>>> reliably in this scenario, either.  However, we should skip rseq
>>> registration in a nested libc (see the rtld_active function).
>>
>> So for SHARED, if (!rtld_active ()), we should indeed leave the state of
>> __rseq_handled as it is, because we are within a nested inactive ld.so.
> 
> I think we should add __rseq_handled initialization to ld.so, so it will
> only run once, ever.

OK

> 
> It's the registration from libc.so which needs some care.  In
> particular, we must not override an existing registration.

OK, so it could check if __rseq_abi.cpu_id is -1, and only
perform registration if it is the case. Or do you have another
approach in mind ?

For the main thread, "nested" unregistration does not appear to be a
problem, because we rely on program exit() to implicitly unregister.

Thanks,

Mathieu

> 
> Thanks,
> Florian

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-06-12 14:00 UTC (permalink / raw)
  To: carlos, Florian Weimer
  Cc: Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <c16c9785-7f8c-430b-a4df-a53e47bf1600@redhat.com>

----- On Jun 10, 2019, at 4:43 PM, carlos carlos@redhat.com wrote:

> On 6/6/19 7:57 AM, Florian Weimer wrote:
>> Let me ask the key question again: Does it matter if code observes the
>> rseq area first without kernel support, and then with kernel support?
>> If we don't expect any problems immediately, we do not need to worry
>> much about the constructor ordering right now.  I expect that over time,
>> fixing this properly will become easier.
> 
> I just wanted to chime in and say that splitting this into:
> 
> * Ownership (__rseq_handled)
> 
> * Initialization (__rseq_abi)
> 
> Makes sense to me.
> 
> I agree we need an answer to this question of ownership but not yet
> initialized, to owned and initialized.
> 
> I like the idea of having __rseq_handled in ld.so.

Very good, so I'll implement this approach. Sorry for the delayed
feedback, I am traveling this week.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [tip:x86/core] Documentation/filesystems/proc.txt: Add arch_status file
From: tip-bot for Aubrey Li @ 2019-06-12 12:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: adobriyan, arjan, tglx, aubrey.li, peterz, mingo, tim.c.chen,
	linux-api, dave.hansen, hpa, ak, luto, linux-kernel, akpm
In-Reply-To: <20190606012236.9391-3-aubrey.li@linux.intel.com>

Commit-ID:  711486fd18596315d42cebaac3dba8c408f60a3d
Gitweb:     https://git.kernel.org/tip/711486fd18596315d42cebaac3dba8c408f60a3d
Author:     Aubrey Li <aubrey.li@linux.intel.com>
AuthorDate: Thu, 6 Jun 2019 09:22:36 +0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

Documentation/filesystems/proc.txt: Add arch_status file

Add documentation for /proc/<pid>/arch_status file and the x86 specific
AVX512_elapsed_ms entry in it.

[ tglx: Massage changelog ]

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: adobriyan@gmail.com
Cc: aubrey.li@intel.com
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190606012236.9391-3-aubrey.li@linux.intel.com

---
 Documentation/filesystems/proc.txt | 40 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..a226061fa109 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.12	/proc/<pid>/arch_status - Task architecture specific information
 
   4	Configuring procfs
   4.1	Mount options
@@ -1948,6 +1949,45 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc/<pid>/arch_status - task architecture specific status
+-------------------------------------------------------------------
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+-------
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:      8
+
+Description
+-----------
+
+x86 specific entries:
+---------------------
+ AVX512_elapsed_ms:
+ ------------------
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+    1) The time which the task spent on the CPU without being scheduled
+       out. With CPU isolation and a single runnable task this can take
+       several seconds.
+
+    2) The time since the task was scheduled out last. Depending on the
+       reason for being scheduled out (time slice exhausted, syscall ...)
+       this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not. Precise information can be obtained
+  with performance counters.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 ------------------------------------------------------------------------------
 Configuring procfs

^ permalink raw reply related

* [tip:x86/core] x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
From: tip-bot for Aubrey Li @ 2019-06-12 12:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: adobriyan, mingo, luto, tim.c.chen, hpa, ak, peterz, dave.hansen,
	tglx, aubrey.li, linux-kernel, arjan, akpm, linux-api
In-Reply-To: <20190606012236.9391-2-aubrey.li@linux.intel.com>

Commit-ID:  0c608dad2a771c0a11b6d12148d1a8b975e015d4
Gitweb:     https://git.kernel.org/tip/0c608dad2a771c0a11b6d12148d1a8b975e015d4
Author:     Aubrey Li <aubrey.li@linux.intel.com>
AuthorDate: Thu, 6 Jun 2019 09:22:35 +0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status

AVX-512 components usage can result in turbo frequency drop. So it's useful
to expose AVX-512 usage elapsed time as a heuristic hint for user space job
schedulers to cluster the AVX-512 using tasks together.

Examples:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:      4
AVX512_elapsed_ms:      8
AVX512_elapsed_ms:      4

This means that 4 milliseconds have elapsed since the tsks AVX512 usage was
detected when the task was scheduled out.

$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:      -1

'-1' indicates that no AVX512 usage was recorded for this task.

The time exposed is not necessarily accurate when the arch_status file is
read as the AVX512 usage is only evaluated when a task is scheduled
out. Accurate usage information can be obtained with performance counters.

[ tglx: Massaged changelog ]

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: adobriyan@gmail.com
Cc: aubrey.li@intel.com
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190606012236.9391-2-aubrey.li@linux.intel.com

---
 arch/x86/Kconfig             |  1 +
 arch/x86/kernel/fpu/xstate.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..8a49b4b03f6b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -217,6 +217,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select PROC_PID_ARCH_STATUS		if PROC_FS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3c36dd1784db..591ddde3b3e8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -8,6 +8,8 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1240,3 +1242,48 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+			struct pid *pid, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+
+	return 0;
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */

^ permalink raw reply related

* [tip:x86/core] proc: Add /proc/<pid>/arch_status
From: tip-bot for Aubrey Li @ 2019-06-12 12:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, adobriyan, mingo, dave.hansen, tim.c.chen, arjan,
	linux-kernel, tglx, ak, peterz, linux-api, luto, aubrey.li, hpa
In-Reply-To: <20190606012236.9391-1-aubrey.li@linux.intel.com>

Commit-ID:  68bc30bb9f33fc8d11e3d110d29e06490896a999
Gitweb:     https://git.kernel.org/tip/68bc30bb9f33fc8d11e3d110d29e06490896a999
Author:     Aubrey Li <aubrey.li@linux.intel.com>
AuthorDate: Thu, 6 Jun 2019 09:22:34 +0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Jun 2019 11:42:13 +0200

proc: Add /proc/<pid>/arch_status

Exposing architecture specific per process information is useful for
various reasons. An example is the AVX512 usage on x86 which is important
for task placement for power/performance optimizations.

Adding this information to the existing /prcc/pid/status file would be the
obvious choise, but it has been agreed on that a explicit arch_status file
is better in separating the generic and architecture specific information.

[ tglx: Massage changelog ]

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: adobriyan@gmail.com
Cc: aubrey.li@intel.com
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Linux API <linux-api@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey.li@linux.intel.com

---
 fs/proc/Kconfig         | 4 ++++
 fs/proc/base.c          | 6 ++++++
 include/linux/proc_fs.h | 9 +++++++++
 3 files changed, 19 insertions(+)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 62ee41b4bbd0..4c3dcb718961 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -98,3 +98,7 @@ config PROC_CHILDREN
 
 	  Say Y if you are running any user-space software which takes benefit from
 	  this interface. For example, rkt is such a piece of software.
+
+config PROC_PID_ARCH_STATUS
+	def_bool n
+	depends on PROC_FS
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9c8ca6cd3ce4..ec436c61eece 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3061,6 +3061,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
 	ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3449,6 +3452,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
 	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..a705aa2d03f9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -75,6 +75,15 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
+ * provide proc_pid_arch_status() definition.
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+			struct pid *pid, struct task_struct *task);
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)

^ permalink raw reply related

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Pavel Machek @ 2019-06-12 11:37 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
	Michal Hocko, Johannes Weiner, Tim Murray, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao,
	Brian Geffon, jannh, oleg, christian, hdanton, lizeb
In-Reply-To: <20190612111920.evedpmre63ivnxkz@butterfly.localdomain>

[-- Attachment #1: Type: text/plain, Size: 1969 bytes --]

Hi!

> > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > information required to make the reclaim decision is not known to the app.
> > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > must be able to initiate reclaim on its own without any app involvement.
> > > To solve the concern, this patch introduces new syscall -
> > > 
> > >     struct pr_madvise_param {
> > >             int size;               /* the size of this structure */
> > >             int cookie;             /* reserved to support atomicity */
> > >             int nr_elem;            /* count of below arrary fields */
> > >             int __user *hints;      /* hints for each range */
> > >             /* to store result of each operation */
> > >             const struct iovec __user *results;
> > >             /* input address ranges */
> > >             const struct iovec __user *ranges;
> > >     };
> > >     
> > >     int process_madvise(int pidfd, struct pr_madvise_param *u_param,
> > >                             unsigned long flags);
> > 
> > That's quite a complex interface.
> > 
> > Could we simply have feel_free_to_swap_out(int pid) syscall? :-).
> 
> I wonder for how long we'll go on with adding new syscalls each time we need
> some amendment to existing interfaces. Yes, clone6(), I'm looking at
> you :(.
> 
> In case of process_madvise() keep in mind it will be focused not only on
> MADV_COLD, but also, potentially, on other MADV_ flags as well. I can
> hardly imagine we'll add one syscall per each flag.

Use case described above talked about whole-process-at-a-time usage,
so I'm asking if simpler interface/code is enough. If there's
motivation for more complex version, it should be described here...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Oleksandr Natalenko @ 2019-06-12 11:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
	Michal Hocko, Johannes Weiner, Tim Murray, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao,
	Brian Geffon, jannh, oleg, christian, hdanton, lizeb
In-Reply-To: <20190612105945.GA16442@amd>

On Wed, Jun 12, 2019 at 12:59:45PM +0200, Pavel Machek wrote:
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> 
> Is it still faster to swap-in the application than to restart it?

It's the same type of question I was addressing earlier in the remote
KSM discussion: making applications aware of all the memory management stuff
or delegate the decision to some supervising task.

In this case, we cannot rewrite all the application to handle imaginary
SIGRESTART (or whatever you invent to handle restarts gracefully). SIGTERM
may require more memory to finish stuff to not lose your data (and I guess
you don't want to lose your data, right?), and SIGKILL is pretty much
destructive.

Offloading proactive memory management to a process that knows how to do
it allows to handle not only throwaway containers/microservices, but also
usual desktop/mobile workflow.

> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> >     struct pr_madvise_param {
> >             int size;               /* the size of this structure */
> >             int cookie;             /* reserved to support atomicity */
> >             int nr_elem;            /* count of below arrary fields */
> >             int __user *hints;      /* hints for each range */
> >             /* to store result of each operation */
> >             const struct iovec __user *results;
> >             /* input address ranges */
> >             const struct iovec __user *ranges;
> >     };
> >     
> >     int process_madvise(int pidfd, struct pr_madvise_param *u_param,
> >                             unsigned long flags);
> 
> That's quite a complex interface.
> 
> Could we simply have feel_free_to_swap_out(int pid) syscall? :-).

I wonder for how long we'll go on with adding new syscalls each time we need
some amendment to existing interfaces. Yes, clone6(), I'm looking at
you :(.

In case of process_madvise() keep in mind it will be focused not only on
MADV_COLD, but also, potentially, on other MADV_ flags as well. I can
hardly imagine we'll add one syscall per each flag.

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Pavel Machek @ 2019-06-12 10:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2114 bytes --]

Hi!

> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

Is it still faster to swap-in the application than to restart it?


> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
>     struct pr_madvise_param {
>             int size;               /* the size of this structure */
>             int cookie;             /* reserved to support atomicity */
>             int nr_elem;            /* count of below arrary fields */
>             int __user *hints;      /* hints for each range */
>             /* to store result of each operation */
>             const struct iovec __user *results;
>             /* input address ranges */
>             const struct iovec __user *ranges;
>     };
>     
>     int process_madvise(int pidfd, struct pr_madvise_param *u_param,
>                             unsigned long flags);

That's quite a complex interface.

Could we simply have feel_free_to_swap_out(int pid) syscall? :-).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-06-12  9:32 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Florian Weimer, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <031bc55d8dcdcf4f031e6ff27c33fd52c61d33a5.camel@intel.com>

On Tue, Jun 11, 2019 at 12:31:34PM -0700, Yu-cheng Yu wrote:
> On Tue, 2019-06-11 at 12:41 +0100, Dave Martin wrote:
> > On Mon, Jun 10, 2019 at 07:24:43PM +0200, Florian Weimer wrote:
> > > * Yu-cheng Yu:
> > > 
> > > > To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
> > > > logical choice.  And it breaks only a limited set of toolchains.
> > > > 
> > > > I will simplify the parser and leave this patch as-is for anyone who wants
> > > > to
> > > > back-port.  Are there any objections or concerns?
> > > 
> > > Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
> > > the largest collection of CET-enabled binaries that exists today.
> > 
> > For clarity, RHEL is actively parsing these properties today?
> > 
> > > My hope was that we would backport the upstream kernel patches for CET,
> > > port the glibc dynamic loader to the new kernel interface, and be ready
> > > to run with CET enabled in principle (except that porting userspace
> > > libraries such as OpenSSL has not really started upstream, so many
> > > processes where CET is particularly desirable will still run without
> > > it).
> > > 
> > > I'm not sure if it is a good idea to port the legacy support if it's not
> > > part of the mainline kernel because it comes awfully close to creating
> > > our own private ABI.
> > 
> > I guess we can aim to factor things so that PT_NOTE scanning is
> > available as a fallback on arches for which the absence of
> > PT_GNU_PROPERTY is not authoritative.
> 
> We can probably check PT_GNU_PROPERTY first, and fallback (based on ld-linux
> version?) to PT_NOTE scanning?

For arm64, we can check for PT_GNU_PROPERTY and then give up
unconditionally.

For x86, we would fall back to PT_NOTE scanning, but this will add a bit
of cost to binaries that don't have NT_GNU_PROPERTY_TYPE_0.  The ld.so
version doesn't tell you what ELF ABI a given executable conforms to.

Since this sounds like it's largely a distro-specific issue, maybe there
could be a Kconfig option to turn the fallback PT_NOTE scanning on?

Cheers
---Dave

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: David Howells @ 2019-06-12  8:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Andy Lutomirski, Casey Schaufler, Andy Lutomirski,
	Al Viro, USB list, LSM List, Greg Kroah-Hartman, raven,
	Linux FS Devel, Linux API, linux-block, keyrings, LKML,
	Paul Moore
In-Reply-To: <cf3f4865-b6d7-7303-0212-960439e0c119@tycho.nsa.gov>

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> 2) If notifications can be triggered by read-like operations (as in fanotify,
> for example), then a "read" can be turned into a "write" flow through a
> notification.

I don't think any of the things can be classed as "read-like" operations.  At
the moment, there are the following groups:

 (1) Addition of objects (eg. key_link, mount).

 (2) Modifications to things (eg. keyctl_write, remount).

 (3) Removal of objects (eg. key_unlink, unmount, fput+FMODE_NEED_UNMOUNT).

 (4) I/O or hardware errors (eg. USB device add/remove, EDQUOT, ENOSPC).

I have not currently defined any access events.

I've been looking at the possibility of having epoll generate events this way,
but that's not as straightforward as I'd hoped and fanotify could potentially
use it also, but in both those cases, the process is already getting the
events currently by watching for them using synchronous waiting syscalls.
Instead this would generate an event to say it had happened.

David

^ permalink raw reply

* Re: [PATCH 09/10] usb: Add USB subsystem notifications [ver #3]
From: Felipe Balbi @ 2019-06-12  6:58 UTC (permalink / raw)
  To: Alan Stern
  Cc: Mathias Nyman, Greg Kroah-Hartman, David Howells, viro, linux-usb,
	raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <Pine.LNX.4.44L0.1906110950440.1535-100000@iolanthe.rowland.org>


Hi,

Alan Stern <stern@rowland.harvard.edu> writes:

> On Tue, 11 Jun 2019, Felipe Balbi wrote:
>
>> >> >> > So for "severe" issues, yes, we should do this, but perhaps not for all
>> >> >> > of the "normal" things we see when a device is yanked out of the system
>> >> >> > and the like.
>> >> >> 
>> >> >> Then what counts as a "severe" issue?  Anything besides enumeration 
>> >> >> failure?
>> >> >
>> >> > Not that I can think of at the moment, other than the other recently
>> >> > added KOBJ_CHANGE issue.  I'm sure we have other "hard failure" issues
>> >> > in the USB stack that people will want exposed over time.
>> >> 
>> >> From an XHCI standpoint, Transaction Errors might be one thing. They
>> >> happen rarely and are a strong indication that the bus itself is
>> >> bad. Either bad cable, misbehaving PHYs, improper power management, etc.
>> >
>> > Don't you also get transaction errors if the user unplugs a device in 
>> > the middle of a transfer?  That's not the sort of thing we want to sent 
>> > notifications about.
>> 
>> Mathias, do we get Transaction Error if user removes cable during a
>> transfer? I thought we would just get Port Status Change with CC bit
>> cleared, no?
>
> Even if xHCI doesn't give Transaction Errors when a cable is unplugged 
> during a transfer, other host controllers do.  Sometimes quite a lot -- 
> they continue to occur until the kernel polls the parent hub's 
> interrupt ep and learns that the port is disconnected, which can take 
> up to 250 ms.

my comment was specific about XHCI. It even started with "From an XHCI
standpoint" :-)

-- 
balbi

^ permalink raw reply

* Re: [PATCH 1/7] signal.h: Define SIGINFO on all architectures
From: Arseny Maslennikov @ 2019-06-11 22:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Ingo Molnar, Peter Zijlstra,
	linux-kernel, Vladimir D . Seleznev, linux-api
In-Reply-To: <87o933507j.fsf@xmission.com>

[-- Attachment #1: Type: text/plain, Size: 4419 bytes --]

On Tue, Jun 11, 2019 at 03:36:00PM -0500, Eric W. Biederman wrote:
> Arseny Maslennikov <ar@cs.msu.ru> writes:
> 
> > This complementary patch defines SIGINFO as a synonym for SIGPWR
> > on every architecture supported by the kernel.
> > The particular signal number chosen does not really matter and is only
> > required for the related tty functionality to work properly,
> > so if it does not suite expectations, any suggestions are warmly
> > welcome.
> >
> > SIGPWR looks like a nice candidate for this role, because it is
> > defined on every supported arch; it is currently only used to inform
> > PID 1 of power failures, and daemons that care about low-level
> > events do not tend to have a controlling terminal.
> >
> > However, on sparcs SIGPWR is a synonym for SIGLOST, a signal unique
> > to that architecture, with a narrow set of intended uses that do not
> > combine well with interactively requesting status.
> > SIGLOST is not used by any kernel code at the moment.
> > I'm not sure there is a more reasonable alternative right now.
> 
> Is the name SIGINFO already well established.
> 
> It just is a little bit confusing with struct siginfo.
> 

I thought about this, but the alternatives are clunkier.
SIGKBINFO? Implies there actually is a keyboard, most ttys are
virtual/software nowadays. Misplaced emphasis.
SIGSTATUS? Too generic.
SIGTINFO? This could work, like conveys the nature and all, but many
applications already do #ifdef SIGINFO; if we reuse the name they will
suddenly just work, and this helps adoption.
SIGTINFO does decipher to "terminal info", this sounds too ambiguous.

Fortunately, from UNIX tradition signal names are defined in caps, and
given there is much more related written discussion than spoken, this
stands out enough to my taste. I'd say, it stands out a lot.

> At least on x86 it looks like we have signals 32 and 33 that are
> reserved and not used for anything.  Is there a reason you have
> not picked one of those?

Quoting signal(7) from man-pages:
       The Linux kernel supports a range of 33 different real-time
       signals, numbered 32  to  64.   However,  the  glibc  POSIX
       threads  implementation  internally  uses two (for NPTL) or
       three   (for   LinuxThreads)   real-time    signals    (see
       pthreads(7)),  and  adjusts  the value of SIGRTMIN suitably
       (to 34 or 35).

This excerpt gives the cue that signals 32 and 33 are not really unused,
and I did not want to cause further adjustments to SIGRTMIN/SIGRTMAX,
since this means we have to recompile the world with the new, changed
SIGRTMIN/MAX...

If we'd change RTMIN, this makes all the rt signal numbers shift to the
right and can cause havoc if we do not recompile the world.

We could probably go with decrementing RTMAX and reusing the value
(effectively reserving a signal number from the end), but aliasing PWR
seems much easier and less harmful, and easier for apps to adopt.
If we find a strong reason not to reuse PWR, this may be considered as a
plan B, but the idea to allocate even more precious signals seems too
wasteful to me for now.

> 
> Also should this be a realtime signal with signal information
> or a non-realtime signal?

I see no requirement for this to be a realtime signal.

Non-realtime signals like PWR or INT also have at least some signal
information which can be accessed with handlers installed with
sigaction(2). [1] has a small example.
The most obvious piece of info is that signals sent by n_tty
(si_code==SI_KERNEL) are easy to distinguish from signals sent by
processes (si_code==SI_USER)

> 
> I don't expect there is much to encode except that the user is asking
> for information.  I half wonder if it could be done as a different
> si_code to SIGWINCH.  But of course that doesn't work because it is
> not a real time signal so does not queue more than one siginfo. (Sigh).
> 

Signals are just a huge mess (not that this is news...)  :)

> I just would like to see that we have a clear concept of how this new
> signal plays into all of the signal handling bits.
> 
> Added linux-api because this is fundamentally extending the linux-api,
> and we probably want man-page updates etc.
> 
> Eric
> 

[1]https://github.com/porrided/tty-kb-status-userspace/blob/d32028056c85b279cf8d5f43478b5419d090637c/siginfo-test/siginfo.c

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/7] signal.h: Define SIGINFO on all architectures
From: Eric W. Biederman @ 2019-06-11 20:36 UTC (permalink / raw)
  To: Arseny Maslennikov
  Cc: Greg Kroah-Hartman, Jiri Slaby, Ingo Molnar, Peter Zijlstra,
	linux-kernel, Vladimir D . Seleznev, linux-api
In-Reply-To: <20190605081906.28938-2-ar@cs.msu.ru>

Arseny Maslennikov <ar@cs.msu.ru> writes:

> This complementary patch defines SIGINFO as a synonym for SIGPWR
> on every architecture supported by the kernel.
> The particular signal number chosen does not really matter and is only
> required for the related tty functionality to work properly,
> so if it does not suite expectations, any suggestions are warmly
> welcome.
>
> SIGPWR looks like a nice candidate for this role, because it is
> defined on every supported arch; it is currently only used to inform
> PID 1 of power failures, and daemons that care about low-level
> events do not tend to have a controlling terminal.
>
> However, on sparcs SIGPWR is a synonym for SIGLOST, a signal unique
> to that architecture, with a narrow set of intended uses that do not
> combine well with interactively requesting status.
> SIGLOST is not used by any kernel code at the moment.
> I'm not sure there is a more reasonable alternative right now.

Is the name SIGINFO already well established.

It just is a little bit confusing with struct siginfo.

At least on x86 it looks like we have signals 32 and 33 that are
reserved and not used for anything.  Is there a reason you have
not picked one of those?

Also should this be a realtime signal with signal information
or a non-realtime signal?

I don't expect there is much to encode except that the user is asking
for information.  I half wonder if it could be done as a different
si_code to SIGWINCH.  But of course that doesn't work because it is
not a real time signal so does not queue more than one siginfo. (Sigh).

I just would like to see that we have a clear concept of how this new
signal plays into all of the signal handling bits.

Added linux-api because this is fundamentally extending the linux-api,
and we probably want man-page updates etc.

Eric

>
> Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> ---
>  arch/arm/include/uapi/asm/signal.h     | 1 +
>  arch/h8300/include/uapi/asm/signal.h   | 1 +
>  arch/ia64/include/uapi/asm/signal.h    | 1 +
>  arch/m68k/include/uapi/asm/signal.h    | 1 +
>  arch/mips/include/uapi/asm/signal.h    | 1 +
>  arch/parisc/include/uapi/asm/signal.h  | 1 +
>  arch/powerpc/include/uapi/asm/signal.h | 1 +
>  arch/s390/include/uapi/asm/signal.h    | 1 +
>  arch/sparc/include/uapi/asm/signal.h   | 2 ++
>  arch/x86/include/uapi/asm/signal.h     | 1 +
>  arch/xtensa/include/uapi/asm/signal.h  | 1 +
>  include/uapi/asm-generic/signal.h      | 1 +
>  12 files changed, 13 insertions(+)
>
> diff --git a/arch/arm/include/uapi/asm/signal.h b/arch/arm/include/uapi/asm/signal.h
> index 9b4185ba4f8a..b80b53a17267 100644
> --- a/arch/arm/include/uapi/asm/signal.h
> +++ b/arch/arm/include/uapi/asm/signal.h
> @@ -50,6 +50,7 @@ typedef unsigned long sigset_t;
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/arch/h8300/include/uapi/asm/signal.h b/arch/h8300/include/uapi/asm/signal.h
> index e15521037348..7a2b783af22b 100644
> --- a/arch/h8300/include/uapi/asm/signal.h
> +++ b/arch/h8300/include/uapi/asm/signal.h
> @@ -50,6 +50,7 @@ typedef unsigned long sigset_t;
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/arch/ia64/include/uapi/asm/signal.h b/arch/ia64/include/uapi/asm/signal.h
> index aa98ff1b9e22..b4c98cb17165 100644
> --- a/arch/ia64/include/uapi/asm/signal.h
> +++ b/arch/ia64/include/uapi/asm/signal.h
> @@ -45,6 +45,7 @@
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  /* signal 31 is no longer "unused", but the SIGUNUSED macro remains for backwards compatibility */
>  #define	SIGUNUSED	31
> diff --git a/arch/m68k/include/uapi/asm/signal.h b/arch/m68k/include/uapi/asm/signal.h
> index 915cc755a184..a0b4e4108cb8 100644
> --- a/arch/m68k/include/uapi/asm/signal.h
> +++ b/arch/m68k/include/uapi/asm/signal.h
> @@ -50,6 +50,7 @@ typedef unsigned long sigset_t;
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/arch/mips/include/uapi/asm/signal.h b/arch/mips/include/uapi/asm/signal.h
> index 53104b10aae2..975a6f0d3b0b 100644
> --- a/arch/mips/include/uapi/asm/signal.h
> +++ b/arch/mips/include/uapi/asm/signal.h
> @@ -43,6 +43,7 @@ typedef unsigned long old_sigset_t;		/* at least 32 bits */
>  #define SIGCHLD		18	/* Child status has changed (POSIX).  */
>  #define SIGCLD		SIGCHLD /* Same as SIGCHLD (System V).	*/
>  #define SIGPWR		19	/* Power failure restart (System V).  */
> +#define SIGINFO		SIGPWR	/* Keyboard status request (4.2 BSD). */
>  #define SIGWINCH	20	/* Window size change (4.3 BSD, Sun).  */
>  #define SIGURG		21	/* Urgent condition on socket (4.2 BSD).  */
>  #define SIGIO		22	/* I/O now possible (4.2 BSD).	*/
> diff --git a/arch/parisc/include/uapi/asm/signal.h b/arch/parisc/include/uapi/asm/signal.h
> index d38563a394f2..fe2e00d590ac 100644
> --- a/arch/parisc/include/uapi/asm/signal.h
> +++ b/arch/parisc/include/uapi/asm/signal.h
> @@ -22,6 +22,7 @@
>  #define SIGUSR2		17
>  #define SIGCHLD		18
>  #define SIGPWR		19
> +#define SIGINFO		SIGPWR
>  #define SIGVTALRM	20
>  #define SIGPROF		21
>  #define SIGIO		22
> diff --git a/arch/powerpc/include/uapi/asm/signal.h b/arch/powerpc/include/uapi/asm/signal.h
> index 85b0a7aa43e7..e7f3885905b4 100644
> --- a/arch/powerpc/include/uapi/asm/signal.h
> +++ b/arch/powerpc/include/uapi/asm/signal.h
> @@ -53,6 +53,7 @@ typedef struct {
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/arch/s390/include/uapi/asm/signal.h b/arch/s390/include/uapi/asm/signal.h
> index 9a14a611ed82..12ee62987971 100644
> --- a/arch/s390/include/uapi/asm/signal.h
> +++ b/arch/s390/include/uapi/asm/signal.h
> @@ -58,6 +58,7 @@ typedef unsigned long sigset_t;
>  #define SIGLOST         29
>  */
>  #define SIGPWR          30
> +#define SIGINFO         SIGPWR
>  #define SIGSYS		31
>  #define SIGUNUSED       31
>  
> diff --git a/arch/sparc/include/uapi/asm/signal.h b/arch/sparc/include/uapi/asm/signal.h
> index ff9505923b9a..b655163198bb 100644
> --- a/arch/sparc/include/uapi/asm/signal.h
> +++ b/arch/sparc/include/uapi/asm/signal.h
> @@ -71,6 +71,8 @@
>  #define SIGWINCH	28
>  #define SIGLOST		29
>  #define SIGPWR		SIGLOST
> +/* XXX: is it OK for SIGINFO to collide with LOST? */
> +#define SIGINFO		SIGPWR
>  #define SIGUSR1		30
>  #define SIGUSR2		31
>  
> diff --git a/arch/x86/include/uapi/asm/signal.h b/arch/x86/include/uapi/asm/signal.h
> index e5745d593dc7..1539bb28826c 100644
> --- a/arch/x86/include/uapi/asm/signal.h
> +++ b/arch/x86/include/uapi/asm/signal.h
> @@ -55,6 +55,7 @@ typedef unsigned long sigset_t;
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/arch/xtensa/include/uapi/asm/signal.h b/arch/xtensa/include/uapi/asm/signal.h
> index 005dec5bfde4..d644234305de 100644
> --- a/arch/xtensa/include/uapi/asm/signal.h
> +++ b/arch/xtensa/include/uapi/asm/signal.h
> @@ -65,6 +65,7 @@ typedef struct {
>  #define SIGPOLL		SIGIO
>  /* #define SIGLOST		29 */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31
>  
> diff --git a/include/uapi/asm-generic/signal.h b/include/uapi/asm-generic/signal.h
> index 5c716a952cbe..9f9a1db0d43c 100644
> --- a/include/uapi/asm-generic/signal.h
> +++ b/include/uapi/asm-generic/signal.h
> @@ -43,6 +43,7 @@
>  #define SIGLOST		29
>  */
>  #define SIGPWR		30
> +#define SIGINFO		SIGPWR
>  #define SIGSYS		31
>  #define	SIGUNUSED	31

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Yu-cheng Yu @ 2019-06-11 19:31 UTC (permalink / raw)
  To: Dave Martin, Florian Weimer
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <20190611114109.GN28398@e103592.cambridge.arm.com>

On Tue, 2019-06-11 at 12:41 +0100, Dave Martin wrote:
> On Mon, Jun 10, 2019 at 07:24:43PM +0200, Florian Weimer wrote:
> > * Yu-cheng Yu:
> > 
> > > To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
> > > logical choice.  And it breaks only a limited set of toolchains.
> > > 
> > > I will simplify the parser and leave this patch as-is for anyone who wants
> > > to
> > > back-port.  Are there any objections or concerns?
> > 
> > Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
> > the largest collection of CET-enabled binaries that exists today.
> 
> For clarity, RHEL is actively parsing these properties today?
> 
> > My hope was that we would backport the upstream kernel patches for CET,
> > port the glibc dynamic loader to the new kernel interface, and be ready
> > to run with CET enabled in principle (except that porting userspace
> > libraries such as OpenSSL has not really started upstream, so many
> > processes where CET is particularly desirable will still run without
> > it).
> > 
> > I'm not sure if it is a good idea to port the legacy support if it's not
> > part of the mainline kernel because it comes awfully close to creating
> > our own private ABI.
> 
> I guess we can aim to factor things so that PT_NOTE scanning is
> available as a fallback on arches for which the absence of
> PT_GNU_PROPERTY is not authoritative.

We can probably check PT_GNU_PROPERTY first, and fallback (based on ld-linux
version?) to PT_NOTE scanning?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 25/27] mm/mmap: Add Shadow stack pages to memory accounting
From: Yu-cheng Yu @ 2019-06-11 19:22 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <1cfc7396-ca90-1933-34ad-b3d43ae52e08@intel.com>

On Tue, 2019-06-11 at 10:55 -0700, Dave Hansen wrote:
> On 6/6/19 1:06 PM, Yu-cheng Yu wrote:
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1703,6 +1703,9 @@ static inline int accountable_mapping(struct file
> > *file, vm_flags_t vm_flags)
> >  	if (file && is_file_hugepages(file))
> >  		return 0;
> >  
> > +	if (arch_copy_pte_mapping(vm_flags))
> > +		return 1;
> > +
> >  	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) ==
> > VM_WRITE;
> >  }
> >  
> > @@ -3319,6 +3322,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t
> > flags, long npages)
> >  		mm->stack_vm += npages;
> >  	else if (is_data_mapping(flags))
> >  		mm->data_vm += npages;
> > +	else if (arch_copy_pte_mapping(flags))
> > +		mm->data_vm += npages;
> >  }
> 
> This classifies shadow stack as data instead of stack.  That seems a wee
> bit counterintuitive.  Why did you make this choice?

I don't recall the reason; I will change it to stack and test it out.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 25/27] mm/mmap: Add Shadow stack pages to memory accounting
From: Dave Hansen @ 2019-06-11 17:55 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190606200646.3951-26-yu-cheng.yu@intel.com>

On 6/6/19 1:06 PM, Yu-cheng Yu wrote:
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1703,6 +1703,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
>  	if (file && is_file_hugepages(file))
>  		return 0;
>  
> +	if (arch_copy_pte_mapping(vm_flags))
> +		return 1;
> +
>  	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
>  }
>  
> @@ -3319,6 +3322,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
>  		mm->stack_vm += npages;
>  	else if (is_data_mapping(flags))
>  		mm->data_vm += npages;
> +	else if (arch_copy_pte_mapping(flags))
> +		mm->data_vm += npages;
>  }

This classifies shadow stack as data instead of stack.  That seems a wee
bit counterintuitive.  Why did you make this choice?

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Stephen Smalley @ 2019-06-11 14:32 UTC (permalink / raw)
  To: Andy Lutomirski, Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, USB list, LSM List,
	Greg Kroah-Hartman, raven, Linux FS Devel, Linux API, linux-block,
	keyrings, LKML, Paul Moore
In-Reply-To: <97BA9EB5-4E62-4E3A-BD97-CEC34F16FCFF@amacapital.net>

On 6/10/19 8:13 PM, Andy Lutomirski wrote:
> 
> 
>> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> I think you really need to give an example of a coherent policy that
>>>>>>> needs this.
>>>>>> I keep telling you, and you keep ignoring what I say.
>>>>>>
>>>>>>> As it stands, your analogy seems confusing.
>>>>>> It's pretty simple. I have given both the abstract
>>>>>> and examples.
>>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>>> That addressed an explicit objection, and pointed out
>>>> an exception to a generality you had asserted, which was
>>>> not true. It's also a red herring regarding the current
>>>> discussion.
>>> This argument is pointless.
>>>
>>> Please humor me and just give me an example.  If you think you have
>>> already done so, feel free to repeat yourself.  If you have no
>>> example, then please just say so.
>>
>> To repeat the /dev/null example:
>>
>> Process A and process B both open /dev/null.
>> A and B can write and read to their hearts content
>> to/from /dev/null without ever once communicating.
>> The mutual accessibility of /dev/null in no way implies that
>> A and B can communicate. If A can set a watch on /dev/null,
>> and B triggers an event, there still has to be an access
>> check on the delivery of the event because delivering an event
>> to A is not an action on /dev/null, but on A.
>>
> 
> At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
>>
>>
>>>   An unprivileged
>>> user can create a new userns and a new mount ns, but then they're
>>> modifying a whole different mount tree.
>>
>> Within those namespaces you can still have multiple users,
>> constrained be system access control policy.
> 
> And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.
> 
>>
>>>
>>>>>>> Similarly, if someone
>>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>>> We do not check whether the sender can send to the receiver.
>>>>>> Bzzzt! Smack sure does.
>>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>>> Process A sends a packet to process B.
>>>> If A has access to TopSecret data and B is not
>>>> allowed to see TopSecret data, the delivery should
>>>> be prevented. Is that nonsensical?
>>> It makes sense.  As I see it, the way that a sensible policy should do
>>> this is by making sure that there are no sockets, pipes, etc that
>>> Process A can write and that Process B can read.
>>
>> You can't explain UDP controls without doing the access check
>> on packet delivery. The sendmsg() succeeds when the packet leaves
>> the sender. There doesn't even have to be a socket bound to the
>> port. The only opportunity you have for control is on packet
>> delivery, which is the only point at which you can have the
>> information required.
> 
> Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.
> 
>>
>>> If you really want to prevent a malicious process with TopSecret data
>>> from sending it to a different process, then you can't use Linux on
>>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>>> need to use an extremely tight sandbox to make this work.
>>
>> I won't be commenting on that.
> 
> Then why is preventing this is an absolute requirement? It’s unattainable.
> 
>>
>>>
>>>>>>> The signal example is inapplicable.
>>>>>>  From a modeling viewpoint the actions are identical.
>>>>> This seems incorrect to me
>>>> What would be correct then? Some convoluted combination
>>>> of system entities that aren't owned or controlled by
>>>> any mechanism?
>>>>
>>> POSIX signal restrictions aren't there to prevent two processes from
>>> communicating.  They're there to prevent the sender from manipulating
>>> or crashing the receiver without appropriate privilege.
>>
>> POSIX signal restrictions have a long history. In the P10031e/2c
>> debates both communication and manipulation where seriously
>> considered. I would say both are true.
>>
>>>>> and, I think, to most everyone else reading this.
>>>> That's quite the assertion. You may even be correct.
>>>>
>>>>> Can you explain?
>>>>>
>>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>>> YES!!!!!!!!!!!!
>>>>
>>>> And when a process triggers a notification it is the subject
>>>> and the watching process is the object!
>>>>
>>>> Subject == active entity
>>>> Object  == passive entity
>>>>
>>>> Triggering an event is, like calling kill(), an action!
>>>>
>>> And here is where I disagree with your interpretation.  Triggering an
>>> event is a side effect of writing to the file.  There are *two*
>>> security relevant actions, not one, and they are:
>>>
>>> First, the write:
>>>
>>> Subject == the writer
>>> Action == write
>>> Object == the file
>>>
>>> Then the event, which could be modeled in a couple of ways:
>>>
>>> Subject == the file
>>
>> Files   are   not   subjects. They are passive entities.
>>
>>> Action == notify
>>> Object == the recipient
> 
> Great. Then use the variant below.
> 
>>>
>>> or
>>>
>>> Subject == the recipient
>>> Action == watch
>>> Object == the file
>>>
>>> By conflating these two actions into one, you've made the modeling
>>> very hard, and you start running into all these nasty questions like
>>> "who actually closed this open file"
>>
>> No, I've made the code more difficult.
>> You can not call
>> the file a subject. That is just wrong. It's not a valid
>> model.
> 
> You’ve ignored the “Action == watch” variant. Do you care to comment?

While I agree with this model in general, I will note two caveats when 
trying to apply this to watches/notifications:

1) The object on which the notification was triggered and the object on 
which the watch was placed are not necessarily the same and access to 
one might not imply access to the other,

2) If notifications can be triggered by read-like operations (as in 
fanotify, for example), then a "read" can be turned into a "write" flow 
through a notification.

Whether or not these caveats are applicable to the notifications in this 
series I am not clear.

^ permalink raw reply

* Re: [PATCH 09/10] usb: Add USB subsystem notifications [ver #3]
From: Alan Stern @ 2019-06-11 13:53 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Mathias Nyman, Greg Kroah-Hartman, David Howells, viro, linux-usb,
	raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <875zpcfxfk.fsf@linux.intel.com>

On Tue, 11 Jun 2019, Felipe Balbi wrote:

> >> >> > So for "severe" issues, yes, we should do this, but perhaps not for all
> >> >> > of the "normal" things we see when a device is yanked out of the system
> >> >> > and the like.
> >> >> 
> >> >> Then what counts as a "severe" issue?  Anything besides enumeration 
> >> >> failure?
> >> >
> >> > Not that I can think of at the moment, other than the other recently
> >> > added KOBJ_CHANGE issue.  I'm sure we have other "hard failure" issues
> >> > in the USB stack that people will want exposed over time.
> >> 
> >> From an XHCI standpoint, Transaction Errors might be one thing. They
> >> happen rarely and are a strong indication that the bus itself is
> >> bad. Either bad cable, misbehaving PHYs, improper power management, etc.
> >
> > Don't you also get transaction errors if the user unplugs a device in 
> > the middle of a transfer?  That's not the sort of thing we want to sent 
> > notifications about.
> 
> Mathias, do we get Transaction Error if user removes cable during a
> transfer? I thought we would just get Port Status Change with CC bit
> cleared, no?

Even if xHCI doesn't give Transaction Errors when a cable is unplugged 
during a transfer, other host controllers do.  Sometimes quite a lot -- 
they continue to occur until the kernel polls the parent hub's 
interrupt ep and learns that the port is disconnected, which can take 
up to 250 ms.

Alan Stern

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-06-11 11:41 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <87lfy9cq04.fsf@oldenburg2.str.redhat.com>

On Mon, Jun 10, 2019 at 07:24:43PM +0200, Florian Weimer wrote:
> * Yu-cheng Yu:
> 
> > To me, looking at PT_GNU_PROPERTY and not trying to support anything is a
> > logical choice.  And it breaks only a limited set of toolchains.
> >
> > I will simplify the parser and leave this patch as-is for anyone who wants to
> > back-port.  Are there any objections or concerns?
> 
> Red Hat Enterprise Linux 8 does not use PT_GNU_PROPERTY and is probably
> the largest collection of CET-enabled binaries that exists today.

For clarity, RHEL is actively parsing these properties today?

> My hope was that we would backport the upstream kernel patches for CET,
> port the glibc dynamic loader to the new kernel interface, and be ready
> to run with CET enabled in principle (except that porting userspace
> libraries such as OpenSSL has not really started upstream, so many
> processes where CET is particularly desirable will still run without
> it).
> 
> I'm not sure if it is a good idea to port the legacy support if it's not
> part of the mainline kernel because it comes awfully close to creating
> our own private ABI.

I guess we can aim to factor things so that PT_NOTE scanning is
available as a fallback on arches for which the absence of
PT_GNU_PROPERTY is not authoritative.

Can we argue that the lack of PT_GNU_PROPERTY is an ABI bug, fix it
for new binaries and hence limit the efforts we go to to support
theoretical binaries that lack the phdrs entry?

If we can make practical simplifications to the parsing, such as
limiting the maximum PT_NOTE size that we will search for the program
properties to 1K (say), or requiring NT_NOTE_GNU_PROPERTY_TYPE_0 to sit
by itself in a single PT_NOTE then that could help minimse the exec
overheads and the number of places for bugs to hide in the kernel.

What we can do here depends on what the tools currently do and what
binaries are out there.

Cheers
---Dave

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Pavel Machek @ 2019-06-11 10:33 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Hansen, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook
In-Reply-To: <e1543e7beb0eb55d6febcd847ccab9b219e60338.camel@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

On Mon 2019-06-10 08:47:45, Yu-cheng Yu wrote:
> On Sat, 2019-06-08 at 22:52 +0200, Pavel Machek wrote:
> > Hi!
> > 
> > > > I've no idea what the kernel should do; since you failed to answer the
> > > > question what happens when you point this to garbage.
> > > > 
> > > > Does it then fault or what?
> > > 
> > > Yeah, I think you'll fault with a rather mysterious CR2 value since
> > > you'll go look at the instruction that faulted and not see any
> > > references to the CR2 value.
> > > 
> > > I think this new MSR probably needs to get included in oops output when
> > > CET is enabled.
> > > 
> > > Why don't we require that a VMA be in place for the entire bitmap?
> > > Don't we need a "get" prctl function too in case something like a JIT is
> > > running and needs to find the location of this bitmap to set bits itself?
> > > 
> > > Or, do we just go whole-hog and have the kernel manage the bitmap
> > > itself. Our interface here could be:
> > > 
> > > 	prctl(PR_MARK_CODE_AS_LEGACY, start, size);
> > > 
> > > and then have the kernel allocate and set the bitmap for those code
> > > locations.
> > 
> > For the record, that sounds like a better interface than userspace knowing
> > about the bitmap formats...
> > 									Pavel
> 
> Initially we implemented the bitmap that way.  To manage the bitmap, every time
> the application issues a syscall for a .so it loads, and the kernel does
> copy_from_user() & copy_to_user() (or similar things).  If a system has a few
> legacy .so files and every application does the same, it can take a long time to
> boot up.

Loading .so is already many syscalls, I'd not expect measurable
performance there. Are you sure?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Florian Weimer @ 2019-06-11  7:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Andy Lutomirski, Peter Zijlstra, x86, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H.J. Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz
In-Reply-To: <92e56b28-0cd4-e3f4-867b-639d9b98b86c@intel.com>

* Dave Hansen:

> My assumption has always been that these large, potentially sparse
> hardware tables *must* be mmap()'d with MAP_NORESERVE specified.  That
> should keep them from being problematic with respect to overcommit.

MAP_NORESERVE pages still count towards the commit limit.  The flag only
disables checks at allocation time, for this particular allocation.  (At
least this was the behavior the last time I looked into this, I
believe.)

Not sure if this makes a difference here.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 09/10] usb: Add USB subsystem notifications [ver #3]
From: Felipe Balbi @ 2019-06-11  6:28 UTC (permalink / raw)
  To: Alan Stern, Mathias Nyman
  Cc: Greg Kroah-Hartman, David Howells, viro, linux-usb, raven,
	linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <Pine.LNX.4.44L0.1906071000260.1612-100000@iolanthe.rowland.org>

[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]


Hi,

Alan Stern <stern@rowland.harvard.edu> writes:
>> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
>> > On Thu, Jun 06, 2019 at 10:55:24AM -0400, Alan Stern wrote:
>> >> On Thu, 6 Jun 2019, Greg Kroah-Hartman wrote:
>> >> 
>> >> > On Thu, Jun 06, 2019 at 10:24:18AM -0400, Alan Stern wrote:
>> >> > > On Thu, 6 Jun 2019, David Howells wrote:
>> >> > > 
>> >> > > > Add a USB subsystem notification mechanism whereby notifications about
>> >> > > > hardware events such as device connection, disconnection, reset and I/O
>> >> > > > errors, can be reported to a monitoring process asynchronously.
>> >> > > 
>> >> > > USB I/O errors covers an awfully large and vague field.  Do we really
>> >> > > want to include them?  I'm doubtful.
>> >> > 
>> >> > See the other patch on the linux-usb list that wanted to start adding
>> >> > KOBJ_CHANGE notifications about USB "i/o errors".
>> >> 
>> >> That patch wanted to add notifications only for enumeration failures
>> >> (assuming you're talking about the patch from Eugeniu Rosca), not I/O
>> >> errors in general.
>> >
>> > Yes, sorry, I was thinking that as a "I/O failed in enumeration" :)
>> >
>> >> > So for "severe" issues, yes, we should do this, but perhaps not for all
>> >> > of the "normal" things we see when a device is yanked out of the system
>> >> > and the like.
>> >> 
>> >> Then what counts as a "severe" issue?  Anything besides enumeration 
>> >> failure?
>> >
>> > Not that I can think of at the moment, other than the other recently
>> > added KOBJ_CHANGE issue.  I'm sure we have other "hard failure" issues
>> > in the USB stack that people will want exposed over time.
>> 
>> From an XHCI standpoint, Transaction Errors might be one thing. They
>> happen rarely and are a strong indication that the bus itself is
>> bad. Either bad cable, misbehaving PHYs, improper power management, etc.
>
> Don't you also get transaction errors if the user unplugs a device in 
> the middle of a transfer?  That's not the sort of thing we want to sent 
> notifications about.

Mathias, do we get Transaction Error if user removes cable during a
transfer? I thought we would just get Port Status Change with CC bit
cleared, no?

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-11  0:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <a329c4fa-adb0-09a4-7a8c-465f82e0e6c7@intel.com>



> On Jun 10, 2019, at 5:08 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 6/10/19 4:54 PM, Andy Lutomirski wrote:
>> Another benefit of kernel management: we could plausibly auto-clear
>> the bits corresponding to munmapped regions. Is this worth it?
> 
> I did it for MPX.  I think I even went to the trouble of zapping the
> whole pages that got unused.
> 
> But, MPX tables took 80% of the address space, worst-case.  This takes
> 0.003% :)  The only case it would really matter would be a task was
> long-running, used legacy executables/JITs, and was mapping/unmapping
> text all over the address space.  That seems rather unlikely.

Every wasted page still costs 4K plus page table overhead.  The worst case is a JIT that doesn’t clean up and leaks legacy bitmap memory all over. We can blame the JIT, but the actual attribution could be complicated.

It also matters when you unmap one thing, map something else, and are sad when the legacy bits are still set.

Admittedly, it’s a bit hard to imagine the exploit that takes advantage of this.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Andy Lutomirski @ 2019-06-11  0:13 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Stephen Smalley, David Howells, Al Viro,
	USB list, LSM List, Greg Kroah-Hartman, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LKML, Paul Moore
In-Reply-To: <25d88489-9850-f092-205e-0a4fc292f41b@schaufler-ca.com>



> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>> I think you really need to give an example of a coherent policy that
>>>>>> needs this.
>>>>> I keep telling you, and you keep ignoring what I say.
>>>>> 
>>>>>> As it stands, your analogy seems confusing.
>>>>> It's pretty simple. I have given both the abstract
>>>>> and examples.
>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>> That addressed an explicit objection, and pointed out
>>> an exception to a generality you had asserted, which was
>>> not true. It's also a red herring regarding the current
>>> discussion.
>> This argument is pointless.
>> 
>> Please humor me and just give me an example.  If you think you have
>> already done so, feel free to repeat yourself.  If you have no
>> example, then please just say so.
> 
> To repeat the /dev/null example:
> 
> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.
> 

At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
> 
> 
>>  An unprivileged
>> user can create a new userns and a new mount ns, but then they're
>> modifying a whole different mount tree.
> 
> Within those namespaces you can still have multiple users,
> constrained be system access control policy.

And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.

> 
>> 
>>>>>> Similarly, if someone
>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>> We do not check whether the sender can send to the receiver.
>>>>> Bzzzt! Smack sure does.
>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>> Process A sends a packet to process B.
>>> If A has access to TopSecret data and B is not
>>> allowed to see TopSecret data, the delivery should
>>> be prevented. Is that nonsensical?
>> It makes sense.  As I see it, the way that a sensible policy should do
>> this is by making sure that there are no sockets, pipes, etc that
>> Process A can write and that Process B can read.
> 
> You can't explain UDP controls without doing the access check
> on packet delivery. The sendmsg() succeeds when the packet leaves
> the sender. There doesn't even have to be a socket bound to the
> port. The only opportunity you have for control is on packet
> delivery, which is the only point at which you can have the
> information required.

Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.

> 
>> If you really want to prevent a malicious process with TopSecret data
>> from sending it to a different process, then you can't use Linux on
>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>> need to use an extremely tight sandbox to make this work.
> 
> I won't be commenting on that.

Then why is preventing this is an absolute requirement? It’s unattainable.

> 
>> 
>>>>>> The signal example is inapplicable.
>>>>> From a modeling viewpoint the actions are identical.
>>>> This seems incorrect to me
>>> What would be correct then? Some convoluted combination
>>> of system entities that aren't owned or controlled by
>>> any mechanism?
>>> 
>> POSIX signal restrictions aren't there to prevent two processes from
>> communicating.  They're there to prevent the sender from manipulating
>> or crashing the receiver without appropriate privilege.
> 
> POSIX signal restrictions have a long history. In the P10031e/2c
> debates both communication and manipulation where seriously
> considered. I would say both are true.
> 
>>>> and, I think, to most everyone else reading this.
>>> That's quite the assertion. You may even be correct.
>>> 
>>>> Can you explain?
>>>> 
>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>> YES!!!!!!!!!!!!
>>> 
>>> And when a process triggers a notification it is the subject
>>> and the watching process is the object!
>>> 
>>> Subject == active entity
>>> Object  == passive entity
>>> 
>>> Triggering an event is, like calling kill(), an action!
>>> 
>> And here is where I disagree with your interpretation.  Triggering an
>> event is a side effect of writing to the file.  There are *two*
>> security relevant actions, not one, and they are:
>> 
>> First, the write:
>> 
>> Subject == the writer
>> Action == write
>> Object == the file
>> 
>> Then the event, which could be modeled in a couple of ways:
>> 
>> Subject == the file
> 
> Files   are   not   subjects. They are passive entities.
> 
>> Action == notify
>> Object == the recipient

Great. Then use the variant below.

>> 
>> or
>> 
>> Subject == the recipient
>> Action == watch
>> Object == the file
>> 
>> By conflating these two actions into one, you've made the modeling
>> very hard, and you start running into all these nasty questions like
>> "who actually closed this open file"
> 
> No, I've made the code more difficult.
> You can not call
> the file a subject. That is just wrong. It's not a valid
> model.

You’ve ignored the “Action == watch” variant. Do you care to comment?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox