Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v13 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
From: Jann Horn @ 2019-04-05 20:27 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H . Peter Anvin,
	Andi Kleen, tim.c.chen, Dave Hansen, Arjan van de Ven, aubrey.li,
	kernel list, Linux API, Alexey Dobriyan, Andrew Morton
In-Reply-To: <20190224044400.34975-2-aubrey.li@linux.intel.com>

On Fri, Apr 5, 2019 at 10:02 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
> AVX-512 components use could cause core turbo frequency drop. So
> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
> for the user space job scheduler to cluster the AVX-512 using tasks
> together.
>
> Tensorflow example:
> $ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
> AVX512_elapsed_ms:      4
> AVX512_elapsed_ms:      8
> AVX512_elapsed_ms:      4
>
> This means that 4 milliseconds have elapsed since the AVX512 usage
> of tensorflow task was detected when the task was scheduled out.
>
> Or:
> $ cat /proc/pid/status | grep AVX512_elapsed_ms
> AVX512_elapsed_ms:      -1

(Very nitpicky, feel free to ignore: If you change the /proc/pid to
/proc/tid in the commit message, it becomes clearer that this status
is really per-task/thread, not per-process/threadgroup.)

[...]
> +
> +/*
> + * Report the amount of time elapsed in millisecond since last AVX512
> + * use in the task.
> + */
> +static void avx512_status(struct seq_file *m, struct task_struct *task)
> +{
> +       unsigned long timestamp = task->thread.fpu.avx512_timestamp;

This is theoretically a data race, right? Should this have a READ_ONCE() on it?

Is there something that zeroes out the avx512_timestamp on
fork()/clone(), or will every child inherit the avx512 timestamp? As
far as I can tell, the timestamp is inherited; I think it would be
nicer to zero it out at that point. Either way, It might be worth
documenting this decision.

^ permalink raw reply

* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Thomas Gleixner @ 2019-04-05 19:32 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, Peter Zijlstra, H. Peter Anvin, ak, tim.c.chen,
	dave.hansen, arjan, aubrey.li, LKML, Linux API, Alexey Dobriyan,
	Andrew Morton
In-Reply-To: <20190224044400.34975-1-aubrey.li@linux.intel.com>

On Sun, 24 Feb 2019, Aubrey Li wrote:

> The architecture specific information of the running processes could
> be useful to the userland. Add support to examine process architecture
> specific information externally.
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>

This really lacks

Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

Cc'ed now.

> ---
>  fs/proc/array.c         | 5 +++++
>  include/linux/proc_fs.h | 2 ++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 9d428d5a0ac8..ea7a981f289c 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>  
> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
> +{
> +}
> +
>  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>  			struct pid *pid, struct task_struct *task)
>  {
> @@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>  	task_cpus_allowed(m, task);
>  	cpuset_task_status_allowed(m, task);
>  	task_context_switch_counts(m, task);
> +	arch_proc_pid_status(m, task);
>  	return 0;
>  }
>  
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index d0e1f1522a78..1de9ba1b064f 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
>  						    int (*show)(struct seq_file *, void *),
>  						    proc_write_t write,
>  						    void *data);
> +/* Add support for architecture specific output in /proc/pid/status */
> +extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
>  
>  #else /* CONFIG_PROC_FS */
>  
> -- 
> 2.17.1
> 
> 

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-05 15:40 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Michael Ellerman, Mathieu Desnoyers, Paul Burton, Will Deacon,
	Boqun Feng, Heiko Carstens, Vasily Gorbik, Martin Schwidefsky,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras, carlos,
	Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Dave
In-Reply-To: <87y34o4xt3.fsf@oldenburg2.str.redhat.com>

On 4/5/19 5:16 AM, Florian Weimer wrote:
> * Carlos O'Donell:
>> It is valuable that it be a trap, particularly for constant pools because
>> it means that a jump into the constant pool will trap.
> 
> Sorry, I don't understand why this matters in this context.  Would you
> please elaborate?

Sorry, I wasn't very clear.

My point is only that any accidental jumps, either with off-by-one (like you
fixed in gcc/glibc's signal unwinding most recently), result in a process fault
rather than executing RSEQ_SIG as a valid instruction *and then* continuing
onwards to the handler.

A process fault is achieved either by a trap, or an invalid instruction, or
a privileged insn (like suggested for MIPS in this thread).

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Peter Zijlstra @ 2019-04-05 11:49 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Alexey Dobriyan, mingo, linux-kernel, linux-api
In-Reply-To: <87ftqw3e0l.fsf@oldenburg2.str.redhat.com>

On Fri, Apr 05, 2019 at 01:08:58PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
> >
> >> > True; but I suppose glibc already does lots of that anyway, right? It
> >> > does contain the right information.
> >> 
> >> If I recall correctly my last investigation,
> >> /sys/devices/system/cpu/possible does not reflect the size of the
> >> affinity mask, either.
> >
> > Strictly speaking correct; the bitmap can be longer than the highest
> > possible cpu number, however the remainder would be 0-padding and could
> > thus be stripped without issue.
> 
> Doesn't the kernel still enforce the larget bitmap in sched_getaffinity,
> even if the bits are always zero?

Oh crap, you're right. That's unfortunate I suppose.

^ permalink raw reply

* Re: [PATCH ghak90 V5 09/10] audit: add support for containerid to network namespaces
From: Neil Horman @ 2019-04-05 11:32 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Paul Moore, linux-api, containers, LKML, dhowells,
	Linux-Audit Mailing List, netfilter-devel, ebiederm, simo, netdev,
	linux-fsdevel, Eric Paris, Serge Hallyn
In-Reply-To: <20190404214006.jcgmjb4u6iobu42s@madcap2.tricolour.ca>

On Thu, Apr 04, 2019 at 05:40:06PM -0400, Richard Guy Briggs wrote:
> On 2019-04-02 07:31, Neil Horman wrote:
> > On Mon, Apr 01, 2019 at 10:50:03AM -0400, Paul Moore wrote:
> > > On Fri, Mar 15, 2019 at 2:35 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > Audit events could happen in a network namespace outside of a task
> > > > context due to packets received from the net that trigger an auditing
> > > > rule prior to being associated with a running task.  The network
> > > > namespace could be in use by multiple containers by association to the
> > > > tasks in that network namespace.  We still want a way to attribute
> > > > these events to any potential containers.  Keep a list per network
> > > > namespace to track these audit container identifiiers.
> > > >
> > > > Add/increment the audit container identifier on:
> > > > - initial setting of the audit container identifier via /proc
> > > > - clone/fork call that inherits an audit container identifier
> > > > - unshare call that inherits an audit container identifier
> > > > - setns call that inherits an audit container identifier
> > > > Delete/decrement the audit container identifier on:
> > > > - an inherited audit container identifier dropped when child set
> > > > - process exit
> > > > - unshare call that drops a net namespace
> > > > - setns call that drops a net namespace
> > > >
> > > > See: https://github.com/linux-audit/audit-kernel/issues/92
> > > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > ---
> > > >  include/linux/audit.h | 19 ++++++++++++
> > > >  kernel/audit.c        | 86 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > > >  kernel/nsproxy.c      |  4 +++
> > > >  3 files changed, 106 insertions(+), 3 deletions(-)
> > > 
> > > ...
> > > 
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index cf448599ef34..7fa3194f5342 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > > > @@ -72,6 +72,7 @@
> > > >  #include <linux/freezer.h>
> > > >  #include <linux/pid_namespace.h>
> > > >  #include <net/netns/generic.h>
> > > > +#include <net/net_namespace.h>
> > > >
> > > >  #include "audit.h"
> > > >
> > > > @@ -99,9 +100,13 @@
> > > >  /**
> > > >   * struct audit_net - audit private network namespace data
> > > >   * @sk: communication socket
> > > > + * @contid_list: audit container identifier list
> > > > + * @contid_list_lock audit container identifier list lock
> > > >   */
> > > >  struct audit_net {
> > > >         struct sock *sk;
> > > > +       struct list_head contid_list;
> > > > +       spinlock_t contid_list_lock;
> > > >  };
> > > >
> > > >  /**
> > > > @@ -275,8 +280,11 @@ struct audit_task_info init_struct_audit = {
> > > >  void audit_free(struct task_struct *tsk)
> > > >  {
> > > >         struct audit_task_info *info = tsk->audit;
> > > > +       struct nsproxy *ns = tsk->nsproxy;
> > > >
> > > >         audit_free_syscall(tsk);
> > > > +       if (ns)
> > > > +               audit_netns_contid_del(ns->net_ns, audit_get_contid(tsk));
> > > >         /* Freeing the audit_task_info struct must be performed after
> > > >          * audit_log_exit() due to need for loginuid and sessionid.
> > > >          */
> > > > @@ -376,6 +384,73 @@ static struct sock *audit_get_sk(const struct net *net)
> > > >         return aunet->sk;
> > > >  }
> > > >
> > > > +void audit_netns_contid_add(struct net *net, u64 contid)
> > > > +{
> > > > +       struct audit_net *aunet = net_generic(net, audit_net_id);
> > > > +       struct list_head *contid_list = &aunet->contid_list;
> > > > +       struct audit_contid *cont;
> > > > +
> > > > +       if (!audit_contid_valid(contid))
> > > > +               return;
> > > > +       if (!aunet)
> > > > +               return;
> > > 
> > > We should move the contid_list assignment below this check, or decide
> > > that aunet is always going to valid (?) and get rid of this check
> > > completely.
> > > 
> > I'm not sure why that would be needed.  Finding the net_id list is an operation
> > of a map relating net namespaces to lists, not contids to lists.  We could do
> > it, sure, but since they're unrelated operations, I don't think we experience
> > any slowdowns from doing it this way.
> > 
> > > > +       spin_lock(&aunet->contid_list_lock);
> > > > +       if (!list_empty(contid_list))
> > > 
> > > We don't need the list_empty() check here do we?  I think we can just
> > > call list_for_each_entry_rcu(), yes?
> > > 
> > This is true, the list_empty check is redundant, and the for loop will fall
> > through if the list is empty.
> > 
> > > > +               list_for_each_entry_rcu(cont, contid_list, list)
> > > > +                       if (cont->id == contid) {
> > > > +                               refcount_inc(&cont->refcount);
> > > > +                               goto out;
> > > > +                       }
> > > > +       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> > > 
> > > If you had to guess, what do you think is going to be more common:
> > > bumping the refcount of an existing entry in the list, or adding a new
> > > entry?  I'm asking because I always get a little nervous when doing
> > > allocations while holding a spinlock.  Yes, you are doing it with
> > > GFP_ATOMIC, but it still seems like something to try and avoid if this
> > > is going to approach 50%.  However, if the new entry is rare then the
> > > extra work of always doing the allocation before taking the lock and
> > > then freeing it afterwards might be a bad tradeoff.
> > > 
> > I think this is another way of asking, will multiple processes exist in the same
> > network namespace?  That is to say, will we expect a process to create a net
> > namespace, and then have other processes enter that namespace (thereby
> > triggering multiple calls to audit_netns_contid_add with the same net pointer
> > and cont id).  Given that the kubernetes notion of a pod is almost by definition
> > multiple containers sharing a network namespace, I think the answer is that we
> > will be strongly biased towards the refcount_inc case, rather than the kmalloc
> > case.  I could be wrong, but I think this is likely already in the optimized
> > order.
> 
> I had a stab at doing a GFP_KERNEL alloc before the spinlock and releasing it
> after if it wasn't needed (in Patch#1 below).  I also went one step further and
> converted it to kmem_cache (in Patch#2 below).
> 
> > > My gut feeling says we might do about as many allocations as refcount
> > > bumps, but I could be thinking about this wrong.
> > > 
> > > Moving the allocation outside the spinlock might also open the door to
> > > doing this as GFP_KERNEL, which is a good thing, but I haven't looked
> > > at the callers to see if that is possible (it may not be).  That's an
> > > exercise left to the patch author (if he hasn't done that already).
> 
> Both appear to work, but after successfully running both the contid test and
> audit_testsuite, once I start to push that test system a bit harder I end up
> with a deadlock warning.
> 
> I am having trouble understanding why since it happens both without and with
> the kmem_cache options, so it must be another part of the code that is
> triggering it.  The task_lock is being held at this point in
> audit_set_contid().  I haven't tried changing this alloc over to a GFP_ATOMIC
> to see if that prevents it, just as a debugging check...
> At this point, I'm inclined to leave it as it was without these two patches
> since it works and there doesn't seem to be an obvious best way given the
> uncertainty of the potential workloads.
> 
I would agree.  The trace below says that, when you use GFP_KERNEL, you are
allowing the memory subsystem to block the memory allocating task on memory
reclaim operations (effectively to sleep until kswapd reclaims sufficient memory
to grant the allocation), but kswapd wants to lock the task that is doing the
allocation with task_lock, which the task calling kmalloc(...,GFP_KEREL) already
holds, ergo, deadlock.

In short, you can't hold a tasks lock while preforming a sleeping operation
involving memory allocation. You either need to use GFP_ATOMIC, or release the
tasks lock before allocating memory with GFP_KERNEL.  Given the possible race
conditions surrounding releasing and re-aquiring the lock, I'm inclined to say
just use GFP_ATOMIC

Neil

> ======================================================
> WARNING: possible circular locking dependency detected
> 5.1.0-rc1-ghak90-audit-containerID.v6.3+ #80 Not tainted
> ------------------------------------------------------
> kswapd0/41 is trying to acquire lock:
> 0000000006a8c88b (&(&p->alloc_lock)->rlock){+.+.}, at: create_task_io_context+0xe9/0x150
> 
> but task is already holding lock:
> 0000000010aadb74 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x40
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (fs_reclaim){+.+.}:
>        __fs_reclaim_acquire+0x2c/0x40
>        kmem_cache_alloc_trace+0x30/0x230
>        audit_netns_contid_add.part.12+0xcf/0x210
>        audit_set_contid+0x18f/0x480
>        proc_contid_write+0x74/0x110
>        vfs_write+0xad/0x1b0
>        ksys_write+0x55/0xc0
>        do_syscall_64+0x60/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> -> #0 (&(&p->alloc_lock)->rlock){+.+.}:
>        lock_acquire+0xd4/0x1c0
>        _raw_spin_lock+0x2e/0x40
>        create_task_io_context+0xe9/0x150
>        generic_make_request_checks+0x8b4/0x970
>        generic_make_request+0xbb/0x570
>        submit_bio+0x6e/0x130
>        __swap_writepage+0x281/0x420
>        shmem_writepage+0x1d3/0x350
>        pageout.isra.54+0x1e9/0x3f0
>        shrink_page_list+0xa9f/0xe60
>        shrink_inactive_list+0x263/0x6e0
>        shrink_node_memcg+0x376/0x7c0
>        shrink_node+0xdd/0x470
>        balance_pgdat+0x26c/0x570
>        kswapd+0x191/0x4f0
>        kthread+0xf8/0x130
>        ret_from_fork+0x3a/0x50
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(fs_reclaim);
>                                lock(&(&p->alloc_lock)->rlock);
>                                lock(fs_reclaim);
>   lock(&(&p->alloc_lock)->rlock);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by kswapd0/41:
>  #0: 0000000010aadb74 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x40
> 
> > > paul moore
> 
> Patch#1:
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -388,7 +388,7 @@ void audit_netns_contid_add(struct net *net, u64 contid)
>  {
>         struct audit_net *aunet;
>         struct list_head *contid_list;
> -       struct audit_contid *cont;
> +       struct audit_contid *cont, *newcont;
>  
>         if (!net)
>                 return;
> @@ -398,18 +398,19 @@ void audit_netns_contid_add(struct net *net, u64 contid)
>         if (!aunet)
>                 return;
>         contid_list = &aunet->contid_list;
> +       newcont = kmalloc(sizeof(struct audit_contid), GFP_KERNEL);
>         spin_lock(&aunet->contid_list_lock);
>         list_for_each_entry_rcu(cont, contid_list, list)
>                 if (cont->id == contid) {
>                         refcount_inc(&cont->refcount);
> +                       kfree(newcont);
>                         goto out;
>                 }
> -       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> -       if (cont) {
> -               INIT_LIST_HEAD(&cont->list);
> -               cont->id = contid;
> -               refcount_set(&cont->refcount, 1);
> -               list_add_rcu(&cont->list, contid_list);
> +       if (newcont) {
> +               INIT_LIST_HEAD(&newcont->list);
> +               newcont->id = contid;
> +               refcount_set(&newcont->refcount, 1);
> +               list_add_rcu(&newcont->list, contid_list);
>         }
>  out:
>         spin_unlock(&aunet->contid_list_lock);
> 
> 
> Patch#2:
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -222,12 +222,16 @@ struct audit_reply {
>  };
>  
>  static struct kmem_cache *audit_task_cache;
> +static struct kmem_cache *audit_contid_cache;
>  
>  void __init audit_task_init(void)
>  {
>         audit_task_cache = kmem_cache_create("audit_task",
>                                              sizeof(struct audit_task_info),
>                                              0, SLAB_PANIC, NULL);
> +       audit_contid_cache = kmem_cache_create("audit_contid",
> +                                            sizeof(struct audit_contid),
> +                                            0, SLAB_PANIC, NULL);
>  }
>  
>  /**
> @@ -388,7 +392,7 @@ void audit_netns_contid_add(struct net *net, u64 contid)
>  {
>         struct audit_net *aunet;
>         struct list_head *contid_list;
> -       struct audit_contid *cont;
> +       struct audit_contid *cont, *newcont;
>  
>         if (!net)
>                 return;
> @@ -398,23 +402,32 @@ void audit_netns_contid_add(struct net *net, u64 contid)
>         if (!aunet)
>                 return;
>         contid_list = &aunet->contid_list;
> +       newcont = kmem_cache_alloc(audit_contid_cache, GFP_KERNEL);
>         spin_lock(&aunet->contid_list_lock);
>         list_for_each_entry_rcu(cont, contid_list, list)
>                 if (cont->id == contid) {
>                         refcount_inc(&cont->refcount);
> +                       kmem_cache_free(audit_contid_cache, newcont);
>                         goto out;
>                 }
> -       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> -       if (cont) {
> -               INIT_LIST_HEAD(&cont->list);
> -               cont->id = contid;
> -               refcount_set(&cont->refcount, 1);
> -               list_add_rcu(&cont->list, contid_list);
> +       if (newcont) {
> +               INIT_LIST_HEAD(&newcont->list);
> +               newcont->id = contid;
> +               refcount_set(&newcont->refcount, 1);
> +               list_add_rcu(&newcont->list, contid_list);
>         }
>  out:
>         spin_unlock(&aunet->contid_list_lock);
>  }
>  
> +static void audit_contid_free(struct rcu_head *rcu)
> +{
> +       struct audit_contid *cont;
> +
> +       cont = container_of(rcu, struct audit_contid, rcu);
> +       kmem_cache_free(audit_contid_cache, cont);
> +}
> +
>  void audit_netns_contid_del(struct net *net, u64 contid)
>  {
>         struct audit_net *aunet;
> @@ -434,7 +447,7 @@ void audit_netns_contid_del(struct net *net, u64 contid)
>                 if (cont->id == contid) {
>                         if (refcount_dec_and_test(&cont->refcount)) {
>                                 list_del_rcu(&cont->list);
> -                               kfree_rcu(cont, rcu);
> +                               call_rcu(&cont->rcu, audit_contid_free);
>                         }
>                         break;
>                 }
> 
> - RGB
> 
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
> 

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Florian Weimer @ 2019-04-05 11:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Alexey Dobriyan, mingo, linux-kernel, linux-api
In-Reply-To: <20190405110415.GP12232@hirez.programming.kicks-ass.net>

* Peter Zijlstra:

> On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
>
>> > True; but I suppose glibc already does lots of that anyway, right? It
>> > does contain the right information.
>> 
>> If I recall correctly my last investigation,
>> /sys/devices/system/cpu/possible does not reflect the size of the
>> affinity mask, either.
>
> Strictly speaking correct; the bitmap can be longer than the highest
> possible cpu number, however the remainder would be 0-padding and could
> thus be stripped without issue.

Doesn't the kernel still enforce the larget bitmap in sched_getaffinity,
even if the bits are always zero?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Peter Zijlstra @ 2019-04-05 11:04 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Alexey Dobriyan, mingo, linux-kernel, linux-api
In-Reply-To: <87wok83gfs.fsf@oldenburg2.str.redhat.com>

On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:

> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
> 
> If I recall correctly my last investigation,
> /sys/devices/system/cpu/possible does not reflect the size of the
> affinity mask, either.

Strictly speaking correct; the bitmap can be longer than the highest
possible cpu number, however the remainder would be 0-padding and could
thus be stripped without issue.

So a bitmap sized to fit the max possible cpu number, should for all
practical purposes suffice.

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Florian Weimer @ 2019-04-05 10:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Alexey Dobriyan, mingo, linux-kernel, linux-api
In-Reply-To: <20190404084249.GS4038@hirez.programming.kicks-ass.net>

* Peter Zijlstra:

> On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
>> Currently there is no easy way to get the number of CPUs on the system.

The size of the affinity mask is only related to the number of CPUs in
the system in such a way that the number of CPUs cannot be larger than
the number of bits in the affinity mask.

>> Glibc in particular shipped with 1024 CPUs support maximum at some point
>> which is quite surprising as glibc maitainers should know better.

This dates back to a time when the kernel was never going to support
more than 1024 CPUs.

A lot of distribution kernels still enforce a hard limit, which papers
over firmware bugs which tell the kernel that the system can be
hot-plugged to a ridiculous number of sockets/CPUs.

>> Another group dynamically grow buffer until cpumask fits. This is
>> inefficient as multiple system calls are done.
>> 
>> Nobody seems to parse "/sys/devices/system/cpu/possible".
>> Even if someone does, parsing sysfs is much slower than necessary.
>
> True; but I suppose glibc already does lots of that anyway, right? It
> does contain the right information.

If I recall correctly my last investigation,
/sys/devices/system/cpu/possible does not reflect the size of the
affinity mask, either.

>> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
>> This will make gettting CPU mask require at most 2 system calls
>> and will eliminate unnecessary code.
>> 
>> len=0 is chosen so that
>> * passing zeroes is the simplest thing
>> 
>> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
>> 
>>   will simply do the right thing,
>> 
>> * old kernels returned -EINVAL unconditionally.
>> 
>> Note: glibc segfaults upon exiting from system call because it tries to
>> clear the rest of the buffer if return value is positive, so
>> applications will have to use syscall(3).
>> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

Given that old kernels fail with EINVAL, that evidence is fairly
restricted.

I'm not sure if it's a good idea to overload this interface.  I expect
that users will want to call sched_getaffinity (the system call wrapper)
with cpusetsize == 0 to query the value, so there will be pressure on
glibc to remove the memset.  At that point we have an API that obscurely
fails with old glibc versions, but suceeds with newer ones, which isn't
great.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Peter Zijlstra @ 2019-04-05  9:26 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: mingo, linux-kernel, linux-api
In-Reply-To: <20190404180230.GA14800@avx2>

On Thu, Apr 04, 2019 at 09:02:30PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 04, 2019 at 10:42:49AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> > > Currently there is no easy way to get the number of CPUs on the system.
> > 
> > And this patch doesn't change that :-)
> 
> It does! Application or a library could do one idempotent system call
> in a constructor.

You said: "get the number of CPUs on the system", but what this call
actually returns is: "size of CPU bitmask", these two things are
distinctly not the same.

> > The point is that nr_cpu_ids is the length of the bitmap, but does not
> > contain information on how many CPUs are in the system. Consider the
> > case where the bitmap is sparse.
> 
> I understand that but how do you ship number of CPUs _and_ possible mask
> in one go?

You don't, possible mask is unrelated to number of CPUs and both are
unrelated to bitmap size.

In particular:

  nr_cpus <= nr_possible_cpus <= nr_cpumask_bits

In absurdum an architecture could choose to iterate its CPUs as
1024*cpuid, just for giggles. Then suppose it has two sockets, and 16
CPUs per socket and only the second socket is populated.

The we get:

  nr_cpumask_bits = 32k
  nr_possible_cpus = 32
  nr_cpus = 16

see what I mean?

Now, luckily we typically don't have crap like that, but I suspect that
with a little creative ACPI table magic we could actually get something
slightly less absurd but still non-trivial even on x86.

Also see how num_possible_cpus() is defined as
cpumask_weight(cpu_possible_mask), which is very much not nr_cpu_ids in
the generic case.

> > > Nobody seems to parse "/sys/devices/system/cpu/possible".
> > > Even if someone does, parsing sysfs is much slower than necessary.
> > 
> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
> 
> sched_getaffinity(3) does sched_getaffinity(2) + memset()
> 
> sysconf(_SC_NPROCESSORS_ONLN) does "/sys/devices/system/cpu/online"
> 
> sysconf(_SC_NPROCESSORS_CONF) does readdir("/sys/devices/system/cpu")
> which is 5 syscalls. I'm not sure which cpumask readdir represents.

It counts the number of CPUs, which is strictly not the same as the
bitmap size.

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Florian Weimer @ 2019-04-05  9:16 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Michael Ellerman, Mathieu Desnoyers, Paul Burton, Will Deacon,
	Boqun Feng, Heiko Carstens, Vasily Gorbik, Martin Schwidefsky,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras, carlos,
	Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Dave
In-Reply-To: <ce6f9db3-bf85-7aec-4bae-998e6fd629e1@redhat.com>

* Carlos O'Donell:

> On 4/2/19 3:08 AM, Florian Weimer wrote:
>> * Michael Ellerman:
>>
>>> I'm a bit vague on what we're trying to do here.
>>>
>>> But it seems like you want some sort of "eye catcher" prior to the branch?
>>>
>>> That value is a valid instruction on current CPUs (rlwimi.
>>> r5,r24,6,1,9), and even if it wasn't it could become one in future.
>>>
>>> If you change it to 0x8053530 that is both a valid instruction and is a
>>> nop (conditional trap immediate but with no conditions set).
>>
>> I think we need something that is very unlikely to appear in the
>> instruction stream.  It's just a marker.  The instruction will never be
>> executed, and it does not have to be a trap, either (I believe that a
>> standard trap instruction would be a bad choice).
>
> I assume you want to avoid a standard trap instruction because it would
> be common, and so not meet the intent of the RSEQ_SIG choice as being something
> that is *uncommon* right?

Ideally, RSEQ_SIG would be something that does not show up in the
instruction stream at all, so that it is a reliable marker for the start
of an rseq handler.  I assume the intent here is that the kernel
provides some validation on the program counter it reads from the rseq
area, so that we do not end up with some easily-abused gadget in every
process image.

> It is valuable that it be a trap, particularly for constant pools because
> it means that a jump into the constant pool will trap.

Sorry, I don't understand why this matters in this context.  Would you
please elaborate?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH ghak90 V5 09/10] audit: add support for containerid to network namespaces
From: Paul Moore @ 2019-04-05  2:06 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Neil Horman, linux-api, containers, LKML, dhowells,
	Linux-Audit Mailing List, netfilter-devel, ebiederm, simo, netdev,
	linux-fsdevel, Eric Paris, Serge Hallyn
In-Reply-To: <20190404214006.jcgmjb4u6iobu42s@madcap2.tricolour.ca>

On Thu, Apr 4, 2019 at 5:40 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-04-02 07:31, Neil Horman wrote:
> > On Mon, Apr 01, 2019 at 10:50:03AM -0400, Paul Moore wrote:
> > > On Fri, Mar 15, 2019 at 2:35 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > Audit events could happen in a network namespace outside of a task
> > > > context due to packets received from the net that trigger an auditing
> > > > rule prior to being associated with a running task.  The network
> > > > namespace could be in use by multiple containers by association to the
> > > > tasks in that network namespace.  We still want a way to attribute
> > > > these events to any potential containers.  Keep a list per network
> > > > namespace to track these audit container identifiiers.
> > > >
> > > > Add/increment the audit container identifier on:
> > > > - initial setting of the audit container identifier via /proc
> > > > - clone/fork call that inherits an audit container identifier
> > > > - unshare call that inherits an audit container identifier
> > > > - setns call that inherits an audit container identifier
> > > > Delete/decrement the audit container identifier on:
> > > > - an inherited audit container identifier dropped when child set
> > > > - process exit
> > > > - unshare call that drops a net namespace
> > > > - setns call that drops a net namespace
> > > >
> > > > See: https://github.com/linux-audit/audit-kernel/issues/92
> > > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > ---
> > > >  include/linux/audit.h | 19 ++++++++++++
> > > >  kernel/audit.c        | 86 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > > >  kernel/nsproxy.c      |  4 +++
> > > >  3 files changed, 106 insertions(+), 3 deletions(-)

...

> > > > +               list_for_each_entry_rcu(cont, contid_list, list)
> > > > +                       if (cont->id == contid) {
> > > > +                               refcount_inc(&cont->refcount);
> > > > +                               goto out;
> > > > +                       }
> > > > +       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> > >
> > > If you had to guess, what do you think is going to be more common:
> > > bumping the refcount of an existing entry in the list, or adding a new
> > > entry?  I'm asking because I always get a little nervous when doing
> > > allocations while holding a spinlock.  Yes, you are doing it with
> > > GFP_ATOMIC, but it still seems like something to try and avoid if this
> > > is going to approach 50%.  However, if the new entry is rare then the
> > > extra work of always doing the allocation before taking the lock and
> > > then freeing it afterwards might be a bad tradeoff.
> > >
> > I think this is another way of asking, will multiple processes exist in the same
> > network namespace?  That is to say, will we expect a process to create a net
> > namespace, and then have other processes enter that namespace (thereby
> > triggering multiple calls to audit_netns_contid_add with the same net pointer
> > and cont id).  Given that the kubernetes notion of a pod is almost by definition
> > multiple containers sharing a network namespace, I think the answer is that we
> > will be strongly biased towards the refcount_inc case, rather than the kmalloc
> > case.  I could be wrong, but I think this is likely already in the optimized
> > order.
>
> I had a stab at doing a GFP_KERNEL alloc before the spinlock and releasing it
> after if it wasn't needed (in Patch#1 below).  I also went one step further and
> converted it to kmem_cache (in Patch#2 below).
>
> > > My gut feeling says we might do about as many allocations as refcount
> > > bumps, but I could be thinking about this wrong.
> > >
> > > Moving the allocation outside the spinlock might also open the door to
> > > doing this as GFP_KERNEL, which is a good thing, but I haven't looked
> > > at the callers to see if that is possible (it may not be).  That's an
> > > exercise left to the patch author (if he hasn't done that already).
>
> Both appear to work, but after successfully running both the contid test and
> audit_testsuite, once I start to push that test system a bit harder I end up
> with a deadlock warning.
>
> I am having trouble understanding why since it happens both without and with
> the kmem_cache options, so it must be another part of the code that is
> triggering it.  The task_lock is being held at this point in
> audit_set_contid().  I haven't tried changing this alloc over to a GFP_ATOMIC
> to see if that prevents it, just as a debugging check...
> At this point, I'm inclined to leave it as it was without these two patches
> since it works and there doesn't seem to be an obvious best way given the
> uncertainty of the potential workloads.

I'm definitely not a mm expert, but I wonder if you might be able to
get this working using GFP_NOFS, but I see just now that they
recommend you *not* use GFP_NOFS; even if this did solve the problem,
it's probably not the right approach.

I'm okay with leaving it as-is with GFP_ATOMIC.  My original response
to this was more of a question about optimizing for a given use case,
having something that works is far more important.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Paul Burton @ 2019-04-04 21:41 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Mathieu Desnoyers, Will Deacon, Boqun Feng, Heiko Carstens,
	Vasily Gorbik, Martin Schwidefsky, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, carlos,
	Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Dave
In-Reply-To: <602718e0-7375-deb7-b6e6-2d17022173c5@redhat.com>

Hi Carlos / all,

On Thu, Apr 04, 2019 at 04:50:08PM -0400, Carlos O'Donell wrote:
> > > > +/* Signature required before each abort handler code.  */
> > > > +#define RSEQ_SIG 0x53053053
> > > 
> > > Why isn't this a mips-specific op code?
> > 
> > MIPS also has a literal pool just before the abort handler, and it
> > jumps over it. My understanding is that we can use any signature value
> > we want, and it does not need to be a valid instruction, similarly to ARM:
> > 
> > #define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, \
> >                                  abort_label, version, flags, \
> >                                  start_ip, post_commit_offset, abort_ip) \
> >                  ".balign 32\n\t" \
> >                  __rseq_str(table_label) ":\n\t" \
> >                  ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
> >                  LONG " " U32_U64_PAD(__rseq_str(start_ip)) "\n\t" \
> >                  LONG " " U32_U64_PAD(__rseq_str(post_commit_offset)) "\n\t" \
> >                  LONG " " U32_U64_PAD(__rseq_str(abort_ip)) "\n\t" \
> >                  ".word " __rseq_str(RSEQ_SIG) "\n\t" \
> >                  __rseq_str(label) ":\n\t" \
> >                  teardown \
> >                  "b %l[" __rseq_str(abort_label) "]\n\t"
> > 
> > Perhaps Paul Burton can confirm this ?
> 
> Yes please.
> 
> You also want to avoid the value being a valid MIPS insn that's common.
> 
> Did you check that?

This does not decode as a standard MIPS instruction, though it does
decode for both the microMIPS (ori) & nanoMIPS (lwxs; sll) ISAs.

I imagine I copied the value from another architecture when porting, and
since it doesn't get executed it seemed fine.

One maybe nicer option along the same lines would be 0x72736571 or
0x71657372 (ASCII 'rseq') neither of which decode as a MIPS instruction.

> I think the order of preference is:
> 
> 1.  An uncommon insn (with random immediate values), in a literal pool, that is
>     not a useful ROP/JOP sequence (very uncommon)

For that option on MIPS we could do something like:

  sll $0, $0, 31     # effectively a nop, but looks weird

> 2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)

Our break instruction has a 19b immediate in nanoMIPS (20b for microMIPS
& classic MIPS) so that could be something like:

  break 0x7273       # ASCII 'rs'

That's pretty unlikely to be seen in normal code, or the teq instruction
has a rarely used code field (4b in microMIPS, 5b in nanoMIPS, 10b in
classic MIPS) that's meaningless to hardware so something like this
would be possible:

  teq $0, $0, 0x8    # ASCII backspace

> 2b. A NOP to avoid affecting speculative execution (maybe uncommon)
> 
> With 2a/2b being roughly equivalent depending on speculative execution policy.

There are a bunch of potential odd looking nops possible, one of which
would be the sll I mentioned above.

Another option would be to use a priveleged instruction which userland
code can't execute & should normally never contain. That would decode as
a valid instruction & effectively behave like a trap instruction but
look very odd to anyone reading disassembled code. eg:

  mfc0 $0, 13        # Try to read the cause register; take SIGILL

In order to handle MIPS vs microMIPS vs nanoMIPS differences I'm
thinking it may be best to switch to one of these real instructions that
looks strange. The ugly part would be the nest of #ifdef's to deal with
endianness & ISA when defining it as a number...

Thanks,
    Paul

^ permalink raw reply

* Re: [PATCH ghak90 V5 09/10] audit: add support for containerid to network namespaces
From: Richard Guy Briggs @ 2019-04-04 21:40 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-api, containers, LKML, dhowells, Linux-Audit Mailing List,
	netfilter-devel, ebiederm, simo, netdev, linux-fsdevel,
	Eric Paris, Serge Hallyn
In-Reply-To: <20190402113150.GA17593@hmswarspite.think-freely.org>

On 2019-04-02 07:31, Neil Horman wrote:
> On Mon, Apr 01, 2019 at 10:50:03AM -0400, Paul Moore wrote:
> > On Fri, Mar 15, 2019 at 2:35 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Audit events could happen in a network namespace outside of a task
> > > context due to packets received from the net that trigger an auditing
> > > rule prior to being associated with a running task.  The network
> > > namespace could be in use by multiple containers by association to the
> > > tasks in that network namespace.  We still want a way to attribute
> > > these events to any potential containers.  Keep a list per network
> > > namespace to track these audit container identifiiers.
> > >
> > > Add/increment the audit container identifier on:
> > > - initial setting of the audit container identifier via /proc
> > > - clone/fork call that inherits an audit container identifier
> > > - unshare call that inherits an audit container identifier
> > > - setns call that inherits an audit container identifier
> > > Delete/decrement the audit container identifier on:
> > > - an inherited audit container identifier dropped when child set
> > > - process exit
> > > - unshare call that drops a net namespace
> > > - setns call that drops a net namespace
> > >
> > > See: https://github.com/linux-audit/audit-kernel/issues/92
> > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > >  include/linux/audit.h | 19 ++++++++++++
> > >  kernel/audit.c        | 86 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > >  kernel/nsproxy.c      |  4 +++
> > >  3 files changed, 106 insertions(+), 3 deletions(-)
> > 
> > ...
> > 
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index cf448599ef34..7fa3194f5342 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -72,6 +72,7 @@
> > >  #include <linux/freezer.h>
> > >  #include <linux/pid_namespace.h>
> > >  #include <net/netns/generic.h>
> > > +#include <net/net_namespace.h>
> > >
> > >  #include "audit.h"
> > >
> > > @@ -99,9 +100,13 @@
> > >  /**
> > >   * struct audit_net - audit private network namespace data
> > >   * @sk: communication socket
> > > + * @contid_list: audit container identifier list
> > > + * @contid_list_lock audit container identifier list lock
> > >   */
> > >  struct audit_net {
> > >         struct sock *sk;
> > > +       struct list_head contid_list;
> > > +       spinlock_t contid_list_lock;
> > >  };
> > >
> > >  /**
> > > @@ -275,8 +280,11 @@ struct audit_task_info init_struct_audit = {
> > >  void audit_free(struct task_struct *tsk)
> > >  {
> > >         struct audit_task_info *info = tsk->audit;
> > > +       struct nsproxy *ns = tsk->nsproxy;
> > >
> > >         audit_free_syscall(tsk);
> > > +       if (ns)
> > > +               audit_netns_contid_del(ns->net_ns, audit_get_contid(tsk));
> > >         /* Freeing the audit_task_info struct must be performed after
> > >          * audit_log_exit() due to need for loginuid and sessionid.
> > >          */
> > > @@ -376,6 +384,73 @@ static struct sock *audit_get_sk(const struct net *net)
> > >         return aunet->sk;
> > >  }
> > >
> > > +void audit_netns_contid_add(struct net *net, u64 contid)
> > > +{
> > > +       struct audit_net *aunet = net_generic(net, audit_net_id);
> > > +       struct list_head *contid_list = &aunet->contid_list;
> > > +       struct audit_contid *cont;
> > > +
> > > +       if (!audit_contid_valid(contid))
> > > +               return;
> > > +       if (!aunet)
> > > +               return;
> > 
> > We should move the contid_list assignment below this check, or decide
> > that aunet is always going to valid (?) and get rid of this check
> > completely.
> > 
> I'm not sure why that would be needed.  Finding the net_id list is an operation
> of a map relating net namespaces to lists, not contids to lists.  We could do
> it, sure, but since they're unrelated operations, I don't think we experience
> any slowdowns from doing it this way.
> 
> > > +       spin_lock(&aunet->contid_list_lock);
> > > +       if (!list_empty(contid_list))
> > 
> > We don't need the list_empty() check here do we?  I think we can just
> > call list_for_each_entry_rcu(), yes?
> > 
> This is true, the list_empty check is redundant, and the for loop will fall
> through if the list is empty.
> 
> > > +               list_for_each_entry_rcu(cont, contid_list, list)
> > > +                       if (cont->id == contid) {
> > > +                               refcount_inc(&cont->refcount);
> > > +                               goto out;
> > > +                       }
> > > +       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> > 
> > If you had to guess, what do you think is going to be more common:
> > bumping the refcount of an existing entry in the list, or adding a new
> > entry?  I'm asking because I always get a little nervous when doing
> > allocations while holding a spinlock.  Yes, you are doing it with
> > GFP_ATOMIC, but it still seems like something to try and avoid if this
> > is going to approach 50%.  However, if the new entry is rare then the
> > extra work of always doing the allocation before taking the lock and
> > then freeing it afterwards might be a bad tradeoff.
> > 
> I think this is another way of asking, will multiple processes exist in the same
> network namespace?  That is to say, will we expect a process to create a net
> namespace, and then have other processes enter that namespace (thereby
> triggering multiple calls to audit_netns_contid_add with the same net pointer
> and cont id).  Given that the kubernetes notion of a pod is almost by definition
> multiple containers sharing a network namespace, I think the answer is that we
> will be strongly biased towards the refcount_inc case, rather than the kmalloc
> case.  I could be wrong, but I think this is likely already in the optimized
> order.

I had a stab at doing a GFP_KERNEL alloc before the spinlock and releasing it
after if it wasn't needed (in Patch#1 below).  I also went one step further and
converted it to kmem_cache (in Patch#2 below).

> > My gut feeling says we might do about as many allocations as refcount
> > bumps, but I could be thinking about this wrong.
> > 
> > Moving the allocation outside the spinlock might also open the door to
> > doing this as GFP_KERNEL, which is a good thing, but I haven't looked
> > at the callers to see if that is possible (it may not be).  That's an
> > exercise left to the patch author (if he hasn't done that already).

Both appear to work, but after successfully running both the contid test and
audit_testsuite, once I start to push that test system a bit harder I end up
with a deadlock warning.

I am having trouble understanding why since it happens both without and with
the kmem_cache options, so it must be another part of the code that is
triggering it.  The task_lock is being held at this point in
audit_set_contid().  I haven't tried changing this alloc over to a GFP_ATOMIC
to see if that prevents it, just as a debugging check...
At this point, I'm inclined to leave it as it was without these two patches
since it works and there doesn't seem to be an obvious best way given the
uncertainty of the potential workloads.

======================================================
WARNING: possible circular locking dependency detected
5.1.0-rc1-ghak90-audit-containerID.v6.3+ #80 Not tainted
------------------------------------------------------
kswapd0/41 is trying to acquire lock:
0000000006a8c88b (&(&p->alloc_lock)->rlock){+.+.}, at: create_task_io_context+0xe9/0x150

but task is already holding lock:
0000000010aadb74 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x40

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
       __fs_reclaim_acquire+0x2c/0x40
       kmem_cache_alloc_trace+0x30/0x230
       audit_netns_contid_add.part.12+0xcf/0x210
       audit_set_contid+0x18f/0x480
       proc_contid_write+0x74/0x110
       vfs_write+0xad/0x1b0
       ksys_write+0x55/0xc0
       do_syscall_64+0x60/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (&(&p->alloc_lock)->rlock){+.+.}:
       lock_acquire+0xd4/0x1c0
       _raw_spin_lock+0x2e/0x40
       create_task_io_context+0xe9/0x150
       generic_make_request_checks+0x8b4/0x970
       generic_make_request+0xbb/0x570
       submit_bio+0x6e/0x130
       __swap_writepage+0x281/0x420
       shmem_writepage+0x1d3/0x350
       pageout.isra.54+0x1e9/0x3f0
       shrink_page_list+0xa9f/0xe60
       shrink_inactive_list+0x263/0x6e0
       shrink_node_memcg+0x376/0x7c0
       shrink_node+0xdd/0x470
       balance_pgdat+0x26c/0x570
       kswapd+0x191/0x4f0
       kthread+0xf8/0x130
       ret_from_fork+0x3a/0x50

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&(&p->alloc_lock)->rlock);
                               lock(fs_reclaim);
  lock(&(&p->alloc_lock)->rlock);

 *** DEADLOCK ***

1 lock held by kswapd0/41:
 #0: 0000000010aadb74 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x40

> > paul moore

Patch#1:
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -388,7 +388,7 @@ void audit_netns_contid_add(struct net *net, u64 contid)
 {
        struct audit_net *aunet;
        struct list_head *contid_list;
-       struct audit_contid *cont;
+       struct audit_contid *cont, *newcont;
 
        if (!net)
                return;
@@ -398,18 +398,19 @@ void audit_netns_contid_add(struct net *net, u64 contid)
        if (!aunet)
                return;
        contid_list = &aunet->contid_list;
+       newcont = kmalloc(sizeof(struct audit_contid), GFP_KERNEL);
        spin_lock(&aunet->contid_list_lock);
        list_for_each_entry_rcu(cont, contid_list, list)
                if (cont->id == contid) {
                        refcount_inc(&cont->refcount);
+                       kfree(newcont);
                        goto out;
                }
-       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
-       if (cont) {
-               INIT_LIST_HEAD(&cont->list);
-               cont->id = contid;
-               refcount_set(&cont->refcount, 1);
-               list_add_rcu(&cont->list, contid_list);
+       if (newcont) {
+               INIT_LIST_HEAD(&newcont->list);
+               newcont->id = contid;
+               refcount_set(&newcont->refcount, 1);
+               list_add_rcu(&newcont->list, contid_list);
        }
 out:
        spin_unlock(&aunet->contid_list_lock);


Patch#2:
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -222,12 +222,16 @@ struct audit_reply {
 };
 
 static struct kmem_cache *audit_task_cache;
+static struct kmem_cache *audit_contid_cache;
 
 void __init audit_task_init(void)
 {
        audit_task_cache = kmem_cache_create("audit_task",
                                             sizeof(struct audit_task_info),
                                             0, SLAB_PANIC, NULL);
+       audit_contid_cache = kmem_cache_create("audit_contid",
+                                            sizeof(struct audit_contid),
+                                            0, SLAB_PANIC, NULL);
 }
 
 /**
@@ -388,7 +392,7 @@ void audit_netns_contid_add(struct net *net, u64 contid)
 {
        struct audit_net *aunet;
        struct list_head *contid_list;
-       struct audit_contid *cont;
+       struct audit_contid *cont, *newcont;
 
        if (!net)
                return;
@@ -398,23 +402,32 @@ void audit_netns_contid_add(struct net *net, u64 contid)
        if (!aunet)
                return;
        contid_list = &aunet->contid_list;
+       newcont = kmem_cache_alloc(audit_contid_cache, GFP_KERNEL);
        spin_lock(&aunet->contid_list_lock);
        list_for_each_entry_rcu(cont, contid_list, list)
                if (cont->id == contid) {
                        refcount_inc(&cont->refcount);
+                       kmem_cache_free(audit_contid_cache, newcont);
                        goto out;
                }
-       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
-       if (cont) {
-               INIT_LIST_HEAD(&cont->list);
-               cont->id = contid;
-               refcount_set(&cont->refcount, 1);
-               list_add_rcu(&cont->list, contid_list);
+       if (newcont) {
+               INIT_LIST_HEAD(&newcont->list);
+               newcont->id = contid;
+               refcount_set(&newcont->refcount, 1);
+               list_add_rcu(&newcont->list, contid_list);
        }
 out:
        spin_unlock(&aunet->contid_list_lock);
 }
 
+static void audit_contid_free(struct rcu_head *rcu)
+{
+       struct audit_contid *cont;
+
+       cont = container_of(rcu, struct audit_contid, rcu);
+       kmem_cache_free(audit_contid_cache, cont);
+}
+
 void audit_netns_contid_del(struct net *net, u64 contid)
 {
        struct audit_net *aunet;
@@ -434,7 +447,7 @@ void audit_netns_contid_del(struct net *net, u64 contid)
                if (cont->id == contid) {
                        if (refcount_dec_and_test(&cont->refcount)) {
                                list_del_rcu(&cont->list);
-                               kfree_rcu(cont, rcu);
+                               call_rcu(&cont->rcu, audit_contid_free);
                        }
                        break;
                }

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-04 20:50 UTC (permalink / raw)
  To: Mathieu Desnoyers, Paul Burton, Will Deacon, Boqun Feng,
	Heiko Carstens, Vasily Gorbik, Martin Schwidefsky, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: carlos, Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Dave Watson, Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <1965431879.7576.1553529272844.JavaMail.zimbra@efficios.com>

On 3/25/19 11:54 AM, Mathieu Desnoyers wrote:
> Hi Carlos,
> 
> ----- On Mar 22, 2019, at 4:09 PM, Carlos O'Donell codonell@redhat.com wrote:
> 
> [...]
> 
> I took care of all your comments for an upcoming round of patches, except the
> following that remain open (see answer inline). I'm adding Linux maintainers
> for ARM, aarch64, MIPS, powerpc, s390, x86 in CC to discuss the choice of
> code signature prior to the abort handler for each of those architectures.

Thank you for kicking off this conversation.

Every architecture should have a reasonable RSEQ_SIG that applies to their
ISA with a comment about why that instruction was chosen. It should be a
conscious choice, without a default.

> * Support for automatically registering threads with the Linux rseq(2)
>    system call has been added.  This system call is implemented starting
>    from Linux 4.18.  The Restartable Sequences ABI accelerates user-space
>    operations on per-cpu data.  It allows user-space to perform updates
>    on per-cpu data without requiring heavy-weight atomic operations. See
>    https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
>    for further explanation.
> 
>    In order to be activated, it requires that glibc is built against
>    kernel headers that include this system call, and that glibc detects
>    availability of that system call at runtime.

Suggest:

* Support for automatically registering threads with the Linux rseq(2)
   system call has been added.  This system call is implemented starting
   from Linux 4.18.  The Restartable Sequences ABI accelerates user-space
   operations on per-cpu data.  It allows user-space to perform updates
   on per-cpu data without requiring heavy-weight atomic operations.
   Automatically registering threads allows all libraries, including libc,
   to make immediate use of the rseq(2) support by using the documented ABI.
   See 'man 2 rseq' for the details of the ABI shared between libc and the
   kernel.

> 
> For reference the assembly code I'm pointing at below can be found
> in the Linux selftests under:
> 
> tools/testing/selftests/rseq/rseq-*.h

OK.


>>> +++ b/sysdeps/unix/sysv/linux/arm/bits/rseq.h
> [...]
>>> +
>>> +/* Signature required before each abort handler code.  */
>>> +#define RSEQ_SIG 0x53053053
>>
>> Why isn't this an arm specific op code? Does the user have to mark this
>> up as part of a constant pool when placing it in front of the abort handler
>> to avoid disassembling the constant as code? This was at one point required
>> to get gdb to work properly.
>>
> 
> For arm, the abort is defined as:
> 
> #define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown,           \
>                                  abort_label, version, flags,            \
>                                  start_ip, post_commit_offset, abort_ip) \
>                  ".balign 32\n\t"                                        \
>                  __rseq_str(table_label) ":\n\t"                         \
>                  ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
>                  ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
>                  ".word " __rseq_str(RSEQ_SIG) "\n\t"                    \
>                  __rseq_str(label) ":\n\t"                               \
>                  teardown                                                \
>                  "b %l[" __rseq_str(abort_label) "]\n\t"
> 
> Which contains a copy of the struct rseq_cs for that critical section
> close to the actual critical section, within the code, followed by the
> signature. The reason why we have a copy of the struct rseq_cs there is
> to speed up entry into the critical section by using the "adr" instruction
> to compute the address to store into __rseq_cs->rseq_cs.
> 
> AFAIU, a literal pool on ARM is defined as something which is always
> jumped over (never executed), which is the case here. We always have
> an unconditional branch instruction ("b") skipping over each
> RSEQ_ASM_DEFINE_ABORT().
> 
> Therefore, given that the signature is part of a literal pool on ARM,
> it can be any value we choose and should not need to be an actual valid
> instruction.
> 
> aarch64 defines the abort as:
> 
> #define RSEQ_ASM_DEFINE_ABORT(label, abort_label)                               \
>          "       b       222f\n"                                                 \
>          "       .inst   "       __rseq_str(RSEQ_SIG) "\n"                       \
>          __rseq_str(label) ":\n"                                                 \
>          "       b       %l[" __rseq_str(abort_label) "]\n"                      \
>          "222:\n"
> 
> Where the signature actually maps to a valid instruction. Considering that
> aarch64 also have literal pools, and we branch over the signature, I wonder
> why it's so important to ensure the signature is a valid trap instruction.
> Perhaps Will Deacon can enlighten us ?

In the event that you accidentally jump to it then you trap?

However, you want an *uncommon* trap insn.

I think the order of preference is:

1.  An uncommon insn (with random immediate values), in a literal pool, that is
     not a useful ROP/JOP sequence (very uncommon)
2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
2b. A NOP to avoid affecting speculative execution (maybe uncommon)

With 2a/2b being roughly equivalent depending on speculative execution policy.

>>> +/* Signature required before each abort handler code.  */
>>> +#define RSEQ_SIG 0x53053053
>>
>> Why isn't this a mips-specific op code?
> 
> MIPS also has a literal pool just before the abort handler, and it
> jumps over it. My understanding is that we can use any signature value
> we want, and it does not need to be a valid instruction, similarly to ARM:
> 
> #define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, \
>                                  abort_label, version, flags, \
>                                  start_ip, post_commit_offset, abort_ip) \
>                  ".balign 32\n\t" \
>                  __rseq_str(table_label) ":\n\t" \
>                  ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
>                  LONG " " U32_U64_PAD(__rseq_str(start_ip)) "\n\t" \
>                  LONG " " U32_U64_PAD(__rseq_str(post_commit_offset)) "\n\t" \
>                  LONG " " U32_U64_PAD(__rseq_str(abort_ip)) "\n\t" \
>                  ".word " __rseq_str(RSEQ_SIG) "\n\t" \
>                  __rseq_str(label) ":\n\t" \
>                  teardown \
>                  "b %l[" __rseq_str(abort_label) "]\n\t"
> 
> Perhaps Paul Burton can confirm this ?

Yes please.

You also want to avoid the value being a valid MIPS insn that's common.

Did you check that?

> [...]
>>> +++ b/sysdeps/unix/sysv/linux/powerpc/bits/rseq.h
> [...]
>>> +/* Signature required before each abort handler code.  */
>>> +#define RSEQ_SIG 0x53053053
>>
>> Why isn't this an opcode specific to power?
> 
> On powerpc 32/64, the abort is placed in a __rseq_failure executable section:
> 
> #define RSEQ_ASM_DEFINE_ABORT(label, abort_label)                               \
>                  ".pushsection __rseq_failure, \"ax\"\n\t"                       \
>                  ".long " __rseq_str(RSEQ_SIG) "\n\t"                            \
>                  __rseq_str(label) ":\n\t"                                       \
>                  "b %l[" __rseq_str(abort_label) "]\n\t"                         \
>                  ".popsection\n\t"
> 
> That section only contains snippets of those trampolines. Arguably, it would be
> good if disassemblers could find valid instructions there. Boqun Feng could perhaps
> shed some light on this signature choice ? Now would be a good time to decide
> once and for all whether a valid instruction would be a better choice.

This seems questionable too.

> [...]
>>> +++ b/sysdeps/unix/sysv/linux/s390/bits/rseq.h
> [...]
>>> +
>>> +/* Signature required before each abort handler code.  */
>>> +#define RSEQ_SIG 0x53053053
>>
>> Why not a s390 specific value here?
> 
> s390 also has the abort handler in a __rseq_failure section:
> 
> #define RSEQ_ASM_DEFINE_ABORT(label, teardown, abort_label)             \
>                  ".pushsection __rseq_failure, \"ax\"\n\t"               \
>                  ".long " __rseq_str(RSEQ_SIG) "\n\t"                    \
>                  __rseq_str(label) ":\n\t"                               \
>                  teardown                                                \
>                  "j %l[" __rseq_str(abort_label) "]\n\t"                 \
>                  ".popsection\n\t"
> 
> Same question applies as powerpc: since disassemblers will try to decode
> that instruction, would it be better to define it as a valid one ?

Yes, I think it needs to be a valid uncommon insn or nop.

> [...]
>>> +++ b/sysdeps/unix/sysv/linux/x86/bits/rseq.h
> [...]
>>> +/* Signature required before each abort handler code.  */
>>> +#define RSEQ_SIG 0x53053053
>>
>> Why not an x86-specific op code?
> 
> On x86, we use this 4-byte signature as operand to a "no-op" instruction
> taking 4-byte immediate operand:

That makes perfect sense. Thanks.

So what is left to audit?

In summary:

- Why does AArch64 choose a trap?

- Is the current choice of 0x53053053 OK for MIPS? Does it map to a valid insn?

- What better choice is there for power? Pick a real uncommon insn or nop?

- What better choice is there for s390? Pick a real uncommon insn or nop?
   - Todays choice could become something special in the future since it's unassigned.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-04 20:32 UTC (permalink / raw)
  To: Florian Weimer, Michael Ellerman
  Cc: Mathieu Desnoyers, Paul Burton, Will Deacon, Boqun Feng,
	Heiko Carstens, Vasily Gorbik, Martin Schwidefsky, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, carlos, Joseph Myers,
	Szabolcs Nagy, libc-alpha, Thomas Gleixner, Ben Maurer,
	Peter Zijlstra, Paul E. McKenney, Dave Watson, Paul Turner
In-Reply-To: <87pnq4zxyj.fsf@oldenburg2.str.redhat.com>

On 4/2/19 3:08 AM, Florian Weimer wrote:
> * Michael Ellerman:
> 
>> I'm a bit vague on what we're trying to do here.
>>
>> But it seems like you want some sort of "eye catcher" prior to the branch?
>>
>> That value is a valid instruction on current CPUs (rlwimi.
>> r5,r24,6,1,9), and even if it wasn't it could become one in future.
>>
>> If you change it to 0x8053530 that is both a valid instruction and is a
>> nop (conditional trap immediate but with no conditions set).
> 
> I think we need something that is very unlikely to appear in the
> instruction stream.  It's just a marker.  The instruction will never be
> executed, and it does not have to be a trap, either (I believe that a
> standard trap instruction would be a bad choice).

I assume you want to avoid a standard trap instruction because it would
be common, and so not meet the intent of the RSEQ_SIG choice as being something
that is *uncommon* right?

It is valuable that it be a trap, particularly for constant pools because
it means that a jump into the constant pool will trap.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-04 20:15 UTC (permalink / raw)
  To: Michael Ellerman, Mathieu Desnoyers, Paul Burton, Will Deacon,
	Boqun Feng, Heiko Carstens, Vasily Gorbik, Martin Schwidefsky,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras
  Cc: carlos, Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Dave Watson, Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <87lg0tosfz.fsf@concordia.ellerman.id.au>

On 4/2/19 2:02 AM, Michael Ellerman wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> Hi Carlos,
>>
>> ----- On Mar 22, 2019, at 4:09 PM, Carlos O'Donell codonell@redhat.com wrote:
> ...
>>
>> [...]
>>>> +++ b/sysdeps/unix/sysv/linux/powerpc/bits/rseq.h
>> [...]
>>>> +/* Signature required before each abort handler code.  */
>>>> +#define RSEQ_SIG 0x53053053
>>>
>>> Why isn't this an opcode specific to power?
>>
>> On powerpc 32/64, the abort is placed in a __rseq_failure executable section:
>>
>> #define RSEQ_ASM_DEFINE_ABORT(label, abort_label)                               \
>>                  ".pushsection __rseq_failure, \"ax\"\n\t"                       \
>>                  ".long " __rseq_str(RSEQ_SIG) "\n\t"                            \
>>                  __rseq_str(label) ":\n\t"                                       \
>>                  "b %l[" __rseq_str(abort_label) "]\n\t"                         \
>>                  ".popsection\n\t"
>>
>> That section only contains snippets of those trampolines. Arguably, it would be
>> good if disassemblers could find valid instructions there. Boqun Feng could perhaps
>> shed some light on this signature choice ? Now would be a good time to decide
>> once and for all whether a valid instruction would be a better choice.
> 
> I'm a bit vague on what we're trying to do here.
> 
> But it seems like you want some sort of "eye catcher" prior to the branch?
> 
> That value is a valid instruction on current CPUs (rlwimi.
> r5,r24,6,1,9), and even if it wasn't it could become one in future.
> 
> If you change it to 0x8053530 that is both a valid instruction and is a
> nop (conditional trap immediate but with no conditions set).

The s390 IBM team needs to respond to this and I want to make sure they ACK
the NOP being used here because it impacts them directly.

I'd like to see Martin's opinion on this.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH V32 27/27] tracefs: Restrict tracefs when the kernel is locked down
From: Matthew Garrett @ 2019-04-04 20:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: James Morris, LSM List, Linux Kernel Mailing List, David Howells,
	Linux API, Andy Lutomirski
In-Reply-To: <20190404093907.153f52a8@gandalf.local.home>

On Thu, Apr 4, 2019 at 6:39 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed,  3 Apr 2019 17:32:49 -0700
> Matthew Garrett <matthewgarrett@google.com> wrote:
>
>
> > +static void tracefs_destroy_inode(struct inode *inode)
> > +{
> > +     if S_ISREG(inode->i_mode)
>
> Can we please put parenthesis around the condition. I know that the
> macro has them, but no other place in the kernel plays such a trick.

Ha, I've been spending too much time in Go lately. Fixed.

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Alexey Dobriyan @ 2019-04-04 18:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, linux-api
In-Reply-To: <20190404084249.GS4038@hirez.programming.kicks-ass.net>

On Thu, Apr 04, 2019 at 10:42:49AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> > Currently there is no easy way to get the number of CPUs on the system.
> 
> And this patch doesn't change that :-)

It does! Application or a library could do one idempotent system call
in a constructor.

> Still, it does the right thing and I like it.

Thanks.

> The point is that nr_cpu_ids is the length of the bitmap, but does not
> contain information on how many CPUs are in the system. Consider the
> case where the bitmap is sparse.

I understand that but how do you ship number of CPUs _and_ possible mask
in one go?

> > Applications are divided into 2 groups:
> > One group allocates buffer and call sched_getaffinity(2) once. It works
> > but either underallocate or overallocates and in the future such application
> > will become buggy as Linux will start working on even more SMP-ier systems.
> > 
> > Glibc in particular shipped with 1024 CPUs support maximum at some point
> > which is quite surprising as glibc maitainers should know better.
> > 
> > Another group dynamically grow buffer until cpumask fits. This is
> > inefficient as multiple system calls are done.
> > 
> > Nobody seems to parse "/sys/devices/system/cpu/possible".
> > Even if someone does, parsing sysfs is much slower than necessary.
> 
> True; but I suppose glibc already does lots of that anyway, right? It
> does contain the right information.

sched_getaffinity(3) does sched_getaffinity(2) + memset()

sysconf(_SC_NPROCESSORS_ONLN) does "/sys/devices/system/cpu/online"

sysconf(_SC_NPROCESSORS_CONF) does readdir("/sys/devices/system/cpu")
which is 5 syscalls. I'm not sure which cpumask readdir represents.

> > Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> > This will make gettting CPU mask require at most 2 system calls
> > and will eliminate unnecessary code.
> > 
> > len=0 is chosen so that
> > * passing zeroes is the simplest thing
> > 
> > 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
> > 
> >   will simply do the right thing,
> > 
> > * old kernels returned -EINVAL unconditionally.
> > 
> > Note: glibc segfaults upon exiting from system call because it tries to
> > clear the rest of the buffer if return value is positive, so
> > applications will have to use syscall(3).
> > Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
> 
> This also needs a manpage update. And I'm missing the libc people on Cc.

[nods]
Shipping "man/2/" with kernel is long overdue. :^)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4942,6 +4942,9 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
> >  	int ret;
> >  	cpumask_var_t mask;
> >  
> > +	if (len == 0)
> > +		return nr_cpu_ids;
> > +
> >  	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
> >  		return -EINVAL;
> >  	if (len & (sizeof(unsigned long)-1))

^ permalink raw reply

* Re: [PATCH V32 27/27] tracefs: Restrict tracefs when the kernel is locked down
From: Steven Rostedt @ 2019-04-04 13:39 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, dhowells, linux-api,
	luto, Matthew Garrett
In-Reply-To: <20190404003249.14356-28-matthewgarrett@google.com>

On Wed,  3 Apr 2019 17:32:49 -0700
Matthew Garrett <matthewgarrett@google.com> wrote:


> +static void tracefs_destroy_inode(struct inode *inode)
> +{
> +	if S_ISREG(inode->i_mode)

Can we please put parenthesis around the condition. I know that the
macro has them, but no other place in the kernel plays such a trick.

	if (S_ISREG(inode->i_mode))

Other than that, the rest looks good.

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

> +		kfree(inode->i_fop);
> +}
> +
>  static int tracefs_remount(struct super_block *sb, int *flags, char *data)
>  {
>  	int err;
> @@ -260,6 +288,7 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
>  
>  static const struct super_operations tracefs_super_operations = {
>  	.statfs		= simple_statfs,
> +	.destroy_inode  = tracefs_destroy_inode,
>  	.remount_fs	= tracefs_remount,
>  	.show_options	= tracefs_show_options,
>  };
> @@ -393,6 +422,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  {
>  	struct dentry *dentry;
>  	struct inode *inode;
> +	struct file_operations *proxy_fops;
>  
>  	if (!(mode & S_IFMT))
>  		mode |= S_IFREG;
> @@ -406,8 +436,16 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  	if (unlikely(!inode))
>  		return failed_creating(dentry);
>  
> +	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
> +	if (!proxy_fops)
> +		return failed_creating(dentry);
> +
> +	dentry->d_fsdata = fops ? (void *)fops :
> +		(void *)&tracefs_file_operations;
> +	memcpy(proxy_fops, dentry->d_fsdata, sizeof(struct file_operations));
> +	proxy_fops->open = default_open_file;
>  	inode->i_mode = mode;
> -	inode->i_fop = fops ? fops : &tracefs_file_operations;
> +	inode->i_fop = proxy_fops;
>  	inode->i_private = data;
>  	d_instantiate(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);

^ permalink raw reply

* Re: [PATCH v4 1/6] arm64: HWCAP: add support for AT_HWCAP2
From: Dave Martin @ 2019-04-04 12:47 UTC (permalink / raw)
  To: Andrew Murray
  Cc: Mark Rutland, libc-alpha, Suzuki K Poulose, Szabolcs Nagy,
	Catalin Marinas, Will Deacon, Phil Blundell, linux-api,
	linux-arm-kernel
In-Reply-To: <20190404112549.GN53702@e119886-lin.cambridge.arm.com>

On Thu, Apr 04, 2019 at 12:25:50PM +0100, Andrew Murray wrote:
> On Wed, Apr 03, 2019 at 05:33:03PM +0100, Dave Martin wrote:
> > On Wed, Apr 03, 2019 at 05:06:23PM +0100, Andrew Murray wrote:
> > > On Wed, Apr 03, 2019 at 02:21:12PM +0100, Dave Martin wrote:
> > > > On Wed, Apr 03, 2019 at 11:56:23AM +0100, Andrew Murray wrote:
> > > > > As we will exhaust the first 32 bits of AT_HWCAP let's start
> > > > > exposing AT_HWCAP2 to userspace to give us up to 64 caps.
> > > > > 
> > > > > Whilst it's possible to use the remaining 32 bits of AT_HWCAP, we
> > > > > prefer to expand into AT_HWCAP2 in order to provide a consistent
> > > > > view to userspace between ILP32 and LP64. However internal to the
> > > > > kernel we prefer to continue to use the full space of elf_hwcap.
> > > > > 
> > > > > To reduce complexity and allow for future expansion, we now
> > > > > represent hwcaps in the kernel as ordinals and use a
> > > > > KERNEL_HWCAP_ prefix. This allows us to support automatic feature
> > > > > based module loading for all our hwcaps.
> > > > > 
> > > > > We introduce cpu_set_feature to set hwcaps which complements the
> > > > > existing cpu_have_feature helper. These helpers allow us to clean
> > > > > up existing direct uses of elf_hwcap and reduce any future effort
> > > > > required to move beyond 64 caps.
> > > > > 
> > > > > For convenience we also introduce cpu_{have,set}_named_feature which
> > > > > makes use of the cpu_feature macro to allow providing a hwcap name
> > > > > without a {KERNEL_}HWCAP_ prefix.
> > > > > 
> > > > > Signed-off-by: Andrew Murray <andrew.murray@arm.com>
> > > > 
> > > > [...]
> > > > 
> > > > > diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
> > > > > index 400b80b49595..1f38a2740f7a 100644
> > > > > --- a/arch/arm64/include/asm/hwcap.h
> > > > > +++ b/arch/arm64/include/asm/hwcap.h
> > > > > @@ -39,12 +39,61 @@
> > > > >  #define COMPAT_HWCAP2_SHA2	(1 << 3)
> > > > >  #define COMPAT_HWCAP2_CRC32	(1 << 4)
> > > > >  
> > > > > +/*
> > > > > + * For userspace we represent hwcaps as a collection of HWCAP{,2}_x bitfields
> > > > > + * as described in uapi/asm/hwcap.h. For the kernel we represent hwcaps as
> > > > > + * natural numbers (in a single range of size MAX_CPU_FEATURES) defined here
> > > > > + * with prefix KERNEL_HWCAP_ mapped to their HWCAP{,2}_x counterpart.
> > > > > + *
> > > > > + * Hwcaps should be set and tested within the kernel via the
> > > > > + * cpu_{set,have}_named_feature(feature) where feature is the unique suffix
> > > > > + * of KERNEL_HWCAP_{feature}.
> > > > > + */
> > > > > +#define __khwcap_feature(x)		ilog2(HWCAP_ ## x)
> > > > 
> > > > Hmm, I didn't spot this before, but we should probably include
> > > > <linux/log2.h>.  This isn't asm-friendly however.
> > > 
> > > Doh!
> > > 
> > > > 
> > > > <asm/hwcap.h> gets included (unnecessarily?) by arch/arm64/mm/proc.S and
> > > > arch/arm64/include/uapi/asm/ptrace.h.
> > > 
> > > I also can't see any reason why either of these files includes hwcap.h...
> > 
> > Maybe we could just drop that include from proc.S.
> > 
> > > > Rather than risk breaking a UAPI header, can we remove the ilog2() here
> > > > and add it back into cpu_feature() where it was originally?
> > > 
> > > No I don't think we can. 
> > 
> > Agreed: userspace may be relying (however unwisely) on getting the
> > hwcaps as a side-effect of <uapi/asm/ptrace.h>, so we can't do much
> > about that one without taking a risk.
> > 
> > > > There may be a reason why this didn't work that I've forgotten...
> > > 
> > > We need the UAPI HWCAP_xx's to be bitfields and we've decided that we should
> > > limit them to 32 bits. Thus UAPI HWCAP2_xx's will also live within the first
> > > 32 bits meaning that we can't distinguish between them based on their value.
> > > 
> > > This isn't ideal within the kernel, as it means if we store the value
> > > anywhere (such as struct arm64_cpu_capabilities) then we need to also store
> > > some additional information to identify if it's AT_HWCAP or AT_HWCAP2.
> > 
> > But we could keep shadow kernel #defines that (for hwcap2) are shifted
> > up by 32 bits?  This required anything that deals with hwcap numbers to
> > cope with them being giant numbers that fit in an unsigned long, not
> > just small intergers (which possibly doesn't work without core changes?)
> > 
> > > In some cases (automatic hwcap based module loading) it's not possible to work
> > > around this - which is why arm32 can only support this for their elf_hwcap2.
> > > The approach this series takes allows automatic module loading to work based
> > > on any hwcap.
> > > 
> > > The solutions I can come up with at the moment are:
> > > 
> > >  - hard code the mapping without ilog2, as follows, though this is error
> > >    prone
> > > 
> > > #define KERNEL_HWCAP_ASIMD              2
> > > 
> > >  - Move the #ifndef __ASSEMBLY__ in include/asm/hwcap.h above the definitions
> > >    of KERNEL_HWCAP_xx and include <linux/log2.h> under __ASSEMBLY__. This works
> > >    but we can't test for hwcaps in assembly - maybe this isn't a problem?
> > 
> > Since this is a kernel header, this is probably OK: is asm needs the
> > hwcaps, sooner or later someone will need to fix it.
> > 
> > Possibly there are out-of-tree drivers relying on using the hwcaps from
> > assembly, but that's probably their own problem.
> > 
> > So, either move the #ifndef for simplicity, or introduce the ordinals
> > into <uapi/asm/hwcap.h>:
> > 
> > #define __HWCAP_NR_FP		0
> > #define __HWCAP_NR_ASIMD	1
> > #define __HWCAP_NR_EVTSTRM	2
> > 
> > ...
> > 
> > #define HWCAP_FP	(1UL << __HWCAP_NR_FP)
> > #define HWCAP_ASIMD 	(1UL << __HWCAP_NR_ASIMD)
> > #define HWCAP_EVTSTRM	(1UL << __HWCAP_NR_EVTSTRM)
> > 
> > ...
> > 
> > #define __HWCAP2_NR_DCPODP	0
> > 
> > #define HWCAP2_DCPODP	(1UL << __HWCAP2_NR_DCPODP)
> > 
> > ...
> > 
> > then use the __HWCAP{,2}_NR_ constants directly place of the
> > KERNEL_HWCAP_ #defines, or define the KERNEL_HWCAP defined in terms of
> > them.
> 
> Though we'd still have to add 32 to the HWCAP2's for asm/hwcap.h (thus
> justifying the continued use of KERNEL_HWCAP).
>
> > This is a noisy approach though, and I'm not totally convinced it's
> > better.
> > 
> > What do you think?
> 
> The only downside to this, is that we create another set of defines per
> HWCAP (bringing it up to 3) - these feels excessive.

I can't really argue with that.

> I'll stick with moving the #ifndef for now.

Probably makes sense for now, if there's no other straightforward
solution.

Cheers
---Dave

^ permalink raw reply

* Re: [PATCH v4 1/6] arm64: HWCAP: add support for AT_HWCAP2
From: Andrew Murray @ 2019-04-04 11:25 UTC (permalink / raw)
  To: Dave Martin
  Cc: Mark Rutland, libc-alpha, Suzuki K Poulose, Szabolcs Nagy,
	Catalin Marinas, Will Deacon, Phil Blundell, linux-api,
	linux-arm-kernel
In-Reply-To: <20190403163302.GV3567@e103592.cambridge.arm.com>

On Wed, Apr 03, 2019 at 05:33:03PM +0100, Dave Martin wrote:
> On Wed, Apr 03, 2019 at 05:06:23PM +0100, Andrew Murray wrote:
> > On Wed, Apr 03, 2019 at 02:21:12PM +0100, Dave Martin wrote:
> > > On Wed, Apr 03, 2019 at 11:56:23AM +0100, Andrew Murray wrote:
> > > > As we will exhaust the first 32 bits of AT_HWCAP let's start
> > > > exposing AT_HWCAP2 to userspace to give us up to 64 caps.
> > > > 
> > > > Whilst it's possible to use the remaining 32 bits of AT_HWCAP, we
> > > > prefer to expand into AT_HWCAP2 in order to provide a consistent
> > > > view to userspace between ILP32 and LP64. However internal to the
> > > > kernel we prefer to continue to use the full space of elf_hwcap.
> > > > 
> > > > To reduce complexity and allow for future expansion, we now
> > > > represent hwcaps in the kernel as ordinals and use a
> > > > KERNEL_HWCAP_ prefix. This allows us to support automatic feature
> > > > based module loading for all our hwcaps.
> > > > 
> > > > We introduce cpu_set_feature to set hwcaps which complements the
> > > > existing cpu_have_feature helper. These helpers allow us to clean
> > > > up existing direct uses of elf_hwcap and reduce any future effort
> > > > required to move beyond 64 caps.
> > > > 
> > > > For convenience we also introduce cpu_{have,set}_named_feature which
> > > > makes use of the cpu_feature macro to allow providing a hwcap name
> > > > without a {KERNEL_}HWCAP_ prefix.
> > > > 
> > > > Signed-off-by: Andrew Murray <andrew.murray@arm.com>
> > > 
> > > [...]
> > > 
> > > > diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
> > > > index 400b80b49595..1f38a2740f7a 100644
> > > > --- a/arch/arm64/include/asm/hwcap.h
> > > > +++ b/arch/arm64/include/asm/hwcap.h
> > > > @@ -39,12 +39,61 @@
> > > >  #define COMPAT_HWCAP2_SHA2	(1 << 3)
> > > >  #define COMPAT_HWCAP2_CRC32	(1 << 4)
> > > >  
> > > > +/*
> > > > + * For userspace we represent hwcaps as a collection of HWCAP{,2}_x bitfields
> > > > + * as described in uapi/asm/hwcap.h. For the kernel we represent hwcaps as
> > > > + * natural numbers (in a single range of size MAX_CPU_FEATURES) defined here
> > > > + * with prefix KERNEL_HWCAP_ mapped to their HWCAP{,2}_x counterpart.
> > > > + *
> > > > + * Hwcaps should be set and tested within the kernel via the
> > > > + * cpu_{set,have}_named_feature(feature) where feature is the unique suffix
> > > > + * of KERNEL_HWCAP_{feature}.
> > > > + */
> > > > +#define __khwcap_feature(x)		ilog2(HWCAP_ ## x)
> > > 
> > > Hmm, I didn't spot this before, but we should probably include
> > > <linux/log2.h>.  This isn't asm-friendly however.
> > 
> > Doh!
> > 
> > > 
> > > <asm/hwcap.h> gets included (unnecessarily?) by arch/arm64/mm/proc.S and
> > > arch/arm64/include/uapi/asm/ptrace.h.
> > 
> > I also can't see any reason why either of these files includes hwcap.h...
> 
> Maybe we could just drop that include from proc.S.
> 
> > > Rather than risk breaking a UAPI header, can we remove the ilog2() here
> > > and add it back into cpu_feature() where it was originally?
> > 
> > No I don't think we can. 
> 
> Agreed: userspace may be relying (however unwisely) on getting the
> hwcaps as a side-effect of <uapi/asm/ptrace.h>, so we can't do much
> about that one without taking a risk.
> 
> > > There may be a reason why this didn't work that I've forgotten...
> > 
> > We need the UAPI HWCAP_xx's to be bitfields and we've decided that we should
> > limit them to 32 bits. Thus UAPI HWCAP2_xx's will also live within the first
> > 32 bits meaning that we can't distinguish between them based on their value.
> > 
> > This isn't ideal within the kernel, as it means if we store the value
> > anywhere (such as struct arm64_cpu_capabilities) then we need to also store
> > some additional information to identify if it's AT_HWCAP or AT_HWCAP2.
> 
> But we could keep shadow kernel #defines that (for hwcap2) are shifted
> up by 32 bits?  This required anything that deals with hwcap numbers to
> cope with them being giant numbers that fit in an unsigned long, not
> just small intergers (which possibly doesn't work without core changes?)
> 
> > In some cases (automatic hwcap based module loading) it's not possible to work
> > around this - which is why arm32 can only support this for their elf_hwcap2.
> > The approach this series takes allows automatic module loading to work based
> > on any hwcap.
> > 
> > The solutions I can come up with at the moment are:
> > 
> >  - hard code the mapping without ilog2, as follows, though this is error
> >    prone
> > 
> > #define KERNEL_HWCAP_ASIMD              2
> > 
> >  - Move the #ifndef __ASSEMBLY__ in include/asm/hwcap.h above the definitions
> >    of KERNEL_HWCAP_xx and include <linux/log2.h> under __ASSEMBLY__. This works
> >    but we can't test for hwcaps in assembly - maybe this isn't a problem?
> 
> Since this is a kernel header, this is probably OK: is asm needs the
> hwcaps, sooner or later someone will need to fix it.
> 
> Possibly there are out-of-tree drivers relying on using the hwcaps from
> assembly, but that's probably their own problem.
> 
> So, either move the #ifndef for simplicity, or introduce the ordinals
> into <uapi/asm/hwcap.h>:
> 
> #define __HWCAP_NR_FP		0
> #define __HWCAP_NR_ASIMD	1
> #define __HWCAP_NR_EVTSTRM	2
> 
> ...
> 
> #define HWCAP_FP	(1UL << __HWCAP_NR_FP)
> #define HWCAP_ASIMD 	(1UL << __HWCAP_NR_ASIMD)
> #define HWCAP_EVTSTRM	(1UL << __HWCAP_NR_EVTSTRM)
> 
> ...
> 
> #define __HWCAP2_NR_DCPODP	0
> 
> #define HWCAP2_DCPODP	(1UL << __HWCAP2_NR_DCPODP)
> 
> ...
> 
> then use the __HWCAP{,2}_NR_ constants directly place of the
> KERNEL_HWCAP_ #defines, or define the KERNEL_HWCAP defined in terms of
> them.

Though we'd still have to add 32 to the HWCAP2's for asm/hwcap.h (thus
justifying the continued use of KERNEL_HWCAP).

> 
> This is a noisy approach though, and I'm not totally convinced it's
> better.
> 
> What do you think?

The only downside to this, is that we create another set of defines per
HWCAP (bringing it up to 3) - these feels excessive.

I'll stick with moving the #ifndef for now.

Thanks,

Andrew Murray

> 
> Cheers
> ---Dave

^ permalink raw reply

* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Peter Zijlstra @ 2019-04-04  8:42 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: mingo, linux-kernel, linux-api
In-Reply-To: <20190403200809.GA13876@avx2>

On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> Currently there is no easy way to get the number of CPUs on the system.

And this patch doesn't change that :-) Still, it does the right thing
and I like it.

The point is that nr_cpu_ids is the length of the bitmap, but does not
contain information on how many CPUs are in the system. Consider the
case where the bitmap is sparse.

> Applications are divided into 2 groups:
> One group allocates buffer and call sched_getaffinity(2) once. It works
> but either underallocate or overallocates and in the future such application
> will become buggy as Linux will start working on even more SMP-ier systems.
> 
> Glibc in particular shipped with 1024 CPUs support maximum at some point
> which is quite surprising as glibc maitainers should know better.
> 
> Another group dynamically grow buffer until cpumask fits. This is
> inefficient as multiple system calls are done.
> 
> Nobody seems to parse "/sys/devices/system/cpu/possible".
> Even if someone does, parsing sysfs is much slower than necessary.

True; but I suppose glibc already does lots of that anyway, right? It
does contain the right information.

> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> This will make gettting CPU mask require at most 2 system calls
> and will eliminate unnecessary code.
> 
> len=0 is chosen so that
> * passing zeroes is the simplest thing
> 
> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
> 
>   will simply do the right thing,
> 
> * old kernels returned -EINVAL unconditionally.
> 
> Note: glibc segfaults upon exiting from system call because it tries to
> clear the rest of the buffer if return value is positive, so
> applications will have to use syscall(3).
> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

This also needs a manpage update. And I'm missing the libc people on Cc.

> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> ---
> 
>  kernel/sched/core.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4942,6 +4942,9 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
>  	int ret;
>  	cpumask_var_t mask;
>  
> +	if (len == 0)
> +		return nr_cpu_ids;
> +
>  	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
>  		return -EINVAL;
>  	if (len & (sizeof(unsigned long)-1))

^ permalink raw reply

* Re: [PATCH V32 11/27] x86: Lock down IO port access when the kernel is locked down
From: Thomas Gleixner @ 2019-04-04  7:49 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, dhowells, linux-api,
	luto, Matthew Garrett, Matthew Garrett, x86
In-Reply-To: <20190404003249.14356-12-matthewgarrett@google.com>

On Wed, 3 Apr 2019, Matthew Garrett wrote:

> From: Matthew Garrett <mjg59@srcf.ucam.org>
> 
> IO port access would permit users to gain access to PCI configuration
> registers, which in turn (on a lot of hardware) give access to MMIO
> register space. This would potentially permit root to trigger arbitrary
> DMA, so lock it down by default.
> 
> This also implicitly locks down the KDADDIO, KDDELIO, KDENABIO and
> KDDISABIO console ioctls.
> 
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Signed-off-by: David Howells <dhowells@redhat.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply

* Re: [PATCH V32 19/27] x86/mmiotrace: Lock down the testmmiotrace module
From: Thomas Gleixner @ 2019-04-04  7:47 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, dhowells, linux-api,
	luto, Matthew Garrett, Steven Rostedt, Ingo Molnar,
	H. Peter Anvin, x86
In-Reply-To: <20190404003249.14356-20-matthewgarrett@google.com>

On Wed, 3 Apr 2019, Matthew Garrett wrote:

> From: David Howells <dhowells@redhat.com>
> 
> The testmmiotrace module shouldn't be permitted when the kernel is locked
> down as it can be used to arbitrarily read and write MMIO space. This is
> a runtime check rather than buildtime in order to allow configurations
> where the same kernel may be run in both locked down or permissive modes
> depending on local policy.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Howells <dhowells@redhat.com
> Signed-off-by: Matthew Garrett <mjg59@google.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply

* Re: [PATCH V32 19/27] x86/mmiotrace: Lock down the testmmiotrace module
From: Steven Rostedt @ 2019-04-04  1:33 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, dhowells, linux-api,
	luto, Thomas Gleixner, Matthew Garrett, Ingo Molnar,
	H. Peter Anvin, x86
In-Reply-To: <20190404003249.14356-20-matthewgarrett@google.com>

On Wed,  3 Apr 2019 17:32:41 -0700
Matthew Garrett <matthewgarrett@google.com> wrote:

> From: David Howells <dhowells@redhat.com>
> 
> The testmmiotrace module shouldn't be permitted when the kernel is locked
> down as it can be used to arbitrarily read and write MMIO space. This is
> a runtime check rather than buildtime in order to allow configurations
> where the same kernel may be run in both locked down or permissive modes
> depending on local policy.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Howells <dhowells@redhat.com
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: Thomas Gleixner <tglx@linutronix.de>
> cc: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

> cc: Ingo Molnar <mingo@kernel.org>
> cc: "H. Peter Anvin" <hpa@zytor.com>
> cc: x86@kernel.org
> ---
>  arch/x86/mm/testmmiotrace.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/mm/testmmiotrace.c b/arch/x86/mm/testmmiotrace.c
> index f6ae6830b341..9e8ad665f354 100644
> --- a/arch/x86/mm/testmmiotrace.c
> +++ b/arch/x86/mm/testmmiotrace.c
> @@ -115,6 +115,9 @@ static int __init init(void)
>  {
>  	unsigned long size = (read_far) ? (8 << 20) : (16 << 10);
>  
> +	if (kernel_is_locked_down("MMIO trace testing", LOCKDOWN_INTEGRITY))
> +		return -EPERM;
> +
>  	if (mmio_address == 0) {
>  		pr_err("you have to use the module argument mmio_address.\n");
>  		pr_err("DO NOT LOAD THIS MODULE UNLESS YOU REALLY KNOW WHAT YOU ARE DOING!\n");

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox