Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Dave Hansen @ 2019-06-06 21:18 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <20190606200646.3951-5-yu-cheng.yu@intel.com>

> +/*
> + * Helpers for changing XSAVES system states.
> + */
> +static inline void modify_fpu_regs_begin(void)
> +{
> +	fpregs_lock();
> +	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> +		__fpregs_load_activate();
> +}
> +
> +static inline void modify_fpu_regs_end(void)
> +{
> +	fpregs_unlock();
> +}

These are massively under-commented and under-changelogged.  This looks
like it's intended to ensure that we have supervisor FPU state for this
task loaded before we go and run the MSRs that might be modifying it.

But, that seems broken.  If we have supervisor state, we can't always
defer the load until return to userspace, so we'll never?? have
TIF_NEED_FPU_LOAD.  That would certainly be true for cet_kernel_state.

It seems like we actually need three classes of XSAVE states:
1. User state
2. Supervisor state that affects user mode
3. Supervisor state that affects kernel mode

We can delay the load of 1 and 2, but not 3.

But I don't see any infrastructure for this.

^ permalink raw reply

* Re: [PATCH 10/10] Add sample notification program [ver #3]
From: Eugeniu Rosca @ 2019-06-06 21:21 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel, Eugeniu Rosca, Eugeniu Rosca
In-Reply-To: <155981421379.17513.13158528501056454772.stgit@warthog.procyon.org.uk>

Hi David,

On Thu, Jun 06, 2019 at 10:43:33AM +0100, David Howells wrote:
[..]
> diff --git a/samples/watch_queue/Makefile b/samples/watch_queue/Makefile
> new file mode 100644
> index 000000000000..42b694430d0f
> --- /dev/null
> +++ b/samples/watch_queue/Makefile
> @@ -0,0 +1,9 @@
> +# List of programs to build
> +hostprogs-y := watch_test
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_watch_test.o += -I$(objtree)/usr/include

How about arm64? Do you intend to enable cross-compilation?

> +
> +HOSTLOADLIBES_watch_test += -lkeyutils
> diff --git a/samples/watch_queue/watch_test.c b/samples/watch_queue/watch_test.c
> new file mode 100644
> index 000000000000..893a5380f792
> --- /dev/null
> +++ b/samples/watch_queue/watch_test.c
[..]

> +			asm ("lfence" : : : "memory" );
[..]
> +			asm("mfence" ::: "memory");

FWIW, trying to cross-compile it returned:

aarch64-linux-gnu-gcc -I../../usr/include -I../../include  watch_test.c   -o watch_test
/tmp/ccDXYynm.s: Assembler messages:
/tmp/ccDXYynm.s:471: Error: unknown mnemonic `lfence' -- `lfence'
/tmp/ccDXYynm.s:568: Error: unknown mnemonic `mfence' -- `mfence'
<builtin>: recipe for target 'watch_test' failed
make: *** [watch_test] Error 1

-- 
Best Regards,
Eugeniu.

^ permalink raw reply

* Re: [PATCH v19 1/3] proc: add /proc/<pid>/arch_status
From: Andrew Morton @ 2019-06-06 21:34 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	adobriyan, aubrey.li, linux-api, linux-kernel, Andy Lutomirski
In-Reply-To: <20190606012236.9391-1-aubrey.li@linux.intel.com>

On Thu,  6 Jun 2019 09:22:34 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:

> The architecture specific information of the running processes
> could be useful to the userland. Add /proc/<pid>/arch_status
> interface support to examine process architecture specific
> information externally.

I'll grab these for some testing.  I can merge them up if the x86
maintainers are OK with it all.  However I think it's best that these
be merged via an x86 tree, please.

^ permalink raw reply

* Re: [PATCH v3 1/2] fork: add clone3
From: Serge E. Hallyn @ 2019-06-06 21:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, dhowells, Pavel Emelyanov, Andrew Morton, Adrian Reber,
	Andrei Vagin, linux-api
In-Reply-To: <20190604160944.4058-1-christian@brauner.io>

On Tue, Jun 04, 2019 at 06:09:43PM +0200, Christian Brauner wrote:
> This adds the clone3 system call.
> 
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
> 
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
> 
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again. In general, clone3() is extensible and allows for the implementation
> of new features.
> 
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
>   in clone() into a dedicated struct clone_args
>   - choose a struct layout that is easy to handle on 32 and on 64 bit
>   - choose a struct layout that is extensible
>   - give new flags that currently need to abuse another flag's dedicated
>     return argument in clone() their own dedicated return argument
>     (e.g. CLONE_PIDFD)
>   - use a separate kernel internal struct kernel_clone_args that is
>     properly typed according to current kernel conventions in fork.c and is
>     different from  the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
>   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
>   (Arnd suggested, that we can probably also port do_fork() itself in a
>    separate patchset.)
> - ease of transition for userspace from clone() to clone3()
>   This very much means that we do *not* remove functionality that userspace
>   currently relies on as the latter is a good way of creating a syscall
>   that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
> 
> In accordance with Linus suggestions (cf. [11]), clone3() has the following
> signature:
> 
> /* uapi */
> struct clone_args {
>         __aligned_u64 flags;
>         __aligned_u64 pidfd;
>         __aligned_u64 child_tid;
>         __aligned_u64 parent_tid;
>         __aligned_u64 exit_signal;
>         __aligned_u64 stack;
>         __aligned_u64 stack_size;
>         __aligned_u64 tls;
> };
> 
> /* kernel internal */
> struct kernel_clone_args {
>         u64 flags;
>         int __user *pidfd;
>         int __user *child_tid;
>         int __user *parent_tid;
>         int exit_signal;
>         unsigned long stack;
>         unsigned long stack_size;
>         unsigned long tls;
> };
> 
> long sys_clone3(struct clone_args __user *uargs, size_t size)
> 
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
> 
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> 
> /* User visible differences to legacy clone() */
> - CLONE_DETACHED will cause EINVAL with clone3()
> - CSIGNAL is deprecated
>   It is superseeded by a dedicated "exit_signal" argument in struct
>   clone_args freeing up space for additional flags.
>   This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
> 
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
> [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
> [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
> [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>

Acked-by: Serge Hallyn <serge@hallyn.com>

> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> ---
> v1:
> - Linus Torvalds <torvalds@linux-foundation.org>:
>   - redesign based on Linus proposal
>   - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@arndb.de>:
>   - use a single copy_from_user() instead of multiple get_user() calls
>     since the latter have a constant overhead on some architectures
>   - a range of other tweaks and suggestions
> v2:
> - Linus Torvalds <torvalds@linux-foundation.org>,
>   Andrei Vagin <avagin@gmail.com>:
>   - replace CSIGNAL flag with dedicated exit_signal argument in struct
>     clone_args
> - Christian Brauner <christian@brauner.io>:
>   - improve naming for some struct clone_args members
> v3:
> - Arnd Bergmann <arnd@arndb.de>:
>   - replace memset with constructor for clarity and better object code
>   - call flag verification function clone3_flags_valid() on
>     kernel_clone_args instead of clone_args
>   - remove __ARCH_WANT_SYS_CLONE ifdefine around sys_clone3()
> - Christian Brauner <christian@brauner.io>:
>   - replace clone3_flags_valid() with clone3_args_valid() and call in
>     clone3() directly rather than in copy_clone_args_from_user()
>     This cleanly separates copying the args from userspace from the
>     verification whether those args are sane.
> - David Howells <dhowells@redhat.com>:
>   - align new struct member assignments with tabs
>   - replace CLONE_MAX by with a non-uapi exported CLONE_LEGACY_FLAGS and
>     define it as  0xffffffffULL for clarity
>   - make copy_clone_args_from_user() noinline
>   - avoid assigning to local variables from struct kernel_clone_args
>     members in cases where it makes sense
> ---
>  arch/x86/ia32/sys_ia32.c   |  12 ++-
>  include/linux/sched/task.h |  17 +++-
>  include/linux/syscalls.h   |   4 +
>  include/uapi/linux/sched.h |  16 +++
>  kernel/fork.c              | 201 ++++++++++++++++++++++++++++---------
>  5 files changed, 199 insertions(+), 51 deletions(-)

^ permalink raw reply

* Re: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]
From: Andy Lutomirski @ 2019-06-06 21:54 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Casey Schaufler, Stephen Smalley, Al Viro,
	Greg Kroah-Hartman, USB list, raven, Linux FS Devel, Linux API,
	linux-block, keyrings, LSM List, LKML, Paul Moore
In-Reply-To: <23611.1559855827@warthog.procyon.org.uk>



> On Jun 6, 2019, at 2:17 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@kernel.org> wrote:
> 
>>>> You are allowing arbitrary information flow between T and W above.  Who
>>>> cares about notifications?
>>> 
>>> I do. If Watched object is /dev/null no data flow is possible.
>>> There are many objects on a modern Linux system for which this
>>> is true. Even if it's "just a file" the existence of one path
>>> for data to flow does not justify ignoring the rules for other
>>> data paths.
>> 
>> Aha!
>> 
>> Even ignoring security, writes to things like /dev/null should
>> probably not trigger notifications to people who are watching
>> /dev/null.  (There are probably lots of things like this: /dev/zero,
>> /dev/urandom, etc.)
> 
> Even writes to /dev/null might generate access notifications; leastways,
> vfs_read() will call fsnotify_access() afterwards on success.

Hmm. I can see this being an issue, but I guess not with your patch set.

> 
> Whether or not you can set marks on open device files is another matter.
> 
>> David, are there any notification types that have this issue in your
>> patchset?  If so, is there a straightforward way to fix it?
> 
> I'm not sure what issue you're referring to specifically.  Do you mean whether
> writes to device files generate notifications?

I mean: are there cases where some action generates a notification but does not otherwise have an effect visible to the users who can receive the notification. It looks like the answer is probably “no”, which is good.

Casey, is this good enough for you, or is there still an issue?

^ permalink raw reply

* Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Andy Lutomirski @ 2019-06-06 22:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <0a2f8b9b-b96b-06c8-bae0-b78b2ca3b727@intel.com>



On Jun 6, 2019, at 2:18 PM, Dave Hansen <dave.hansen@intel.com> wrote:

>> +/*
>> + * Helpers for changing XSAVES system states.
>> + */
>> +static inline void modify_fpu_regs_begin(void)
>> +{
>> +    fpregs_lock();
>> +    if (test_thread_flag(TIF_NEED_FPU_LOAD))
>> +        __fpregs_load_activate();
>> +}
>> +
>> +static inline void modify_fpu_regs_end(void)
>> +{
>> +    fpregs_unlock();
>> +}
> 
> These are massively under-commented and under-changelogged.  This looks
> like it's intended to ensure that we have supervisor FPU state for this
> task loaded before we go and run the MSRs that might be modifying it.
> 
> But, that seems broken.  If we have supervisor state, we can't always
> defer the load until return to userspace, so we'll never?? have
> TIF_NEED_FPU_LOAD.  That would certainly be true for cet_kernel_state.

Ugh. I was sort of imagining that we would treat supervisor state completely separately from user state.  But can you maybe give examples of exactly what you mean?

> 
> It seems like we actually need three classes of XSAVE states:
> 1. User state

This is FPU, XMM, etc, right?

> 2. Supervisor state that affects user mode

User CET?


> 3. Supervisor state that affects kernel mode

Like supervisor CET?  If we start doing supervisor shadow stack, the context switches will be real fun.  We may need to handle this in asm.

Where does PKRU fit in?  Maybe we can treat it as #3?

—Andy

^ permalink raw reply

* Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Dave Hansen @ 2019-06-06 22:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <5EE146A8-6C8C-4C5D-B7C0-AB8AD1012F1E@amacapital.net>



On 6/6/19 3:04 PM, Andy Lutomirski wrote:
>> But, that seems broken.  If we have supervisor state, we can't 
>> always defer the load until return to userspace, so we'll never?? 
>> have TIF_NEED_FPU_LOAD.  That would certainly be true for 
>> cet_kernel_state.
> 
> Ugh. I was sort of imagining that we would treat supervisor state
 completely separately from user state.  But can you maybe give
examples of exactly what you mean?
> 
>> It seems like we actually need three classes of XSAVE states: 1. 
>> User state
> 
> This is FPU, XMM, etc, right?

Yep.

>> 2. Supervisor state that affects user mode
> 
> User CET?

Yep.

>> 3. Supervisor state that affects kernel mode
> 
> Like supervisor CET?  If we start doing supervisor shadow stack, the 
> context switches will be real fun.  We may need to handle this in 
> asm.

Yeah, that's what I was thinking.

I have the feeling Yu-cheng's patches don't comprehend this since
Sebastian's patches went in after he started working on shadow stacks.

> Where does PKRU fit in?  Maybe we can treat it as #3?

I thought Sebastian added specific PKRU handling to make it always
eager.  It's actually user state that affect kernel mode. :)

^ permalink raw reply

* Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Yu-cheng Yu @ 2019-06-06 22:10 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <4effb749-0cdc-6a49-6352-7b2d4aa7d866@intel.com>

On Thu, 2019-06-06 at 15:08 -0700, Dave Hansen wrote:
> 
> On 6/6/19 3:04 PM, Andy Lutomirski wrote:
> > > But, that seems broken.  If we have supervisor state, we can't 
> > > always defer the load until return to userspace, so we'll never?? 
> > > have TIF_NEED_FPU_LOAD.  That would certainly be true for 
> > > cet_kernel_state.
> > 
> > Ugh. I was sort of imagining that we would treat supervisor state
> 
>  completely separately from user state.  But can you maybe give
> examples of exactly what you mean?
> > 
> > > It seems like we actually need three classes of XSAVE states: 1. 
> > > User state
> > 
> > This is FPU, XMM, etc, right?
> 
> Yep.
> 
> > > 2. Supervisor state that affects user mode
> > 
> > User CET?
> 
> Yep.
> 
> > > 3. Supervisor state that affects kernel mode
> > 
> > Like supervisor CET?  If we start doing supervisor shadow stack, the 
> > context switches will be real fun.  We may need to handle this in 
> > asm.
> 
> Yeah, that's what I was thinking.
> 
> I have the feeling Yu-cheng's patches don't comprehend this since
> Sebastian's patches went in after he started working on shadow stacks.
> 
> > Where does PKRU fit in?  Maybe we can treat it as #3?
> 
> I thought Sebastian added specific PKRU handling to make it always
> eager.  It's actually user state that affect kernel mode. :)

For CET user states, we need to restore before making changes.  If they are not
being changed (i.e. signal handling and syscalls), then they are restored only
before going back to user-mode.

For CET kernel states, we only need to make small changes in the way similar to
the PKRU handling, right?  We'll address it when sending CET kernel-mode
patches.

I will put in more comments as suggested by Dave in earlier emails.

Yu-cheng

^ permalink raw reply

* Re: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]
From: David Howells @ 2019-06-06 22:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Andy Lutomirski, Casey Schaufler, Stephen Smalley,
	Al Viro, Greg Kroah-Hartman, USB list, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML, Paul Moore
In-Reply-To: <AD7898AE-B92C-4DE6-B895-7116FEDB3091@amacapital.net>

Andy Lutomirski <luto@amacapital.net> wrote:

> I mean: are there cases where some action generates a notification but does
> not otherwise have an effect visible to the users who can receive the
> notification. It looks like the answer is probably “no”, which is good.

mount_notify().  You can get a notification that someone altered the mount
topology (eg. by mounting something).  A process receiving a notification
could then use fsinfo(), say, to reread the mount topology tree, find out
where the new mount is and wander over there to have a look - assuming they
have the permissions for pathwalk to succeed.

David

^ permalink raw reply

* Re: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]
From: Andy Lutomirski @ 2019-06-06 22:42 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Casey Schaufler, Stephen Smalley, Al Viro,
	Greg Kroah-Hartman, USB list, raven, Linux FS Devel, Linux API,
	linux-block, keyrings, LSM List, LKML, Paul Moore
In-Reply-To: <30567.1559860681@warthog.procyon.org.uk>



> On Jun 6, 2019, at 3:38 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>> I mean: are there cases where some action generates a notification but does
>> not otherwise have an effect visible to the users who can receive the
>> notification. It looks like the answer is probably “no”, which is good.
> 
> mount_notify().  You can get a notification that someone altered the mount
> topology (eg. by mounting something).  A process receiving a notification
> could then use fsinfo(), say, to reread the mount topology tree, find out
> where the new mount is and wander over there to have a look - assuming they
> have the permissions for pathwalk to succeed.
> 
> 

They can call fsinfo() anyway, or just read /proc/self/mounts. As far as I’m concerned, if you have CAP_SYS_ADMIN over a mount namespace and LSM policy lets you mount things, the of course you can get information to basically anyone who can use that mount namespace.

^ permalink raw reply

* Re: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]
From: David Howells @ 2019-06-06 22:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Andy Lutomirski, Casey Schaufler, Stephen Smalley,
	Al Viro, Greg Kroah-Hartman, USB list, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML, Paul Moore
In-Reply-To: <D2BD8FEB-5DF5-449B-AF81-83BA65E0E643@amacapital.net>

Andy Lutomirski <luto@amacapital.net> wrote:

> They can call fsinfo() anyway, or just read /proc/self/mounts. As far as I’m
> concerned, if you have CAP_SYS_ADMIN over a mount namespace and LSM policy
> lets you mount things, the of course you can get information to basically
> anyone who can use that mount namespace.

And automounts?  You don't need CAP_SYS_ADMIN to trigger one of those, but
they still generate events.  On the other hand, you need CSA to mount
something that has automounts in the first place, and if you're particularly
concerned about security, you probably don't want the processes you might be
suspicious of having access to things that contain automounts (typically
network filesystems).

David

^ permalink raw reply

* Re: [PATCH 10/10] Add sample notification program [ver #3]
From: David Howells @ 2019-06-06 22:52 UTC (permalink / raw)
  To: Eugeniu Rosca
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel, Eugeniu Rosca
In-Reply-To: <20190606212140.GA25664@vmlxhi-102.adit-jv.com>

Eugeniu Rosca <erosca@de.adit-jv.com> wrote:

> How about arm64? Do you intend to enable cross-compilation?

There's no guarantee that a cross-compiler can actually build userspace apps,
though I haven't intended to encode anything against that in the Makefile.

> > +			asm ("lfence" : : : "memory" );
> [..]
> > +			asm("mfence" ::: "memory");
> 
> FWIW, trying to cross-compile it returned:
> 
> aarch64-linux-gnu-gcc -I../../usr/include -I../../include  watch_test.c   -o watch_test
> /tmp/ccDXYynm.s: Assembler messages:
> /tmp/ccDXYynm.s:471: Error: unknown mnemonic `lfence' -- `lfence'
> /tmp/ccDXYynm.s:568: Error: unknown mnemonic `mfence' -- `mfence'

Yeah.  I need to use C-standard __atomic_* stuff.

David

^ permalink raw reply

* Re: [RFC]: Convention for naming syscall revisions
From: Theodore Ts'o @ 2019-06-06 23:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-api, linux-kernel
In-Reply-To: <20190606154224.7lln4zp6v3ey4icq@brauner.io>

On Thu, Jun 06, 2019 at 05:42:25PM +0200, Christian Brauner wrote:
> Hey everyone,
> 
> I hope this is not going to start a trash fire.
> 
> While working on a new clone version I tried to find out what the
> current naming conventions for syscall revisions is. I was told and
> seemed to be able to confirm through the syscall list that revisions of
> syscalls are for the most part (for examples see [1]) named after the
> number of arguments and not for the number of revisions. But some also
> seem to escape that logic (e.g. clone2).

There are also examples which show that it's a revision number:

      preadv2, pwritev2, mlock2, sync_file_range2

immediately come to mind.  It's also important to note that in some
cases, we do something very different (look aht the stat and fstat
variants), and that in some cases the number of parameters for a
system call vary between architectures (because of system call
argument passing limitations), and this gets papered over by glibc.

So we can define what the historical pattern, but there might be a big
difference between what might make sense as an internal naming
convention, and the names that we want to expose to userspace
application programmers --- especially if the number of arguments at
the syscall level might be different (on some architectures) than at
the C library level.

					- Ted

^ permalink raw reply

* Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Andy Lutomirski @ 2019-06-07  1:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <4effb749-0cdc-6a49-6352-7b2d4aa7d866@intel.com>



> On Jun 6, 2019, at 3:08 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> 
> 
> On 6/6/19 3:04 PM, Andy Lutomirski wrote:
>>> But, that seems broken.  If we have supervisor state, we can't 
>>> always defer the load until return to userspace, so we'll never?? 
>>> have TIF_NEED_FPU_LOAD.  That would certainly be true for 
>>> cet_kernel_state.
>> 
>> Ugh. I was sort of imagining that we would treat supervisor state
> completely separately from user state.  But can you maybe give
> examples of exactly what you mean?

I was imagining a completely separate area in memory for supervisor states.  I guess this might defeat the modified optimization and is probably a bad idea.

>> 
>>> It seems like we actually need three classes of XSAVE states: 1. 
>>> User state
>> 
>> This is FPU, XMM, etc, right?
> 
> Yep.
> 
>>> 2. Supervisor state that affects user mode
>> 
>> User CET?
> 
> Yep.
> 
>>> 3. Supervisor state that affects kernel mode
>> 
>> Like supervisor CET?  If we start doing supervisor shadow stack, the 
>> context switches will be real fun.  We may need to handle this in 
>> asm.
> 
> Yeah, that's what I was thinking.
> 
> I have the feeling Yu-cheng's patches don't comprehend this since
> Sebastian's patches went in after he started working on shadow stacks.

Do we need to have TIF_LOAD_FPU mean “we need to load *some* of the xsave state”?  If so, maybe a bunch of the accessors should have their interfaces reviewed to make sure they’re sill sensible.

> 
>> Where does PKRU fit in?  Maybe we can treat it as #3?
> 
> I thought Sebastian added specific PKRU handling to make it always
> eager.  It's actually user state that affect kernel mode. :)

Indeed.  But, if we document a taxonomy of states, we should make sure it fits in. I guess it’s like supervisor CET except that user code can can also read and write it.

We should probably have self tests that make sure that the correct states, and nothing else, show up in ptrace and signal states, and that trying to write supervisor CET via ptrace and sigreturn is properly rejected.

Just to double check my mental model: it’s okay to XSAVES twice to the same buffer with disjoint RFBM as long as we do something intelligent with XSTATE_BV afterwards, right?  Because, unless we split up the buffers, I think we will have to do this when we context switch while TIF_LOAD_FPU is set.

Are there performance numbers for how the time needed to XRSTORS everything versus the time to XRSTORS supervisor CET and then separately XRSTORS the FPU?  This may affect whether we want context switches to have the new task eagerly or lazily restored.

Hmm. I wonder if we need some way for a selftest to reliably trigger TIF_LOAD_FPU.

—Andy

^ permalink raw reply

* Re: [PATCH] uapi: avoid namespace conflict in linux/posix_types.h
From: Florian Weimer @ 2019-06-07  4:28 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Arnd Bergmann, linux-api, linux-arch, netdev, Laura Abbott,
	Paul Burton, Deepa Dinamani, linux-kernel, torvalds
In-Reply-To: <alpine.DEB.2.21.1905072249570.19308@digraph.polyomino.org.uk>

* Joseph Myers:

> What happened with this patch (posted 19 March)?  I found today that we 
> can't use Linux 5.1 headers in glibc testing because the namespace issues 
> are still present in the headers as of the release.

This regression fix still hasn't been merged into Linus' tree.  What is
going on here?

This might seem rather minor, but the namespace testing is actually
relevant in practice.  It prevents accidental clashes with C/C++
identifiers in user code.

If this fairly central UAPI header is not made namespace-clean again,
then we need to duplicate information from more UAPI headers in glibc,
and I don't think that's something we'd want to do.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 09/10] usb: Add USB subsystem notifications [ver #3]
From: Felipe Balbi @ 2019-06-07  6:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Alan Stern
  Cc: David Howells, viro, linux-usb, raven, linux-fsdevel, linux-api,
	linux-block, keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190606153150.GB28997@kroah.com>

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]


Hi,

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> On Thu, Jun 06, 2019 at 10:55:24AM -0400, Alan Stern wrote:
>> On Thu, 6 Jun 2019, Greg Kroah-Hartman wrote:
>> 
>> > On Thu, Jun 06, 2019 at 10:24:18AM -0400, Alan Stern wrote:
>> > > On Thu, 6 Jun 2019, David Howells wrote:
>> > > 
>> > > > Add a USB subsystem notification mechanism whereby notifications about
>> > > > hardware events such as device connection, disconnection, reset and I/O
>> > > > errors, can be reported to a monitoring process asynchronously.
>> > > 
>> > > USB I/O errors covers an awfully large and vague field.  Do we really
>> > > want to include them?  I'm doubtful.
>> > 
>> > See the other patch on the linux-usb list that wanted to start adding
>> > KOBJ_CHANGE notifications about USB "i/o errors".
>> 
>> That patch wanted to add notifications only for enumeration failures
>> (assuming you're talking about the patch from Eugeniu Rosca), not I/O
>> errors in general.
>
> Yes, sorry, I was thinking that as a "I/O failed in enumeration" :)
>
>> > So for "severe" issues, yes, we should do this, but perhaps not for all
>> > of the "normal" things we see when a device is yanked out of the system
>> > and the like.
>> 
>> Then what counts as a "severe" issue?  Anything besides enumeration 
>> failure?
>
> Not that I can think of at the moment, other than the other recently
> added KOBJ_CHANGE issue.  I'm sure we have other "hard failure" issues
> in the USB stack that people will want exposed over time.

From an XHCI standpoint, Transaction Errors might be one thing. They
happen rarely and are a strong indication that the bus itself is
bad. Either bad cable, misbehaving PHYs, improper power management, etc.

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v7 05/27] x86/fpu/xstate: Add XSAVES system states for shadow stack
From: Peter Zijlstra @ 2019-06-07  7:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-6-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:24PM -0700, Yu-cheng Yu wrote:
> Intel Control-flow Enforcement Technology (CET) introduces the
> following MSRs.
> 
>     MSR_IA32_U_CET (user-mode CET settings),
>     MSR_IA32_PL3_SSP (user-mode shadow stack),
>     MSR_IA32_PL0_SSP (kernel-mode shadow stack),
>     MSR_IA32_PL1_SSP (Privilege Level 1 shadow stack),
>     MSR_IA32_PL2_SSP (Privilege Level 2 shadow stack).
> 
> Introduce them into XSAVES system states.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/fpu/types.h            | 22 +++++++++++++++++++++
>  arch/x86/include/asm/fpu/xstate.h           |  4 +++-
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c                | 10 ++++++++++
>  4 files changed, 37 insertions(+), 1 deletion(-)

And yet, no changes to msr-index.h !?

^ permalink raw reply

* Re: [PATCH v7 15/27] mm: Handle shadow stack page fault
From: Peter Zijlstra @ 2019-06-07  7:30 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-16-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:34PM -0700, Yu-cheng Yu wrote:

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 75d9d68a6de7..ffcc0be7cadc 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1188,4 +1188,12 @@ static inline bool arch_has_pfn_modify_check(void)
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef CONFIG_ARCH_HAS_SHSTK
> +#define pte_set_vma_features(pte, vma) pte
> +#define arch_copy_pte_mapping(vma_flags) false

static inline pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma)
{
	return pte;
}

static inline bool arch_copy_pte_mapping(unsigned long vm_flags)
{
	return false;
}

Please, this way we retain function prototype checking.

> +#else
> +pte_t pte_set_vma_features(pte_t pte, struct vm_area_struct *vma);
> +bool arch_copy_pte_mapping(vm_flags_t vm_flags);
> +#endif

^ permalink raw reply

* Re: [PATCH v7 18/27] mm: Introduce do_mmap_locked()
From: Peter Zijlstra @ 2019-06-07  7:43 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-19-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:37PM -0700, Yu-cheng Yu wrote:
> There are a few places that need do_mmap() with mm->mmap_sem held.
> Create an in-line function for that.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  include/linux/mm.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 398f1e1c35e5..7cf014604848 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2411,6 +2411,24 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
>  static inline void mm_populate(unsigned long addr, unsigned long len) {}
>  #endif
>  
> +static inline unsigned long do_mmap_locked(unsigned long addr,
> +	unsigned long len, unsigned long prot, unsigned long flags,
> +	vm_flags_t vm_flags)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long populate;
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap(NULL, addr, len, prot, flags, vm_flags, 0,
> +		       &populate, NULL);

Funny thing how do_mmap() takes a file pointer as first argument and
this thing explicitly NULLs that. That more or less invalidates the name
do_mmap_locked().

> +	up_write(&mm->mmap_sem);
> +
> +	if (populate)
> +		mm_populate(addr, populate);
> +
> +	return addr;
> +}

^ permalink raw reply

* Re: [PATCH v7 18/27] mm: Introduce do_mmap_locked()
From: Peter Zijlstra @ 2019-06-07  7:47 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190607074322.GP3419@hirez.programming.kicks-ass.net>

On Fri, Jun 07, 2019 at 09:43:22AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 06, 2019 at 01:06:37PM -0700, Yu-cheng Yu wrote:
> > There are a few places that need do_mmap() with mm->mmap_sem held.
> > Create an in-line function for that.
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > ---
> >  include/linux/mm.h | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 398f1e1c35e5..7cf014604848 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2411,6 +2411,24 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
> >  static inline void mm_populate(unsigned long addr, unsigned long len) {}
> >  #endif
> >  
> > +static inline unsigned long do_mmap_locked(unsigned long addr,
> > +	unsigned long len, unsigned long prot, unsigned long flags,
> > +	vm_flags_t vm_flags)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned long populate;
> > +
> > +	down_write(&mm->mmap_sem);
> > +	addr = do_mmap(NULL, addr, len, prot, flags, vm_flags, 0,
> > +		       &populate, NULL);
> 
> Funny thing how do_mmap() takes a file pointer as first argument and
> this thing explicitly NULLs that. That more or less invalidates the name
> do_mmap_locked().
> 
> > +	up_write(&mm->mmap_sem);
> > +
> > +	if (populate)
> > +		mm_populate(addr, populate);
> > +
> > +	return addr;
> > +}

You also don't retain that last @uf argument.

I'm thikning you're better off adding a helper to the cet.c file; call
it cet_mmap() or whatever.

^ permalink raw reply

* Re: [PATCH v7 23/27] x86/cet/shstk: ELF header parsing of Shadow Stack
From: Peter Zijlstra @ 2019-06-07  7:54 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-24-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:42PM -0700, Yu-cheng Yu wrote:

> +#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
> +int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> +{
> +	int r;
> +	uint32_t property;

Flip those two lines around.

> +
> +	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> +			     &property);
> +
> +	memset(&current->thread.cet, 0, sizeof(struct cet_status));

It seems to me that memset would be better placed before
get_gnu_property().

> +	if (r)
> +		return r;
> +
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {

	if (r || !cpu_feature_enabled())
		return r;

> +		if (property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
> +			r = cet_setup_shstk();
> +		if (r < 0)
> +			return r;
> +	}
> +	return r;

and loose the indent.

> +}
> +#endif
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH v7 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Peter Zijlstra @ 2019-06-07  7:58 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200646.3951-23-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:06:41PM -0700, Yu-cheng Yu wrote:
> An ELF file's .note.gnu.property indicates features the executable file
> can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> 
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
> arch_setup_property().
> 
> For example, for X86_64:
> 
> int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> {
> 	int r;
> 	uint32_t property;
> 
> 	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> 			     &property);
> 	...
> }
> 
> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

Did HJ write this patch as suggested by that SoB chain? If so, you lost
a From: line on top, if not, the SoB thing is invalid.

^ permalink raw reply

* Re: [PATCH v7 07/14] x86/cet/ibt: Add arch_prctl functions for IBT
From: Peter Zijlstra @ 2019-06-07  8:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200926.4029-8-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:09:19PM -0700, Yu-cheng Yu wrote:

> +static int handle_bitmap(unsigned long arg2)
> +{
> +	unsigned long addr, size;
> +
> +	if (get_user(addr, (unsigned long __user *)arg2) ||
> +	    get_user(size, (unsigned long __user *)arg2 + 1))
> +		return -EFAULT;
> +
> +	return cet_setup_ibt_bitmap(addr, size);
> +}


> +	/*
> +	 * Allocate legacy bitmap and return address & size to user.
> +	 */
> +	case ARCH_X86_CET_SET_LEGACY_BITMAP:
> +		return handle_bitmap(arg2);

AFAICT it does exactly the opposite of that comment; it gets the address
and size from userspace and doesn't allocate anything at all.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Peter Zijlstra @ 2019-06-07  8:08 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190606200926.4029-4-yu-cheng.yu@intel.com>

On Thu, Jun 06, 2019 at 01:09:15PM -0700, Yu-cheng Yu wrote:
> Indirect Branch Tracking (IBT) provides an optional legacy code bitmap
> that allows execution of legacy, non-IBT compatible library by an
> IBT-enabled application.  When set, each bit in the bitmap indicates
> one page of legacy code.
> 
> The bitmap is allocated and setup from the application.

> +int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size)
> +{
> +	u64 r;
> +
> +	if (!current->thread.cet.ibt_enabled)
> +		return -EINVAL;
> +
> +	if (!PAGE_ALIGNED(bitmap) || (size > TASK_SIZE_MAX))
> +		return -EINVAL;
> +
> +	current->thread.cet.ibt_bitmap_addr = bitmap;
> +	current->thread.cet.ibt_bitmap_size = size;
> +
> +	/*
> +	 * Turn on IBT legacy bitmap.
> +	 */
> +	modify_fpu_regs_begin();
> +	rdmsrl(MSR_IA32_U_CET, r);
> +	r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> +	wrmsrl(MSR_IA32_U_CET, r);
> +	modify_fpu_regs_end();
> +
> +	return 0;
> +}

So you just program a random user supplied address into the hardware.
What happens if there's not actually anything at that address or the
user munmap()s the data after doing this?

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-06-07  9:40 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner
In-Reply-To: <5d44edf655e193789823094d1b4301fd@suse.de>

Hi Roman,

On Thu, Jun 06, 2019 at 10:11:57PM +0200, Roman Penyaev wrote:
> Hi Renzo,
> On 2019-06-03 17:00, Renzo Davoli wrote:
> > Please, have a look of the README.md page here:
> > https://github.com/virtualsquare/vuos
> Is that similar to what user-mode linux does?  I mean the principle.

let us write this proportion:
user-mode-linux / umvu = linux / namespace

In a comparison between user-mode linux and umvu,
while the way to get the system call requests is the same (ptrace)
the goal is different.

user-mode linux catches all the system calls, none of them is forwarded to
the real kernel: it uses a linux kernel compiled as a process to give processes
the illusion to live in another machine.

umvu catches all the system calls and then it decides if the syscall must be forwarded
to the kernel (maybe modified) or entirely processed at user-level by the hypervisor
(by means of specific plug-in modules like vufuse for file systems, vudev for devices
 and so on).
While the "illusion" of user-mode linux is a global illusion, the "illusion" provided by
umvu is limited and configurable. After a "mount" of a filesystem using vufuse,
the file system tree is the same *but* the subtree of the mountpoint.
The illusion is limited to the subtree as only the system call requests for paths inside
the subtree are processed by umvu and its modules.

It is similar to a namespace implemented at user level.
w.r.t. namespaces:
* umvu does not change the attack surface of the kernel (it is just a virtualization
		-- a.k.a. illusion -- provided by a user process to other user processes)
* umvu can provide features not currently supported by the kernel (e.g. a file system
		organization unavailable as kernel code, networking stacks at user level etc.)
* ...

umvu is an implementation of vuos concepts using ptrace. In a future maybe it will be
possibile to reimplement the same idea of partial virtual machines using other
syscall tracing/filtering tools.
> 
> > I am not trying to port some tools to use user-space implemented
> > stacks or device
> > drivers/emulators, I am seeking to a general purpose approach.
> 
> You still intersect *each* syscall, why not to do the same for epoll_wait()
> and replace events with correct value?  Seems you do something similar
> already
> in a vu_wrap_poll.c: wo_epoll_wait(), right?
> 
> Don't get me wrong, I really want to understand whether everything really
> looks so bad without proposed change. It seems not, because the whole
> principle
> is based on intersection of each syscall, thus one more one less - it does
> not
> become more clean and especially does not look like a generic purpose
> solution,
> which you seek.  I may be wrong.
Your comments are precious. Thank you as I see that you have browsed into my code
to have a better view of the problem.

umvu is a modular tool. The executable of umvu is a dispatcher between the
system call requests coming from the user processes and modules (loaded at
run time as dynamic plug-in libraries)

+-----------------+         +------+      +---------------------------------+
+processes running|<------->| umvu |<---->| module (e.g. vufuse/vudev/vunet)|
+  "inside" umvu  |         +------+      +---------------------------------+
+-----------------+

Each module "registers" to umvu its "responsabilities"
It can register:
* a pathname (it will receive the syscall requests for that subtree)
* an address_family (all the syscall for sockets of that AF)
* major/minor numbers of a char or block device
* a systam call number
* ....
(each module can register more items)

The problem is not in the dialogue between umvu and the user processes
(<---> on the left in the diagram above) but between umvu and its modules
(<---> on the right). 
(wi_epoll_wait, wd_epoll_wait, wo_epoll_wait are the three wrappers used
 respectively before, during and after epoll_wait in the dialogue on the left with the
user processes).

When a user process generates a "read" syscall request and umvu discovers that
the fd is managed by vufuse, it forwards to vufuse a "read" request
having the same signature of the "read" system call (plus a trailing fdprivate arg for
syscalls using a fd. This arg can be used to speed up virtualization but can be safely
ignored).

If for the same "read" request the file descriptor is managed by vunet,
it is forwarded to vunet (actually it is converted to "recvmsg": if fd is a socket
recvmesg manages all read/recv/recvfrom/recvmsg, umvu tends to simplify the API 
by unifying similar system calls).

But what about poll/epoll/ppoll/select/pselect?
umvu takes care of all the system call requests but it needs a clean way to ask
modules some feedback when the expected events happen.

I think the clean way to do it is illustrated in test_modules/unrealsock.c
This is a test module: it registers the address families AF_INET, AF_INET6 and AF_NETLINK
then it forwards all the requests to the system calls.
In this way when this module is loaded all system calls requests 
related to sockets of the mentioned families are sent to unrealsock 
which uses the system call.

When unrealsock is loaded
   vu_insmod unrealsock
nothing apparently happens.
It is possible to run ssh, ask for ip addresses using "ip addr" etc.
The difference is that all the system call requests "pass through" umvu
and are sent to the real kernel by unrealsock.

We use modules like unrealsock to test umvu features in an independent manner (without
specific modules and submodules).

All the stuff related the virtualization of poll/ppoll/select/pselect/epoll is managed at
line 77:
		vu_syscall_handler(s, epoll_ctl) = epoll_ctl;

It says: "dear module if I need to get informed by you about an event on a file descriptor
I'll call an epoll_ctl". That's all.

Here it works as I am diverting the request to the system call which is able to generate
EPOLL_PRI and all other flavours of EPOLL_*.
When I want to write a module able to virtualize a network stack I need to write my epoll_ctl 
so I need a way to generate EPOLL_PRI when I receive an OOB packet, EPOLL_IN must be set 
when the stack receives a packet and reset when the buffer gets empty, etc.

For sure I could teach people aiming to write a module for umvu that there is a variable in
proxima centaury (I mean an helper function using data.ptr of struct epoll_event) and
then if they generate an EPOLL_IN it will be processed as if it were an EPOLL_PRI.
why?

Let us reverse the question.
Why not giving linux programmers the ability to have file descriptors on which arbitrary 
EPOLL events can be generated as they need/please? eventfd/EFD_VPOLL is (in my opinion) 
a simple way to implement this feature, we can converge on a different API.
It is not a Copernican revolution of the code. It seems to me that it is not a dangerous
feature (I could be wrong, please tell me!). It is possible to generate signals normally
sent by the kernel (e.g. SEGV), why it is not possible to generate an EPOLL_PRI?
There is at least one happy user of the new feature (me) and other may come.
It is just one (tiny) more degree of freedom.

Many kernel features were added for one usage (e.g. seccomp was created to lend/rent processing
power in a safe way) and then it was discovered that they are useful in other cases (e.g.
sandboxes for browsers).

Again thanks to everybody who will have read this message up to this point ;-)

	renzo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox