Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-16 20:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <b719199a-ed91-610b-38bc-015a0749f600@kernel.org>

On Fri, 16 Aug 2019, Andy Lutomirski wrote:
> On 8/15/19 9:38 AM, Dmitry Safonov wrote:
> > As it has been discussed on timens RFC, adding a new conditional branch
> > `if (inside_time_ns)` on VDSO for all processes is undesirable.
> > It will add a penalty for everybody as branch predictor may mispredict
> > the jump. Also there are instruction cache lines wasted on cmp/jmp.
> > 
> > Those effects of introducing time namespace are very much unwanted
> > having in mind how much work have been spent on micro-optimisation
> > vdso code.
> > 
> > The propose is to allocate a second vdso code with dynamically
> > patched out (disabled by static_branch) timens code on boot time.
> > 
> > Allocate another vdso and copy original code.
> 
> 
> I'm unconvinced that any of this magic is wise.  I think you should make a
> special timens vvar page that causes the normal fastpath to fail (using a
> special vclock mode, a special seq value, or a special "last" value) and then
> make the failure path detect that timens is in use and use the timens path.

My initial suggestion still stands. Do that at compile time. It really does
not matter whether we have another 2 or 3 variants of vdso binaries.

Use it and be done with it. No special magic, just straight forward
decisions to use a timens capable VDSO or not.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCHv6 22/36] x86/vdso: Add offsets page in vvar
From: Dmitry Safonov @ 2019-08-16 20:20 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov
  Cc: linux-kernel, Andrei Vagin, Adrian Reber, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86, Andrei Vagin
In-Reply-To: <alpine.DEB.2.21.1908152117231.1908@nanos.tec.linutronix.de>

Hi Thomas,

On 8/15/19 8:21 PM, Thomas Gleixner wrote:
> On Thu, 15 Aug 2019, Dmitry Safonov wrote:
>> ---
>>  arch/Kconfig                          |  5 +++
>>  arch/x86/Kconfig                      |  1 +
>>  arch/x86/entry/vdso/vdso-layout.lds.S |  9 ++++-
>>  arch/x86/entry/vdso/vdso2c.c          |  3 ++
>>  arch/x86/entry/vdso/vma.c             | 12 +++++++
>>  arch/x86/include/asm/vdso.h           |  1 +
>>  init/Kconfig                          |  1 +
>>  lib/vdso/gettimeofday.c               | 47 +++++++++++++++++++++++++++
> 
> This needs to be split into the generic lib/vdso part and then x86 making
> use of it.

Ok

>> +#ifdef CONFIG_TIME_NS
> 
> This should be COMPILE_WITH_TIME_NS and not CONFIG_TIME_NS
> 
>> +extern u8 timens_page
>> +	__attribute__((visibility("hidden")));
>> +
>> +notrace static __always_inline void clk_to_ns(clockid_t clk, struct __kernel_timespec *ts)
> 
> This needs notrace because?

Heh, well it's alive from the time it was in arch/x86.
I believe, functions there had it since commit 23adec554a76 ("x86: add
notrace annotations to vsyscall"). Probably, lib/vdso is compiled
without mcount in the Makefile somewhere.
Will drop.

[..]
>> +	/*
>> +	 * The kernel allows to set a negative offset only if the current clock
>> +	 * value in a namespace is positive, so the result tv_sec can't be
>> +	 * negative here.
>> +	 */
>> +	ts->tv_nsec += offset64->tv_nsec;
>> +	ts->tv_sec += offset64->tv_sec;
>> +	if (ts->tv_nsec >= NSEC_PER_SEC) {
>> +		ts->tv_nsec -= NSEC_PER_SEC;
>> +		ts->tv_sec++;
>> +	}
>> +	if (ts->tv_nsec < 0) {
>> +		ts->tv_nsec += NSEC_PER_SEC;
>> +		ts->tv_sec--;
>> +	}
> 
> That's broken for 32bit user space on 64bit hosts. On LE due to
> misalignment and on BE because 32bit will read always 0.

Ugh, will look into that.

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Thomas Gleixner @ 2019-08-16 20:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <20190816195233.vzqqbqrivnooohq6@ast-mbp.dhcp.thefacebook.com>

Alexei,

On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> It's both of the above when 'systemd' is not taken literally.
> To earlier Thomas's point: the use case is not only about systemd.
> There are other containers management systems.

<SNIP>

> These daemons need to drop privileges to make the system safer == less
> prone to corruption due to bugs in themselves. Not necessary security
> bugs.

Let's take a step back.

While real usecases are helpful to understand a design decision, the design
needs to be usecase independent.

The kernel provides mechanisms, not policies. My impression of this whole
discussion is that it is policy driven. That's the wrong approach.

So let's look at the mechanisms which we have at hand:

 1) Capabilities
 
 2) SUID and dropping priviledges

 3) Seccomp and LSM

Now the real interesting questions are:

 A) What kind of restrictions does BPF allow? Is it a binary on/off or is
    there a more finegrained control of BPF functionality?

    TBH, I can't tell.

 B) Depending on the answer to #A what is the control possibility for
    #1/#2/#3 ?

Answering those questions gives us a real scope of what can be achieved
independent of use cases and wishful thought out policies.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-16 21:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Andy Lutomirski, Song Liu, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <B0364660-AD6A-4E5C-B04F-3B6DA78B4BBE@amacapital.net>

On Thu, Aug 15, 2019 at 05:54:59PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
> >> 
> >> I'm not sure why you draw the line for VMs -- they're just as buggy
> >> as anything else. Regardless, I reject this line of thinking: yes,
> >> all software is buggy, but that isn't a reason to give up.
> > 
> > hmm. are you saying you want kernel community to work towards
> > making containers (namespaces) being able to run arbitrary code
> > downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)

exactly: "meltdown", "spectre", "sigh".
Side channels effectively stalled the work on secure containers.
And killed projects like sandstorm.
Why work on improving kaslr when there are several ways to
get kernel addresses through hw bugs? Patch mouse holes when roof is leaking ?
In case of unprivileged bpf I'm confident that all known holes are patched.
Until disclosures stop happening with the frequency they do now the time
of bpf developers is better spent on something other than unprivileged bpf.

> I’m suggesting that you engage the security community ...
> .. so that normal users can use bpf filtering

yes, but not soon. unfortunately.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Christian Brauner @ 2019-08-16 22:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190816214542.inpt6p655whc2ejw@ast-mbp.dhcp.thefacebook.com>

On Fri, Aug 16, 2019 at 02:45:44PM -0700, Alexei Starovoitov wrote:
> On Thu, Aug 15, 2019 at 05:54:59PM -0700, Andy Lutomirski wrote:
> > 
> > 
> > > On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > 
> > >> 
> > >> I'm not sure why you draw the line for VMs -- they're just as buggy
> > >> as anything else. Regardless, I reject this line of thinking: yes,
> > >> all software is buggy, but that isn't a reason to give up.
> > > 
> > > hmm. are you saying you want kernel community to work towards
> > > making containers (namespaces) being able to run arbitrary code
> > > downloaded from the internet?
> > 
> > Yes.

If I may weigh in here too: Yes. In fact, we already do that large
scale!

> > 
> > As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> exactly: "meltdown", "spectre", "sigh".
> Side channels effectively stalled the work on secure containers.
> And killed projects like sandstorm.

If I may, Sandstorm's death has very likely nothing to do with
Meltdown/Spectre etc. since that should've also killed Qemu, Crosvm and
all the others in one fell swoop. It's also not a very good example (no
offense, Andy :)) and probably a bit dated.
We have LXD which is a full-fledged container manager that runs
*unprivileged system* containers on a large scale and is very much
alive. That is it runs systemd, openrc, what have you, i.e. simply
unmodifed distro images at this point.

It's used to run Linux workloads on all Chromebooks and in a bunch of
other places. Since its inception we did not have a single
*unprivileged* container to init userns/host breakout.
At this point in time the really bad CVEs are almost exclusively against
*privileged* containers (see this year's leading nomination for
container CVE grand mal of the year: CVE-2019-5736) which were never and
will never be considered root safe despite everyone pretending
otherwise.

> Why work on improving kaslr when there are several ways to
> get kernel addresses through hw bugs? Patch mouse holes when roof is leaking ?
> In case of unprivileged bpf I'm confident that all known holes are patched.
> Until disclosures stop happening with the frequency they do now the time
> of bpf developers is better spent on something other than unprivileged bpf.
> 
> > I’m suggesting that you engage the security community ...
> > .. so that normal users can use bpf filtering
> 
> yes, but not soon. unfortunately.

Tbh, I do not have a strong opinion on unprivileged bpf at this moment.
If the community thinks that the bits and pieces are not in place or the
timing is not right that's fine with me.
What we should make sure though is that we don't end up with something
halfbaked. And this /dev/bpf thing very much feels like a hack. If
unprivileged bpf is not a thing why bother with /dev/bpf or CAP_BPF at
all.

(The one usecase I'd care about is to extend seccomp to do pointer-based
syscall filtering. Whether or not that'd require (unprivileged) ebpf is
up for discussion at KSummit.)

Christian

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Dmitry Safonov @ 2019-08-16 22:47 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Dmitry Safonov, linux-kernel, Adrian Reber, Andrei Vagin,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908162208190.1923@nanos.tec.linutronix.de>

Hi Andy, Thomas,

thank you very much for your time and the reviews, appreciate that.

On 8/16/19 9:10 PM, Thomas Gleixner wrote:
> On Fri, 16 Aug 2019, Andy Lutomirski wrote:
[..]
>> I'm unconvinced that any of this magic is wise.  I think you should make a
>> special timens vvar page that causes the normal fastpath to fail (using a
>> special vclock mode, a special seq value, or a special "last" value) and then
>> make the failure path detect that timens is in use and use the timens path.

I see. That's so clever, it haven't come on my mind.
Hmm, is that better because of the price of 5-byte NOP?
I'm a bit afraid to complicate that seq/vclock logic more..

So, what I'm driving at is would you change your mind if timens still
had boot-time dynamic patching but without introducing NOP?

We've got the point that you want to have no penalty at all for host
tasks [on RFC reply] by introducing `if` as trashing cache and branch
predictor, but I wasn't sure if NOP is also unacceptable.

At that moment we had a "plan B" with something that was half-wittedly
called "retcalls". The basic idea is that all that the timens brings
into vdso are calls clk_to_ns(), which are all placed in tails of
functions. So, if we could just memcpy() function returns in host vdso
over introduced time-ns tail calls - it would be a very same code that
lied before. There is a draft of those [1], that actually works on x86
on both mine and Andrei's machines.

Consulting with Andrei, I've decided that we better stick to
static_branchs as they are well known and have already backends for
other architectures. We probably mistakenly decided that a price of NOP
on scalar machines is negligible and would be acceptable.

Would those self-invented "retcalls" be something that could be reviewed
and potentially accepted in further iterations?

[1]
https://github.com/0x7f454c46/linux/commit/ab0eeb646f43#diff-c22e1e73e7367f371e1f12e3877ea12f

> My initial suggestion still stands. Do that at compile time. It really does
> not matter whether we have another 2 or 3 variants of vdso binaries.
> 
> Use it and be done with it. No special magic, just straight forward
> decisions to use a timens capable VDSO or not.

I believe that was something we did in version 1 of the patches set.
It doesn't sound like a rocket science to do, but it resulted in a
couple of ugly patches.

The post-attempt notes about downsides of doing it compile-time are:

1. There is additional .so for each vdso: 64-bit, ia32, x32. The same
for every architecture to-be supported. It adds rules in Makefiles. [2]
2. If we still intend to keep setns() working without exec(), function
entries on both host/namespace vdso should be aligned to each other [3].
That results in a patch to vdso2c to generate offsets [4, 5] and in
linker magic to align another vdso [6].
3. As unexpected consequence, we also need to align local functions on
vdso [7].

So, it might be all related to my lack of skills, but it seems to bring
some big amount of complexity into build process. And in my point of
view, major issue is that it would not scale easily when the day will
come and there will be a need to introduce another vdso.so. As I didn't
want to be the guy that happens to be remembered as "he wrote this
unmaintainable pile of garbage", I've taken dynamic patching approach
that is done once a boot time.

Regardless, we both with Andrei want to improve the patches set and make
it acceptable and easy to maintain in future. I hope, that our effort to
do that is visible through evolution of patches. And we're very glad
that we've constructive critics and such patient maintainers.
So, if I'm mistaken in those points about compile-time vdso(s), or you
have in mind a plan how-to avoid those, I'd appreciate and rework it to
that direction.

[2] lkml.kernel.org/r/20190206001107.16488-14-dima@arista.com
[3] lkml.kernel.org/r/20190206001107.16488-15-dima@arista.com
[4] lkml.kernel.org/r/20190206001107.16488-16-dima@arista.com
[5] lkml.kernel.org/r/20190206001107.16488-17-dima@arista.com
[6] lkml.kernel.org/r/20190206001107.16488-19-dima@arista.com
[7] lkml.kernel.org/r/20190206001107.16488-20-dima@arista.com

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-17 15:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <alpine.DEB.2.21.1908162211270.1923@nanos.tec.linutronix.de>

On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> Alexei,
> 
> On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > It's both of the above when 'systemd' is not taken literally.
> > To earlier Thomas's point: the use case is not only about systemd.
> > There are other containers management systems.
> 
> <SNIP>
> 
> > These daemons need to drop privileges to make the system safer == less
> > prone to corruption due to bugs in themselves. Not necessary security
> > bugs.
> 
> Let's take a step back.
> 
> While real usecases are helpful to understand a design decision, the design
> needs to be usecase independent.
> 
> The kernel provides mechanisms, not policies. My impression of this whole
> discussion is that it is policy driven. That's the wrong approach.

not sure what you mean by 'policy driven'.
Proposed CAP_BPF is a policy?

My desire to do kernel.unprivileged_bpf_disabled=1 is driven by
text in Documentation/x86/mds.rst which says:
"There is one exception, which is untrusted BPF. The functionality of
untrusted BPF is limited, but it needs to be thoroughly investigated
whether it can be used to create such a construct."

commit 6a9e52927251 ("x86/speculation/mds: Add mds_clear_cpu_buffers()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>

The way I read this text:
- there is a concern that mds is exploitable via bpf
- there is a desire to investigate to address this concern

I'm committed to help with the investigation.

In the mean time I propose a path to do
kernel.unprivileged_bpf_disabled=1 which is CAP_BPF.

Can kernel.unprivileged_bpf_disabled=1 be used now?
Yes, but it will weaken overall system security because things that
use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
to move to stronger CAP_SYS_ADMIN.

With CAP_BPF both load and attach would happen under CAP_BPF
instead of CAP_SYS_ADMIN.

> So let's look at the mechanisms which we have at hand:
> 
>  1) Capabilities
>  
>  2) SUID and dropping priviledges
> 
>  3) Seccomp and LSM
> 
> Now the real interesting questions are:
> 
>  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
>     there a more finegrained control of BPF functionality?
> 
>     TBH, I can't tell.
> 
>  B) Depending on the answer to #A what is the control possibility for
>     #1/#2/#3 ?

Can any of the mechanisms 1/2/3 address the concern in mds.rst?

I believe Andy wants to expand the attack surface when
kernel.unprivileged_bpf_disabled=0
Before that happens I'd like the community to work on addressing the text above.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-17 15:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190816222252.a7zizw7azkxnv3ot@wittgenstein>

On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
> 
> (The one usecase I'd care about is to extend seccomp to do pointer-based
> syscall filtering. Whether or not that'd require (unprivileged) ebpf is
> up for discussion at KSummit.)

Kees have been always against using ebpf in seccomp. I believe he still
holds this opinion. Until he changes his mind let's stop bringing seccomp
as a use case for unpriv bpf.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Christian Brauner @ 2019-08-17 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190817150843.4vsmzpwpcvzndjld@ast-mbp>

On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
>> 
>> (The one usecase I'd care about is to extend seccomp to do
>pointer-based
>> syscall filtering. Whether or not that'd require (unprivileged) ebpf
>is
>> up for discussion at KSummit.)
>
>Kees have been always against using ebpf in seccomp. I believe he still
>holds this opinion. Until he changes his mind let's stop bringing
>seccomp
>as a use case for unpriv bpf.

That's why I said "whether or not".
For the record, I do prefer a non-unpriv-ebpf way.
It's still something that will most surely come up in the discussion though.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-17 15:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <61B88085-9FBB-41E6-9783-324E445E428D@ubuntu.com>

On Sat, Aug 17, 2019 at 05:16:53PM +0200, Christian Brauner wrote:
> On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
> >> 
> >> (The one usecase I'd care about is to extend seccomp to do
> >pointer-based
> >> syscall filtering. Whether or not that'd require (unprivileged) ebpf
> >is
> >> up for discussion at KSummit.)
> >
> >Kees have been always against using ebpf in seccomp. I believe he still
> >holds this opinion. Until he changes his mind let's stop bringing
> >seccomp
> >as a use case for unpriv bpf.
> 
> That's why I said "whether or not".
> For the record, I do prefer a non-unpriv-ebpf way.
> It's still something that will most surely come up in the discussion though.

It's very un-kernely way to defer to in-person meetings.
If there is anything to discuss please discuss it on the public mailing list.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Christian Brauner @ 2019-08-17 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190817153652.zfcsklt474j72dzm@ast-mbp.dhcp.thefacebook.com>

On August 17, 2019 5:36:54 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>On Sat, Aug 17, 2019 at 05:16:53PM +0200, Christian Brauner wrote:
>> On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov
><alexei.starovoitov@gmail.com> wrote:
>> >On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
>> >> 
>> >> (The one usecase I'd care about is to extend seccomp to do
>> >pointer-based
>> >> syscall filtering. Whether or not that'd require (unprivileged)
>ebpf
>> >is
>> >> up for discussion at KSummit.)
>> >
>> >Kees have been always against using ebpf in seccomp. I believe he
>still
>> >holds this opinion. Until he changes his mind let's stop bringing
>> >seccomp
>> >as a use case for unpriv bpf.
>> 
>> That's why I said "whether or not".
>> For the record, I do prefer a non-unpriv-ebpf way.
>> It's still something that will most surely come up in the discussion
>though.
>
>It's very un-kernely way to defer to in-person meetings.
>If there is anything to discuss please discuss it on the public mailing
>list.

https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2019-July/006699.html

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-17 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jordan Glover, Andy Lutomirski,
	Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190817150245.xxzxqjpvgqsxmloe@ast-mbp>

> On Aug 17, 2019, at 8:02 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 
> Can any of the mechanisms 1/2/3 address the concern in mds.rst?
> 

seccomp() can. It’s straightforward to use seccomp to disable bpf() outright for a process tree.  In this regard, bpf() isn’t particularly unique — it’s a system call that exposes some attack surface and that isn’t required by most programs for basic functionality.

At LPC this year, there will be a discussion about seccomp improvements that will, among other things, offer fiber-grained control. It’s quite likely, for example, that seccomp will soon be able to enable and disable specific map types or attach types.  The exact mechanism isn’t decided yet,  but I think everyone expects that this is mostly a design problem, not an implementation problem.

This is off topic for the current thread, but it could be useful to allow bpf programs to be loaded from files directly (i.e. pass an fd to a file into bpf() to load the program), which would enable LSMs to check that the file is appropriately labeled. This would dramatically raise the bar for exploitation of verifier bugs or speculation attacks, since anyone trying to exploit it would need to get the bpf payload through LSM policy first.

> I believe Andy wants to expand the attack surface when
> kernel.unprivileged_bpf_disabled=0
> Before that happens I'd like the community to work on addressing the text above.
> 

Not by much. BPF maps are already largely exposed to unprivileged code (when unprivileged_bpf_disabled=0).  The attack surface is there, and they’re arguably even more exposed than they should be.  My patch 1 earlier was about locking these interfaces down.

Similarly, my suggestions about reworking cgroup attach and program load don’t actually allow fully unprivileged users to run arbitrary bpf() programs [0] — under my proposal, to attach a bpf cgroup program, you need a delegated cgroup. The mechanism could be extended by a requirement that a privileged cgroup manager explicitly enable certain attach types for a delegated subtree.

A cgroup knob to turn unprivileged bpf on and off for tasks in the cgroup might actually be quite useful.

[0] on some thought, the test run mechanism should probably remain root-only.

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-18 16:21 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Dmitry Safonov, linux-kernel, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <483678c7-7687-5445-f09e-e45e9460d559@gmail.com>

Dmitry,

On Fri, 16 Aug 2019, Dmitry Safonov wrote:
> On 8/16/19 9:10 PM, Thomas Gleixner wrote:
> > On Fri, 16 Aug 2019, Andy Lutomirski wrote:
> [..]
> >> I'm unconvinced that any of this magic is wise.  I think you should make a
> >> special timens vvar page that causes the normal fastpath to fail (using a
> >> special vclock mode, a special seq value, or a special "last" value) and then
> >> make the failure path detect that timens is in use and use the timens path.
> 
> I see. That's so clever, it haven't come on my mind.
> Hmm, is that better because of the price of 5-byte NOP?
> I'm a bit afraid to complicate that seq/vclock logic more..

See below.

> > My initial suggestion still stands. Do that at compile time. It really does
> > not matter whether we have another 2 or 3 variants of vdso binaries.
> > 
> > Use it and be done with it. No special magic, just straight forward
> > decisions to use a timens capable VDSO or not.
> 
> I believe that was something we did in version 1 of the patches set.
> It doesn't sound like a rocket science to do, but it resulted in a
> couple of ugly patches.
> 
> The post-attempt notes about downsides of doing it compile-time are:
> 
> 1. There is additional .so for each vdso: 64-bit, ia32, x32. The same
> for every architecture to-be supported. It adds rules in Makefiles. [2]
> 2. If we still intend to keep setns() working without exec(), function
> entries on both host/namespace vdso should be aligned to each other [3].
> That results in a patch to vdso2c to generate offsets [4, 5] and in
> linker magic to align another vdso [6].
> 3. As unexpected consequence, we also need to align local functions on
> vdso [7].
> 
> So, it might be all related to my lack of skills, but it seems to bring
> some big amount of complexity into build process. And in my point of
> view, major issue is that it would not scale easily when the day will
> come and there will be a need to introduce another vdso.so. As I didn't
> want to be the guy that happens to be remembered as "he wrote this
> unmaintainable pile of garbage", I've taken dynamic patching approach
> that is done once a boot time.

Fair enough. Did not think about that part.
 
> Regardless, we both with Andrei want to improve the patches set and make
> it acceptable and easy to maintain in future. I hope, that our effort to
> do that is visible through evolution of patches. And we're very glad
> that we've constructive critics and such patient maintainers.

I'm happy to review well written stuff which makes progress and takes
review comments into account or the submitter discusses them for
resolution.

> So, if I'm mistaken in those points about compile-time vdso(s), or you
> have in mind a plan how-to avoid those, I'd appreciate and rework it to
> that direction.

Coming back to Andy's idea. Create your time namespace page as an exact
copy of the vdso data page. When the page is created do:

     	 memset(p->vdso_data, 0, sizeof(p->vdso_data));
     	 p->vdso_data[0].clock_mode = CLOCK_TIMENS;
	 p->vdso_data[0].seq = 1;
 
	 /* Store the namespace offsets in basetime */
	 p->vdso_data[0].basetime[CLOCK_MONOTONIC].sec = myns->mono_sec;
	 p->vdso_data[0].basetime[CLOCK_MONOTONIC].nsec = myns->mono_nsec;
	 p->vdso_data[0].basetime[CLOCK_BOOTTIME].sec = myns->boot_sec;
	 p->vdso_data[0].basetime[CLOCK_BOOTTIME].nsec = myns->boot_nsec;

     	 p->vdso_data[1].clock_mode = CLOCK_TIMENS;
	 p->vdso_data[1].seq = 1;

For a normal task the VVAR pages are installed in the normal ordering:

       VVAR
       PVCLOCK
       HVCLOCK
       TIMENS	<- Not really required

Now for a timens task you install the pages in the following order

       TIMENS
       PVCLOCK
       HVCLOCK
       VVAR

The check for vdso_data->clock_mode is in the unlikely path of the now open
coded seq begin magic. So for the non-timens case most of the time 'seq' is
even, so the branch is not taken.

If 'seq' is odd, i.e. a concurrent update is in progress, the extra check
for vdso_data->clock_mode is a non-issue. The task is spin waiting for the
update to finish and for 'seq' to become even anyway.

Patch below. I tested this with the normal order and by installing a
'timens' page unconditionally for all processes. I'll reply with the timens
testing hacks so you can see what I did.

The test results are pretty good.

    	 	 Base (upstream)  + VDSO patch	+ timens page

MONO		 30ns 		    30ns 	  32ns
REAL		 30ns		    30ns	  32ns
BOOT		 30ns		    30ns	  32ns
MONOCOARSE	  7ns		     8ns	  10ns
REALCOARSE	  7ns		     8ns	  10ns
TAI		 30ns		    30ns	  32ns
MONORAW		 30ns		    30ns	  32ns

So except for the coarse clocks there is no change when the timens page is
not used, i.e. the regular VVAR page is at the proper place. But that's on
one machine, a different one showed an effect in the noise range. I'm not
worried about that as the VDSO behaviour varies depending on micro
architecture anyway.

With timens enabled the performance hit (cache hot microbenchmark) is
somewhere in the range of 5-7% when looking at the perf counters
numbers. The hit for the coarse accessors is larger obviously because the
overhead is constant time.

I did a quick comparison of the array vs. switch case (what you used for
your clk_to_ns() helper). The switch case is slower.

So I rather go for the array based approach. It's simpler code and the
I-cache footprint is smaller and no conditional branches involved.

That means your timens_to_host() and host_to_timens() conversion functions
should just use that special VDSO page and do the same array based
unconditional add/sub of the clock specific offset.

Thanks,

	tglx

8<-----------------
Subject: lib/vdso: Prepare for time namespace support
From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 18 Aug 2019 16:53:06 +0200

To support time namespaces in the vdso with a minimal impact on regular non
time namespace affected tasks, the namespace handling needs to be hidden in
a slow path.

The most obvious place is vdso_seq_begin(). If a task belongs to a time
namespace then the VVAR page which contains the system wide vdso data is
replaced with a namespace specific page which has the same layout as the
VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path
and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time
namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the vdso data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

If VDSO time namespace support is disabled the whole magic is compiled out.

Initial testing shows that the disabled case is almost identical to the
host case which does not take the slow timens path. With the special timens
page installed the performance hit is constant time and in the range of
5-7%.

For the vdso functions which are not using the sequence count an
unconditional check for vdso_data->clock_mode is added which switches to
the real vdso when the clock_mode is VCLOCK_TIMENS.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/time.h    |    6 ++
 include/vdso/datapage.h |   19 ++++++-
 lib/vdso/Kconfig        |    6 ++
 lib/vdso/gettimeofday.c |  128 ++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 155 insertions(+), 4 deletions(-)

--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -96,4 +96,10 @@ static inline bool itimerspec64_valid(co
  */
 #define time_after32(a, b)	((s32)((u32)(b) - (u32)(a)) < 0)
 #define time_before32(b, a)	time_after32(a, b)
+
+struct timens_offset {
+	s64	sec;
+	u64	nsec;
+};
+
 #endif
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -21,6 +21,8 @@
 #define CS_RAW		1
 #define CS_BASES	(CS_RAW + 1)
 
+#define VCLOCK_TIMENS	UINT_MAX
+
 /**
  * struct vdso_timestamp - basetime per clock_id
  * @sec:	seconds
@@ -48,6 +50,7 @@ struct vdso_timestamp {
  * @mult:		clocksource multiplier
  * @shift:		clocksource shift
  * @basetime[clock_id]:	basetime per clock_id
+ * @offset[clock_id]:	time namespace offset per clock_id
  * @tz_minuteswest:	minutes west of Greenwich
  * @tz_dsttime:		type of DST correction
  * @hrtimer_res:	hrtimer resolution
@@ -55,6 +58,17 @@ struct vdso_timestamp {
  *
  * vdso_data will be accessed by 64 bit and compat code at the same time
  * so we should be careful before modifying this structure.
+ *
+ * @basetime is used to store the base time for the system wide time getter
+ * VVAR page.
+ *
+ * @offset is used by the special time namespace VVAR pages which are
+ * installed instead of the real VVAR page. These namespace pages must set
+ * @seq to 1 and @clock_mode to VLOCK_TIMENS to force the code into the
+ * time namespace slow path. The namespace aware functions retrieve the
+ * real system wide VVAR page, read host time and add the per clock offset.
+ * For clocks which are not affected by time namespace adjustement the
+ * offset must be zero.
  */
 struct vdso_data {
 	u32			seq;
@@ -65,7 +79,10 @@ struct vdso_data {
 	u32			mult;
 	u32			shift;
 
-	struct vdso_timestamp	basetime[VDSO_BASES];
+	union {
+		struct vdso_timestamp	basetime[VDSO_BASES];
+		struct timens_offset	offset[VDSO_BASES];
+	};
 
 	s32			tz_minuteswest;
 	s32			tz_dsttime;
--- a/lib/vdso/Kconfig
+++ b/lib/vdso/Kconfig
@@ -33,4 +33,10 @@ config CROSS_COMPILE_COMPAT_VDSO
 	  If a 64 bit compiler (i.e. x86_64) can compile the VDSO for
 	  32 bit, it does not need to define this parameter.
 
+config VDSO_TIMENS
+	bool
+	help
+	  Selected by architectures which support time namespaces in the
+	  VDSO
+
 endif
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -38,6 +38,51 @@ u64 vdso_calc_delta(u64 cycles, u64 last
 }
 #endif
 
+#ifndef CONFIG_VDSO_TIMENS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
+{
+	return NULL;
+}
+#endif
+
+static int do_hres_ns(const struct vdso_data *vdns, clockid_t clk,
+		      struct __kernel_timespec *ts)
+{
+	const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns);
+	const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
+	const struct timens_offset *offs = &vdns->offset[clk];
+	u64 cycles, last, ns;
+	s64 sec;
+	u32 seq;
+
+	do {
+		seq = vdso_read_begin(vd);
+		cycles = __arch_get_hw_counter(vd->clock_mode);
+		ns = vdso_ts->nsec;
+		last = vd->cycle_last;
+		if (unlikely((s64)cycles < 0))
+			return -1;
+
+		ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult);
+		ns >>= vd->shift;
+		sec = vdso_ts->sec;
+	} while (unlikely(vdso_read_retry(vd, seq)));
+
+	/* Add the namespace offset */
+	sec += offs->sec;
+	ns += offs->nsec;
+
+	/*
+	 * Do this outside the loop: a race inside the loop could result
+	 * in __iter_div_u64_rem() being extremely slow.
+	 */
+	ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+	ts->tv_nsec = ns;
+
+	return 0;
+}
+
 static int do_hres(const struct vdso_data *vd, clockid_t clk,
 		   struct __kernel_timespec *ts)
 {
@@ -46,7 +91,28 @@ static int do_hres(const struct vdso_dat
 	u32 seq;
 
 	do {
-		seq = vdso_read_begin(vd);
+		/*
+		 * Open coded to handle VCLOCK_TIMENS. Time namespace
+		 * enabled tasks have a special VVAR page installed which
+		 * has vd->seq set to 1 and vd->clock_mode set to
+		 * VCLOCK_TIMENS. For non time namespace affected tasks
+		 * this does not affect performance because if vd->seq is
+		 * odd, i.e. a concurrent update is in progress the extra
+		 * check for vd->clock_mode is just a few extra
+		 * instructions while spin waiting for vd->seq to become
+		 * even again.
+		 */
+		while (1) {
+			seq = READ_ONCE(vd->seq);
+			if (likely(!(seq & 1)))
+				break;
+			if (IS_ENABLED(CONFIG_VDSO_TIMENS) &&
+			    vd->clock_mode == VCLOCK_TIMENS)
+				return do_hres_ns(vd, clk, ts);
+			cpu_relax();
+		}
+		smp_rmb();
+
 		cycles = __arch_get_hw_counter(vd->clock_mode);
 		ns = vdso_ts->nsec;
 		last = vd->cycle_last;
@@ -68,6 +134,34 @@ static int do_hres(const struct vdso_dat
 	return 0;
 }
 
+static void do_coarse_ns(const struct vdso_data *vdns, clockid_t clk,
+			 struct __kernel_timespec *ts)
+{
+	const struct vdso_data *vd = __arch_get_timens_vdso_data(vdns);
+	const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
+	const struct timens_offset *offs = &vdns->offset[clk];
+	u64 nsec;
+	s64 sec;
+	s32 seq;
+
+	do {
+		seq = vdso_read_begin(vd);
+		sec = vdso_ts->sec;
+		nsec = vdso_ts->nsec;
+	} while (unlikely(vdso_read_retry(vd, seq)));
+
+	/* Add the namespace offset */
+	sec += offs->sec;
+	nsec += offs->nsec;
+
+	/*
+	 * Do this outside the loop: a race inside the loop could result
+	 * in __iter_div_u64_rem() being extremely slow.
+	 */
+	ts->tv_sec = sec + __iter_div_u64_rem(nsec, NSEC_PER_SEC, &nsec);
+	ts->tv_nsec = nsec;
+}
+
 static void do_coarse(const struct vdso_data *vd, clockid_t clk,
 		      struct __kernel_timespec *ts)
 {
@@ -75,7 +169,23 @@ static void do_coarse(const struct vdso_
 	u32 seq;
 
 	do {
-		seq = vdso_read_begin(vd);
+		/*
+		 * Open coded to handle VCLOCK_TIMENS. See comment in
+		 * do_hres().
+		 */
+		while (1) {
+			seq = READ_ONCE(vd->seq);
+			if (likely(!(seq & 1)))
+				break;
+			if (IS_ENABLED(CONFIG_VDSO_TIMENS) &&
+			    vd->clock_mode == VCLOCK_TIMENS) {
+				do_coarse_ns(vd, clk, ts);
+				return;
+			}
+			cpu_relax();
+		}
+		smp_rmb();
+
 		ts->tv_sec = vdso_ts->sec;
 		ts->tv_nsec = vdso_ts->nsec;
 	} while (unlikely(vdso_read_retry(vd, seq)));
@@ -156,6 +266,10 @@ static __maybe_unused int
 	}
 
 	if (unlikely(tz != NULL)) {
+		if (IS_ENABLED(CONFIG_VDSO_TIMENS) &&
+		    vd->clock_mode == VCLOCK_TIMENS)
+			vd = __arch_get_timens_vdso_data(vd);
+
 		tz->tz_minuteswest = vd[CS_HRES_COARSE].tz_minuteswest;
 		tz->tz_dsttime = vd[CS_HRES_COARSE].tz_dsttime;
 	}
@@ -167,7 +281,12 @@ static __maybe_unused int
 static __maybe_unused time_t __cvdso_time(time_t *time)
 {
 	const struct vdso_data *vd = __arch_get_vdso_data();
-	time_t t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec);
+	time_t t;
+
+	if (IS_ENABLED(CONFIG_VDSO_TIMENS) && vd->clock_mode == VCLOCK_TIMENS)
+		vd = __arch_get_timens_vdso_data(vd);
+
+	t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec);
 
 	if (time)
 		*time = t;
@@ -189,6 +308,9 @@ int __cvdso_clock_getres_common(clockid_
 	if (unlikely((u32) clock >= MAX_CLOCKS))
 		return -1;
 
+	if (IS_ENABLED(CONFIG_VDSO_TIMENS) && vd->clock_mode == VCLOCK_TIMENS)
+		vd = __arch_get_timens_vdso_data(vd);
+
 	hrtimer_res = READ_ONCE(vd[CS_HRES_COARSE].hrtimer_res);
 	/*
 	 * Convert the clockid to a bitmask and use it to check which

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-18 16:24 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Dmitry Safonov, linux-kernel, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908171709360.1923@nanos.tec.linutronix.de>

On Sun, 18 Aug 2019, Thomas Gleixner wrote:
> 
> Patch below. I tested this with the normal order and by installing a
> 'timens' page unconditionally for all processes. I'll reply with the timens
> testing hacks so you can see what I did.

First hack...

Thanks,

	tglx

8<-----------------

Subject: x86/vdso: Hack to enable time name space  for testing
From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 18 Aug 2019 16:42:26 +0200

Select CONFIG_VDSO_TIMENS and prepare for the extra magic time namespace
vvar page. The fault handler is not handling it yet as the path is never
taken (hopefully)

Not-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig                         |    1 +
 arch/x86/entry/vdso/vdso-layout.lds.S    |    3 ++-
 arch/x86/entry/vdso/vdso2c.c             |    3 +++
 arch/x86/include/asm/vdso.h              |    1 +
 arch/x86/include/asm/vdso/gettimeofday.h |    9 +++++++++
 5 files changed, 16 insertions(+), 1 deletion(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -224,6 +224,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
 	select PROC_PID_ARCH_STATUS		if PROC_FS
+	select VDSO_TIMENS
 
 config INSTRUCTION_DECODER
 	def_bool y
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -16,7 +16,7 @@ SECTIONS
 	 * segment.
 	 */
 
-	vvar_start = . - 3 * PAGE_SIZE;
+	vvar_start = . - 4 * PAGE_SIZE;
 	vvar_page = vvar_start;
 
 	/* Place all vvars at the offsets in asm/vvar.h. */
@@ -28,6 +28,7 @@ SECTIONS
 
 	pvclock_page = vvar_start + PAGE_SIZE;
 	hvclock_page = vvar_start + 2 * PAGE_SIZE;
+	timens_page = vvar_start + 3 * PAGE_SIZE;
 
 	. = SIZEOF_HEADERS;
 
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -75,12 +75,14 @@ enum {
 	sym_vvar_page,
 	sym_pvclock_page,
 	sym_hvclock_page,
+	sym_timens_page,
 };
 
 const int special_pages[] = {
 	sym_vvar_page,
 	sym_pvclock_page,
 	sym_hvclock_page,
+	sym_timens_page,
 };
 
 struct vdso_sym {
@@ -93,6 +95,7 @@ struct vdso_sym required_syms[] = {
 	[sym_vvar_page] = {"vvar_page", true},
 	[sym_pvclock_page] = {"pvclock_page", true},
 	[sym_hvclock_page] = {"hvclock_page", true},
+	[sym_timens_page] = {"timens_page", true},
 	{"VDSO32_NOTE_MASK", true},
 	{"__kernel_vsyscall", true},
 	{"__kernel_sigreturn", true},
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -21,6 +21,7 @@ struct vdso_image {
 	long sym_vvar_page;
 	long sym_pvclock_page;
 	long sym_hvclock_page;
+	long sym_timens_page;
 	long sym_VDSO32_NOTE_MASK;
 	long sym___kernel_sigreturn;
 	long sym___kernel_rt_sigreturn;
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -265,6 +265,15 @@ static __always_inline const struct vdso
 	return __vdso_data;
 }
 
+/* HACK .... */
+#define VDSO_TIMENS_PAGEOFFSET		(3 * PAGE_SIZE)
+
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
+{
+	return (void *)vd + VDSO_TIMENS_PAGEOFFSET;
+}
+
 /*
  * x86 specific delta calculation.
  *

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-18 16:29 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Dmitry Safonov, linux-kernel, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908181823010.1923@nanos.tec.linutronix.de>

On Sun, 18 Aug 2019, Thomas Gleixner wrote:

> On Sun, 18 Aug 2019, Thomas Gleixner wrote:
> > 
> > Patch below. I tested this with the normal order and by installing a
> > 'timens' page unconditionally for all processes. I'll reply with the timens
> > testing hacks so you can see what I did.
> 
> First hack...

And the second one.

Thanks,

	tglx

8<-----------------

Subject: x86/vdso: Hack to test the time namespace path
From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 18 Aug 2019 16:49:00 +0200

Install a special TIMENS vvar page which forces the VDSO to take the time
namespace path for testing.

Not-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vma.c |   34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -84,6 +84,33 @@ static int vdso_mremap(const struct vm_s
 	return 0;
 }
 
+/* Hack for testing */
+static struct page *vdso_timens_page;
+
+static int __init init_vdso_timens(void)
+{
+	struct vdso_data *vdata;
+	void *va;
+
+	vdso_timens_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!vdso_timens_page)
+		return -ENOMEM;
+
+	/* Hack: vdso data is at offset 0x80 in the page ... */
+	va = page_address(vdso_timens_page);
+	vdata = (struct vdso_data *)(va + 0x80);
+
+	vdata[0].seq = 1;
+	vdata[0].clock_mode = UINT_MAX;
+	vdata[1].seq = 1;
+	vdata[1].clock_mode = UINT_MAX;
+
+	/* All offsets are zero */
+
+	return 0;
+}
+subsys_initcall(init_vdso_timens);
+
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 		      struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -106,7 +133,7 @@ static vm_fault_t vvar_fault(const struc
 	if (sym_offset == 0)
 		return VM_FAULT_SIGBUS;
 
-	if (sym_offset == image->sym_vvar_page) {
+	if (sym_offset == image->sym_timens_page) {
 		return vmf_insert_pfn(vma, vmf->address,
 				__pa_symbol(&__vvar_page) >> PAGE_SHIFT);
 	} else if (sym_offset == image->sym_pvclock_page) {
@@ -123,6 +150,11 @@ static vm_fault_t vvar_fault(const struc
 		if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
 			return vmf_insert_pfn(vma, vmf->address,
 					vmalloc_to_pfn(tsc_pg));
+	} else if (sym_offset == image->sym_vvar_page) {
+		unsigned long pfn;
+
+		pfn = page_to_pfn(vdso_timens_page);
+		return vmf_insert_pfn(vma, vmf->address, pfn);
 	}
 
 	return VM_FAULT_SIGBUS;

^ permalink raw reply

* Re: [PATCH v5 0/9] FPGA DFL updates
From: Wu Hao @ 2019-08-19  5:31 UTC (permalink / raw)
  To: gregkh, mdf, linux-fpga; +Cc: linux-kernel, linux-api, linux-doc, atull
In-Reply-To: <1565578204-13969-1-git-send-email-hao.wu@intel.com>

On Mon, Aug 12, 2019 at 10:49:55AM +0800, Wu Hao wrote:
> Hi Greg,
> 
> This is v5 patchset which adds more features to FPGA DFL. Marjor changes
> against v4 are sysfs related code rework to address comments on v4.
> 
> Please help to take a look. Thanks!

Hi Greg,

Did you get a chance to take a look at this new version patchset? :)

Thanks
Hao

> 
> Main changes from v4:
>   - convert code to use dev_groups for sysfs entries (#2, #3, #4, #6, #8).
>   - clean up for empty init function after remove sysfs add/remove (#1).
>   - introduce is_visible for sysfs groups (#3, #4, #6, #8).
>   - remove revision sysfs entries (#4, #6, #8).
>   - improve naming on shared functions (#5).
>   - reorganize sysfs entries for port and fme error reporting (#6, #8).
> 
> Main changes from v3:
>   - drop avx512 partail reconfiguration patch for now.
>   - split dfl_fpga_cdev_config_port to 2 functions *_release/assign_port
>     (#1).
>   - split __dfl_fpga_cdev_config_port_vf into 2 functions with locking
>     added (#2).
>   - improve description in sysfs doc to avoid misunderstanding (#3).
>   - switch to boolean in sysfs entry store function (#3).
>   - remove dev_dbg in init/uinit callback function (#7, #9, #11).
>   - remove uinit callback which does does nothing (#8, #9)
> 
> Main changes from v2:
>   - update kernel version/date in sysfs doc (patch #4, #5, #8, #10, #11).
>   - add back Documentation patch (patch #12).
> 
> Main changes from v1:
>   - remove DRV/MODULE_VERSION modifications. (patch #1, #3, #4, #6)
>   - remove argsz from new ioctls. (patch #2)
>   - replace sysfs_create/remove_* with device_add/remove_* for sysfs entries.
>     (patch #5, #8, #11)
> 
> Wu Hao (9):
>   fpga: dfl: make init callback optional
>   fpga: dfl: fme: convert platform_driver to use dev_groups
>   fpga: dfl: afu: convert platform_driver to use dev_groups
>   fpga: dfl: afu: add userclock sysfs interfaces.
>   fpga: dfl: afu: expose __afu_port_enable/disable function.
>   fpga: dfl: afu: add error reporting support.
>   fpga: dfl: afu: add STP (SignalTap) support
>   fpga: dfl: fme: add global error reporting support
>   Documentation: fpga: dfl: add descriptions for virtualization and new
>     interfaces.
> 
>  Documentation/ABI/testing/sysfs-platform-dfl-fme  |  62 ++++
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  53 ++++
>  Documentation/fpga/dfl.rst                        | 105 +++++++
>  drivers/fpga/Makefile                             |   3 +-
>  drivers/fpga/dfl-afu-error.c                      | 230 ++++++++++++++
>  drivers/fpga/dfl-afu-main.c                       | 230 +++++++++++---
>  drivers/fpga/dfl-afu.h                            |   9 +
>  drivers/fpga/dfl-fme-error.c                      | 359 ++++++++++++++++++++++
>  drivers/fpga/dfl-fme-main.c                       |  42 +--
>  drivers/fpga/dfl-fme.h                            |   3 +
>  drivers/fpga/dfl.c                                |  10 +-
>  drivers/fpga/dfl.h                                |   9 +
>  12 files changed, 1041 insertions(+), 74 deletions(-)
>  create mode 100644 drivers/fpga/dfl-afu-error.c
>  create mode 100644 drivers/fpga/dfl-fme-error.c
> 
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Thomas Gleixner @ 2019-08-19  9:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <20190817150245.xxzxqjpvgqsxmloe@ast-mbp>

Alexei,

On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
> On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> > On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > While real usecases are helpful to understand a design decision, the design
> > needs to be usecase independent.
> > 
> > The kernel provides mechanisms, not policies. My impression of this whole
> > discussion is that it is policy driven. That's the wrong approach.
> 
> not sure what you mean by 'policy driven'.
> Proposed CAP_BPF is a policy?

I was referring to the discussion as a whole.

> Can kernel.unprivileged_bpf_disabled=1 be used now?
> Yes, but it will weaken overall system security because things that
> use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
> to move to stronger CAP_SYS_ADMIN.
> 
> With CAP_BPF both load and attach would happen under CAP_BPF
> instead of CAP_SYS_ADMIN.

I'm not arguing against that.

> > So let's look at the mechanisms which we have at hand:
> > 
> >  1) Capabilities
> >  
> >  2) SUID and dropping priviledges
> > 
> >  3) Seccomp and LSM
> > 
> > Now the real interesting questions are:
> > 
> >  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
> >     there a more finegrained control of BPF functionality?
> > 
> >     TBH, I can't tell.
> > 
> >  B) Depending on the answer to #A what is the control possibility for
> >     #1/#2/#3 ?
> 
> Can any of the mechanisms 1/2/3 address the concern in mds.rst?

Well, that depends. As with any other security policy which is implemented
via these mechanisms, the policy can be strict enough to prevent it by not
allowing certain operations. The more fine-grained the control is, it
allows the administrator who implements the policy to remove the
'dangerous' parts from an untrusted user.

So really question #A is important for this. Is BPF just providing a binary
ON/OFF knob or does it allow to disable/enable certain aspects of BPF
functionality in a more fine grained way? If the latter, then it might be
possible to control functionality which might be abused for exploits of
some sorts (including MDS) in a way which allows other parts of BBF to be
exposed to less priviledged contexts.

> I believe Andy wants to expand the attack surface when
> kernel.unprivileged_bpf_disabled=0
> Before that happens I'd like the community to work on addressing the text above.

Well, that text above can be removed when the BPF wizards are entirely sure
that BPF cannot be abused to exploit stuff. 

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Dmitry Safonov @ 2019-08-19 14:15 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Dmitry Safonov, linux-kernel, Adrian Reber, Andrei Vagin,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908171709360.1923@nanos.tec.linutronix.de>

Hi Thomas,

On 8/18/19 5:21 PM, Thomas Gleixner wrote:
[..]
> I'm happy to review well written stuff which makes progress and takes
> review comments into account or the submitter discusses them for
> resolution.

Thanks again for both your and Andy time!

[..]
> Coming back to Andy's idea. Create your time namespace page as an exact
> copy of the vdso data page. When the page is created do:
> 
>      	 memset(p->vdso_data, 0, sizeof(p->vdso_data));
>      	 p->vdso_data[0].clock_mode = CLOCK_TIMENS;
> 	 p->vdso_data[0].seq = 1;
>  
> 	 /* Store the namespace offsets in basetime */
> 	 p->vdso_data[0].basetime[CLOCK_MONOTONIC].sec = myns->mono_sec;
> 	 p->vdso_data[0].basetime[CLOCK_MONOTONIC].nsec = myns->mono_nsec;
> 	 p->vdso_data[0].basetime[CLOCK_BOOTTIME].sec = myns->boot_sec;
> 	 p->vdso_data[0].basetime[CLOCK_BOOTTIME].nsec = myns->boot_nsec;
> 
>      	 p->vdso_data[1].clock_mode = CLOCK_TIMENS;
> 	 p->vdso_data[1].seq = 1;
> 
> For a normal task the VVAR pages are installed in the normal ordering:
> 
>        VVAR
>        PVCLOCK
>        HVCLOCK
>        TIMENS	<- Not really required
> 
> Now for a timens task you install the pages in the following order
> 
>        TIMENS
>        PVCLOCK
>        HVCLOCK
>        VVAR
> 
> The check for vdso_data->clock_mode is in the unlikely path of the now open
> coded seq begin magic. So for the non-timens case most of the time 'seq' is
> even, so the branch is not taken.
> 
> If 'seq' is odd, i.e. a concurrent update is in progress, the extra check
> for vdso_data->clock_mode is a non-issue. The task is spin waiting for the
> update to finish and for 'seq' to become even anyway.
> 
> Patch below. I tested this with the normal order and by installing a
> 'timens' page unconditionally for all processes. I'll reply with the timens
> testing hacks so you can see what I did.
> 
> The test results are pretty good.
> 
>     	 	 Base (upstream)  + VDSO patch	+ timens page
> 
> MONO		 30ns 		    30ns 	  32ns
> REAL		 30ns		    30ns	  32ns
> BOOT		 30ns		    30ns	  32ns
> MONOCOARSE	  7ns		     8ns	  10ns
> REALCOARSE	  7ns		     8ns	  10ns
> TAI		 30ns		    30ns	  32ns
> MONORAW		 30ns		    30ns	  32ns
> 
> So except for the coarse clocks there is no change when the timens page is
> not used, i.e. the regular VVAR page is at the proper place. But that's on
> one machine, a different one showed an effect in the noise range. I'm not
> worried about that as the VDSO behaviour varies depending on micro
> architecture anyway.
> 
> With timens enabled the performance hit (cache hot microbenchmark) is
> somewhere in the range of 5-7% when looking at the perf counters
> numbers. The hit for the coarse accessors is larger obviously because the
> overhead is constant time.
> 
> I did a quick comparison of the array vs. switch case (what you used for
> your clk_to_ns() helper). The switch case is slower.
> 
> So I rather go for the array based approach. It's simpler code and the
> I-cache footprint is smaller and no conditional branches involved.
> 
> That means your timens_to_host() and host_to_timens() conversion functions
> should just use that special VDSO page and do the same array based
> unconditional add/sub of the clock specific offset.

I was a bit scarred that clock_mode change would result in some complex
logic, but your patch showed me that it's definitely not so black as I
was painting it.
Will rework the patches set with Andrei based on your and Andy's
suggestions and patches.

Thanks,
          Dmitry

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-19 14:44 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Andy Lutomirski, Dmitry Safonov, linux-kernel, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <37f08bfa-0ef8-6df9-e119-e010cdeb9a5a@gmail.com>

Dmitry,

On Mon, 19 Aug 2019, Dmitry Safonov wrote:
> On 8/18/19 5:21 PM, Thomas Gleixner wrote:
> > That means your timens_to_host() and host_to_timens() conversion functions
> > should just use that special VDSO page and do the same array based
> > unconditional add/sub of the clock specific offset.
> 
> I was a bit scarred that clock_mode change would result in some complex
> logic, but your patch showed me that it's definitely not so black as I
> was painting it.

Right. It took me a while to find the right spot which does not affect the
non-timens path and at the same time gives a reasonable result for the
timens case.

One thing occured to me while doing that vvar_fault() hack for testing. For
the timens case it will hit

     if (sym_offset == image->sym_vvar_page) {

first, which is then installing the special vvar page.

It's clear that the code will hit the next fault immediately when trying to
access the real vvar page at the timens offset. So it might be sensible to
map that one in one go to avoid the immediate second page fault. But that
should be a separate patch after the initial 'functional' one.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-19 17:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <alpine.DEB.2.21.1908191103130.1923@nanos.tec.linutronix.de>

On Mon, Aug 19, 2019 at 11:15:11AM +0200, Thomas Gleixner wrote:
> Alexei,
> 
> On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
> > On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> > > On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > > While real usecases are helpful to understand a design decision, the design
> > > needs to be usecase independent.
> > > 
> > > The kernel provides mechanisms, not policies. My impression of this whole
> > > discussion is that it is policy driven. That's the wrong approach.
> > 
> > not sure what you mean by 'policy driven'.
> > Proposed CAP_BPF is a policy?
> 
> I was referring to the discussion as a whole.
>  
> > Can kernel.unprivileged_bpf_disabled=1 be used now?
> > Yes, but it will weaken overall system security because things that
> > use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
> > to move to stronger CAP_SYS_ADMIN.
> > 
> > With CAP_BPF both load and attach would happen under CAP_BPF
> > instead of CAP_SYS_ADMIN.
> 
> I'm not arguing against that.
> 
> > > So let's look at the mechanisms which we have at hand:
> > > 
> > >  1) Capabilities
> > >  
> > >  2) SUID and dropping priviledges
> > > 
> > >  3) Seccomp and LSM
> > > 
> > > Now the real interesting questions are:
> > > 
> > >  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
> > >     there a more finegrained control of BPF functionality?
> > > 
> > >     TBH, I can't tell.
> > > 
> > >  B) Depending on the answer to #A what is the control possibility for
> > >     #1/#2/#3 ?
> > 
> > Can any of the mechanisms 1/2/3 address the concern in mds.rst?
> 
> Well, that depends. As with any other security policy which is implemented
> via these mechanisms, the policy can be strict enough to prevent it by not
> allowing certain operations. The more fine-grained the control is, it
> allows the administrator who implements the policy to remove the
> 'dangerous' parts from an untrusted user.
> 
> So really question #A is important for this. Is BPF just providing a binary
> ON/OFF knob or does it allow to disable/enable certain aspects of BPF
> functionality in a more fine grained way? If the latter, then it might be
> possible to control functionality which might be abused for exploits of
> some sorts (including MDS) in a way which allows other parts of BBF to be
> exposed to less priviledged contexts.

I see. So the kernel.unprivileged_bpf_disabled knob is binary and I think it's
the right mechanism to expose to users.
Having N knobs for every map/prog type won't decrease attack surface.
In the other email Andy's quoting seccomp man page...
Today seccomp cannot really look into bpf_attr syscall args, but even
if it could it won't secure the system.
Examples:
1.
spectre v2 is using bpf in-kernel interpreter in speculative way.
The mere presence of interpreter as part of kernel .text makes the exploit
easier to do. That was the reason to do CONFIG_BPF_JIT_ALWAYS_ON.
For this case even kernel.unprivileged_bpf_disabled=1 was hopeless.

2.
var4 doing store hazard. It doesn't matter which program type is used.
load/store instructions are the same across program types.

3.
prog_array was used as part of var1. I guess it was simply more
convenient for Jann to do it this way :) All other map types
have the same out-of-bounds speculation issue.

In general side channels are cpu bugs that are exploited via sequences
of cpu instructions. In that sense bpf infra provides these instructions.
So all program types and all maps have the same level of 'side channel risk'.

> > I believe Andy wants to expand the attack surface when
> > kernel.unprivileged_bpf_disabled=0
> > Before that happens I'd like the community to work on addressing the text above.
> 
> Well, that text above can be removed when the BPF wizards are entirely sure
> that BPF cannot be abused to exploit stuff. 

Myself and Daniel looked at it in detail. I think we understood
MDS mechanism well enough. Right now we're fairly confident that
combination of existing mechanisms we did for var4 and
verifier speculative analysis protect us from MDS.
The thing is that every new cpu bug is looked at through the bpf lenses.
Can it be exploited through bpf? Complexity of side channels
is growing. Can the most recent swapgs be exploited ?
What if we kprobe+bpf somewhere ?
I don't think there is an issue, but we will never be 'entirely sure'.
Even if myself and Daniel are sure the concern will stay.
Unprivileged bpf as a whole is the concern due to side channels.
The number of them are not yet disclosed. Who is going to analyze them?
imo the only answer to that is kernel.unprivileged_bpf_disabled=1
which together with CONFIG_BPF_JIT_ALWAYS_ON is secure enough.
The other option is to sprinkle every bpf load/store with lfence
which will make execution so slow that it will be unusable.
Which is effectively the same as unprivileged_bpf_disabled=1.

There are other things we can do. Like kasan-style shadow memory
for bpf execution. Auto re-JITing the code after it's running.
We can do lfences everywhere for some time then re-JIT
when kasan-ed shadow memory shows only clean memory accesses.
The beauty of BPF that it is analyze-able and JIT-able instruction set.
The verifier speculative analysis is an example that the kernel
can analyze the speculative execution path that cpu will
take before the code starts executing.
Unprivileged bpf can made absolutely secure. It can be
made more secure than the rest of the kernel.
But today we should just go with unprivileged_bpf_disabled=1
and CAP_BPF.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-19 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jordan Glover, Andy Lutomirski,
	Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190819172718.jwnvvotssxwhc7m6@ast-mbp.dhcp.thefacebook.com>



> On Aug 19, 2019, at 10:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Mon, Aug 19, 2019 at 11:15:11AM +0200, Thomas Gleixner wrote:
>> Alexei,
>> 
>>> On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
>>>> On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
>>>> On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
>>>> While real usecases are helpful to understand a design decision, the design
>>>> needs to be usecase independent.
>>>> 
>>>> The kernel provides mechanisms, not policies. My impression of this whole
>>>> discussion is that it is policy driven. That's the wrong approach.
>>> 
>>> not sure what you mean by 'policy driven'.
>>> Proposed CAP_BPF is a policy?
>> 
>> I was referring to the discussion as a whole.
>> 
>>> Can kernel.unprivileged_bpf_disabled=1 be used now?
>>> Yes, but it will weaken overall system security because things that
>>> use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
>>> to move to stronger CAP_SYS_ADMIN.
>>> 
>>> With CAP_BPF both load and attach would happen under CAP_BPF
>>> instead of CAP_SYS_ADMIN.
>> 
>> I'm not arguing against that.
>> 
>>>> So let's look at the mechanisms which we have at hand:
>>>> 
>>>> 1) Capabilities
>>>> 
>>>> 2) SUID and dropping priviledges
>>>> 
>>>> 3) Seccomp and LSM
>>>> 
>>>> Now the real interesting questions are:
>>>> 
>>>> A) What kind of restrictions does BPF allow? Is it a binary on/off or is
>>>>    there a more finegrained control of BPF functionality?
>>>> 
>>>>    TBH, I can't tell.
>>>> 
>>>> B) Depending on the answer to #A what is the control possibility for
>>>>    #1/#2/#3 ?
>>> 
>>> Can any of the mechanisms 1/2/3 address the concern in mds.rst?
>> 
>> Well, that depends. As with any other security policy which is implemented
>> via these mechanisms, the policy can be strict enough to prevent it by not
>> allowing certain operations. The more fine-grained the control is, it
>> allows the administrator who implements the policy to remove the
>> 'dangerous' parts from an untrusted user.
>> 
>> So really question #A is important for this. Is BPF just providing a binary
>> ON/OFF knob or does it allow to disable/enable certain aspects of BPF
>> functionality in a more fine grained way? If the latter, then it might be
>> possible to control functionality which might be abused for exploits of
>> some sorts (including MDS) in a way which allows other parts of BBF to be
>> exposed to less priviledged contexts.
> 
> I see. So the kernel.unprivileged_bpf_disabled knob is binary and I think it's
> the right mechanism to expose to users.
> Having N knobs for every map/prog type won't decrease attack surface.
> In the other email Andy's quoting seccomp man page...
> Today seccomp cannot really look into bpf_attr syscall args, but even
> if it could it won't secure the system.
> Examples:
> 1.
> spectre v2 is using bpf in-kernel interpreter in speculative way.
> The mere presence of interpreter as part of kernel .text makes the exploit
> easier to do. That was the reason to do CONFIG_BPF_JIT_ALWAYS_ON.
> For this case even kernel.unprivileged_bpf_disabled=1 was hopeless.
> 
> 2.
> var4 doing store hazard. It doesn't matter which program type is used.
> load/store instructions are the same across program types.
> 
> 3.
> prog_array was used as part of var1. I guess it was simply more
> convenient for Jann to do it this way :) All other map types
> have the same out-of-bounds speculation issue.
> 
> In general side channels are cpu bugs that are exploited via sequences
> of cpu instructions. In that sense bpf infra provides these instructions.
> So all program types and all maps have the same level of 'side channel risk'.
> 
>>> I believe Andy wants to expand the attack surface when
>>> kernel.unprivileged_bpf_disabled=0
>>> Before that happens I'd like the community to work on addressing the text above.
>> 
>> Well, that text above can be removed when the BPF wizards are entirely sure
>> that BPF cannot be abused to exploit stuff. 
> 
> Myself and Daniel looked at it in detail. I think we understood
> MDS mechanism well enough. Right now we're fairly confident that
> combination of existing mechanisms we did for var4 and
> verifier speculative analysis protect us from MDS.
> The thing is that every new cpu bug is looked at through the bpf lenses.
> Can it be exploited through bpf? Complexity of side channels
> is growing. Can the most recent swapgs be exploited ?
> What if we kprobe+bpf somewhere ?
> I don't think there is an issue, but we will never be 'entirely sure'.
> Even if myself and Daniel are sure the concern will stay.
> Unprivileged bpf as a whole is the concern due to side channels.
> The number of them are not yet disclosed. Who is going to analyze them?
> imo the only answer to that is kernel.unprivileged_bpf_disabled=1
> which together with CONFIG_BPF_JIT_ALWAYS_ON is secure enough.
> The other option is to sprinkle every bpf load/store with lfence
> which will make execution so slow that it will be unusable.
> Which is effectively the same as unprivileged_bpf_disabled=1.
> 
> There are other things we can do. Like kasan-style shadow memory
> for bpf execution. Auto re-JITing the code after it's running.
> We can do lfences everywhere for some time then re-JIT
> when kasan-ed shadow memory shows only clean memory accesses.
> The beauty of BPF that it is analyze-able and JIT-able instruction set.
> The verifier speculative analysis is an example that the kernel
> can analyze the speculative execution path that cpu will
> take before the code starts executing.
> Unprivileged bpf can made absolutely secure. It can be
> made more secure than the rest of the kernel.
> But today we should just go with unprivileged_bpf_disabled=1

I’m still okay with this.

> and CAP_BPF.
> 

I think this needs more design work.  I’m halfway through writing up an actual proposal. I’ll send it soon.

^ permalink raw reply

* Re: [PATCH v5 0/9] FPGA DFL updates
From: Greg KH @ 2019-08-19 20:51 UTC (permalink / raw)
  To: Wu Hao; +Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull
In-Reply-To: <20190819053133.GA31244@hao-dev>

On Mon, Aug 19, 2019 at 01:31:33PM +0800, Wu Hao wrote:
> On Mon, Aug 12, 2019 at 10:49:55AM +0800, Wu Hao wrote:
> > Hi Greg,
> > 
> > This is v5 patchset which adds more features to FPGA DFL. Marjor changes
> > against v4 are sysfs related code rework to address comments on v4.
> > 
> > Please help to take a look. Thanks!
> 
> Hi Greg,
> 
> Did you get a chance to take a look at this new version patchset? :)

I'm not the FPGA maintainer, what about the review from the other one
first?  :)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Joel Fernandes @ 2019-08-19 21:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Daniel Gruss, kernel list, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, Daniel Colascione, fmayer, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Kees Cook, kernel-team, Linux API,
	linux-doc, linux-fsdevel, Linux-MM, Mike Rapoport
In-Reply-To: <20190814075601.GO17933@dhcp22.suse.cz>

On Wed, Aug 14, 2019 at 09:56:01AM +0200, Michal Hocko wrote:
[snip]
> > > > Can this be used to observe which library pages other processes are
> > > > accessing, even if you don't have access to those processes, as long
> > > > as you can map the same libraries? I realize that there are already a
> > > > bunch of ways to do that with side channels and such; but if you're
> > > > adding an interface that allows this by design, it seems to me like
> > > > something that should be gated behind some sort of privilege check.
> > >
> > > Hmm, you need to be priviledged to get the pfn now and without that you
> > > cannot get to any page so the new interface is weakening the rules.
> > > Maybe we should limit setting the idle state to processes with the write
> > > status. Or do you think that even observing idle status is useful for
> > > practical side channel attacks? If yes, is that a problem of the
> > > profiler which does potentially dangerous things?
> > 
> > I suppose read-only access isn't a real problem as long as the
> > profiler isn't writing the idle state in a very tight loop... but I
> > don't see a usecase where you'd actually want that? As far as I can
> > tell, if you can't write the idle state, being able to read it is
> > pretty much useless.
> > 
> > If the profiler only wants to profile process-private memory, then
> > that should be implementable in a safe way in principle, I think, but
> > since Joel said that they want to profile CoW memory as well, I think
> > that's inherently somewhat dangerous.
> 
> I cannot really say how useful that would be but I can see that
> implementing ownership checks would be really non-trivial for
> shared pages. Reducing the interface to exclusive pages would make it
> easier as you noted but less helpful.
> 
> Besides that the attack vector shouldn't be really much different from
> the page cache access, right? So essentially can_do_mincore model.
> 
> I guess we want to document that page idle tracking should be used with
> care because it potentially opens a side channel opportunity if used
> on sensitive data.

I have been thinking of this, and discussing with our heap profiler folks.
Not being able to track shared pages would be a limitation, but I don't see
any way forward considering this security concern so maybe we have to
limit what we can do.

I will look into implementing this without doing the rmap but still make it
work on shared pages from the point of view of the process being tracked. It
just would no longer through the PTEs of *other* processes sharing the page.

My current thought is to just rely on the PTE accessed bit, and not use the
PageIdle flag at all. But we'd still set the PageYoung flag so that the
reclaim code still sees the page as accessed. The reason I feel like avoiding
the PageIdle flag is:

1. It looks like mark_page_accessed() can be called from other paths which
can also result in some kind of side-channel issue if a page was shared.

2. I don't think I need the PageIdle flag since the access bit alone should
let me know, although it could be a bit slower. Since previously, I did not
need to check every PTE and if the PageIdle flag was already cleared, then
the page was declared as idle.

At least this series resulted in a bug fix and a tonne of learning, so thank
you everyone!

Any other thoughts?

thanks,

 - Joel

^ permalink raw reply

* [PATCH V40 00/29] Add kernel lockdown functionality
From: Matthew Garrett @ 2019-08-20  0:17 UTC (permalink / raw)
  To: jmorris; +Cc: linux-security-module, linux-kernel, linux-api

After chatting with James in person, I'm resending the full set with the
fixes merged in in order to avoid any bisect issues. There should be no
functional changes other than avoiding build failures with some configs,
and fixing the oops in tracefs.

^ permalink raw reply

* [PATCH V40 01/29] security: Support early LSMs
From: Matthew Garrett @ 2019-08-20  0:17 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Matthew Garrett, Kees Cook, Casey Schaufler, Stephen Rothwell
In-Reply-To: <20190820001805.241928-1-matthewgarrett@google.com>

The lockdown module is intended to allow for kernels to be locked down
early in boot - sufficiently early that we don't have the ability to
kmalloc() yet. Add support for early initialisation of some LSMs, and
then add them to the list of names when we do full initialisation later.
Early LSMs are initialised in link order and cannot be overridden via
boot parameters, and cannot make use of kmalloc() (since the allocator
isn't initialised yet).

(Fixed by Stephen Rothwell to include a stub to fix builds when
!CONFIG_SECURITY)

Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
---
 include/asm-generic/vmlinux.lds.h |  8 ++++-
 include/linux/lsm_hooks.h         |  6 ++++
 include/linux/security.h          |  6 ++++
 init/main.c                       |  1 +
 security/security.c               | 50 ++++++++++++++++++++++++++-----
 5 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 088987e9a3ea..c1807d14daa3 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -208,8 +208,13 @@
 			__start_lsm_info = .;				\
 			KEEP(*(.lsm_info.init))				\
 			__end_lsm_info = .;
+#define EARLY_LSM_TABLE()	. = ALIGN(8);				\
+			__start_early_lsm_info = .;			\
+			KEEP(*(.early_lsm_info.init))			\
+			__end_early_lsm_info = .;
 #else
 #define LSM_TABLE()
+#define EARLY_LSM_TABLE()
 #endif
 
 #define ___OF_TABLE(cfg, name)	_OF_TABLE_##cfg(name)
@@ -609,7 +614,8 @@
 	ACPI_PROBE_TABLE(irqchip)					\
 	ACPI_PROBE_TABLE(timer)						\
 	EARLYCON_TABLE()						\
-	LSM_TABLE()
+	LSM_TABLE()							\
+	EARLY_LSM_TABLE()
 
 #define INIT_TEXT							\
 	*(.init.text .init.text.*)					\
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 47f58cfb6a19..b02e8bb6654d 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -2104,12 +2104,18 @@ struct lsm_info {
 };
 
 extern struct lsm_info __start_lsm_info[], __end_lsm_info[];
+extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[];
 
 #define DEFINE_LSM(lsm)							\
 	static struct lsm_info __lsm_##lsm				\
 		__used __section(.lsm_info.init)			\
 		__aligned(sizeof(unsigned long))
 
+#define DEFINE_EARLY_LSM(lsm)						\
+	static struct lsm_info __early_lsm_##lsm			\
+		__used __section(.early_lsm_info.init)			\
+		__aligned(sizeof(unsigned long))
+
 #ifdef CONFIG_SECURITY_SELINUX_DISABLE
 /*
  * Assuring the safety of deleting a security module is up to
diff --git a/include/linux/security.h b/include/linux/security.h
index 659071c2e57c..c5dd90981c98 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -195,6 +195,7 @@ int unregister_lsm_notifier(struct notifier_block *nb);
 
 /* prototypes */
 extern int security_init(void);
+extern int early_security_init(void);
 
 /* Security operations */
 int security_binder_set_context_mgr(struct task_struct *mgr);
@@ -423,6 +424,11 @@ static inline int security_init(void)
 	return 0;
 }
 
+static inline int early_security_init(void)
+{
+	return 0;
+}
+
 static inline int security_binder_set_context_mgr(struct task_struct *mgr)
 {
 	return 0;
diff --git a/init/main.c b/init/main.c
index 66a196c5e4c3..598effd29a0a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -569,6 +569,7 @@ asmlinkage __visible void __init start_kernel(void)
 	boot_cpu_init();
 	page_address_init();
 	pr_notice("%s", linux_banner);
+	early_security_init();
 	setup_arch(&command_line);
 	mm_init_cpumask(&init_mm);
 	setup_command_line(command_line);
diff --git a/security/security.c b/security/security.c
index f493db0bf62a..ef4a0111c8b4 100644
--- a/security/security.c
+++ b/security/security.c
@@ -33,6 +33,7 @@
 
 /* How many LSMs were built into the kernel? */
 #define LSM_COUNT (__end_lsm_info - __start_lsm_info)
+#define EARLY_LSM_COUNT (__end_early_lsm_info - __start_early_lsm_info)
 
 struct security_hook_heads security_hook_heads __lsm_ro_after_init;
 static ATOMIC_NOTIFIER_HEAD(lsm_notifier_chain);
@@ -277,6 +278,8 @@ static void __init ordered_lsm_parse(const char *order, const char *origin)
 static void __init lsm_early_cred(struct cred *cred);
 static void __init lsm_early_task(struct task_struct *task);
 
+static int lsm_append(const char *new, char **result);
+
 static void __init ordered_lsm_init(void)
 {
 	struct lsm_info **lsm;
@@ -323,6 +326,26 @@ static void __init ordered_lsm_init(void)
 	kfree(ordered_lsms);
 }
 
+int __init early_security_init(void)
+{
+	int i;
+	struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
+	struct lsm_info *lsm;
+
+	for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
+	     i++)
+		INIT_HLIST_HEAD(&list[i]);
+
+	for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
+		if (!lsm->enabled)
+			lsm->enabled = &lsm_enabled_true;
+		prepare_lsm(lsm);
+		initialize_lsm(lsm);
+	}
+
+	return 0;
+}
+
 /**
  * security_init - initializes the security framework
  *
@@ -330,14 +353,18 @@ static void __init ordered_lsm_init(void)
  */
 int __init security_init(void)
 {
-	int i;
-	struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
+	struct lsm_info *lsm;
 
 	pr_info("Security Framework initializing\n");
 
-	for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
-	     i++)
-		INIT_HLIST_HEAD(&list[i]);
+	/*
+	 * Append the names of the early LSM modules now that kmalloc() is
+	 * available
+	 */
+	for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
+		if (lsm->enabled)
+			lsm_append(lsm->name, &lsm_names);
+	}
 
 	/* Load LSMs in specified order. */
 	ordered_lsm_init();
@@ -384,7 +411,7 @@ static bool match_last_lsm(const char *list, const char *lsm)
 	return !strcmp(last, lsm);
 }
 
-static int lsm_append(char *new, char **result)
+static int lsm_append(const char *new, char **result)
 {
 	char *cp;
 
@@ -422,8 +449,15 @@ void __init security_add_hooks(struct security_hook_list *hooks, int count,
 		hooks[i].lsm = lsm;
 		hlist_add_tail_rcu(&hooks[i].list, hooks[i].head);
 	}
-	if (lsm_append(lsm, &lsm_names) < 0)
-		panic("%s - Cannot get early memory.\n", __func__);
+
+	/*
+	 * Don't try to append during early_security_init(), we'll come back
+	 * and fix this up afterwards.
+	 */
+	if (slab_is_available()) {
+		if (lsm_append(lsm, &lsm_names) < 0)
+			panic("%s - Cannot get early memory.\n", __func__);
+	}
 }
 
 int call_lsm_notifier(enum lsm_event event, void *data)
-- 
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox