Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-15 18:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <20190815172856.yoqvgu2yfrgbkowu@ast-mbp.dhcp.thefacebook.com>

On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.
>

I have the feeling that we're somehow speaking different languages.
What, precisely, do you mean when you say "systemd itself is trusted"?
 Do you mean "the administrator trusts that the /lib/systemd/systemd
binary is not malicious"?  Do you mean "the administrator trusts that
the running systemd process is not malicious"?

On a regular Linux desktop or server box, passing CAP_NET_ADMIN, your
envisioned CAP_BPF, or /dev/bpf as in this patchset through to a
systemd --user binary would be a gaping security hole.  You are
welcome to do it on your own systemd, but if a distro did it, it would
be a major error.

If you want IPAddressDeny= to work in a user systemd unit (i.e.
/etc/systemd/user/*), then I think you have two choices.  You could
have an API by which systemd --user can ask a privileged helper to
assist (which has all the challenges you mentioned but is definitely
*possible*) or the kernel bpf() interfaces need to be designed so
that, in the absence of kernel bugs, they are safe to use from an
unprivileged process.  By "safe", I mean "would not expose the system
to attack if the kernel's implementation of the bpf() ABI were
perfect".

My suggestions upthread for incrementally making bpf() depend less on
privilege would accomplish this goal.  It would be entirely reasonable
to say that, even with those changes, bpf() is still a large attack
surface and access to it should be restricted, and having a capability
or other mechanism to explicitly grant access to the
hopefully-secure-but-plausibly-buggy parts of bpf() would make sense.
But you rejected that idea and said you "realized that [changing all
the capable() checks is] perfect as-is" without much explanation,
which makes it hard to understand where you're coming from.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-15 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815172856.yoqvgu2yfrgbkowu@ast-mbp.dhcp.thefacebook.com>

On Thursday, August 15, 2019 5:28 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
>
> > On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> >
> > > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> > >
> > > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > > by the admin, then no kernel changes at all are needed. Programs that
> > > > want to reduce their own privileges can easily fork() a privileged
> > > > subprocess or run a little helper to which they delegate BPF
> > > > operations. This is far more flexible than anything that will ever be
> > > > in the kernel because it allows the helper to verify that the rest of
> > > > the program is doing exactly what it's supposed to and restrict eBPF
> > > > operations to exactly the subset that is needed. So a container
> > > > manager or network manager that drops some provilege could have a
> > > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > > configuration. The two processes would talk over a socketpair.
> > >
> > > there were three projects that tried to delegate bpf operations.
> > > All of them failed.
> > > bpf operational workflow is much more complex than you're imagining.
> > > fork() also doesn't work for all cases.
> > > I gave this example before: consider multiple systemd-like deamons
> > > that need to do bpf operations that want to pass this 'bpf capability'
> > > to other deamons written by other teams. Some of them will start
> > > non-root, but still need to do bpf. They will be rpm installed
> > > and live upgraded while running.
> > > We considered to make systemd such centralized bpf delegation
> > > authority too. It didn't work. bpf in kernel grows quickly.
> > > libbpf part grows independently. llvm keeps evolving.
> > > All of them are being changed while system overall has to stay
> > > operational. Centralized approach breaks apart.
> > >
> > > > The interesting cases you're talking about really do involved
> > > > unprivileged or less privileged eBPF, though. Let's see:
> > > > systemd --user: systemd --user is not privileged at all. There's no
> > > > issue of reducing privilege, since systemd --user doesn't have any
> > > > privilege to begin with. But systemd supports some eBPF features, and
> > > > presumably it would like to support them in the systemd --user case.
> > > > This is unprivileged eBPF.
> > >
> > > Let's disambiguate the terminology.
> > > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > > I think that was a mistake.
> > > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > > This is not unprivileged.
> > > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > > There is a huge difference between the two.
> > > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > > today for many reasons mentioned earlier.
> > > The /dev/bpf is about 'less privileged'.
> > > Less privileged than root. We need to split part of full root capability
> > > into bpf capability. So that most of the root can be dropped.
> > > This is very similar to what cap_net_admin does.
> > > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > > Same thing for cap_bpf.
> > > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > > this is the same thing. Two interfaces to achieve the same result.
> >
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.

The point was that systemd will run any arbitrary command you'll throw at it and you want
to automatically attach CAP_BPF to it. AmbientCapabilities is not valid option for
systemd --user instance (otherwise it would be nuts).

I think we may have misunderstanding here. Did you mean systemd "system" service with
"User=xxx" option instead of "systemd --user" service? It would make sense then.

^ permalink raw reply

* Re: [PATCHv6 22/36] x86/vdso: Add offsets page in vvar
From: Thomas Gleixner @ 2019-08-15 19:21 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Andrei Vagin, Adrian Reber,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86, Andrei Vagin
In-Reply-To: <20190815163836.2927-23-dima@arista.com>

On Thu, 15 Aug 2019, Dmitry Safonov wrote:
> ---
>  arch/Kconfig                          |  5 +++
>  arch/x86/Kconfig                      |  1 +
>  arch/x86/entry/vdso/vdso-layout.lds.S |  9 ++++-
>  arch/x86/entry/vdso/vdso2c.c          |  3 ++
>  arch/x86/entry/vdso/vma.c             | 12 +++++++
>  arch/x86/include/asm/vdso.h           |  1 +
>  init/Kconfig                          |  1 +
>  lib/vdso/gettimeofday.c               | 47 +++++++++++++++++++++++++++

This needs to be split into the generic lib/vdso part and then x86 making
use of it.

> +#ifdef CONFIG_TIME_NS

This should be COMPILE_WITH_TIME_NS and not CONFIG_TIME_NS

> +extern u8 timens_page
> +	__attribute__((visibility("hidden")));
> +
> +notrace static __always_inline void clk_to_ns(clockid_t clk, struct __kernel_timespec *ts)

This needs notrace because?

> +{
> +	struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
> +	struct timespec64 *offset64;
> +
> +	switch (clk) {
> +	case CLOCK_MONOTONIC:
> +	case CLOCK_MONOTONIC_COARSE:
> +	case CLOCK_MONOTONIC_RAW:
> +		offset64 = &timens->monotonic;
> +		break;
> +	case CLOCK_BOOTTIME:
> +		offset64 = &timens->boottime;
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	/*
> +	 * The kernel allows to set a negative offset only if the current clock
> +	 * value in a namespace is positive, so the result tv_sec can't be
> +	 * negative here.
> +	 */
> +	ts->tv_nsec += offset64->tv_nsec;
> +	ts->tv_sec += offset64->tv_sec;
> +	if (ts->tv_nsec >= NSEC_PER_SEC) {
> +		ts->tv_nsec -= NSEC_PER_SEC;
> +		ts->tv_sec++;
> +	}
> +	if (ts->tv_nsec < 0) {
> +		ts->tv_nsec += NSEC_PER_SEC;
> +		ts->tv_sec--;
> +	}

That's broken for 32bit user space on 64bit hosts. On LE due to
misalignment and on BE because 32bit will read always 0.

> +}
> +#else
> +notrace static __always_inline void clk_to_ns(clockid_t clk, struct __kernel_timespec *ts) {}
> +#endif
> +
>  static int do_hres(const struct vdso_data *vd, clockid_t clk,
>  		   struct __kernel_timespec *ts)
>  {
> @@ -65,6 +108,8 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk,
>  	ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
>  	ts->tv_nsec = ns;
>  
> +	clk_to_ns(clk, ts);
> +
>  	return 0;
>  }
>  
> @@ -79,6 +124,8 @@ static void do_coarse(const struct vdso_data *vd, clockid_t clk,
>  		ts->tv_sec = vdso_ts->sec;
>  		ts->tv_nsec = vdso_ts->nsec;
>  	} while (unlikely(vdso_read_retry(vd, seq)));
> +
> +	clk_to_ns(clk, ts);
>  }
>  
>  static __maybe_unused int
> -- 
> 2.22.0
> 
> 

^ permalink raw reply

* Re: [PATCHv6 28/36] posix-clocks: Add align for timens_offsets
From: Thomas Gleixner @ 2019-08-15 19:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20190815163836.2927-29-dima@arista.com>

On Thu, 15 Aug 2019, Dmitry Safonov wrote:

> Align offsets so that time namespace will work for ia32 applications on
> x86_64 host.

That's true for any 64 bit arch which supports 32bit user space and should
be folded into the patch which introduces the offset store.

> +/*
> + * Time offsets need align as they're placed on VVAR page,
> + * which is used by x86_64 and ia32 VDSO code.
> + * On ia32 offset::tv_sec (u64) has align(4), so re-align offsets
> + * to the same positions as 64-bit offsets.

This is generic code. Please do not add x86'isms here. The alignement
problem is more or less the same for any 64bit arch which supports 32bit
user space. And it's even worse on BE.

> + * On 64-bit big-endian systems VDSO should convert to timespec64
> + * to timespec ...

What?

> ... because of a padding occurring between the fields.

There is no padding between the fields.

32bit BE (powerpc)

struct timespec64 {
	time64_t                   tv_sec;               /*     0     8 */
	long int                   tv_nsec;              /*     8     4 */

tv_nsec is directly after tv_sec

};

64bit LE and BE (x86, powerpc64)

struct timespec64 {
	time64_t                   tv_sec;               /*     0     8 */
	long int                   tv_nsec;              /*     8     8 */
};

The problem for BE is that the 64bit host uses long int to store
tv_nsec. So the 32bit userspace will always read 0 because it reads byte
2/3 as seen from the 64 host side.

So using struct timespec64 for the offset is wrong. You really need to open
code that offset storage if you don't want to end up with weird workarounds
for BE.

Something like this:

struct timens_offs {
	  time64_t	tv_sec;
	  s64		tv_nsec;
};

Then your offset store becomes:

struct timens_offsets {
	struct timens_offs	monotonic;
	struct timens_offs	boottime;
};

which needs tweaks to your conversion functions:

static inline void timens_add_monotonic(struct timespec64 *ts)
{
	struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
	struct timens_offs *mo = &ns_offsets->monotonic;

	if (ns_offsets) {
		set_normalized_timespec64(ts, ts->tv_sec + mo->tv_sec,
                                	  ts->tv_nsec + mo->tv_nsec);
	}
}

And for your to host conversion you need:

	case CLOCK_MONOTONIC:
		mo = &ns_offsets->monotonic;
		offset = ktime_set(mo->tv_sec, mo->tv_nsec);
		break;

Similar changes are needed in the VDSO and the proc interface
obviously. Then this works for any arch without magic BE fixups. You get
the idea.

And ideally you change that storage to:

struct timens_offs {
	  time64_t	tv_sec;
	  s64		tv_nsec;
	  ktime_t	nsecs;
};

and do the conversion once in the proc write. Then your to host conversion
can use 'nsecs' and spare the multiplication on every invocation.

	case CLOCK_MONOTONIC:
    		offset = ns_offsets.monotonic.nsecs;

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Kees Cook @ 2019-08-15 19:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <20190813215823.3sfbakzzjjykyng2@ast-mbp>

On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.

Unsurprisingly, I totally disagree: this is the very definition of
improved "security": reduced attack surface, confined trust, etc.

> Linux has become a single user system.

I hope this is just hyperbole, because it's not true in reality. I agree
that the vast majority of Linux devices are single-user-at-a-time
systems now (rather than the "shell servers" of yore), but the system
still has to be expected to confine users from each other, root, and the
hardware. Switching users on Chrome OS or a distro laptop, etc is still
very much expected to _mean_ something.

> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.
> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.

I'm not sure why you draw the line for VMs -- they're just as buggy
as anything else. Regardless, I reject this line of thinking: yes,
all software is buggy, but that isn't a reason to give up. In fact,
we should be trying very hard to create safe code (*insert arguments
for sane languages and toolchains here*).

If you look at software safety as a binary, you will always be
disappointed. If you look at it as it manifests in the real world,
then there is some perspective to be had. Reachability of flaws becomes
a major factor; exploit chain length becomes a factor. There are very
real impacts to be had from security hardening, sandboxing, etc. Of
course nothing is perfect, but the current state of the world isn't
as you describe. (And I say this with the knowledge of how long
the lifetime of bugs are in the kernel.)

> Containers are used to make production systems safer.

Yes.

> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code

Perhaps it's just a language issue. "More secure" and "safer" mean
mostly the same thing to me. I tend to think "safer" is actually
a superset that includes things that wreck the user experience but
aren't actually in the privilege manipulation realm. In the traditional
"security" triad of confidentiality, integrity, and availability, I tend
to weigh availability less highly, but a bug that stops someone from
doing their work but doesn't wreck data, let them switch users, etc,
is still considered a "security" issue by many folks. The fewer bugs
someone is exposed to improves their security, safety, whatever. The
easiest way to do that is confinement and its associated attack surface
reduction. tl;dr: security and safety are very use-case-specific
continuum, not a binary state.

> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.
> There is not a single use case to allow arbitrary malicious bpf
> program to be loaded and executed.

The world isn't binary (safe code/malicious code), and we need to build
systems that can be used safely even when things go wrong. Yes, probably
no one has a system that _intentionally_ feeds eBPF into the kernel from
a web form. But there is probably someone who does it unintentionally,
or has a user login exposed on a system where unpriv BPF is enabled. The
point is to create primitives as safely as possible so when things DO
go wrong, they fail safe instead of making things worse.

I'm all for a "less privileged than root" API for eBPF, but I get worried
when I see "security" being treated as a binary state. Especially when
it is considered an always-failed state. :)

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jordan Glover, Daniel Colascione, Song Liu, Kees Cook, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrUv+g+cb79FJ1S4XuV0K=kowFkPXpzoC99svoOfs4-Kvg@mail.gmail.com>

On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > process. Also unprivileged user process can start systemd --user process with any
> > > command they like.
> >
> > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > CAP_BPF is a natural step in the same direction.
> >
> 
> I have the feeling that we're somehow speaking different languages.
> What, precisely, do you mean when you say "systemd itself is trusted"?
>  Do you mean "the administrator trusts that the /lib/systemd/systemd
> binary is not malicious"?  Do you mean "the administrator trusts that
> the running systemd process is not malicious"?

please see
https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
I'm not advocating for or against this approach.
Call it 'security hole' or 'better security'.
There are two categories of people for any feature like this.
My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
level of privileges is acceptable.
Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
But CAP_BPF is clearly better way.

> My suggestions upthread for incrementally making bpf() depend less on
> privilege would accomplish this goal.

As I pointed out countless times it would make the system overall _less_ secure.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1 to
make it _more_ secure.

^ permalink raw reply

* Re: [PATCHv6 30/36] selftest/timens: Add Time Namespace test for supported clocks
From: shuah @ 2019-08-15 23:18 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Thomas Gleixner,
	Vincenzo Frascino, containers, criu, linux-api, x86, shuah
In-Reply-To: <20190815163836.2927-31-dima@arista.com>

Hi Dmitry,

Thanks for the patch.

On 8/15/19 10:38 AM, Dmitry Safonov wrote:
> A test to check that all supported clocks work on host and inside
> a new time namespace. Use both ways to get time: through VDSO and
> by entering the kernel with implicit syscall.
> 
> Introduce a new timens directory in selftests framework for
> the next timens tests.
> 
> Co-developed-by: Andrei Vagin <avagin@openvz.org>
> Signed-off-by: Andrei Vagin <avagin@openvz.org>
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>   tools/testing/selftests/Makefile          |   1 +
>   tools/testing/selftests/timens/.gitignore |   1 +
>   tools/testing/selftests/timens/Makefile   |   5 +
>   tools/testing/selftests/timens/config     |   1 +
>   tools/testing/selftests/timens/log.h      |  26 +++
>   tools/testing/selftests/timens/timens.c   | 185 ++++++++++++++++++++++
>   tools/testing/selftests/timens/timens.h   |  63 ++++++++
>   7 files changed, 282 insertions(+)
>   create mode 100644 tools/testing/selftests/timens/.gitignore
>   create mode 100644 tools/testing/selftests/timens/Makefile
>   create mode 100644 tools/testing/selftests/timens/config
>   create mode 100644 tools/testing/selftests/timens/log.h
>   create mode 100644 tools/testing/selftests/timens/timens.c
>   create mode 100644 tools/testing/selftests/timens/timens.h
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 25b43a8c2b15..6fc63b84a857 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -47,6 +47,7 @@ TARGETS += splice
>   TARGETS += static_keys
>   TARGETS += sync
>   TARGETS += sysctl
> +TARGETS += timens

How long does this test run for? Does this test need to be run
as root? If yes, please add a root check and skip the test.

What does the test output looks like for skip and pass/fail cases?

thanks,
-- Shuah

^ permalink raw reply

* Re: [PATCHv6 35/36] selftests/timens: Add a simple perf test for clock_gettime()
From: shuah @ 2019-08-15 23:29 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Dmitry Safonov, Andrei Vagin, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86, shuah
In-Reply-To: <20190815163836.2927-36-dima@arista.com>

On 8/15/19 10:38 AM, Dmitry Safonov wrote:
> From: Andrei Vagin <avagin@gmail.com>
> 
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> Co-developed-by: Dmitry Safonov <dima@arista.com>
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>   tools/testing/selftests/timens/.gitignore     |   2 +
>   tools/testing/selftests/timens/Makefile       |  10 +-
>   tools/testing/selftests/timens/gettime_perf.c | 101 +++++++++++
>   .../selftests/timens/gettime_perf_cold.c      | 160 ++++++++++++++++++
>   4 files changed, 271 insertions(+), 2 deletions(-)
>   create mode 100644 tools/testing/selftests/timens/gettime_perf.c
>   create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c
> 
> diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
> index 3b7eda8f35ce..16292e4d08a5 100644
> --- a/tools/testing/selftests/timens/.gitignore
> +++ b/tools/testing/selftests/timens/.gitignore
> @@ -1,4 +1,6 @@
>   clock_nanosleep
> +gettime_perf
> +gettime_perf_cold
>   procfs
>   timens
>   timer
> diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
> index ae1ffd24cc43..97e0460eaf48 100644
> --- a/tools/testing/selftests/timens/Makefile
> +++ b/tools/testing/selftests/timens/Makefile
> @@ -1,6 +1,12 @@
> -TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
> +TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
> +
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),x86_64)
> +TEST_GEN_PROGS += gettime_perf_cold
> +endif
>   
>   CFLAGS := -Wall -Werror
> -LDFLAGS := -lrt
> +LDFLAGS := -lrt -ldl
>   
>   include ../lib.mk
> diff --git a/tools/testing/selftests/timens/gettime_perf.c b/tools/testing/selftests/timens/gettime_perf.c
> new file mode 100644
> index 000000000000..f7d7832c0293
> --- /dev/null
> +++ b/tools/testing/selftests/timens/gettime_perf.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <sched.h>
> +#include <time.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <dlfcn.h>
> +
> +#include "log.h"
> +#include "timens.h"
> +
> +//#define TEST_SYSCALL
> +

How is this supposed to be used? When does TEST_SYSCALL
get defined?

> +typedef int (*vgettime_t)(clockid_t, struct timespec *);
> +
> +vgettime_t vdso_clock_gettime;
> +
> +static void fill_function_pointers(void)
> +{
> +	void *vdso = dlopen("linux-vdso.so.1",
> +			    RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
> +	if (!vdso)
> +		vdso = dlopen("linux-gate.so.1",
> +			      RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
> +	if (!vdso) {
> +		pr_err("[WARN]\tfailed to find vDSO\n");
> +		return;
> +	}
> +
> +	vdso_clock_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime");
> +	if (!vdso_clock_gettime)
> +		pr_err("Warning: failed to find clock_gettime in vDSO\n");
> +
> +}
> +
> +static void test(clock_t clockid, char *clockstr, bool in_ns)
> +{
> +	struct timespec tp, start;
> +	long i = 0;
> +	const int timeout = 3;
> +
> +#ifndef TEST_SYSCALL
> +	vdso_clock_gettime(clockid, &start);
> +#else
> +	syscall(__NR_clock_gettime, clockid, &start);
> +#endif

Hmm. This doesn't look right. Does this test need to be compiled
with TEST_SYSCALL. Please find a way to do this without ifdef.

thanks,
-- Shuah

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 23:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <201908151203.FE87970@keescook>

On Thu, Aug 15, 2019 at 12:46:41PM -0700, Kees Cook wrote:
> On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> 
> Unsurprisingly, I totally disagree: this is the very definition of
> improved "security": reduced attack surface, confined trust, etc.

there are different ways to define the meaning of the word "security".
Of course containers reduce attack surface, etc.
The 'attack surface' as a mitigation from malicious users is not always the goal
of running containers. Ask yourself why containers are used in the datacenters
where only root can ssh into a server, only signed packages can
ever be installed, no browsers running, and no remote code is ever downloaded?

> > Linux has become a single user system.
> 
> I hope this is just hyperbole, because it's not true in reality. I agree
> that the vast majority of Linux devices are single-user-at-a-time
> systems now (rather than the "shell servers" of yore), but the system
> still has to be expected to confine users from each other, root, and the
> hardware. Switching users on Chrome OS or a distro laptop, etc is still
> very much expected to _mean_ something.

of course.

> 
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> 
> I'm not sure why you draw the line for VMs -- they're just as buggy
> as anything else. Regardless, I reject this line of thinking: yes,
> all software is buggy, but that isn't a reason to give up.

hmm. are you saying you want kernel community to work towards
making containers (namespaces) being able to run arbitrary code
downloaded from the internet?
In other words the problems solved by user space sandboxing, gvisor, etc
should be solved by the kernel?
I really don't think it's a good idea.

> If you look at software safety as a binary, you will always be
> disappointed. If you look at it as it manifests in the real world,
> then there is some perspective to be had. Reachability of flaws becomes
> a major factor; exploit chain length becomes a factor. There are very
> real impacts to be had from security hardening, sandboxing, etc. Of
> course nothing is perfect, but the current state of the world isn't
> as you describe. (And I say this with the knowledge of how long
> the lifetime of bugs are in the kernel.)

No arguing here. Security today is mainly the number of layers.
Hardening at all levels, sanboxing do help.
namespaces is one of the layers provided by the kernel.
The point that the kernel did its job already.
All other security layers are in user space.
Looking for bugs at every layer is still must have.
In the kernel, systemd, qemu, OS, browsers, etc.
Containers provide one level of security. VMs have another.

> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code
> 
> Perhaps it's just a language issue. "More secure" and "safer" mean
> mostly the same thing to me. I tend to think "safer" is actually
> a superset that includes things that wreck the user experience but
> aren't actually in the privilege manipulation realm. In the traditional
> "security" triad of confidentiality, integrity, and availability, I tend
> to weigh availability less highly, but a bug that stops someone from
> doing their work but doesn't wreck data, let them switch users, etc,
> is still considered a "security" issue by many folks. The fewer bugs
> someone is exposed to improves their security, safety, whatever. The
> easiest way to do that is confinement and its associated attack surface
> reduction. tl;dr: security and safety are very use-case-specific
> continuum, not a binary state.

yep

> 
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> > There is not a single use case to allow arbitrary malicious bpf
> > program to be loaded and executed.
> 
> The world isn't binary (safe code/malicious code), and we need to build
> systems that can be used safely even when things go wrong. Yes, probably
> no one has a system that _intentionally_ feeds eBPF into the kernel from
> a web form. But there is probably someone who does it unintentionally,
> or has a user login exposed on a system where unpriv BPF is enabled. The
> point is to create primitives as safely as possible so when things DO
> go wrong, they fail safe instead of making things worse.
> 
> I'm all for a "less privileged than root" API for eBPF, but I get worried
> when I see "security" being treated as a binary state. Especially when
> it is considered an always-failed state. :)

'security as always failed state' ? hmm.
not sure where this impression came from.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1
which will make the system overall _more_ secure.
I hope we can agree on that.

^ permalink raw reply

* Re: [PATCHv6 16/36] fd/proc: Respect boottime inside time namespace for /proc/uptime
From: Randy Dunlap @ 2019-08-16  0:46 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
	Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <20190815163836.2927-17-dima@arista.com>

On 8/15/19 9:38 AM, Dmitry Safonov wrote:
> Co-developed-by: Andrei Vagin <avagin@openvz.org>
> Signed-off-by: Andrei Vagin <avagin@openvz.org>
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  fs/proc/uptime.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> index a4c2791ab70b..5a1b228964fb 100644
> --- a/fs/proc/uptime.c
> +++ b/fs/proc/uptime.c

Please fix $Subject (s/fd/fs/)

thanks.
-- 
~Randy

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-16  0:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Song Liu, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815234622.t65oxm5mtfzy6fhg@ast-mbp.dhcp.thefacebook.com>

> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>> 
>> I'm not sure why you draw the line for VMs -- they're just as buggy
>> as anything else. Regardless, I reject this line of thinking: yes,
>> all software is buggy, but that isn't a reason to give up.
> 
> hmm. are you saying you want kernel community to work towards
> making containers (namespaces) being able to run arbitrary code
> downloaded from the internet?

Yes.

As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)

To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.

Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:

CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.

Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.

(Sandstorm still exists but is no longer as actively developed, sadly.)

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-16  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <B0364660-AD6A-4E5C-B04F-3B6DA78B4BBE@amacapital.net>



> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
>>> 
>>> I'm not sure why you draw the line for VMs -- they're just as buggy
>>> as anything else. Regardless, I reject this line of thinking: yes,
>>> all software is buggy, but that isn't a reason to give up.
>> 
>> hmm. are you saying you want kernel community to work towards
>> making containers (namespaces) being able to run arbitrary code
>> downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.
> 
> Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:
> 
> CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.
> 
> Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.
> 
> (Sandstorm still exists but is no longer as actively developed, sadly.)

I am trying to understand different perspectives here. 

Disclaimer: Alexei and I both work for Facebook. But he may disagree 
with everything I am about to say below, because we haven't sync'ed 
about this for a while. :)

I think there are two types of use cases here: 

    1. CAP_BPF_ADMIN: one big key to all sys_bpf(). 
    2. CAP_BPF: subset of sys_bpf() that is safe for containers.

IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN. 
And there aren't many real world use cases for CAP_BPF. 

The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN.
On the other hand, Andy would like to introduce CAP_BPF and build
amazing use cases around it (chicken-egg problem). 

Did I misunderstand anything?

If not, I think these two use cases do not really conflict with each
other, and we probably need both of them. Then, the next question is 
do we really need both/either of them. Maybe having two separate 
discussions would make it easier?


The following are some questions I am trying to understand for 
the two cases. 

For CAP_BPF_ADMIN (or /dev/bpf):
Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and
reuse existing CAP_ should be easier than introducing a new one? 

For CAP_BPF: 
Do we really need it for the containers? Is it possible to implement 
all container use cases with SUID? At this moment, I think SUID is 
the right way to go for this use case, because this is likely to 
start with a small set of functionalities. We can introduce CAP_BPF
when the container use case is too complicated for SUID. 


I hope some of these questions/thoughts would make some sense?

Thanks,
Song

^ permalink raw reply

* Re: [PATCHv6 01/36] ns: Introduce Time Namespace
From: Andrei Vagin @ 2019-08-16  6:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Andrei Vagin,
	Adrian Reber, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908151908230.1908@nanos.tec.linutronix.de>

On Thu, Aug 15, 2019 at 07:19:12PM +0200, Thomas Gleixner wrote:
> Dmitry,
> 
> On Thu, 15 Aug 2019, Dmitry Safonov wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 420567d1519a..97b7737f5aba 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -12898,6 +12898,8 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> >  S:	Maintained
> >  F:	fs/timerfd.c
> >  F:	include/linux/timer*
> > +F:	include/linux/time_namespace.h
> > +F:	kernel/time_namespace.c
> 
> Shouldn't this be kernel/time/namespace.c so all that stuff is lumped
> together. No strong opinion though.

Sure, we can move this in kernel/time. I don't remember why I decided to
place it in kernel/.

> 
> > +++ b/kernel/time_namespace.c
> > +static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
> > +{
> > +	return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
> > +}
> > +
> > +static void dec_time_namespaces(struct ucounts *ucounts)
> > +{
> > +	dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
> > +}
> > +
> > +static struct time_namespace *create_time_ns(void)
> > +{
> > +	struct time_namespace *time_ns;
> > +
> > +	time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
> 
> Shouldn't this use kzalloc()? There are tons of members in that struct.

I don't think so. All of other members are initialized in clone_time_ns.
Maybe we don't need this helper. When I wrote this code, I looked at
kernel/utsname.c. I will think what we can do here to make this code
more clear.

> 
> > +	if (time_ns) {
> > +		kref_init(&time_ns->kref);
> > +		time_ns->initialized = false;
> 
> And you spare this one.

This one should be initialized in clone_time_ns too.
> 
> > +	}
> > +	return time_ns;
> > +}
> > +
> > +/*
> > + * Clone a new ns copying @old_ns, setting refcount to 1
> > + * @old_ns: namespace to clone
> > + * Return the new ns or ERR_PTR.
> 
> If you use kernel-doc style then please use te proper syntax
> 
> /*
>  * clone_time_ns - Clone a time namespace
>  * @old_ns:	Namespace to clone
>  *
>  * Clone @old_ns and set the clone refcount to 1
>  *
>  * Return: The new namespace or ERR_PTR.
>  */

Will fix. Thanks!

> 
> > + */
> > +static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
> > +					  struct time_namespace *old_ns)
> > +{
> > +	struct time_namespace *ns;
> > +	struct ucounts *ucounts;
> > +	int err;
> > +
> > +	err = -ENOSPC;
> > +	ucounts = inc_time_namespaces(user_ns);
> > +	if (!ucounts)
> > +		goto fail;
> > +
> > +	err = -ENOMEM;
> > +	ns = create_time_ns();
> > +	if (!ns)
> > +		goto fail_dec;
> > +
> > +	err = ns_alloc_inum(&ns->ns);
> > +	if (err)
> > +		goto fail_free;
> > +
> > +	ns->ucounts = ucounts;
> > +	ns->ns.ops = &timens_operations;
> > +	ns->user_ns = get_user_ns(user_ns);
> > +	return ns;
> > +
> > +fail_free:
> > +	kfree(ns);
> > +fail_dec:
> > +	dec_time_namespaces(ucounts);
> > +fail:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/*
> > + * Add a reference to old_ns, or clone it if @flags specify CLONE_NEWTIME.
> > + * In latter case, changes to the time of this process won't be seen by parent,
> > + * and vice versa.
> 
> Ditto

Will fix.

> 
> > + */
> > +struct time_namespace *copy_time_ns(unsigned long flags,
> > +	struct user_namespace *user_ns, struct time_namespace *old_ns)
> > +{
> > +	if (!(flags & CLONE_NEWTIME))
> > +		return get_time_ns(old_ns);
> > +
> > +	return clone_time_ns(user_ns, old_ns);
> > +}
> > +
> > +void free_time_ns(struct kref *kref)
> > +{
> > +	struct time_namespace *ns;
> > +
> > +	ns = container_of(kref, struct time_namespace, kref);
> > +	dec_time_namespaces(ns->ucounts);
> > +	put_user_ns(ns->user_ns);
> > +	ns_free_inum(&ns->ns);
> > +	kfree(ns);
> > +}
> > +
> > +static struct time_namespace *to_time_ns(struct ns_common *ns)
> > +{
> > +	return container_of(ns, struct time_namespace, ns);
> > +}
> > +
> > +static struct ns_common *timens_get(struct task_struct *task)
> > +{
> > +	struct time_namespace *ns = NULL;
> > +	struct nsproxy *nsproxy;
> > +
> > +	task_lock(task);
> > +	nsproxy = task->nsproxy;
> > +	if (nsproxy) {
> > +		ns = nsproxy->time_ns;
> > +		get_time_ns(ns);
> > +	}
> > +	task_unlock(task);
> > +
> > +	return ns ? &ns->ns : NULL;
> > +}
> > +
> > +static struct ns_common *timens_for_children_get(struct task_struct *task)
> > +{
> > +	struct time_namespace *ns = NULL;
> > +	struct nsproxy *nsproxy;
> > +
> > +	task_lock(task);
> > +	nsproxy = task->nsproxy;
> > +	if (nsproxy) {
> > +		ns = nsproxy->time_ns_for_children;
> > +		get_time_ns(ns);
> > +	}
> > +	task_unlock(task);
> > +
> > +	return ns ? &ns->ns : NULL;
> > +}
> > +
> > +static void timens_put(struct ns_common *ns)
> > +{
> > +	put_time_ns(to_time_ns(ns));
> > +}
> > +
> > +static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
> > +{
> > +	struct time_namespace *ns = to_time_ns(new);
> > +
> > +	if (!current_is_single_threaded())
> > +		return -EUSERS;
> > +
> > +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > +	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > +		return -EPERM;
> > +
> > +	get_time_ns(ns);
> > +	get_time_ns(ns);
> 
> Why is this a double get?

Because we change nsproxy->time_ns and nsproxy->time_ns_for_children.

Probably, I need to reorgonize the code this way:

	get_time_ns(ns);
	put_time_ns(nsproxy->time_ns);
	nsproxy->time_ns = ns;

	get_time_ns(ns);
	put_time_ns(nsproxy->time_ns_for_children);
	nsproxy->time_ns_for_children = ns;
> 
> > +	put_time_ns(nsproxy->time_ns);
> > +	put_time_ns(nsproxy->time_ns_for_children);
> > +	nsproxy->time_ns = ns;
> > +	nsproxy->time_ns_for_children = ns;
> > +	ns->initialized = true;
> > +	return 0;
> > +}
> > +
> > +int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
> > +{
> > +	struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
> > +	struct time_namespace *ns = to_time_ns(nsc);
> > +
> > +	if (nsproxy->time_ns == nsproxy->time_ns_for_children)
> > +		return 0;
> > +
> > +	get_time_ns(ns);
> > +	put_time_ns(nsproxy->time_ns);
> > +	nsproxy->time_ns = ns;
> > +	ns->initialized = true;
> 
> Isn't that one initialized already?

When we discussed time namespaces last year, we decided that clock
offsets can be set only before any task enters a namespace.

When a namespace is created, ns->initialized is set to false. When a
task enters the namespace, ns->initialized is set to true.

Namespace clock offsets can be changed only if ns->initialized is false.

A new task can be forked to a specified time namespace or it can call
setns, so we set ns->initialized to true in timens_on_fork and
timens_install.

> 
> > +
> > +	return 0;
> > +}
> > +
> > +static struct user_namespace *timens_owner(struct ns_common *ns)
> > +{
> > +	return to_time_ns(ns)->user_ns;
> > +}
> > +
> > +const struct proc_ns_operations timens_operations = {
> > +	.name		= "time",
> > +	.type		= CLONE_NEWTIME,
> > +	.get		= timens_get,
> > +	.put		= timens_put,
> > +	.install	= timens_install,
> > +	.owner		= timens_owner,
> > +};
> > +
> > +const struct proc_ns_operations timens_for_children_operations = {
> > +	.name		= "time_for_children",
> > +	.type		= CLONE_NEWTIME,
> > +	.get		= timens_for_children_get,
> > +	.put		= timens_put,
> > +	.install	= timens_install,
> > +	.owner		= timens_owner,
> > +};
> > +
> > +struct time_namespace init_time_ns = {
> > +	.kref = KREF_INIT(3),
> > +	.user_ns = &init_user_ns,
> > +	.ns.inum = PROC_TIME_INIT_INO,
> > +	.ns.ops = &timens_operations,
> > +};
> 
> Inconsisten formatting. The above static initializers are nicely
> tabular. This on not.

Will fix. Thanks.

Thanks,
Andrei
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply

* Re: [PATCHv6 30/36] selftest/timens: Add Time Namespace test for supported clocks
From: Andrei Vagin @ 2019-08-16  6:20 UTC (permalink / raw)
  To: shuah
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov,
	Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
	x86
In-Reply-To: <02add700-b626-a1b4-09e1-1e4d5cd242f2@kernel.org>

On Thu, Aug 15, 2019 at 05:18:21PM -0600, shuah wrote:
> Hi Dmitry,
> 
> Thanks for the patch.
> 
> On 8/15/19 10:38 AM, Dmitry Safonov wrote:
> > A test to check that all supported clocks work on host and inside
> > a new time namespace. Use both ways to get time: through VDSO and
> > by entering the kernel with implicit syscall.
> > 
> > Introduce a new timens directory in selftests framework for
> > the next timens tests.
> > 
> > Co-developed-by: Andrei Vagin <avagin@openvz.org>
> > Signed-off-by: Andrei Vagin <avagin@openvz.org>
> > Signed-off-by: Dmitry Safonov <dima@arista.com>
> > ---
> >   tools/testing/selftests/Makefile          |   1 +
> >   tools/testing/selftests/timens/.gitignore |   1 +
> >   tools/testing/selftests/timens/Makefile   |   5 +
> >   tools/testing/selftests/timens/config     |   1 +
> >   tools/testing/selftests/timens/log.h      |  26 +++
> >   tools/testing/selftests/timens/timens.c   | 185 ++++++++++++++++++++++
> >   tools/testing/selftests/timens/timens.h   |  63 ++++++++
> >   7 files changed, 282 insertions(+)
> >   create mode 100644 tools/testing/selftests/timens/.gitignore
> >   create mode 100644 tools/testing/selftests/timens/Makefile
> >   create mode 100644 tools/testing/selftests/timens/config
> >   create mode 100644 tools/testing/selftests/timens/log.h
> >   create mode 100644 tools/testing/selftests/timens/timens.c
> >   create mode 100644 tools/testing/selftests/timens/timens.h
> > 
> > diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> > index 25b43a8c2b15..6fc63b84a857 100644
> > --- a/tools/testing/selftests/Makefile
> > +++ b/tools/testing/selftests/Makefile
> > @@ -47,6 +47,7 @@ TARGETS += splice
> >   TARGETS += static_keys
> >   TARGETS += sync
> >   TARGETS += sysctl
> > +TARGETS += timens
> 
> How long does this test run for?

On my laptop, it needs 30 miliseconds.

> Does this test need to be run
> as root? If yes, please add a root check and skip the test.

Yes, it needs to be as root. We will add this check. Thanks.

> 
> What does the test output looks like for skip and pass/fail cases?

[avagin@laptop timens]$ ./timens 
not ok 1 # SKIP Time namespaces are not supported

[root@fc24 timens]# ./timens 
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
ok 3 Passed for CLOCK_BOOTTIME_ALARM (syscall)
ok 4 Passed for CLOCK_BOOTTIME_ALARM (vdso)
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Pass 10 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

Thanks,
Andrei

> 
> thanks,
> -- Shuah

^ permalink raw reply

* Re: [PATCHv6 01/36] ns: Introduce Time Namespace
From: Thomas Gleixner @ 2019-08-16  6:34 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Andrei Vagin,
	Adrian Reber, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20190816061119.GA14312@gmail.com>

Andrei,

On Thu, 15 Aug 2019, Andrei Vagin wrote:
> On Thu, Aug 15, 2019 at 07:19:12PM +0200, Thomas Gleixner wrote:
> > > +static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
> > > +{
> > > +	struct time_namespace *ns = to_time_ns(new);
> > > +
> > > +	if (!current_is_single_threaded())
> > > +		return -EUSERS;
> > > +
> > > +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > > +	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > > +		return -EPERM;
> > > +
> > > +	get_time_ns(ns);
> > > +	get_time_ns(ns);
> > 
> > Why is this a double get?
> 
> Because we change nsproxy->time_ns and nsproxy->time_ns_for_children.
> 
> Probably, I need to reorgonize the code this way:
> 
> 	get_time_ns(ns);
> 	put_time_ns(nsproxy->time_ns);
> 	nsproxy->time_ns = ns;
> 
> 	get_time_ns(ns);
> 	put_time_ns(nsproxy->time_ns_for_children);
> 	nsproxy->time_ns_for_children = ns;

That's better. A few comments about refcounting might be useful as well.
 
> > > +	put_time_ns(nsproxy->time_ns);
> > > +	put_time_ns(nsproxy->time_ns_for_children);
> > > +	nsproxy->time_ns = ns;
> > > +	nsproxy->time_ns_for_children = ns;
> > > +	ns->initialized = true;
> > > +	return 0;
> > > +}
> > > +
> > > +int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
> > > +{
> > > +	struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
> > > +	struct time_namespace *ns = to_time_ns(nsc);
> > > +
> > > +	if (nsproxy->time_ns == nsproxy->time_ns_for_children)
> > > +		return 0;

Doesn't this need to take a refcount on fork? Maybe not, but see below.

> > > +
> > > +	get_time_ns(ns);
> > > +	put_time_ns(nsproxy->time_ns);
> > > +	nsproxy->time_ns = ns;
> > > +	ns->initialized = true;
> > 
> > Isn't that one initialized already?
> 
> When we discussed time namespaces last year, we decided that clock
> offsets can be set only before any task enters a namespace.
> 
> When a namespace is created, ns->initialized is set to false. When a
> task enters the namespace, ns->initialized is set to true.

Right. I'm probably just hopelessly confused by this nsproxy->time_ns
vs. nsproxy->time_ns_for_children->ns thing.
 
> Namespace clock offsets can be changed only if ns->initialized is false.
> 
> A new task can be forked to a specified time namespace or it can call
> setns, so we set ns->initialized to true in timens_on_fork and
> timens_install.

Some comments explaining that logic above would be really helpful.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCHv6 28/36] posix-clocks: Add align for timens_offsets
From: Thomas Gleixner @ 2019-08-16  6:36 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Adrian Reber, Andrei Vagin,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1908152010230.1908@nanos.tec.linutronix.de>

On Thu, 15 Aug 2019, Thomas Gleixner wrote:
> So using struct timespec64 for the offset is wrong. You really need to open
> code that offset storage if you don't want to end up with weird workarounds
> for BE.
> 
> Something like this:
> 
> struct timens_offs {
> 	  time64_t	tv_sec;
> 	  s64		tv_nsec;

Actually that should use 'int' for the tv_nsec part.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-16  9:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815230808.2o2qe7a72cwdce2m@ast-mbp.dhcp.thefacebook.com>

On Thursday, August 15, 2019 11:08 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
>
> > On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> > alexei.starovoitov@gmail.com wrote:
> >
> > > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > >
> > > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > > process. Also unprivileged user process can start systemd --user process with any
> > > > command they like.
> > >
> > > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > > CAP_BPF is a natural step in the same direction.
> >
> > I have the feeling that we're somehow speaking different languages.
> > What, precisely, do you mean when you say "systemd itself is trusted"?
> > Do you mean "the administrator trusts that the /lib/systemd/systemd
> > binary is not malicious"? Do you mean "the administrator trusts that
> > the running systemd process is not malicious"?
>
> please see
> https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
> I'm not advocating for or against this approach.
> Call it 'security hole' or 'better security'.
> There are two categories of people for any feature like this.
> My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
> level of privileges is acceptable.
> Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
> But CAP_BPF is clearly better way.
>

Do you realize it's not possible to grant CAP_NET_ADMIN or any other CAP in
"systemd --user" service? Trying to do so will fail with:
"Failed to apply ambient capabilities (before UID change): Operation not permitted"

I think it's crucial to clear that point to avoid confusion in this discussion
where people are talking about different things.

On the other hand running "systemd --system" service with:

User=nobody
AmbientCapabilities=CAP_NET_ADMIN

is perfectly legit and clears some security concerns as only privileged user
can start such service.

Jordan

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Thomas Gleixner @ 2019-08-16  9:59 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <fkD3fs46a1YnR4lh0tEG-g3tDnDcyZuzji7bAUR9wujPLLl75ZhI8Yk-H1jZpSugO7qChVeCwxAMmxLdeoF2QFS3ZzuYlh7zmeZOmhDJxww=@protonmail.ch>

On Fri, 16 Aug 2019, Jordan Glover wrote:
> "systemd --user" service? Trying to do so will fail with:
> "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> 
> I think it's crucial to clear that point to avoid confusion in this discussion
> where people are talking about different things.
> 
> On the other hand running "systemd --system" service with:
> 
> User=nobody
> AmbientCapabilities=CAP_NET_ADMIN
> 
> is perfectly legit and clears some security concerns as only privileged user
> can start such service.

While we are at it, can we please stop looking at this from a systemd only
perspective. There is a world outside of systemd.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-16 11:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <alpine.DEB.2.21.1908161158490.1873@nanos.tec.linutronix.de>

On Friday, August 16, 2019 9:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 16 Aug 2019, Jordan Glover wrote:
>
> > "systemd --user" service? Trying to do so will fail with:
> > "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> > I think it's crucial to clear that point to avoid confusion in this discussion
> > where people are talking about different things.
> > On the other hand running "systemd --system" service with:
> > User=nobody
> > AmbientCapabilities=CAP_NET_ADMIN
> > is perfectly legit and clears some security concerns as only privileged user
> > can start such service.
>
> While we are at it, can we please stop looking at this from a systemd only
> perspective. There is a world outside of systemd.
>
> Thanks,
>
> tglx

If you define:

"systemd --user" == unprivileged process started by unprivileged user
"systemd --system" == process started by privileged user but run as another
user which keeps some of parent user privileges and drops others

you can get rid of "systemd" from the equation.

"systemd --user" was the example provided by Alexei when asked about the usecase
but his description didn't match what it does so it's not obvious what the real
usecase is. I'm sure there can be many more examples and systemd isn't important
here in particular beside to understand this specific example.

Jordan

^ permalink raw reply

* Re: [PATCH v8 22/27] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-08-16 13:55 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190813205225.12032-23-yu-cheng.yu@intel.com>

On Tue, Aug 13, 2019 at 01:52:20PM -0700, Yu-cheng Yu wrote:
> An ELF file's .note.gnu.property indicates features the executable file
> can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> 
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and specific
> arch_parse_property() and arch_setup_property().
> 
> For example, for X86_64:
> 
> int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> {
> 	int r;
> 	uint32_t property;
> 
> 	r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> 			     &property);
> 	...
> }
> 
> This patch is derived from code provided by H.J. Lu <hjl.tools@gmail.com>.

This is a nice simplification over the previous version, but I'm still
wondering whether it would be better to follow others folks' suggestions
and simply iterate over all the properties found, calling an arch
function for each note that the core doesn't care about.

Something like the following pseudocode:

include/x86/elf.h:
	int arch_elf_property(p)
	{
		if (p->pr_type == GNU_PROPERTY_X86_FEATURE_1_AND)
			return elf_property_x86_feature_1_and(p);
		else
			return 0;
	}

binfmt_elf.c:
	while (p = find next property)
		arch_elf_property(p);


This would also be more efficient when more than one property needs to
be extracted, since it ensures the file is only read once.

Anyway, comments below...

> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  fs/Kconfig.binfmt        |   3 +
>  fs/Makefile              |   1 +
>  fs/binfmt_elf.c          |  20 +++++
>  fs/gnu_property.c        | 178 +++++++++++++++++++++++++++++++++++++++
>  include/linux/elf.h      |  11 +++
>  include/uapi/linux/elf.h |  14 +++
>  6 files changed, 227 insertions(+)
>  create mode 100644 fs/gnu_property.c
> 
> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
> index 62dc4f577ba1..d2cfe0729a73 100644
> --- a/fs/Kconfig.binfmt
> +++ b/fs/Kconfig.binfmt
> @@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
>  config ARCH_BINFMT_ELF_STATE
>  	bool
>  
> +config ARCH_USE_GNU_PROPERTY
> +	bool
> +
>  config BINFMT_ELF_FDPIC
>  	bool "Kernel support for FDPIC ELF binaries"
>  	default y if !BINFMT_ELF
> diff --git a/fs/Makefile b/fs/Makefile
> index d60089fd689b..939b1eb7e8cc 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -44,6 +44,7 @@ obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
>  obj-$(CONFIG_COMPAT_BINFMT_ELF)	+= compat_binfmt_elf.o
>  obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
>  obj-$(CONFIG_BINFMT_FLAT)	+= binfmt_flat.o
> +obj-$(CONFIG_ARCH_USE_GNU_PROPERTY) += gnu_property.o
>  
>  obj-$(CONFIG_FS_MBCACHE)	+= mbcache.o
>  obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.o
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index d4e11b2e04f6..a4e87fcb10a8 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -852,6 +852,21 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			}
>  	}
>  
> +	if (interpreter) {
> +		retval = arch_parse_property(&loc->interp_elf_ex,
> +					     interp_elf_phdata,
> +					     interpreter, true,
> +					     &arch_state);
> +	} else {
> +		retval = arch_parse_property(&loc->elf_ex,
> +					     elf_phdata,
> +					     bprm->file, false,
> +					     &arch_state);
> +	}
> +
> +	if (retval)
> +		goto out_free_dentry;
> +
>  	/*
>  	 * Allow arch code to reject the ELF at this point, whilst it's
>  	 * still possible to return an error to the code that invoked
> @@ -1080,6 +1095,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		goto out_free_dentry;
>  	}
>  
> +	retval = arch_setup_property(&arch_state);
> +
> +	if (retval < 0)
> +		goto out_free_dentry;
> +
>  	if (interpreter) {
>  		unsigned long interp_map_addr = 0;
>  
> diff --git a/fs/gnu_property.c b/fs/gnu_property.c
> new file mode 100644
> index 000000000000..b22b43f4d6a0
> --- /dev/null
> +++ b/fs/gnu_property.c
> @@ -0,0 +1,178 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Extract an ELF file's .note.gnu.property.
> + *
> + * The path from the ELF header to .note.gnu.property is:
> + *	elfhdr->elf_phdr->elf_note.
> + *
> + * .note.gnu.property layout:
> + *
> + *	struct elf_note {
> + *		u32 n_namesz; --> sizeof(n_name[]); always (4)
> + *		u32 n_ndescsz;--> sizeof(property[])
> + *		u32 n_type;   --> always NT_GNU_PROPERTY_TYPE_0 (5)
> + *	};
> + *	char n_name[4]; --> always 'GNU\0'
> + *
> + *	struct {
> + *		struct gnu_property {
> + *			u32 pr_type;
> + *			u32 pr_datasz;
> + *		};
> + *		u8 pr_data[pr_datasz];
> + *	}[];
> + */

Do we need all this comment?  We already have Elf{32,64}_Nhdr and
struct gnu_property in <uapi/elf.h>.

> +
> +#include <linux/elf.h>
> +#include <linux/slab.h>
> +#include <linux/fs.h>
> +#include <linux/string.h>
> +#include <linux/compat.h>
> +
> +/*
> + * Search a note's payload for 'pr_type'.
> + */
> +static int check_note_payload(void *buf, unsigned long len, u32 pr_type,
> +			      u32 *property)
> +{
> +	u32 pr_type_max = 0;
> +
> +	*property = 0;
> +
> +	while (len > 0) {
> +		struct gnu_property *pr = buf;
> +		unsigned long pr_len;
> +
> +		if (sizeof(*pr) > len)

checkpatch? (space required between sizeof and "(")

> +			return 0;

Shouldn't this be an error?
I'd have thought we should return 0 only if the property is found.

> +
> +		pr_len = sizeof(*pr) + pr->pr_datasz;

Overflow?

> +
> +		if (pr_len > len)
> +			return -ENOEXEC;

These seem to be the same class of error, i.e., trailing garbage in the
note, so why don't we return the same thing in both cases?

Maybe

	if (sizeof (*pr) > len ||
	    pr->pr_datasz > len - sizeof (*pr))
		return -ENOEXEC;

	pr_len = sizeof (*pr) + pr->pr_datasz;

> +		/* property types are in ascending order */
> +		if ((pr_type_max != 0) && (pr->pr_type > pr_type_max))
> +			return 0;

Redundant ().  The first part of the condition may be redundant too.
We also don't check for pr->pr_type == pr_type_max (which is presume
is also not supposed to happen).

Do we really need to check this anyway?  I presume this rule is only in
the spec to facilitate binary search (which the spec then defeats by
having a variable property size).

If we consider the ELF file invalid when this check fails, shouldn't
this be -ENOEXEC?

> +		if (pr->pr_type > pr_type)
> +			return 0;
> +
> +		if ((pr->pr_type == pr_type) &&
> +		    (pr->pr_datasz >= sizeof(u32))) {
> +			*property = *(u32 *)(buf + sizeof(*pr));
> +			return 0;
> +		}

Shouldn't pr->pr_datasz be exactly == sizeof (u32)?

> +
> +		if (pr->pr_type > pr_type_max)
> +			pr_type_max = pr->pr_type;

All these checks have my head spinning... if we ignore the ordering
requirement, can't we reduce it all to

	if (pr->pr_type == pr_type) {
		if (pr->pr_datasz != sizeof (u32))
			return -ENOEXEC;

		*property = *(u32 *)(buf + sizeof (*pr));
		return 0;
	}

> +

Do we need to up to the appropriate alignment after each property?

> +		buf += pr_len;
> +		len -= pr_len;
> +	}
> +
> +	return 0;

-ENOENT?

> +}
> +
> +/*
> + * Look at an ELF file's NT_GNU_PROPERTY for the property of pr_type.
> + *
> + * Input:
> + *	buf: the buffer containing the whole note.
> + *	len: size of buf.
> + *	align: alignment of the note's payload.
> + *	pr_type: the property type.
> + *
> + * Output:
> + *	The property found.
> + *
> + * Return:
> + *	Zero or error.
> + */
> +static int check_note(void *buf, unsigned long len, int align,
> +			  u32 pr_type, u32 *property)

check_note_payload() and check_note() are somewhat misleadingly named,
since they don't just check.

Maybe call them gnu_property_type_0_extract_property(),
note_extract_property()?

Admittedly the first of those names would be super-verbose :/

> +{
> +	struct elf_note *n = buf;
> +	char *note_name = buf + sizeof(*n);
> +	unsigned long payload_offset;
> +	unsigned long payload_len;
> +
> +	if (len < sizeof(*n) + 4)
> +		return -ENOEXEC;
> +
> +	if ((n->n_namesz != 4) || strncmp("GNU", note_name, 3))
> +		return -ENOEXEC;

Should that be , n->n_namesz? (or , sizeof ("GNU"))?

Also, no check on n->n_type?

Alternatively, we could just not bother to check the note header:
this was found via PT_GNU_PROPERTY, so if it's not a properly
formatted NT_GNU_PROPERTY_TYPE_0 then the file is garbage anyway
and it doesn't matter exactly what we do.

Personally I would check it though.

> +
> +	payload_offset = round_up(sizeof(*n) + n->n_namesz, align);
> +	payload_len = n->n_descsz;
> +
> +	if (payload_offset + payload_len > len)

May this overflow on 32-bit?

What about:

	if (payload_offset > len ||
	    payload_len > len - payload_offset)

> +		return -ENOEXEC;
> +
> +	buf += payload_offset;
> +	len -= payload_offset;
> +
> +	return check_note_payload(buf, len, pr_type, property);
> +}
> +
> +#define find_note(phdr, nr_phdrs, align, pos, len) { \
> +	int cnt; \
> +	\
> +	for (cnt = 0; cnt < nr_phdrs; cnt++) { \

Just to avoid future surprises:

(nr_phdrs)

> +		if ((phdr)[cnt].p_align != align) \

Similarly:

(align)

> +			continue; \
> +		if ((phdr)[cnt].p_type == PT_GNU_PROPERTY) { \
> +			pos = (phdr)[cnt].p_offset; \
> +			len = (phdr)[cnt].p_filesz; \
> +		} \
> +	} \
> +}
> +
> +int get_gnu_property(void *ehdr, void *phdr, struct file *file,
> +		     u32 pr_type, u32 *property)
> +{
> +	Elf64_Ehdr *ehdr64 = ehdr;
> +	Elf32_Ehdr *ehdr32 = ehdr;
> +	void *buf;
> +	int align;
> +	loff_t pos = 0;
> +	unsigned long len = 0;
> +	int err = 0;
> +
> +	/*
> +	 * Find PT_GNU_PROPERTY from ELF program headers.
> +	 */
> +	if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {

Can we trust e_ident[EI_CLASS] to tell us how big the header is?

We don't check that anywhere AFAICT.  For the ELF interpreter in
particular, we kmalloc() the appropriate header size determined by
e_machine, so a malicious binary could have e_machine = EM_I386 with
e_ident[ELFCLASS] == ELFCLASSS64, causing a buffer overrun here.

For the main elf header, we get away with it because the bprm->buf[] is
statically allocated as BINPRM_BUF_SIZE and zero-padded in the case of a
short read.

We could pass in the header size explicitly here, or otherwise
validate that e_ident[ELFCLASS] is sane before calling in.

> +		align = 8;
> +		find_note((Elf64_Phdr *)phdr, ehdr64->e_phnum, align, pos, len);
> +	} else if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
> +		align = 4;
> +		find_note((Elf32_Phdr *)phdr, ehdr32->e_phnum, align, pos, len);
> +	}

Maybe make the name of find_note upper case, or pass pos and len by
reference.  Otherwise, this looks a bit like a function call -- in
which case pos and len couldn't be modified.

> +
> +	/*
> +	 * Read in the whole note.  PT_GNU_PROPERTY
> +	 * is not expected to be larger than a page.
> +	 */
> +	if (len == 0)
> +		return 0;
> +
> +	if (len > PAGE_SIZE)
> +		return -ENOEXEC;

Add a comment explaining the rationale?

> +
> +	buf = kmalloc(len, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	err = kernel_read(file, buf, len, &pos);
> +	if (err < len) {
> +		if (err >= 0)
> +			err = -EIO;
> +		goto out;
> +	}
> +
> +	err = check_note(buf, len, align, pr_type, property);
> +out:
> +	kfree(buf);
> +	return err;
> +}

[...]

Cheers
---Dave

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Andy Lutomirski @ 2019-08-16 15:23 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Dmitry Safonov, Adrian Reber, Andrei Vagin, Arnd Bergmann,
	Christian Brauner, Cyrill Gorcunov, Eric W. Biederman,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov,
	Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
	containers, criu, linux-api, x86
In-Reply-To: <20190815163836.2927-24-dima@arista.com>

On 8/15/19 9:38 AM, Dmitry Safonov wrote:
> As it has been discussed on timens RFC, adding a new conditional branch
> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> It will add a penalty for everybody as branch predictor may mispredict
> the jump. Also there are instruction cache lines wasted on cmp/jmp.
> 
> Those effects of introducing time namespace are very much unwanted
> having in mind how much work have been spent on micro-optimisation
> vdso code.
> 
> The propose is to allocate a second vdso code with dynamically
> patched out (disabled by static_branch) timens code on boot time.
> 
> Allocate another vdso and copy original code.


I'm unconvinced that any of this magic is wise.  I think you should make 
a special timens vvar page that causes the normal fastpath to fail 
(using a special vclock mode, a special seq value, or a special "last" 
value) and then make the failure path detect that timens is in use and 
use the timens path.


--Andy

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-16 19:52 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Thomas Gleixner, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <lGGTLXBsX3V6p1Z4TkdzAjxbNywaPS2HwX5WLleAkmXNcnKjTPpWnP6DnceSsy8NKt5NBRBbuoAb0woKTcDhJXVoFb7Ygk3Skfj8j6rVfMQ=@protonmail.ch>

On Fri, Aug 16, 2019 at 11:33:57AM +0000, Jordan Glover wrote:
> On Friday, August 16, 2019 9:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Fri, 16 Aug 2019, Jordan Glover wrote:
> >
> > > "systemd --user" service? Trying to do so will fail with:
> > > "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> > > I think it's crucial to clear that point to avoid confusion in this discussion
> > > where people are talking about different things.
> > > On the other hand running "systemd --system" service with:
> > > User=nobody
> > > AmbientCapabilities=CAP_NET_ADMIN
> > > is perfectly legit and clears some security concerns as only privileged user
> > > can start such service.
> >
> > While we are at it, can we please stop looking at this from a systemd only
> > perspective. There is a world outside of systemd.
> >
> > Thanks,
> >
> > tglx
> 
> If you define:
> 
> "systemd --user" == unprivileged process started by unprivileged user
> "systemd --system" == process started by privileged user but run as another
> user which keeps some of parent user privileges and drops others
> 
> you can get rid of "systemd" from the equation.
> 
> "systemd --user" was the example provided by Alexei when asked about the usecase
> but his description didn't match what it does so it's not obvious what the real
> usecase is. I'm sure there can be many more examples and systemd isn't important
> here in particular beside to understand this specific example.

It's both of the above when 'systemd' is not taken literally.
To earlier Thomas's point: the use case is not only about systemd.
There are other containers management systems.
I've used 'systemd-like' terminology as an attempt to explain that such
daemons are trusted signed binaries that can be run as pid=1.
Sometimes it's the later:
"process started by privileged user but run as another user which keeps
some of parent user privileges and drops others".
Sometimes capability delegation to another container management daemon
is too cumbersome, so it's easier to use suid bit on that other daemon.
So it will become like the former:
"sort-of unprivileged process started by unprivileged user."
where daemon has suid and drops most of the capabilities as it starts.
Let's not focus on the model being good or bad security wise.
The point that those are the use cases that folks are thinking about.
That secondary daemon can be full root just fine.
All outer and inner daemons can be root.
These daemons need to drop privileges to make the system safer ==
less prone to corruption due to bugs in themselves. Not necessary security bugs.

^ permalink raw reply

* Re: [PATCH v8 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Thomas Gleixner @ 2019-08-16 19:56 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Ingo Molnar, linux-kernel, linux-doc,
	linux-mm, linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <20190813205225.12032-5-yu-cheng.yu@intel.com>

On Tue, 13 Aug 2019, Yu-cheng Yu wrote:
> +/*
> + * On context switches, XSAVE states are not restored until returning
> + * to user-mode.  FPU registers need to be restored before any changes,
> + * and protected by fpregs_lock()/fpregs_unlock().

I really had to read this comment twice to figure out what it means.

> + */
> +static inline void modify_fpu_regs_begin(void)

Please use a proper name space. fpu_regs_....

> +{
> +	fpregs_lock();
> +	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> +		__fpregs_load_activate();
> +}
> +
> +static inline void modify_fpu_regs_end(void)
> +{
> +	fpregs_unlock();
> +}

Also why are those inlines in this particular patch? I see no relation at all.

>  /*
>   * MXCSR and XCR definitions:
>   */
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 9ded9532257d..970bbd303cfb 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -21,9 +21,6 @@
>  #define XSAVE_YMM_SIZE	    256
>  #define XSAVE_YMM_OFFSET    (XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)
>  
> -/* Supervisor features */
> -#define XFEATURE_MASK_SUPERVISOR (XFEATURE_MASK_PT)
> -
>  /* All currently supported features */
>  #define SUPPORTED_XFEATURES_MASK (XFEATURE_MASK_FP | \
>  				  XFEATURE_MASK_SSE | \
> @@ -42,6 +39,7 @@
>  #endif
>  
>  extern u64 xfeatures_mask_user;
> +extern u64 xfeatures_mask_all;
>  extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
>  
>  extern void __init update_regset_xstate_info(unsigned int size,
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 12c70840980e..31d3cd70b5df 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -294,12 +294,16 @@ void fpu__drop(struct fpu *fpu)
>   * Clear FPU registers by setting them up from
>   * the init fpstate:
>   */
> -static inline void copy_init_fpstate_to_fpregs(void)
> +static inline void copy_init_fpstate_to_fpregs(u64 features_mask)
>  {
>  	fpregs_lock();
>  
> +	/*
> +	 * Only XSAVES user states are copied.
> +	 * System states are preserved.

Fits nicely in one line and aside of that this comment is blatantly
wrong. See that caller:

> +		copy_init_fpstate_to_fpregs(xfeatures_mask_all);

xfeatures_mask_all includes xfeatures_mask_system unless I'm missing
something.

> +	 */
>  	if (use_xsave())
> -		copy_kernel_to_xregs(&init_fpstate.xsave, -1);
> +		copy_kernel_to_xregs(&init_fpstate.xsave, features_mask);
>  	else if (static_cpu_has(X86_FEATURE_FXSR))
>  		copy_kernel_to_fxregs(&init_fpstate.fxsave);

The change of this function should also be split out into a separate
patch. This one is way too big to be reviewable.

>  	else
> @@ -318,7 +322,21 @@ static inline void copy_init_fpstate_to_fpregs(void)
>   * Called by sys_execve(), by the signal handler code and by various
>   * error paths.
>   */
> -void fpu__clear(struct fpu *fpu)
> +void fpu__clear_user_states(struct fpu *fpu)
> +{
> +	WARN_ON_FPU(fpu != &current->thread.fpu); /* Almost certainly an anomaly */

1) Please do not use tail comments. They break the reading flow.

2) Please do not comment the obvious. Put comments where they make sense. I
   know you copied it, but that does not make it any better.

> +	fpu__drop(fpu);
> +
> +	/*
> +	 * Make sure fpstate is cleared and initialized.
> +	 */
> +	fpu__initialize(fpu);
> +	if (static_cpu_has(X86_FEATURE_FPU))
> +		copy_init_fpstate_to_fpregs(xfeatures_mask_user);
> +}
> +
> +void fpu__clear_all(struct fpu *fpu)
>  {
>  	WARN_ON_FPU(fpu != &current->thread.fpu); /* Almost certainly an anomaly */
> @@ -329,7 +347,7 @@ void fpu__clear(struct fpu *fpu)
>  	 */
>  	fpu__initialize(fpu);
>  	if (static_cpu_has(X86_FEATURE_FPU))
> -		copy_init_fpstate_to_fpregs();
> +		copy_init_fpstate_to_fpregs(xfeatures_mask_all);
>  }
>  
>  /*
> diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
> index 73fed33e5bda..0a0ba584a533 100644
> --- a/arch/x86/kernel/fpu/init.c
> +++ b/arch/x86/kernel/fpu/init.c
> @@ -217,16 +217,6 @@ static void __init fpu__init_system_xstate_size_legacy(void)
>  	fpu_user_xstate_size = fpu_kernel_xstate_size;
>  }
>  
> -/*
> - * Find supported xfeatures based on cpu features and command-line input.
> - * This must be called after fpu__init_parse_early_param() is called and
> - * xfeatures_mask is enumerated.
> - */
> -u64 __init fpu__get_supported_xfeatures_mask(void)
> -{
> -	return SUPPORTED_XFEATURES_MASK;
> -}
> -
>  /* Legacy code to initialize eager fpu mode. */
>  static void __init fpu__init_system_ctx_switch(void)
>  {
> diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> index 8a63f07cf400..4ecf1764a971 100644
> --- a/arch/x86/kernel/fpu/signal.c
> +++ b/arch/x86/kernel/fpu/signal.c
> @@ -285,7 +285,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
>  			 IS_ENABLED(CONFIG_IA32_EMULATION));
>  
>  	if (!buf) {
> -		fpu__clear(fpu);
> +		fpu__clear_user_states(fpu);
>  		return 0;
>  	}
>  
> @@ -407,7 +407,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
>  
>  err_out:
>  	if (ret)
> -		fpu__clear(fpu);
> +		fpu__clear_user_states(fpu);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index d560e8861a3c..9fbe73c546df 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -61,9 +61,19 @@ static short xsave_cpuid_features[] __initdata = {
>   */
>  u64 xfeatures_mask_user __read_mostly;
>  
> +/*
> + * Supported XSAVES system states.
> + */
> +static u64 xfeatures_mask_system __read_mostly;
> +
> +/*
> + * Combined XSAVES system and user states.
> + */
> +u64 xfeatures_mask_all __read_mostly;
> +
>  static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>  static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
> -static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_user)*8];
> +static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_all)*8];

  [sizeof(...) * 8]

>  
>  /*
>   * The XSAVE area of kernel can be in standard or compacted format;
> @@ -79,7 +89,7 @@ unsigned int fpu_user_xstate_size;
>   */
>  int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
>  {
> -	u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_user;
> +	u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_all;
>  
>  	if (unlikely(feature_name)) {
>  		long xfeature_idx, max_idx;
> @@ -158,7 +168,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
>  	 * None of the feature bits are in init state. So nothing else
>  	 * to do for us, as the memory layout is up to date.
>  	 */
> -	if ((xfeatures & xfeatures_mask_user) == xfeatures_mask_user)
> +	if ((xfeatures & xfeatures_mask_all) == xfeatures_mask_all)
>  		return;
>  
>  	/*
> @@ -213,28 +223,27 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
>   */
>  void fpu__init_cpu_xstate(void)
>  {
> -	if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_user)
> +	if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_all)
>  		return;
>  	/*
>  	 * XCR_XFEATURE_ENABLED_MASK sets the features that are managed
>  	 * by XSAVE{C, OPT} and XRSTOR.  Only XSAVE user states can be
>  	 * set here.
>  	 */
> -
> -	xfeatures_mask_user &= ~XFEATURE_MASK_SUPERVISOR;
> -
>  	cr4_set_bits(X86_CR4_OSXSAVE);
>  	xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
> +
> +	/*
> +	 * MSR_IA32_XSS controls which system (not user) states are

We know that system state is not including user state. Please stop
documenting the obvious.

> +	 * to be managed by XSAVES.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_XSAVES))
> +		wrmsrl(MSR_IA32_XSS, xfeatures_mask_system);
>  }
>  
> -/*
> - * Note that in the future we will likely need a pair of
> - * functions here: one for user xstates and the other for
> - * system xstates.  For now, they are the same.
> - */
>  static int xfeature_enabled(enum xfeature xfeature)
>  {
> -	return !!(xfeatures_mask_user & BIT_ULL(xfeature));
> +	return !!(xfeatures_mask_all & BIT_ULL(xfeature));
>  }
>  
>  /*
> @@ -340,7 +349,7 @@ static int xfeature_is_aligned(int xfeature_nr)
>   */
>  static void __init setup_xstate_comp(void)
>  {
> -	unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_user)*8];
> +	unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_all)*8];

See above

>  	int i;
>  
>  	/*
> @@ -413,7 +422,7 @@ static void __init setup_init_fpu_buf(void)
>  	print_xstate_features();
>  
>  	if (boot_cpu_has(X86_FEATURE_XSAVES))
> -		init_fpstate.xsave.header.xcomp_bv = BIT_ULL(63) | xfeatures_mask_user;
> +		init_fpstate.xsave.header.xcomp_bv = BIT_ULL(63) | xfeatures_mask_all;
>  
>  	/*
>  	 * Init all the features state with header.xfeatures being 0x0
> @@ -436,7 +445,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
>  	 * format. Checking a system state's uncompacted offset is
>  	 * an error.
>  	 */
> -	if (XFEATURE_MASK_SUPERVISOR & BIT_ULL(xfeature_nr)) {
> +	if (~xfeatures_mask_user & BIT_ULL(xfeature_nr)) {

Sigh. Why can't this use xfeatures_mask_system? That would be too obvious.

>  		WARN_ONCE(1, "No fixed offset for xstate %d\n", xfeature_nr);
>  		return -1;
>  	}
> @@ -608,15 +617,12 @@ static void do_extra_xstate_size_checks(void)
>  
>  
>  /*
> - * Get total size of enabled xstates in XCR0/xfeatures_mask_user.
> + * Get total size of enabled xstates in XCR0 | IA32_XSS.
>   *
>   * Note the SDM's wording here.  "sub-function 0" only enumerates
>   * the size of the *user* states.  If we use it to size a buffer
>   * that we use 'XSAVES' on, we could potentially overflow the
>   * buffer because 'XSAVES' saves system states too.
> - *
> - * Note that we do not currently set any bits on IA32_XSS so
> - * 'XCR0 | IA32_XSS == XCR0' for now.
>   */
>  static unsigned int __init get_xsaves_size(void)
>  {
> @@ -698,6 +704,7 @@ static int __init init_xstate_size(void)
>   */
>  static void fpu__init_disable_system_xstate(void)
>  {
> +	xfeatures_mask_all = 0;
>  	xfeatures_mask_user = 0;
>  	cr4_clear_bits(X86_CR4_OSXSAVE);
>  	setup_clear_cpu_cap(X86_FEATURE_XSAVE);
> @@ -733,10 +740,23 @@ void __init fpu__init_system_xstate(void)
>  		return;
>  	}
>  
> +	/*
> +	 * Find user states supported by the processor.
> +	 * Only these bits can be set in XCR0.
> +	 */
>  	cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
>  	xfeatures_mask_user = eax + ((u64)edx << 32);
>  
> -	if ((xfeatures_mask_user & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
> +	/*
> +	 * Find system states supported by the processor.
> +	 * Only these bits can be set in IA32_XSS MSR.
> +	 */
> +	cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
> +	xfeatures_mask_system = ecx + ((u64)edx << 32);
> +
> +	xfeatures_mask_all = xfeatures_mask_user | xfeatures_mask_system;
> +
> +	if ((xfeatures_mask_all & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {

xfeatures_mask_all is wrong here. FPSSE is clearly user state.

>  		/*
>  		 * This indicates that something really unexpected happened
>  		 * with the enumeration.  Disable XSAVE and try to continue
> @@ -751,10 +771,12 @@ void __init fpu__init_system_xstate(void)
>  	 */
>  	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
>  		if (!boot_cpu_has(xsave_cpuid_features[i]))
> -			xfeatures_mask_user &= ~BIT_ULL(i);
> +			xfeatures_mask_all &= ~BIT_ULL(i);
>  	}
>  
> -	xfeatures_mask_user &= fpu__get_supported_xfeatures_mask();
> +	xfeatures_mask_all &= SUPPORTED_XFEATURES_MASK;
> +	xfeatures_mask_user &= xfeatures_mask_all;
> +	xfeatures_mask_system &= xfeatures_mask_all;
>  
>  	/* Enable xstate instructions to be able to continue with initialization: */
>  	fpu__init_cpu_xstate();
> @@ -766,7 +788,7 @@ void __init fpu__init_system_xstate(void)
>  	 * Update info used for ptrace frames; use standard-format size and no
>  	 * system xstates:
>  	 */
> -	update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user & ~XFEATURE_MASK_SUPERVISOR);
> +	update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user);
>

And exactly this hunk shows that the whole refactoring approach is wrong
from the very beginning. I stared at that in the previous patch already and
had the feeling that it's bogus.

Just doing a s/xfeatures_mask/xfeatures_mask_user/g really does not make
any sense. Simply because the current code assumes that xfeatures_mask ==
xfeatures_mask_all. So if a global rename is the right approach then
s/xfeatures_mask/xfeatures_mask_all/ and not that completely backwards
rename to _user.

That refactoring wants to be done in the following steps:

   1) Introduce xfeatures_mask_user and initialize it with

       xfeatures_mask_user = xfeatures_mask ^ ~XFEATURE_MASK_SUPERVISOR;

   2) Fix up the usage sites in reviewable chunks. It does not matter
      whether that could be folded into a larger all in one patch. What
      matters is that it makes sense and is reviewable.

   3) Change the signature of copy_init_fpstate_to_fpregs() so it takes a
      mask and fix up the call sites accordingly. Without the bogus comment
      of course.

   4) Introduce xfeatures_mask_system and eventually needed helper functions.

   5) Change the affected usage sites

Details may be slightly different but you get the idea.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v8 04/27] x86/fpu/xstate: Introduce XSAVES system states
From: Yu-cheng Yu @ 2019-08-16 19:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: x86, H. Peter Anvin, Ingo Molnar, linux-kernel, linux-doc,
	linux-mm, linux-arch, linux-api, Arnd Bergmann, Andy Lutomirski,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <alpine.DEB.2.21.1908161703010.1923@nanos.tec.linutronix.de>

On Fri, 2019-08-16 at 21:56 +0200, Thomas Gleixner wrote:
> On Tue, 13 Aug 2019, Yu-cheng Yu wrote:
> > +/*
> > + * On context switches, XSAVE states are not restored until returning
> > + * to user-mode.  FPU registers need to be restored before any changes,
> > + * and protected by fpregs_lock()/fpregs_unlock().
> 
> I really had to read this comment twice to figure out what it means.
> 
> > + */
> > +static inline void modify_fpu_regs_begin(void)
> 
> Please use a proper name space. fpu_regs_....
> 
> > +{
> > +	fpregs_lock();
> > +	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> > +		__fpregs_load_activate();
> > +}
> > +
> > +static inline void modify_fpu_regs_end(void)
> > +{
> > +	fpregs_unlock();
> > +}
> 
> Also why are those inlines in this particular patch? I see no relation at all.
> 
> >  /*
> >   * MXCSR and XCR definitions:
> >   */
> > diff --git a/arch/x86/include/asm/fpu/xstate.h
> > b/arch/x86/include/asm/fpu/xstate.h
> > index 9ded9532257d..970bbd303cfb 100644
> > --- a/arch/x86/include/asm/fpu/xstate.h
> > +++ b/arch/x86/include/asm/fpu/xstate.h
> > @@ -21,9 +21,6 @@
> >  #define XSAVE_YMM_SIZE	    256
> >  #define XSAVE_YMM_OFFSET    (XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)
> >  
> > -/* Supervisor features */
> > -#define XFEATURE_MASK_SUPERVISOR (XFEATURE_MASK_PT)
> > -
> >  /* All currently supported features */
> >  #define SUPPORTED_XFEATURES_MASK (XFEATURE_MASK_FP | \
> >  				  XFEATURE_MASK_SSE | \
> > @@ -42,6 +39,7 @@
> >  #endif
> >  
> >  extern u64 xfeatures_mask_user;
> > +extern u64 xfeatures_mask_all;
> >  extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
> >  
> >  extern void __init update_regset_xstate_info(unsigned int size,
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index 12c70840980e..31d3cd70b5df 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -294,12 +294,16 @@ void fpu__drop(struct fpu *fpu)
> >   * Clear FPU registers by setting them up from
> >   * the init fpstate:
> >   */
> > -static inline void copy_init_fpstate_to_fpregs(void)
> > +static inline void copy_init_fpstate_to_fpregs(u64 features_mask)
> >  {
> >  	fpregs_lock();
> >  
> > +	/*
> > +	 * Only XSAVES user states are copied.
> > +	 * System states are preserved.
> 
> Fits nicely in one line and aside of that this comment is blatantly
> wrong. See that caller:
> 
> > +		copy_init_fpstate_to_fpregs(xfeatures_mask_all);
> 
> xfeatures_mask_all includes xfeatures_mask_system unless I'm missing
> something.
> 
> > +	 */
> >  	if (use_xsave())
> > -		copy_kernel_to_xregs(&init_fpstate.xsave, -1);
> > +		copy_kernel_to_xregs(&init_fpstate.xsave, features_mask);
> >  	else if (static_cpu_has(X86_FEATURE_FXSR))
> >  		copy_kernel_to_fxregs(&init_fpstate.fxsave);
> 
> The change of this function should also be split out into a separate
> patch. This one is way too big to be reviewable.
> 
> >  	else
> > @@ -318,7 +322,21 @@ static inline void copy_init_fpstate_to_fpregs(void)
> >   * Called by sys_execve(), by the signal handler code and by various
> >   * error paths.
> >   */
> > -void fpu__clear(struct fpu *fpu)
> > +void fpu__clear_user_states(struct fpu *fpu)
> > +{
> > +	WARN_ON_FPU(fpu != &current->thread.fpu); /* Almost certainly an
> > anomaly */
> 
> 1) Please do not use tail comments. They break the reading flow.
> 
> 2) Please do not comment the obvious. Put comments where they make sense. I
>    know you copied it, but that does not make it any better.
> 
> > +	fpu__drop(fpu);
> > +
> > +	/*
> > +	 * Make sure fpstate is cleared and initialized.
> > +	 */
> > +	fpu__initialize(fpu);
> > +	if (static_cpu_has(X86_FEATURE_FPU))
> > +		copy_init_fpstate_to_fpregs(xfeatures_mask_user);
> > +}
> > +
> > +void fpu__clear_all(struct fpu *fpu)
> >  {
> >  	WARN_ON_FPU(fpu != &current->thread.fpu); /* Almost certainly an
> > anomaly */
> > @@ -329,7 +347,7 @@ void fpu__clear(struct fpu *fpu)
> >  	 */
> >  	fpu__initialize(fpu);
> >  	if (static_cpu_has(X86_FEATURE_FPU))
> > -		copy_init_fpstate_to_fpregs();
> > +		copy_init_fpstate_to_fpregs(xfeatures_mask_all);
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
> > index 73fed33e5bda..0a0ba584a533 100644
> > --- a/arch/x86/kernel/fpu/init.c
> > +++ b/arch/x86/kernel/fpu/init.c
> > @@ -217,16 +217,6 @@ static void __init
> > fpu__init_system_xstate_size_legacy(void)
> >  	fpu_user_xstate_size = fpu_kernel_xstate_size;
> >  }
> >  
> > -/*
> > - * Find supported xfeatures based on cpu features and command-line input.
> > - * This must be called after fpu__init_parse_early_param() is called and
> > - * xfeatures_mask is enumerated.
> > - */
> > -u64 __init fpu__get_supported_xfeatures_mask(void)
> > -{
> > -	return SUPPORTED_XFEATURES_MASK;
> > -}
> > -
> >  /* Legacy code to initialize eager fpu mode. */
> >  static void __init fpu__init_system_ctx_switch(void)
> >  {
> > diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> > index 8a63f07cf400..4ecf1764a971 100644
> > --- a/arch/x86/kernel/fpu/signal.c
> > +++ b/arch/x86/kernel/fpu/signal.c
> > @@ -285,7 +285,7 @@ static int __fpu__restore_sig(void __user *buf, void
> > __user *buf_fx, int size)
> >  			 IS_ENABLED(CONFIG_IA32_EMULATION));
> >  
> >  	if (!buf) {
> > -		fpu__clear(fpu);
> > +		fpu__clear_user_states(fpu);
> >  		return 0;
> >  	}
> >  
> > @@ -407,7 +407,7 @@ static int __fpu__restore_sig(void __user *buf, void
> > __user *buf_fx, int size)
> >  
> >  err_out:
> >  	if (ret)
> > -		fpu__clear(fpu);
> > +		fpu__clear_user_states(fpu);
> >  	return ret;
> >  }
> >  
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index d560e8861a3c..9fbe73c546df 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -61,9 +61,19 @@ static short xsave_cpuid_features[] __initdata = {
> >   */
> >  u64 xfeatures_mask_user __read_mostly;
> >  
> > +/*
> > + * Supported XSAVES system states.
> > + */
> > +static u64 xfeatures_mask_system __read_mostly;
> > +
> > +/*
> > + * Combined XSAVES system and user states.
> > + */
> > +u64 xfeatures_mask_all __read_mostly;
> > +
> >  static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX -
> > 1] = -1};
> >  static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX -
> > 1] = -1};
> > -static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_user)*8];
> > +static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_all)*8];
> 
>   [sizeof(...) * 8]
> 
> >  
> >  /*
> >   * The XSAVE area of kernel can be in standard or compacted format;
> > @@ -79,7 +89,7 @@ unsigned int fpu_user_xstate_size;
> >   */
> >  int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
> >  {
> > -	u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_user;
> > +	u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_all;
> >  
> >  	if (unlikely(feature_name)) {
> >  		long xfeature_idx, max_idx;
> > @@ -158,7 +168,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
> >  	 * None of the feature bits are in init state. So nothing else
> >  	 * to do for us, as the memory layout is up to date.
> >  	 */
> > -	if ((xfeatures & xfeatures_mask_user) == xfeatures_mask_user)
> > +	if ((xfeatures & xfeatures_mask_all) == xfeatures_mask_all)
> >  		return;
> >  
> >  	/*
> > @@ -213,28 +223,27 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
> >   */
> >  void fpu__init_cpu_xstate(void)
> >  {
> > -	if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_user)
> > +	if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_all)
> >  		return;
> >  	/*
> >  	 * XCR_XFEATURE_ENABLED_MASK sets the features that are managed
> >  	 * by XSAVE{C, OPT} and XRSTOR.  Only XSAVE user states can be
> >  	 * set here.
> >  	 */
> > -
> > -	xfeatures_mask_user &= ~XFEATURE_MASK_SUPERVISOR;
> > -
> >  	cr4_set_bits(X86_CR4_OSXSAVE);
> >  	xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
> > +
> > +	/*
> > +	 * MSR_IA32_XSS controls which system (not user) states are
> 
> We know that system state is not including user state. Please stop
> documenting the obvious.
> 
> > +	 * to be managed by XSAVES.
> > +	 */
> > +	if (boot_cpu_has(X86_FEATURE_XSAVES))
> > +		wrmsrl(MSR_IA32_XSS, xfeatures_mask_system);
> >  }
> >  
> > -/*
> > - * Note that in the future we will likely need a pair of
> > - * functions here: one for user xstates and the other for
> > - * system xstates.  For now, they are the same.
> > - */
> >  static int xfeature_enabled(enum xfeature xfeature)
> >  {
> > -	return !!(xfeatures_mask_user & BIT_ULL(xfeature));
> > +	return !!(xfeatures_mask_all & BIT_ULL(xfeature));
> >  }
> >  
> >  /*
> > @@ -340,7 +349,7 @@ static int xfeature_is_aligned(int xfeature_nr)
> >   */
> >  static void __init setup_xstate_comp(void)
> >  {
> > -	unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_user)*8];
> > +	unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_all)*8];
> 
> See above
> 
> >  	int i;
> >  
> >  	/*
> > @@ -413,7 +422,7 @@ static void __init setup_init_fpu_buf(void)
> >  	print_xstate_features();
> >  
> >  	if (boot_cpu_has(X86_FEATURE_XSAVES))
> > -		init_fpstate.xsave.header.xcomp_bv = BIT_ULL(63) |
> > xfeatures_mask_user;
> > +		init_fpstate.xsave.header.xcomp_bv = BIT_ULL(63) |
> > xfeatures_mask_all;
> >  
> >  	/*
> >  	 * Init all the features state with header.xfeatures being 0x0
> > @@ -436,7 +445,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
> >  	 * format. Checking a system state's uncompacted offset is
> >  	 * an error.
> >  	 */
> > -	if (XFEATURE_MASK_SUPERVISOR & BIT_ULL(xfeature_nr)) {
> > +	if (~xfeatures_mask_user & BIT_ULL(xfeature_nr)) {
> 
> Sigh. Why can't this use xfeatures_mask_system? That would be too obvious.
> 
> >  		WARN_ONCE(1, "No fixed offset for xstate %d\n",
> > xfeature_nr);
> >  		return -1;
> >  	}
> > @@ -608,15 +617,12 @@ static void do_extra_xstate_size_checks(void)
> >  
> >  
> >  /*
> > - * Get total size of enabled xstates in XCR0/xfeatures_mask_user.
> > + * Get total size of enabled xstates in XCR0 | IA32_XSS.
> >   *
> >   * Note the SDM's wording here.  "sub-function 0" only enumerates
> >   * the size of the *user* states.  If we use it to size a buffer
> >   * that we use 'XSAVES' on, we could potentially overflow the
> >   * buffer because 'XSAVES' saves system states too.
> > - *
> > - * Note that we do not currently set any bits on IA32_XSS so
> > - * 'XCR0 | IA32_XSS == XCR0' for now.
> >   */
> >  static unsigned int __init get_xsaves_size(void)
> >  {
> > @@ -698,6 +704,7 @@ static int __init init_xstate_size(void)
> >   */
> >  static void fpu__init_disable_system_xstate(void)
> >  {
> > +	xfeatures_mask_all = 0;
> >  	xfeatures_mask_user = 0;
> >  	cr4_clear_bits(X86_CR4_OSXSAVE);
> >  	setup_clear_cpu_cap(X86_FEATURE_XSAVE);
> > @@ -733,10 +740,23 @@ void __init fpu__init_system_xstate(void)
> >  		return;
> >  	}
> >  
> > +	/*
> > +	 * Find user states supported by the processor.
> > +	 * Only these bits can be set in XCR0.
> > +	 */
> >  	cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
> >  	xfeatures_mask_user = eax + ((u64)edx << 32);
> >  
> > -	if ((xfeatures_mask_user & XFEATURE_MASK_FPSSE) !=
> > XFEATURE_MASK_FPSSE) {
> > +	/*
> > +	 * Find system states supported by the processor.
> > +	 * Only these bits can be set in IA32_XSS MSR.
> > +	 */
> > +	cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
> > +	xfeatures_mask_system = ecx + ((u64)edx << 32);
> > +
> > +	xfeatures_mask_all = xfeatures_mask_user | xfeatures_mask_system;
> > +
> > +	if ((xfeatures_mask_all & XFEATURE_MASK_FPSSE) !=
> > XFEATURE_MASK_FPSSE) {
> 
> xfeatures_mask_all is wrong here. FPSSE is clearly user state.
> 
> >  		/*
> >  		 * This indicates that something really unexpected happened
> >  		 * with the enumeration.  Disable XSAVE and try to continue
> > @@ -751,10 +771,12 @@ void __init fpu__init_system_xstate(void)
> >  	 */
> >  	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
> >  		if (!boot_cpu_has(xsave_cpuid_features[i]))
> > -			xfeatures_mask_user &= ~BIT_ULL(i);
> > +			xfeatures_mask_all &= ~BIT_ULL(i);
> >  	}
> >  
> > -	xfeatures_mask_user &= fpu__get_supported_xfeatures_mask();
> > +	xfeatures_mask_all &= SUPPORTED_XFEATURES_MASK;
> > +	xfeatures_mask_user &= xfeatures_mask_all;
> > +	xfeatures_mask_system &= xfeatures_mask_all;
> >  
> >  	/* Enable xstate instructions to be able to continue with
> > initialization: */
> >  	fpu__init_cpu_xstate();
> > @@ -766,7 +788,7 @@ void __init fpu__init_system_xstate(void)
> >  	 * Update info used for ptrace frames; use standard-format size and
> > no
> >  	 * system xstates:
> >  	 */
> > -	update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user
> > & ~XFEATURE_MASK_SUPERVISOR);
> > +	update_regset_xstate_info(fpu_user_xstate_size,
> > xfeatures_mask_user);
> > 
> 
> And exactly this hunk shows that the whole refactoring approach is wrong
> from the very beginning. I stared at that in the previous patch already and
> had the feeling that it's bogus.
> 
> Just doing a s/xfeatures_mask/xfeatures_mask_user/g really does not make
> any sense. Simply because the current code assumes that xfeatures_mask ==
> xfeatures_mask_all. So if a global rename is the right approach then
> s/xfeatures_mask/xfeatures_mask_all/ and not that completely backwards
> rename to _user.
> 
> That refactoring wants to be done in the following steps:
> 
>    1) Introduce xfeatures_mask_user and initialize it with
> 
>        xfeatures_mask_user = xfeatures_mask ^ ~XFEATURE_MASK_SUPERVISOR;
> 
>    2) Fix up the usage sites in reviewable chunks. It does not matter
>       whether that could be folded into a larger all in one patch. What
>       matters is that it makes sense and is reviewable.
> 
>    3) Change the signature of copy_init_fpstate_to_fpregs() so it takes a
>       mask and fix up the call sites accordingly. Without the bogus comment
>       of course.
> 
>    4) Introduce xfeatures_mask_system and eventually needed helper functions.
> 
>    5) Change the affected usage sites
> 
> Details may be slightly different but you get the idea.

I will work on it.  Thanks!

Yu-cheng

^ permalink raw reply

* Re: [PATCHv6 23/36] x86/vdso: Allocate timens vdso
From: Thomas Gleixner @ 2019-08-16 20:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
	Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <b719199a-ed91-610b-38bc-015a0749f600@kernel.org>

On Fri, 16 Aug 2019, Andy Lutomirski wrote:
> On 8/15/19 9:38 AM, Dmitry Safonov wrote:
> > As it has been discussed on timens RFC, adding a new conditional branch
> > `if (inside_time_ns)` on VDSO for all processes is undesirable.
> > It will add a penalty for everybody as branch predictor may mispredict
> > the jump. Also there are instruction cache lines wasted on cmp/jmp.
> > 
> > Those effects of introducing time namespace are very much unwanted
> > having in mind how much work have been spent on micro-optimisation
> > vdso code.
> > 
> > The propose is to allocate a second vdso code with dynamically
> > patched out (disabled by static_branch) timens code on boot time.
> > 
> > Allocate another vdso and copy original code.
> 
> 
> I'm unconvinced that any of this magic is wise.  I think you should make a
> special timens vvar page that causes the normal fastpath to fail (using a
> special vclock mode, a special seq value, or a special "last" value) and then
> make the failure path detect that timens is in use and use the timens path.

My initial suggestion still stands. Do that at compile time. It really does
not matter whether we have another 2 or 3 variants of vdso binaries.

Use it and be done with it. No special magic, just straight forward
decisions to use a timens capable VDSO or not.

Thanks,

	tglx

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox