Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH RESEND v11 7/8] open: openat2(2) syscall
From: Daniel Colascione @ 2019-08-24 20:17 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Christian Brauner,
	Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers
In-Reply-To: <20190820033406.29796-8-cyphar@cyphar.com>

On Mon, Aug 19, 2019 at 8:37 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> The most obvious syscall to add support for the new LOOKUP_* scoping
> flags would be openat(2). However, there are a few reasons why this is
> not the best course of action:
>
>  * The new LOOKUP_* flags are intended to be security features, and
>    openat(2) will silently ignore all unknown flags. This means that
>    users would need to avoid foot-gunning themselves constantly when
>    using this interface if it were part of openat(2). This can be fixed
>    by having userspace libraries handle this for users[1], but should be
>    avoided if possible.
>
>  * Resolution scoping feels like a different operation to the existing
>    O_* flags. And since openat(2) has limited flag space, it seems to be
>    quite wasteful to clutter it with 5 flags that are all
>    resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
>    its entire purpose is to error out if you encounter a trailing
>    symlink -- not to scope resolution.
>
>  * Other systems would be able to reimplement this syscall allowing for
>    cross-OS standardisation rather than being hidden amongst O_* flags
>    which may result in it not being used by all the parties that might
>    want to use it (file servers, web servers, container runtimes, etc).
>
>  * It gives us the opportunity to iterate on the O_PATH interface. In
>    particular, the new @how->upgrade_mask field for fd re-opening is
>    only possible because we have a clean slate without needing to re-use
>    the ACC_MODE flag design nor the existing openat(2) @mode semantics.
>
> To this end, we introduce the openat2(2) syscall. It provides all of the
> features of openat(2) through the @how->flags argument, but also
> also provides a new @how->resolve argument which exposes RESOLVE_* flags
> that map to our new LOOKUP_* flags. It also eliminates the long-standing
> ugliness of variadic-open(2) by embedding it in a struct.
>
> In order to allow for userspace to lock down their usage of file
> descriptor re-opening, openat2(2) has the ability for users to disallow
> certain re-opening modes through @how->upgrade_mask. At the moment,
> there is no UPGRADE_NOEXEC. The open_how struct is padded to 64 bytes
> for future extensions (all of the reserved bits must be zeroed).

Why pad the structure when new functionality (perhaps accommodated via
a larger structure) could be signaled by passing a new flag? Adding
reserved fields to a structure with a size embedded in the ABI makes a
lot of sense --- e.g., pthread_mutex_t can't grow. But this structure
can grow, so the reservation seems needless to me.

^ permalink raw reply

* Re: RFC: very rough draft of a bpf permission model
From: Andy Lutomirski @ 2019-08-23 23:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, Song Liu, Kees Cook, Networking,
	bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <20190822232620.p5tql4rrlzlk35z7@ast-mbp.dhcp.thefacebook.com>

On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> You're proposing all of the above in addition to CAP_BPF, right?
> Otherwise I don't see how it addresses the use cases I kept
> explaining for the last few weeks.

None of my proposal is intended to exclude changes like CAP_BPF to
make privileged bpf() operations need less privilege.  But I think
it's very hard to evaluate CAP_BPF without both a full description of
exactly what CAP_BPF would do and what at least one full example of a
user would look like.

I also think that users who want CAP_BPF should look at manipulating
their effective capability set instead.  A daemon that wants to use
bpf() but otherwise minimize the chance of accidentally causing a
problem can use capset() to clear its effective and inheritable masks.
Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
effective set again.  This works in current kernels and is generally
good practice.

Aside from this, and depending on exactly what CAP_BPF would be, I
have some further concerns.  Looking at your example in this email:

> Here is another example of use case that CAP_BPF is solving:
> The daemon X is started by pid=1 and currently runs as root.
> It loads a bunch of tracing progs and attaches them to kprobes
> and tracepoints. It also loads cgroup-bpf progs and attaches them
> to cgroups. All progs are collecting data about the system and
> logging it for further analysis.

This needs more than just bpf().  Creating a perf kprobe event
requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
attach a bpf program.  And the privilege to attach bpf programs to
cgroups without any DAC or MAC checks (which is what the current API
does) is an extremely broad privilege that is not that much weaker
than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:

> This tracing bpf is looking into kernel memory
> and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
> The system is not going to crash because of BPF,
> but it can easily crash because of simple coding bugs in the user
> space bits of that daemon.

The BPF verifier and interpreter, taken in isolation, may be extremely
safe, but attaching BPF programs to various hooks can easily take down
the system, deliberately or by accident.  A handler, especially if it
can access user memory or otherwise fault, will explode if attached to
an inappropriate kprobe, hw_breakpoint, or function entry trace event.
(I and the other maintainers consider this to be a bug if it happens,
and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
that blocks all network traffic will effectively kill a machine,
especially if it's a server.  A bpf program that runs excessively
slowly attached to a high-frequency hook will kill the system, too.
(I bet a buggy bpf program that calls bpf_probe_read() on an unmapped
address repeatedly could be make extremely slow.  Page faults take
thousands to tens of thousands of cycles.)  A bpf firewall rule that's
wrong can cut a machine off from the network -- I've killed machines
using iptables more than once, and bpf isn't magically safer.

Something finer-grained can mitigate some of this.  CAP_BPF as I think
you're imagining it will not.

I'm wondering if something like CAP_TRACING would make sense.
CAP_TRACING would allow operations that can reveal kernel memory and
other secret kernel state but that do not, by design, allow modifying
system behavior.  So, for example, CAP_TRACING would allow privileged
perf_event_open() operations and privileged bpf verifier usage.  But
it would not allow cgroup-bpf unless further restrictions were added,
and it would not allow the *_BY_ID operations, as those can modify
other users' bpf programs' behavior.

(To get CAP_TRACING to work with cgroup-bpf, there could be a flag to
attach a "tracing" bpf program to a cgroup.  This program would run in
addition to normal or MULTI programs, but it would not be allowed to
return a rejection result.)

^ permalink raw reply

* Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
From: Peter Zijlstra @ 2019-08-23 14:02 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <20190813205225.12032-12-yu-cheng.yu@intel.com>

On Tue, Aug 13, 2019 at 01:52:09PM -0700, Yu-cheng Yu wrote:

> +static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
> +{
> +	if (pte_flags(pte) & from)
> +		pte = pte_set_flags(pte_clear_flags(pte, from), to);
> +	return pte;
> +}

Aside of the whole conditional thing (I agree it would be better to have
this unconditionally); the function doesn't really do as advertised.

That is, if @from is clear, it doesn't endeavour to make sure @to is
also clear.

Now it might be sufficient, but in that case it really needs a comment
and or different name.

An implementation that actually moves the bit is something like:

	pteval_t a,b;

	a = native_pte_value(pte);
	b = (a >> from_bit) & 1;
	a &= ~((1ULL << from_bit) | (1ULL << to_bit));
	a |= b << to_bit;
	return make_native_pte(a);

^ permalink raw reply

* Re: RFC: very rough draft of a bpf permission model
From: Alexei Starovoitov @ 2019-08-22 23:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrWU4xJh4UBg0BboCwdGrgj+dUShsH5ETpiRgEpXJTEfQA@mail.gmail.com>

On Thu, Aug 22, 2019 at 08:17:54AM -0700, Andy Lutomirski wrote:
> BPF security strawman, v0.1
> 
> This is very rough.  Most of this, especially the API details, needs
> work before it's ready to implement.  The whole concept also needs
> review.
> 
> = Goals =
> 
> The overall goal is to make it possible to use eBPF without having
> what is effectively administrator access.  For example, an eBPF user
> should not be able to directly tamper with other processes (unless
> this permission is explicitly granted) and should not be able to
> read or write other users' eBPF maps.
> 
> It should be possible to use eBPF inside a user namespace without breaking
> the userns security model.
> 
> Due to the risk of speculation attacks and such being carried out via
> eBPF, it should not become possible to use too much of eBPF without the
> administrator's permission.  (NB: it is already possible to use
> *classic* BPF without any permission, and classic BPF is translated
> internally to eBPF, so this goal can only be met to a limited extent.)

agree with the goals.

> = Definitions =
> 
> Global capability: A capability bit in the caller's effective mask, so
> long as the caller is in the root user namespace.  Tasks in non-root
> user namespaces never have global capabilibies.  This is what capable()
> checks.
> 
> Namespace capability: A capability over a specific user namespace.
> Tasks in a user namespace have all the capabilities in their effective
> mask over their user namespace.  A namespace capability generally
> indicates that the capability applies to the user namespace itself and
> to all non-user namespaces that live in the user namespace.  For
> example, CAP_NET_ADMIN means that you can configure all networks
> namespaces in the current user namespace.  This is what ns_capable()
> checks.

definitions make sense too.

> Anything that requires a global capability will not work in a non-root
> user namespace.
> 
> = unprivileged_bpf_disabled =
> 
> Nothing in here supercedes unprivileged_bpf_disabled.  If
> unprivileged_bpf_disabled = 1, then these proposals should not allow anything
> that is disallowed today.  The idea is to make unprivileged_bpf_disabled=0
> both safer and more useful.

... a bunch of new features skipped for brevity...

You're proposing all of the above in addition to CAP_BPF, right?
Otherwise I don't see how it addresses the use cases I kept
explaining for the last few weeks.

I don't mind additional features if people who propose them
actively help to maintain that new code and address inevitable
side channel issues in the new code.
But first things first.

Here is another example of use case that CAP_BPF is solving:
The daemon X is started by pid=1 and currently runs as root.
It loads a bunch of tracing progs and attaches them to kprobes
and tracepoints. It also loads cgroup-bpf progs and attaches them
to cgroups. All progs are collecting data about the system and
logging it for further analysis.
There can be different bugs (not security bugs) in the daemon.
Simple coding bugs, but due to processing running as root they
may make the system inoperable. There is a strong desire to
drop privileges for this daemon. Let it do all BPF things the
way it does today and drop root, since other operations do not
require root.
Essentially a bunch of daemons run as root only because
they need bpf. This tracing bpf is looking into kernel memory
and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
The system is not going to crash because of BPF,
but it can easily crash because of simple coding bugs in the user
space bits of that daemon.

Flagging functions is not going to help this case.
bpf_probe_read is necessary.
pointer-to-integer-conversions is also necessary.
bypass hardening features is also necessary for speed,
since this data collection is 24/7.
cgroup.subtree_control idea can help some of it, but not all.

I still think that CAP_BPF is the best way to split this root privilege
universe into smaller 'bpf piece'. Just like CAP_NET_ADMIN splits
all of root into networking specific privileges.

Potentially we can go sysctl_perf_event_paranoid approach, but
it's less flexible, since it's single sysctl for the whole system.

Loading progs via FD instead of memory is something that android folks
proposed some time ago. The need is real. Whether it's going to be
loading via FD or some other form of signing the program is TBD.
imo this is orthogonal.

I hope I answered all points of your proposal.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-22 22:48 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Chenbo Feng
In-Reply-To: <98fee747-795a-ff10-fa98-10ddb5afcc03@iogearbox.net>

On Thu, Aug 22, 2019 at 04:17:43PM +0200, Daniel Borkmann wrote:
> 
> > > Hence unprivileged bpf is actually something that can be deprecated.
> 
> There is actually a publicly known use-case on unprivileged bpf wrt
> socket filters, see the SO_ATTACH_BPF on sockets section as an example:
> 
>   https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/
> 
> If I'd have to take a good guess, I'd think it's major use-case is in
> SO_ATTACH_REUSEPORT_EBPF in the wild, I don't think the sysctl can be
> outright flipped or deprecated w/o breaking existing applications unless
> it's cleanly modeled into some sort of customizable CAP_BPF* type policy
> (more below) where this would be the lowest common denominator.

The cloudflare use case is the perfect example that a lot of program types
are used together.
Do people use SO_ATTACH_BPF ? Of course.
All program types are used by somebody. Before accepting them we had long
conversations with authors to understand that the use cases are real.
Some progs are probably used less than others by now.
Like cls_bpf without exts_integrated is probably not used at all.
We still have to support it, of course.
That cloudflare example demonstrates that kernel.unprivileged_bpf_disabled=1
is a reality. Companies that care about security already switched it on.
Different bits in cloudflare setup need different level of capabilities.
Some (like SO_ATTACH_BPF) need unpriv. Another need CAP_NET_ADMIN.
But common demoninator for them all is still CAP_SYS_ADMIN.
And that's why the system as a whole is not as safe as it could have
been with CAP_BPF. The system needs root in many places.
Folks going out of the way to reduce that SYS_ADMIN to something less.
The example with systemd and NET_ADMIN is just one of them.

^ permalink raw reply

* RFC: very rough draft of a bpf permission model
From: Andy Lutomirski @ 2019-08-22 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrUWQbPK3Pc6P5i_UqHPXJmZVyvuYXfq+VRtD6A3emaRhw@mail.gmail.com>

BPF security strawman, v0.1

This is very rough.  Most of this, especially the API details, needs
work before it's ready to implement.  The whole concept also needs
review.

= Goals =

The overall goal is to make it possible to use eBPF without having
what is effectively administrator access.  For example, an eBPF user
should not be able to directly tamper with other processes (unless
this permission is explicitly granted) and should not be able to
read or write other users' eBPF maps.

It should be possible to use eBPF inside a user namespace without breaking
the userns security model.

Due to the risk of speculation attacks and such being carried out via
eBPF, it should not become possible to use too much of eBPF without the
administrator's permission.  (NB: it is already possible to use
*classic* BPF without any permission, and classic BPF is translated
internally to eBPF, so this goal can only be met to a limited extent.)

= Definitions =

Global capability: A capability bit in the caller's effective mask, so
long as the caller is in the root user namespace.  Tasks in non-root
user namespaces never have global capabilibies.  This is what capable()
checks.

Namespace capability: A capability over a specific user namespace.
Tasks in a user namespace have all the capabilities in their effective
mask over their user namespace.  A namespace capability generally
indicates that the capability applies to the user namespace itself and
to all non-user namespaces that live in the user namespace.  For
example, CAP_NET_ADMIN means that you can configure all networks
namespaces in the current user namespace.  This is what ns_capable()
checks.

Anything that requires a global capability will not work in a non-root
user namespace.

= unprivileged_bpf_disabled =

Nothing in here supercedes unprivileged_bpf_disabled.  If
unprivileged_bpf_disabled = 1, then these proposals should not allow anything
that is disallowed today.  The idea is to make unprivileged_bpf_disabled=0
both safer and more useful.

= Test runs =

Global CAP_SYS_ADMIN is needed to test-run a program.  Test-running a program
exposes its own attack surface.  It's also the only way to run a program at
all if you merely have permission to load the program but not to attach it
anywhere.  Some of the proposed changes below will make it possible to load
most program types without

= Access to programs and maps =

There are two basic security concerns when accessing programs and maps:
the attack surface against the kernel and the ability to access other
people's maps.

Unprivileged processes may read a map if they have an FMODE_READ descriptor
for the map.  Unprivileged processes may write a map if they have an
FMODE_WRITE descriptor to the map.  Unprivileged processes may open a
persistent map with a mode consistent with the permissions in bpffs.

Unprivileged processes may create a bpffs inode for an existing map
if the have an RW file descriptor for the map.  (This is a change to
current behavior.  Daniel, Alexei thought the current behavior was
intentional.  Do you recall whether this is the case?)

The _BY_ID map APIs inherently have no concept of ownership of maps.  These
APIs will continue to require global CAP_SYS_ADMIN.

The small number of things that currently require the _BY_ID APIs, e.g.,
reading maps of maps, can be addressed if needed with new APIs that
return fds instead of ids.  Otherwise using them will continue to require
global capabilities.

Unprivileged processes may create exactly the set of maps that they can
create today.  Future proposals may extend this by a variety of means;
this current proposal makes no changes.

= Program loading =

Loading a program carries the following risks:

 - It exposes the attack surface in the program verifier itself.  That is
   possible, although unlikely, that merely verifying a malicious program
   could crash or otherwise cause a kernel malfunction.

 - It exposes the attack surface of insufficient checks in the verifier.
   That is, a verifier bug could allow a malicious program that is dangerous
   when run.

 - It exposes all of the functions that the program type can call.
   Some functions, e.g. bpf_probe_read(), should require privilege to call.

 - It exposes resource attacks.  Currently, privileged users can load programs
   that use more resources than unprivileged users can load.

 - It exposes pointer-to-integer conversions.  This requires global
   capabilities.

 - The program could contain speculation attack gadgets.

 - Loading a program is a prerequisite to attaching the program.

I propose the following:

Flag functions that require privilege as such.  Loading a program that calls
such a function will require a global capability.  The privileged functions are
mainly used for tracing, I think, and kernel tracing should require global
capabilities.

Loading a program that uses privileged verifier features (function calls or
pointer-to-integer-conversions) will continue to require privileges.

Loading a function that uses excessive resources can continue to require
global capabilities or it could use a new set of cgroup settings that
adjust the bpf complexity limits.

Loading a function that bypasses the various speculation attack hardening
features (e.g. constant blinding) requires global capabilities.

Other than this, bpf program types can have a new flag set to allow
them to be loaded without any privileges.  Some bpf program types
may need additional care, e.g. perf bpf events.  They can be attached
without privilege even in current kernels, and this might need to change.

(optional) Add an API to load a program where the program source comes from a
file specified by id instead of in memory.  This would allow LSMs to require
that bpf() programs be appropriately labeled.  If LSMs require use of this
API, it will make it much harder to exploit the verifier or speculation bugs.

As a possible future extension, a way to selectively grant the ability to
use specific program types without privilege could be useful.  This
could be done with a cgroup option, for example.

= Cgroup attach =

Cgroups have their own hierarchy that does not necessarily follow the
namespace hierarchy.  Unless cgroups integrate with namespaces in ways
that they currently don't, namespace capabilites cannot be used to grant
permission to operate on cgroups.

I propose that attaching and detaching bpf programs to cgroups use a
permission model similar to the model for changing non-bpf cgroup
settings.  In particular, each bpf_attach_type will get a new file in a
cgroup directory.  So there will be
/sys/fs/cgroup/cgroup_name/bpf.inet_ingress, bpf.inet_egress, etc.

A new API will be added to bpf() to attach and detach programs.  The new
API will take an fd to the bpf.attach_type file instead of to the cgroup
directory.  It will require FMODE_WRITE.  This API will *not* require
any capability.

To prevent anyone with a delegated cgroup from automatically being
able to use all bpf program types, the new bpf.attach_type files
will be opt-in as part of the hierarchy.  This could be done by writing
"+bpf.*" or "+bpf.inet_ingress" to cgroup.subtree_control to make
all the bpf.attach_type files or just bpf.inet_ingress available
in descendents of the cgroup in question.  This could alternatively
be a new bpf.subtree_control file if that seems better.

The result of these changes will be that root can use the old
attach API or the new attach API.  Unprivileged programs cannot
use the old attach API.  Unprivileged programs can use the new
attach API if they are explicitly granted permission by all their
ancestor cgroup managers.


= Additional mitigations =

Optional: there may be cases where a user can load a bpf program
but can't attach or otherwise execute it.  Nonetheless, it's plausible
that such a program could be speculatively executed.  The kernel could
mitigate this by only marking a JITted bpf program executable when it
is first attached or test-run.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-22 15:16 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <98fee747-795a-ff10-fa98-10ddb5afcc03@iogearbox.net>

On Thu, Aug 22, 2019 at 7:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> > On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> >>> It tries to make the kernel respect the access modes for fds.  Without
> >>> this patch, there seem to be some holes: nothing looked at program fds
> >>> and, unless I missed something, you could take a readonly fd for a
> >>> program, pin the program, and reopen it RW.
> >>
> >> I think it's by design. iirc Daniel had a use case for something like this.
> >
> > That seems odd.  Daniel, can you elaborate?
>
> [ ... catching up late. ]
>
> Not from my side, the change was added by Chenbo back then for Android
> use-case to replace xt_qtaguid and xt_owner with BPF programs and to
> allow unprivileged applications to read maps. More on their architecture:
>
>    https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor
>
>  From the cover-letter:
>
>    [...]
>    The network-control daemon (netd) creates and loads an eBPF object for
>    network packet filtering and analysis. It passes the object FD to an
>    unprivileged network monitor app (netmonitor), which is not allowed to
>    create, modify or load eBPF objects, but is allowed to read the traffic
>    stats from the map.
>    [...]

I suspect that this use case is, in fact, mostly broken in current
kernels.  An unprivileged process with a read-only fd to a bpf map can
BPF_OBJ_PIN the map and then re-open it read-write.  As far as I can
tell, the only thing mitigating this is that it won't work unless the
attacker has write access to some directory in bpffs.

> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> [...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
> ideally be possible for that container to install BPF programs for
> mangling, dropping, forwarding etc as long as it's only affecting it's
> /own/ netns like the rest of networking subsystem controls that work
> in that case. I would actually like to get to this at some point and
> make it more approachable as long as there is a way for an admin to
> /opt into it/ via policy (aka not by default).

For better or for worse, I think this would need a massive
re-architecting of the way bpf filtering works.  bpf filters attach to
cgroups, which aren't scoped to network namespaces at all.  So we need
a different permission model.

> Thinking out loud, I'd
> love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
> customizable seccomp policy. Meaning, there could be several CAP_BPF
> type sub-policies e.g. from allowing everything (equivalent to the
> /dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
> programmable user defined policy that can be tailored to specific
> needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
> or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
> HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
> their own netns, and if that is untrusted, then same restrictions/
> mitigations could be done by the verifier as with (current) unprivileged
> BPF, enabled via programmable policy as well. We wouldn't make any
> static/fixed assumptions, but allow users to define them based on their
> own use-cases. Haven't looked how feasible this would be, but something
> to take into consideration when we rework the current [admittedly
> suboptimal all-or-nothing] model we have. Is this something you had in
> mind as well for your wip proposal, Andy?
>

Hmm.  Fine-grained seccomp stuff like this is very much in scope for
the seccomp discussion that's happening at LPC this year.
Unfortunately, I'm not there, but I'm participating via the mailing
list.

I also finally finished typing a very rough draft of my bpf ideas.
I'll email them out momentarily in a separate email.  I think it
should come fairly close to doing what you want.

^ permalink raw reply

* Re: [PATCH v5 3/9] fpga: dfl: afu: convert platform_driver to use dev_groups
From: Moritz Fischer @ 2019-08-22 15:07 UTC (permalink / raw)
  To: Wu Hao; +Cc: gregkh, mdf, linux-fpga, linux-kernel, linux-api, linux-doc,
	atull
In-Reply-To: <1565578204-13969-4-git-send-email-hao.wu@intel.com>

Hi Hao,

On Mon, Aug 12, 2019 at 10:49:58AM +0800, Wu Hao wrote:
> This patch takes advantage of driver core which helps to create
> and remove sysfs attribute files, so there is no need to register
> sysfs entries manually in dfl-afu platform river code.
Same nit: s/river/driver
> 
> Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-afu-main.c | 69 +++++++++++++++++++++++----------------------
>  1 file changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index e50c45e..e955149 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -282,24 +282,17 @@ static int port_get_id(struct platform_device *pdev)
>  	&dev_attr_power_state.attr,
>  	NULL,
>  };
> -ATTRIBUTE_GROUPS(port_hdr);
> +
> +static const struct attribute_group port_hdr_group = {
> +	.attrs = port_hdr_attrs,
> +};
>  
>  static int port_hdr_init(struct platform_device *pdev,
>  			 struct dfl_feature *feature)
>  {
> -	dev_dbg(&pdev->dev, "PORT HDR Init.\n");
> -
>  	port_reset(pdev);
>  
> -	return device_add_groups(&pdev->dev, port_hdr_groups);
> -}
> -
> -static void port_hdr_uinit(struct platform_device *pdev,
> -			   struct dfl_feature *feature)
> -{
> -	dev_dbg(&pdev->dev, "PORT HDR UInit.\n");
> -
> -	device_remove_groups(&pdev->dev, port_hdr_groups);
> +	return 0;
>  }
>  
>  static long
> @@ -330,7 +323,6 @@ static void port_hdr_uinit(struct platform_device *pdev,
>  
>  static const struct dfl_feature_ops port_hdr_ops = {
>  	.init = port_hdr_init,
> -	.uinit = port_hdr_uinit,
>  	.ioctl = port_hdr_ioctl,
>  };
>  
> @@ -361,32 +353,37 @@ static void port_hdr_uinit(struct platform_device *pdev,
>  	&dev_attr_afu_id.attr,
>  	NULL
>  };
> -ATTRIBUTE_GROUPS(port_afu);
>  
> -static int port_afu_init(struct platform_device *pdev,
> -			 struct dfl_feature *feature)
> +static umode_t port_afu_attrs_visible(struct kobject *kobj,
> +				      struct attribute *attr, int n)
>  {
> -	struct resource *res = &pdev->resource[feature->resource_index];
> -	int ret;
> -
> -	dev_dbg(&pdev->dev, "PORT AFU Init.\n");
> +	struct device *dev = kobj_to_dev(kobj);
>  
> -	ret = afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> -				  DFL_PORT_REGION_INDEX_AFU, resource_size(res),
> -				  res->start, DFL_PORT_REGION_READ |
> -				  DFL_PORT_REGION_WRITE | DFL_PORT_REGION_MMAP);
> -	if (ret)
> -		return ret;
> +	/*
> +	 * sysfs entries are visible only if related private feature is
> +	 * enumerated.
> +	 */
> +	if (!dfl_get_feature_by_id(dev, PORT_FEATURE_ID_AFU))
> +		return 0;
>  
> -	return device_add_groups(&pdev->dev, port_afu_groups);
> +	return attr->mode;
>  }
>  
> -static void port_afu_uinit(struct platform_device *pdev,
> -			   struct dfl_feature *feature)
> +static const struct attribute_group port_afu_group = {
> +	.attrs      = port_afu_attrs,
> +	.is_visible = port_afu_attrs_visible,
> +};
> +
> +static int port_afu_init(struct platform_device *pdev,
> +			 struct dfl_feature *feature)
>  {
> -	dev_dbg(&pdev->dev, "PORT AFU UInit.\n");
Thanks.
> +	struct resource *res = &pdev->resource[feature->resource_index];
>  
> -	device_remove_groups(&pdev->dev, port_afu_groups);
> +	return afu_mmio_region_add(dev_get_platdata(&pdev->dev),
> +				   DFL_PORT_REGION_INDEX_AFU,
> +				   resource_size(res), res->start,
> +				   DFL_PORT_REGION_MMAP | DFL_PORT_REGION_READ |
> +				   DFL_PORT_REGION_WRITE);
>  }
>  
>  static const struct dfl_feature_id port_afu_id_table[] = {
> @@ -396,7 +393,6 @@ static void port_afu_uinit(struct platform_device *pdev,
>  
>  static const struct dfl_feature_ops port_afu_ops = {
>  	.init = port_afu_init,
> -	.uinit = port_afu_uinit,
>  };
>  
>  static struct dfl_feature_driver port_feature_drvs[] = {
> @@ -748,9 +744,16 @@ static int afu_remove(struct platform_device *pdev)
>  	return 0;
>  }
>  
> +static const struct attribute_group *afu_dev_groups[] = {
> +	&port_hdr_group,
> +	&port_afu_group,
> +	NULL
> +};
> +
>  static struct platform_driver afu_driver = {
>  	.driver	= {
> -		.name    = DFL_FPGA_FEATURE_DEV_PORT,
> +		.name	    = DFL_FPGA_FEATURE_DEV_PORT,
> +		.dev_groups = afu_dev_groups,
>  	},
>  	.probe   = afu_probe,
>  	.remove  = afu_remove,
> -- 
> 1.8.3.1
> 

Thanks,
Moritz

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Daniel Borkmann @ 2019-08-22 14:17 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List, Chenbo Feng
In-Reply-To: <CALCETrXEHL3+NAY6P6vUj7Pvd9ZpZsYC6VCLXOaNxb90a_POGw@mail.gmail.com>

On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
>>> It tries to make the kernel respect the access modes for fds.  Without
>>> this patch, there seem to be some holes: nothing looked at program fds
>>> and, unless I missed something, you could take a readonly fd for a
>>> program, pin the program, and reopen it RW.
>>
>> I think it's by design. iirc Daniel had a use case for something like this.
> 
> That seems odd.  Daniel, can you elaborate?

[ ... catching up late. ]

Not from my side, the change was added by Chenbo back then for Android
use-case to replace xt_qtaguid and xt_owner with BPF programs and to
allow unprivileged applications to read maps. More on their architecture:

   https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor

 From the cover-letter:

   [...]
   The network-control daemon (netd) creates and loads an eBPF object for
   network packet filtering and analysis. It passes the object FD to an
   unprivileged network monitor app (netmonitor), which is not allowed to
   create, modify or load eBPF objects, but is allowed to read the traffic
   stats from the map.
   [...]

Iuuc, netd would be in charge with the ability to r/w into maps and
pin them, but with the ability to to hand off some map fds as r/o to
unprivileged applications in order for them to query data.

>> Hence unprivileged bpf is actually something that can be deprecated.

There is actually a publicly known use-case on unprivileged bpf wrt
socket filters, see the SO_ATTACH_BPF on sockets section as an example:

   https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/

If I'd have to take a good guess, I'd think it's major use-case is in
SO_ATTACH_REUSEPORT_EBPF in the wild, I don't think the sysctl can be
outright flipped or deprecated w/o breaking existing applications unless
it's cleanly modeled into some sort of customizable CAP_BPF* type policy
(more below) where this would be the lowest common denominator.

> I hope not.  There are a couple setsockopt uses right now, and and
> seccomp will surely want it someday.  And the bpf-inside-container use
> case really is unprivileged bpf -- containers are, in many (most?)
> cases, explicitly not trusted by the host.
[...]
>> Inside containers and inside nested containers we need to start processes
>> that will use bpf. All of the processes are trusted.
> 
> Trusted by whom?  In a non-nested container, the container manager
> *might* be trusted by the outside world.  In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted.  I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.

[...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
ideally be possible for that container to install BPF programs for
mangling, dropping, forwarding etc as long as it's only affecting it's
/own/ netns like the rest of networking subsystem controls that work
in that case. I would actually like to get to this at some point and
make it more approachable as long as there is a way for an admin to
/opt into it/ via policy (aka not by default). Thinking out loud, I'd
love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
customizable seccomp policy. Meaning, there could be several CAP_BPF
type sub-policies e.g. from allowing everything (equivalent to the
/dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
programmable user defined policy that can be tailored to specific
needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
their own netns, and if that is untrusted, then same restrictions/
mitigations could be done by the verifier as with (current) unprivileged
BPF, enabled via programmable policy as well. We wouldn't make any
static/fixed assumptions, but allow users to define them based on their
own use-cases. Haven't looked how feasible this would be, but something
to take into consideration when we rework the current [admittedly
suboptimal all-or-nothing] model we have. Is this something you had in
mind as well for your wip proposal, Andy?

Thanks,
Daniel

^ permalink raw reply

* [PATCH v14 6/6] sched/core: uclamp: always use enum uclamp_id for clamp_id values
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

The supported clamp indexes are defined in enum clamp_id however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use unsigned int to represent indexes.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  | 38 +++++++++++++++++++-------------------
 kernel/sched/sched.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc2dc86a2abe..269c14ad4473 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
 	return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
 }
 
-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
 {
 	if (clamp_id == UCLAMP_MIN)
 		return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 }
 
 static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
 		  unsigned int clamp_value)
 {
 	/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
 	return uclamp_none(UCLAMP_MIN);
 }
 
-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 				     unsigned int clamp_value)
 {
 	/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
 }
 
 static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
-				 unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+				   unsigned int clamp_value)
 {
 	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
 	int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
 	struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 	return uc_req;
 }
 
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_eff;
 
@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
  * for each bucket when all its RUNNABLE tasks require the same clamp.
  */
 static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-				    unsigned int clamp_id)
+				    enum uclamp_id clamp_id)
 {
 	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
  * enforce the expected state and warn.
  */
 static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-				    unsigned int clamp_id)
+				    enum uclamp_id clamp_id)
 {
 	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
@@ -1044,7 +1044,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 }
 
 static inline void
-uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct rq_flags rf;
 	struct rq *rq;
@@ -1080,9 +1080,9 @@ static inline void
 uclamp_update_active_tasks(struct cgroup_subsys_state *css,
 			   unsigned int clamps)
 {
+	enum uclamp_id clamp_id;
 	struct css_task_iter it;
 	struct task_struct *p;
-	unsigned int clamp_id;
 
 	css_task_iter_start(css, 0, &it);
 	while ((p = css_task_iter_next(&it))) {
@@ -1190,7 +1190,7 @@ static int uclamp_validate(struct task_struct *p,
 static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr)
 {
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 
 	/*
 	 * On scheduling class change, reset to default clamps for tasks
@@ -1227,7 +1227,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
 
 static void uclamp_fork(struct task_struct *p)
 {
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 
 	for_each_clamp_id(clamp_id)
 		p->uclamp[clamp_id].active = false;
@@ -1249,7 +1249,7 @@ static void uclamp_fork(struct task_struct *p)
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 	int cpu;
 
 	mutex_init(&uclamp_mutex);
@@ -6924,7 +6924,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
 					    struct task_group *parent)
 {
 #ifdef CONFIG_UCLAMP_TASK_GROUP
-	int clamp_id;
+	enum uclamp_id clamp_id;
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&tg->uclamp_req[clamp_id],
@@ -7182,7 +7182,7 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 	struct uclamp_se *uc_parent = NULL;
 	struct uclamp_se *uc_se = NULL;
 	unsigned int eff[UCLAMP_CNT];
-	unsigned int clamp_id;
+	enum uclamp_id clamp_id;
 	unsigned int clamps;
 
 	css_for_each_descendant_pre(css, top_css) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5b343112a47b..00ff5b57e9cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2281,7 +2281,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif /* CONFIG_CPU_FREQ */
 
 #ifdef CONFIG_UCLAMP_TASK
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id);
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 
 static __always_inline
 unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04fc161e4dbe..fc2dc86a2abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the task and the rq where the task is (or was) queued.
+	 *
+	 * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+	 * price to pay to safely serialize util_{min,max} updates with
+	 * enqueues, dequeues and migration operations.
+	 * This is the same locking schema used by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Setting the clamp bucket is serialized by task_rq_lock().
+	 * If the task is not yet RUNNABLE and its task_struct is not
+	 * affecting a valid clamp bucket, the next time it's enqueued,
+	 * it will already see the updated clamp bucket value.
+	 */
+	if (!p->uclamp[clamp_id].active)
+		goto done;
+
+	uclamp_rq_dec_id(rq, p, clamp_id);
+	uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+	task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+			   unsigned int clamps)
+{
+	struct css_task_iter it;
+	struct task_struct *p;
+	unsigned int clamp_id;
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it))) {
+		for_each_clamp_id(clamp_id) {
+			if ((0x1 << clamp_id) & clamps)
+				uclamp_update_active(p, clamp_id);
+		}
+	}
+	css_task_iter_end(&it);
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7160,8 +7211,13 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
 			clamps |= (0x1 << clamp_id);
 		}
-		if (!clamps)
+		if (!clamps) {
 			css = css_rightmost_descendant(css);
+			continue;
+		}
+
+		/* Immediately update descendants RUNNABLE tasks */
+		uclamp_update_active_tasks(css, clamps);
 	}
 }
 
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 4/6] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ca054ad3f3e..04fc161e4dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	struct uclamp_se uc_max;
+
+	/*
+	 * Tasks in autogroups or root task group will be
+	 * restricted by system defaults.
+	 */
+	if (task_group_is_autogroup(task_group(p)))
+		return uc_req;
+	if (task_group(p) == &root_task_group)
+		return uc_req;
+
+	uc_max = task_group(p)->uclamp[clamp_id];
+	if (uc_req.value > uc_max.value || !uc_req.user_defined)
+		return uc_max;
+#endif
+
+	return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
 	struct uclamp_se uc_max = uclamp_default[clamp_id];
 
 	/* System default restrictions always apply */
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 3/6] sched/core: uclamp: Propagate system defaults to root group
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8dab64247597..3ca054ad3f3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+	struct task_group *tg = &root_task_group;
+
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
+		      sysctl_sched_uclamp_util_min, false);
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
+		      sysctl_sched_uclamp_util_max, false);
+
+	rcu_read_lock();
+	cpu_util_update_eff(&root_task_group.css);
+	rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				void __user *buffer, size_t *lenp,
 				loff_t *ppos)
 {
+	bool update_root_tg = false;
 	int old_min, old_max;
 	int result;
 
@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
 			      sysctl_sched_uclamp_util_min, false);
+		update_root_tg = true;
 	}
 	if (old_max != sysctl_sched_uclamp_util_max) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MAX],
 			      sysctl_sched_uclamp_util_max, false);
+		update_root_tg = true;
 	}
 
+	if (update_root_tg)
+		uclamp_update_root_tg();
+
 	/*
-	 * Updating all the RUNNABLE task is expensive, keep it simple and do
-	 * just a lazy update at each next enqueue time.
+	 * We update all RUNNABLE tasks only when task groups are in use.
+	 * Otherwise, keep it simple and do just a lazy update at each next
+	 * task enqueue time.
 	 */
+
 	goto done;
 
 undo:
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 2/6] sched/core: uclamp: Propagate parent clamps
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b610e1a4cda..8dab64247597 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
 		uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		root_task_group.uclamp_req[clamp_id] = uc_max;
+		root_task_group.uclamp[clamp_id] = uc_max;
 #endif
 	}
 }
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&tg->uclamp_req[clamp_id],
 			      uclamp_none(clamp_id), false);
+		tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
 	}
 #endif
 }
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+	struct cgroup_subsys_state *top_css = css;
+	struct uclamp_se *uc_parent = NULL;
+	struct uclamp_se *uc_se = NULL;
+	unsigned int eff[UCLAMP_CNT];
+	unsigned int clamp_id;
+	unsigned int clamps;
+
+	css_for_each_descendant_pre(css, top_css) {
+		uc_parent = css_tg(css)->parent
+			? css_tg(css)->parent->uclamp : NULL;
+
+		for_each_clamp_id(clamp_id) {
+			/* Assume effective clamps matches requested clamps */
+			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+			/* Cap effective clamps with parent's effective clamps */
+			if (uc_parent &&
+			    eff[clamp_id] > uc_parent[clamp_id].value) {
+				eff[clamp_id] = uc_parent[clamp_id].value;
+			}
+		}
+		/* Ensure protection is always capped by limit */
+		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+		/* Propagate most restrictive effective clamps */
+		clamps = 0x0;
+		uc_se = css_tg(css)->uclamp;
+		for_each_clamp_id(clamp_id) {
+			if (eff[clamp_id] == uc_se[clamp_id].value)
+				continue;
+			uc_se[clamp_id].value = eff[clamp_id];
+			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
+			clamps |= (0x1 << clamp_id);
+		}
+		if (!clamps)
+			css = css_rightmost_descendant(css);
+	}
+}
 
 #define _POW10(exp) ((unsigned int)1e##exp)
 #define POW10(exp) _POW10(exp)
@@ -7133,6 +7174,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
 	 */
 	tg->uclamp_pct[clamp_id] = req.percent;
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_eff(of_css(of));
+
 	rcu_read_unlock();
 	mutex_unlock(&uclamp_mutex);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae1be61fb279..5b343112a47b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
 	unsigned int		uclamp_pct[UCLAMP_CNT];
 	/* Clamp values requested for a task group */
 	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+	/* Effective clamp values used for a task group */
+	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
 
 };
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini
In-Reply-To: <20190822132811.31294-1-patrick.bellasi@arm.com>

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      	 minimum frequency which corresponds to the uclamp.min
	      	 utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      	 maximum frequency which corresponds to the uclamp.max
	      	 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v14:
 Message-ID: <20190806161133.GA18532@blackbody.suse.cz>
 - move uclamp_mutex usage here from the following patch
---
 Documentation/admin-guide/cgroup-v2.rst |  34 +++++
 init/Kconfig                            |  22 +++
 kernel/sched/core.c                     | 188 +++++++++++++++++++++++-
 kernel/sched/sched.h                    |   8 +
 4 files changed, 248 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5f1c266131b0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
 	Shows pressure stall information for CPU. See
 	Documentation/accounting/psi.rst for details.
 
+  cpu.uclamp.min
+        A read-write single value file which exists on non-root cgroups.
+        The default is "0", i.e. no utilization boosting.
+
+        The requested minimum utilization (protection) as a percentage
+        rational number, e.g. 12.34 for 12.34%.
+
+        This interface allows reading and setting minimum utilization clamp
+        values similar to the sched_setattr(2). This minimum utilization
+        value is used to clamp the task specific minimum utilization clamp.
+
+        The requested minimum utilization (protection) is always capped by
+        the current value for the maximum utilization (limit), i.e.
+        `cpu.uclamp.max`.
+
+  cpu.uclamp.max
+        A read-write single value file which exists on non-root cgroups.
+        The default is "max". i.e. no utilization capping
+
+        The requested maximum utilization (limit) as a percentage rational
+        number, e.g. 98.76 for 98.76%.
+
+        This interface allows reading and setting maximum utilization clamp
+        values similar to the sched_setattr(2). This maximum utilization
+        value is used to clamp the task specific maximum utilization clamp.
+
+
 
 Memory
 ------
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650d4a99..ac285cfa78b6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -928,6 +928,28 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config UCLAMP_TASK_GROUP
+	bool "Utilization clamping per group of tasks"
+	depends on CGROUP_SCHED
+	depends on UCLAMP_TASK
+	default n
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max
+	  CPU bandwidth which is allowed for each single task in a group.
+	  The max bandwidth allows to clamp the maximum frequency a task
+	  can use, while the min bandwidth allows to define a minimum
+	  frequency a task will always use.
+
+	  When task group based utilization clamping is enabled, an eventually
+	  specified task-specific clamp value is constrained by the cgroup
+	  specified clamp value. Both minimum and maximum task clamping cannot
+	  be bigger than the corresponding clamping defined at task group level.
+
+	  If in doubt, say N.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6661852907b..7b610e1a4cda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				loff_t *ppos)
 {
 	int old_min, old_max;
-	static DEFINE_MUTEX(mutex);
 	int result;
 
-	mutex_lock(&mutex);
+	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
 
@@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
 done:
-	mutex_unlock(&mutex);
+	mutex_unlock(&uclamp_mutex);
 
 	return result;
 }
@@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
 	unsigned int clamp_id;
 	int cpu;
 
+	mutex_init(&uclamp_mutex);
+
 	for_each_possible_cpu(cpu) {
 		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
 		cpu_rq(cpu)->uclamp_flags = 0;
@@ -1149,8 +1162,12 @@ static void __init init_uclamp(void)
 
 	/* System defaults allow max clamp values for both indexes */
 	uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
-	for_each_clamp_id(clamp_id)
+	for_each_clamp_id(clamp_id) {
 		uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+		root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+	}
 }
 
 #else /* CONFIG_UCLAMP_TASK */
@@ -6798,6 +6815,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+					    struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	int clamp_id;
+
+	for_each_clamp_id(clamp_id) {
+		uclamp_se_set(&tg->uclamp_req[clamp_id],
+			      uclamp_none(clamp_id), false);
+	}
+#endif
+}
+
 static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
@@ -6821,6 +6851,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	alloc_uclamp_sched_group(tg, parent);
+
 	return tg;
 
 err:
@@ -7037,6 +7069,126 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+
+#define _POW10(exp) ((unsigned int)1e##exp)
+#define POW10(exp) _POW10(exp)
+
+struct uclamp_request {
+#define UCLAMP_PERCENT_SHIFT	2
+#define UCLAMP_PERCENT_SCALE	(100 * POW10(UCLAMP_PERCENT_SHIFT))
+	s64 percent;
+	u64 util;
+	int ret;
+};
+
+static inline struct uclamp_request
+capacity_from_percent(char *buf)
+{
+	struct uclamp_request req = {
+		.percent = UCLAMP_PERCENT_SCALE,
+		.util = SCHED_CAPACITY_SCALE,
+		.ret = 0,
+	};
+
+	buf = strim(buf);
+	if (strncmp("max", buf, 4)) {
+		req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
+					     &req.percent);
+		if (req.ret)
+			return req;
+		if (req.percent > UCLAMP_PERCENT_SCALE) {
+			req.ret = -ERANGE;
+			return req;
+		}
+
+		req.util = req.percent << SCHED_CAPACITY_SHIFT;
+		req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
+	}
+
+	return req;
+}
+
+static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
+				size_t nbytes, loff_t off,
+				enum uclamp_id clamp_id)
+{
+	struct uclamp_request req;
+	struct task_group *tg;
+
+	req = capacity_from_percent(buf);
+	if (req.ret)
+		return req.ret;
+
+	mutex_lock(&uclamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(of_css(of));
+	if (tg->uclamp_req[clamp_id].value != req.util)
+		uclamp_se_set(&tg->uclamp_req[clamp_id], req.util, false);
+
+	/*
+	 * Because of not recoverable conversion rounding we keep track of the
+	 * exact requested value
+	 */
+	tg->uclamp_pct[clamp_id] = req.percent;
+
+	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
+
+	return nbytes;
+}
+
+static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes,
+				    loff_t off)
+{
+	return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MIN);
+}
+
+static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes,
+				    loff_t off)
+{
+	return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MAX);
+}
+
+static inline void cpu_uclamp_print(struct seq_file *sf,
+				    enum uclamp_id clamp_id)
+{
+	struct task_group *tg;
+	u64 util_clamp;
+	u64 percent;
+	u32 rem;
+
+	rcu_read_lock();
+	tg = css_tg(seq_css(sf));
+	util_clamp = tg->uclamp_req[clamp_id].value;
+	rcu_read_unlock();
+
+	if (util_clamp == SCHED_CAPACITY_SCALE) {
+		seq_puts(sf, "max\n");
+		return;
+	}
+
+	percent = tg->uclamp_pct[clamp_id];
+	percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
+	seq_printf(sf, "%llu.%0*u\n", percent, UCLAMP_PERCENT_SHIFT, rem);
+}
+
+static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
+{
+	cpu_uclamp_print(sf, UCLAMP_MIN);
+	return 0;
+}
+
+static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
+{
+	cpu_uclamp_print(sf, UCLAMP_MAX);
+	return 0;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7381,6 +7533,20 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "uclamp.min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_min_show,
+		.write = cpu_uclamp_min_write,
+	},
+	{
+		.name = "uclamp.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_max_show,
+		.write = cpu_uclamp_max_write,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7548,6 +7714,20 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "uclamp.min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_min_show,
+		.write = cpu_uclamp_min_write,
+	},
+	{
+		.name = "uclamp.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_max_show,
+		.write = cpu_uclamp_max_write,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7111e3a1eeb4..ae1be61fb279 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,14 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	/* The two decimal precision [%] value requested from user-space */
+	unsigned int		uclamp_pct[UCLAMP_CNT];
+	/* Clamp values requested for a task group */
+	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+#endif
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.22.0

^ permalink raw reply related

* [PATCH v14 0/6] Add utilization clamping support (CGroups API)
From: Patrick Bellasi @ 2019-08-22 13:28 UTC (permalink / raw)
  To: linux-kernel, linux-pm, linux-api, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Michal Koutny,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hi all, this is a respin of:

  https://lore.kernel.org/lkml/20190802090853.4810-1-patrick.bellasi@arm.com/

which introduces only a small fix suggested by Michal and adds his Reviewed-by.
Thanks Michal for your additional review!

The series is based on top of today's tip/sched/core:

  commit a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")

Since there was only minor changes, I've kept Tejun ACK tag.

Cheers,
Patrick


Series Organization
===================

The full tree is available here:

   git://linux-arm.org/linux-pb.git   lkml/utilclamp_v14
   http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v14


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [1].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

 - boosting: better interactive response for small tasks which
   are affecting the user experience.

   Consider for example the case of a small control thread for an external
   accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
   the scheduler does not have a complete view of what the task's requirements
   are and, if it's a small utilization task, it keeps selecting a more energy
   efficient CPU, with smaller capacity and lower frequency, thus negatively
   impacting the overall time required to complete task activations.

 - capping: increase energy efficiency for background tasks not affecting the
   user experience.

   Since running on a lower capacity CPU at a lower frequency is more energy
   efficient, when the completion time is not a main goal, then capping the
   utilization considered for certain (maybe big) tasks can have positive
   effects, both on energy consumption and thermal headroom.
   This feature allows also to make RT tasks more energy friendly on mobile
   systems where running them on high capacity CPUs and at the maximum
   frequency is not required.

>From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

 - boosting: try to run small/foreground tasks on higher-capacity CPUs to
   complete them faster despite being less energy efficient.

 - capping: try to run big/background tasks on low-capacity CPUs to save power
   and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] Energy Aware Scheduling
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
    Linux Plumbers Conference 2018
    https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (6):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes
  sched/core: uclamp: always use enum uclamp_id for clamp_id values

 Documentation/admin-guide/cgroup-v2.rst |  34 +++
 init/Kconfig                            |  22 ++
 kernel/sched/core.c                     | 375 ++++++++++++++++++++++--
 kernel/sched/sched.h                    |  12 +-
 4 files changed, 421 insertions(+), 22 deletions(-)

-- 
2.22.0

^ permalink raw reply

* Re: [PATCH v8 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
From: Yu-cheng Yu @ 2019-08-21 14:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pa
In-Reply-To: <20190821102052.GD6752@zn.tnic>

On Wed, 2019-08-21 at 12:20 +0200, Borislav Petkov wrote:
> On Tue, Aug 13, 2019 at 01:52:00PM -0700, Yu-cheng Yu wrote:
> > Add CPU feature flags for Control-flow Enforcement Technology (CET).
> > 
> > [...]
> > diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-
> > deps.c
> > index b5353244749b..9bf35f081080 100644
> > --- a/arch/x86/kernel/cpu/cpuid-deps.c
> > +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> > @@ -68,6 +68,8 @@ static const struct cpuid_dep cpuid_deps[] = {
> >  	{ X86_FEATURE_CQM_MBM_TOTAL,	X86_FEATURE_CQM_LLC   },
> >  	{ X86_FEATURE_CQM_MBM_LOCAL,	X86_FEATURE_CQM_LLC   },
> >  	{ X86_FEATURE_AVX512_BF16,	X86_FEATURE_AVX512VL  },
> > +	{ X86_FEATURE_SHSTK,		X86_FEATURE_XSAVES    },
> > +	{ X86_FEATURE_IBT,		X86_FEATURE_XSAVES    },
> 
> This hunk needs re-tabbing after:
> 
> 1e0c08e3034d ("cpu/cpuid-deps: Add a tab to cpuid dependent features")

Thanks, I will fix it.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v8 03/27] x86/fpu/xstate: Change names to separate XSAVES system and user states
From: Borislav Petkov @ 2019-08-21 14:45 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pa
In-Reply-To: <20190813205225.12032-4-yu-cheng.yu@intel.com>

On Tue, Aug 13, 2019 at 01:52:01PM -0700, Yu-cheng Yu wrote:
> Control-flow Enforcement (CET) MSR contents are XSAVES system states.
> To support CET, introduce XSAVES system states first.
> 
> XSAVES is a "supervisor" instruction and, comparing to XSAVE, saves
> additional "supervisor" states that can be modified only from CPL 0.
> However, these states are per-task and not kernel's own.  Rename
> "supervisor" states to "system" states to clearly separate them from
> "user" states.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/fpu/internal.h |  4 +-
>  arch/x86/include/asm/fpu/xstate.h   | 20 +++----
>  arch/x86/kernel/fpu/init.c          |  2 +-
>  arch/x86/kernel/fpu/signal.c        | 10 ++--
>  arch/x86/kernel/fpu/xstate.c        | 86 ++++++++++++++---------------
>  5 files changed, 60 insertions(+), 62 deletions(-)

...

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index e5cb67d67c03..d560e8861a3c 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -54,13 +54,16 @@ static short xsave_cpuid_features[] __initdata = {
>  };
>  
>  /*
> - * Mask of xstate features supported by the CPU and the kernel:
> + * XSAVES system states can only be modified from CPL 0 and saved by
> + * XSAVES.  The rest are user states.  The following is a mask of
> + * supported user state features derived from boot_cpu_has() and

...derived from detected CPUID feature flags and
SUPPORTED_XFEATURES_MASK.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Chao Yu @ 2019-08-21 13:15 UTC (permalink / raw)
  To: Daniel Rosenberg, Jaegeuk Kim, Chao Yu, Jonathan Corbet,
	linux-f2fs-devel
  Cc: linux-doc, linux-api, linux-kernel, linux-fsdevel, kernel-team
In-Reply-To: <20190723230529.251659-4-drosen@google.com>

On 2019-7-24 7:05, Daniel Rosenberg via Linux-f2fs-devel wrote:
> +static int f2fs_d_compare(const struct dentry *dentry, unsigned int len,
> + const char *str, const struct qstr *name)
> +{
> +	struct qstr qstr = {.name = str, .len = len };
> +
> +	if (!IS_CASEFOLDED(dentry->d_parent->d_inode)) {
> +		if (len != name->len)
> +			return -1;
> +		return memcmp(str, name, len);

66883da1eee8 ("ext4: fix dcache lookup of !casefolded directories")

memcmp(str, name->name, len);

^ permalink raw reply

* Re: [PATCH v8 02/27] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)
From: Borislav Petkov @ 2019-08-21 10:20 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pa
In-Reply-To: <20190813205225.12032-3-yu-cheng.yu@intel.com>

On Tue, Aug 13, 2019 at 01:52:00PM -0700, Yu-cheng Yu wrote:
> Add CPU feature flags for Control-flow Enforcement Technology (CET).
> 
> CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
> CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect branch tracking
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/cpufeatures.h | 2 ++
>  arch/x86/kernel/cpu/cpuid-deps.c   | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index e880f2408e29..122265ab46c1 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -334,6 +334,7 @@
>  #define X86_FEATURE_OSPKE		(16*32+ 4) /* OS Protection Keys Enable */
>  #define X86_FEATURE_WAITPKG		(16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
>  #define X86_FEATURE_AVX512_VBMI2	(16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
> +#define X86_FEATURE_SHSTK		(16*32+ 7) /* Shadow Stack */
>  #define X86_FEATURE_GFNI		(16*32+ 8) /* Galois Field New Instructions */
>  #define X86_FEATURE_VAES		(16*32+ 9) /* Vector AES */
>  #define X86_FEATURE_VPCLMULQDQ		(16*32+10) /* Carry-Less Multiplication Double Quadword */
> @@ -358,6 +359,7 @@
>  #define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
>  #define X86_FEATURE_TSX_FORCE_ABORT	(18*32+13) /* "" TSX_FORCE_ABORT */
>  #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
> +#define X86_FEATURE_IBT			(18*32+20) /* Indirect Branch Tracking */
>  #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
>  #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
>  #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index b5353244749b..9bf35f081080 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -68,6 +68,8 @@ static const struct cpuid_dep cpuid_deps[] = {
>  	{ X86_FEATURE_CQM_MBM_TOTAL,	X86_FEATURE_CQM_LLC   },
>  	{ X86_FEATURE_CQM_MBM_LOCAL,	X86_FEATURE_CQM_LLC   },
>  	{ X86_FEATURE_AVX512_BF16,	X86_FEATURE_AVX512VL  },
> +	{ X86_FEATURE_SHSTK,		X86_FEATURE_XSAVES    },
> +	{ X86_FEATURE_IBT,		X86_FEATURE_XSAVES    },

This hunk needs re-tabbing after:

1e0c08e3034d ("cpu/cpuid-deps: Add a tab to cpuid dependent features")

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [PATCH v5 1/9] fpga: dfl: make init callback optional
From: Wu Hao @ 2019-08-21  5:12 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: gregkh, linux-fpga, linux-kernel, linux-api, linux-doc, atull
In-Reply-To: <20190821032406.GA28625@archbox>

On Tue, Aug 20, 2019 at 08:24:06PM -0700, Moritz Fischer wrote:
> Hi,
> 
> On Mon, Aug 12, 2019 at 10:49:56AM +0800, Wu Hao wrote:
> > This patch makes init callback of sub features optional. With
> > this change, people don't need to prepare any empty init callback.
> > 
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> 
> Acked-by: Moritz Fischer <mdf@kernel.org>
> > ---
> >  drivers/fpga/dfl.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> > index c0512af..96a2b82 100644
> > --- a/drivers/fpga/dfl.c
> > +++ b/drivers/fpga/dfl.c
> > @@ -271,11 +271,13 @@ static int dfl_feature_instance_init(struct platform_device *pdev,
> >  				     struct dfl_feature *feature,
> >  				     struct dfl_feature_driver *drv)
> >  {
> > -	int ret;
> > +	int ret = 0;
> >  
> > -	ret = drv->ops->init(pdev, feature);
> > -	if (ret)
> > -		return ret;
> > +	if (drv->ops->init) {
> > +		ret = drv->ops->init(pdev, feature);
> > +		if (ret)
> > +			return ret;
> > +	}
> >  
> >  	feature->ops = drv->ops;
> 
> You could swap it around maybe like so:
> 
> int dfl_feature_instance_init() ...
> {
> 	feature->ops = drv->ops;
> 	if (drv->ops->init)
> 		return drv->ops->init(pdev, feature);
> 
> 	return 0;
> }
> 
> With the caveat that feature->ops gets always set ...
> 
> Your call.

Hi Moritz,

Thanks a lot for the review and comments. It does simplify the code,
will modify it.

Thanks
Hao

> 
> Thanks,
> Moritz

^ permalink raw reply

* Re: [PATCH v5 2/9] fpga: dfl: fme: convert platform_driver to use dev_groups
From: Moritz Fischer @ 2019-08-21  3:28 UTC (permalink / raw)
  To: Wu Hao; +Cc: gregkh, mdf, linux-fpga, linux-kernel, linux-api, linux-doc,
	atull
In-Reply-To: <1565578204-13969-3-git-send-email-hao.wu@intel.com>

Hi Hao,

On Mon, Aug 12, 2019 at 10:49:57AM +0800, Wu Hao wrote:
> This patch takes advantage of driver core which helps to create
> and remove sysfs attribute files, so there is no need to register
> sysfs entries manually in dfl-fme platform river code.
Nit: s/river/driver
> 
> Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl-fme-main.c | 29 ++---------------------------
>  1 file changed, 2 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index f033f1c..bf8114d 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -129,30 +129,6 @@ static ssize_t socket_id_show(struct device *dev,
>  };
>  ATTRIBUTE_GROUPS(fme_hdr);
>  
> -static int fme_hdr_init(struct platform_device *pdev,
> -			struct dfl_feature *feature)
> -{
> -	void __iomem *base = feature->ioaddr;
> -	int ret;
> -
> -	dev_dbg(&pdev->dev, "FME HDR Init.\n");
> -	dev_dbg(&pdev->dev, "FME cap %llx.\n",
> -		(unsigned long long)readq(base + FME_HDR_CAP));
> -
> -	ret = device_add_groups(&pdev->dev, fme_hdr_groups);
> -	if (ret)
> -		return ret;
> -
> -	return 0;
> -}
> -
> -static void fme_hdr_uinit(struct platform_device *pdev,
> -			  struct dfl_feature *feature)
> -{
> -	dev_dbg(&pdev->dev, "FME HDR UInit.\n");
> -	device_remove_groups(&pdev->dev, fme_hdr_groups);
> -}
> -
>  static long fme_hdr_ioctl_release_port(struct dfl_feature_platform_data *pdata,
>  				       unsigned long arg)
>  {
> @@ -199,8 +175,6 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
>  };
>  
>  static const struct dfl_feature_ops fme_hdr_ops = {
> -	.init = fme_hdr_init,
> -	.uinit = fme_hdr_uinit,
>  	.ioctl = fme_hdr_ioctl,
>  };
>  
> @@ -361,7 +335,8 @@ static int fme_remove(struct platform_device *pdev)
>  
>  static struct platform_driver fme_driver = {
>  	.driver	= {
> -		.name    = DFL_FPGA_FEATURE_DEV_FME,
> +		.name       = DFL_FPGA_FEATURE_DEV_FME,
> +		.dev_groups = fme_hdr_groups,
>  	},
>  	.probe   = fme_probe,
>  	.remove  = fme_remove,
> -- 
> 1.8.3.1
> 
Thanks,
Moritz

^ permalink raw reply

* Re: [PATCH v5 1/9] fpga: dfl: make init callback optional
From: Moritz Fischer @ 2019-08-21  3:24 UTC (permalink / raw)
  To: Wu Hao; +Cc: gregkh, mdf, linux-fpga, linux-kernel, linux-api, linux-doc,
	atull
In-Reply-To: <1565578204-13969-2-git-send-email-hao.wu@intel.com>

Hi,

On Mon, Aug 12, 2019 at 10:49:56AM +0800, Wu Hao wrote:
> This patch makes init callback of sub features optional. With
> this change, people don't need to prepare any empty init callback.
> 
> Signed-off-by: Wu Hao <hao.wu@intel.com>

Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
>  drivers/fpga/dfl.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index c0512af..96a2b82 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -271,11 +271,13 @@ static int dfl_feature_instance_init(struct platform_device *pdev,
>  				     struct dfl_feature *feature,
>  				     struct dfl_feature_driver *drv)
>  {
> -	int ret;
> +	int ret = 0;
>  
> -	ret = drv->ops->init(pdev, feature);
> -	if (ret)
> -		return ret;
> +	if (drv->ops->init) {
> +		ret = drv->ops->init(pdev, feature);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	feature->ops = drv->ops;

You could swap it around maybe like so:

int dfl_feature_instance_init() ...
{
	feature->ops = drv->ops;
	if (drv->ops->init)
		return drv->ops->init(pdev, feature);

	return 0;
}

With the caveat that feature->ops gets always set ...

Your call.

Thanks,
Moritz

^ permalink raw reply

* Re: [PATCH V40 16/29] acpi: Disable ACPI table override if the kernel is locked down
From: Rafael J. Wysocki @ 2019-08-20 22:08 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Linn Crosetto, David Howells, Matthew Garrett, Kees Cook,
	linux-acpi
In-Reply-To: <20190820001805.241928-17-matthewgarrett@google.com>

On Tuesday, August 20, 2019 2:17:52 AM CEST Matthew Garrett wrote:
> From: Linn Crosetto <lcrosetto@gmail.com>
> 
> >From the kernel documentation (initrd_table_override.txt):
> 
>   If the ACPI_INITRD_TABLE_OVERRIDE compile option is true, it is possible
>   to override nearly any ACPI table provided by the BIOS with an
>   instrumented, modified one.
> 
> When lockdown is enabled, the kernel should disallow any unauthenticated
> changes to kernel space.  ACPI tables contain code invoked by the kernel,
> so do not allow ACPI tables to be overridden if the kernel is locked down.
> 
> Signed-off-by: Linn Crosetto <lcrosetto@gmail.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> cc: linux-acpi@vger.kernel.org
> Signed-off-by: James Morris <jmorris@namei.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/tables.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> index de974322a197..b7c29a11c0c1 100644
> --- a/drivers/acpi/tables.c
> +++ b/drivers/acpi/tables.c
> @@ -20,6 +20,7 @@
>  #include <linux/memblock.h>
>  #include <linux/earlycpio.h>
>  #include <linux/initrd.h>
> +#include <linux/security.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_ACPI_CUSTOM_DSDT
> @@ -577,6 +578,11 @@ void __init acpi_table_upgrade(void)
>  	if (table_nr == 0)
>  		return;
>  
> +	if (security_locked_down(LOCKDOWN_ACPI_TABLES)) {
> +		pr_notice("kernel is locked down, ignoring table override\n");
> +		return;
> +	}
> +
>  	acpi_tables_addr =
>  		memblock_find_in_range(0, ACPI_TABLE_UPGRADE_MAX_PHYS,
>  				       all_tables_size, PAGE_SIZE);
> 

^ permalink raw reply

* Re: [PATCH V40 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Rafael J. Wysocki @ 2019-08-20 22:08 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Josh Boyer, David Howells, Matthew Garrett, Kees Cook, Dave Young,
	linux-acpi
In-Reply-To: <20190820001805.241928-16-matthewgarrett@google.com>

On Tuesday, August 20, 2019 2:17:51 AM CEST Matthew Garrett wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> 
> This option allows userspace to pass the RSDP address to the kernel, which
> makes it possible for a user to modify the workings of hardware. Reject
> the option when the kernel is locked down. This requires some reworking
> of the existing RSDP command line logic, since the early boot code also
> makes use of a command-line passed RSDP when locating the SRAT table
> before the lockdown code has been initialised. This is achieved by
> separating the command line RSDP path in the early boot code from the
> generic RSDP path, and then copying the command line RSDP into boot
> params in the kernel proper if lockdown is not enabled. If lockdown is
> enabled and an RSDP is provided on the command line, this will only be
> used when parsing SRAT (which shouldn't permit kernel code execution)
> and will be ignored in the rest of the kernel.
> 
> (Modified by Matthew Garrett in order to handle the early boot RSDP
> environment)
> 
> Signed-off-by: Josh Boyer <jwboyer@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> cc: Dave Young <dyoung@redhat.com>
> cc: linux-acpi@vger.kernel.org
> Signed-off-by: James Morris <jmorris@namei.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/x86/boot/compressed/acpi.c | 19 +++++++++++++------
>  arch/x86/include/asm/acpi.h     |  9 +++++++++
>  arch/x86/include/asm/x86_init.h |  2 ++
>  arch/x86/kernel/acpi/boot.c     |  5 +++++
>  arch/x86/kernel/x86_init.c      |  1 +
>  drivers/acpi/osl.c              | 14 +++++++++++++-
>  include/linux/acpi.h            |  6 ++++++
>  7 files changed, 49 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/acpi.c b/arch/x86/boot/compressed/acpi.c
> index ad84239e595e..e726e9b44bb1 100644
> --- a/arch/x86/boot/compressed/acpi.c
> +++ b/arch/x86/boot/compressed/acpi.c
> @@ -26,7 +26,7 @@ struct mem_vector immovable_mem[MAX_NUMNODES*2];
>   */
>  #define MAX_ADDR_LEN 19
>  
> -static acpi_physical_address get_acpi_rsdp(void)
> +static acpi_physical_address get_cmdline_acpi_rsdp(void)
>  {
>  	acpi_physical_address addr = 0;
>  
> @@ -215,10 +215,7 @@ acpi_physical_address get_rsdp_addr(void)
>  {
>  	acpi_physical_address pa;
>  
> -	pa = get_acpi_rsdp();
> -
> -	if (!pa)
> -		pa = boot_params->acpi_rsdp_addr;
> +	pa = boot_params->acpi_rsdp_addr;
>  
>  	if (!pa)
>  		pa = efi_get_rsdp_addr();
> @@ -240,7 +237,17 @@ static unsigned long get_acpi_srat_table(void)
>  	char arg[10];
>  	u8 *entry;
>  
> -	rsdp = (struct acpi_table_rsdp *)(long)boot_params->acpi_rsdp_addr;
> +	/*
> +	 * Check whether we were given an RSDP on the command line. We don't
> +	 * stash this in boot params because the kernel itself may have
> +	 * different ideas about whether to trust a command-line parameter.
> +	 */
> +	rsdp = (struct acpi_table_rsdp *)get_cmdline_acpi_rsdp();
> +
> +	if (!rsdp)
> +		rsdp = (struct acpi_table_rsdp *)(long)
> +			boot_params->acpi_rsdp_addr;
> +
>  	if (!rsdp)
>  		return 0;
>  
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index aac686e1e005..bc9693c9107e 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -117,6 +117,12 @@ static inline bool acpi_has_cpu_in_madt(void)
>  	return !!acpi_lapic;
>  }
>  
> +#define ACPI_HAVE_ARCH_SET_ROOT_POINTER
> +static inline void acpi_arch_set_root_pointer(u64 addr)
> +{
> +	x86_init.acpi.set_root_pointer(addr);
> +}
> +
>  #define ACPI_HAVE_ARCH_GET_ROOT_POINTER
>  static inline u64 acpi_arch_get_root_pointer(void)
>  {
> @@ -125,6 +131,7 @@ static inline u64 acpi_arch_get_root_pointer(void)
>  
>  void acpi_generic_reduced_hw_init(void);
>  
> +void x86_default_set_root_pointer(u64 addr);
>  u64 x86_default_get_root_pointer(void);
>  
>  #else /* !CONFIG_ACPI */
> @@ -138,6 +145,8 @@ static inline void disable_acpi(void) { }
>  
>  static inline void acpi_generic_reduced_hw_init(void) { }
>  
> +static inline void x86_default_set_root_pointer(u64 addr) { }
> +
>  static inline u64 x86_default_get_root_pointer(void)
>  {
>  	return 0;
> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
> index b85a7c54c6a1..d584128435cb 100644
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -134,10 +134,12 @@ struct x86_hyper_init {
>  
>  /**
>   * struct x86_init_acpi - x86 ACPI init functions
> + * @set_root_poitner:		set RSDP address
>   * @get_root_pointer:		get RSDP address
>   * @reduced_hw_early_init:	hardware reduced platform early init
>   */
>  struct x86_init_acpi {
> +	void (*set_root_pointer)(u64 addr);
>  	u64 (*get_root_pointer)(void);
>  	void (*reduced_hw_early_init)(void);
>  };
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 17b33ef604f3..04205ce127a1 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1760,6 +1760,11 @@ void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size)
>  	e820__update_table_print();
>  }
>  
> +void x86_default_set_root_pointer(u64 addr)
> +{
> +	boot_params.acpi_rsdp_addr = addr;
> +}
> +
>  u64 x86_default_get_root_pointer(void)
>  {
>  	return boot_params.acpi_rsdp_addr;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index 50a2b492fdd6..d0b8f5585a73 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -95,6 +95,7 @@ struct x86_init_ops x86_init __initdata = {
>  	},
>  
>  	.acpi = {
> +		.set_root_pointer	= x86_default_set_root_pointer,
>  		.get_root_pointer	= x86_default_get_root_pointer,
>  		.reduced_hw_early_init	= acpi_generic_reduced_hw_init,
>  	},
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index cc7507091dec..b7c3aeb175dd 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -26,6 +26,7 @@
>  #include <linux/list.h>
>  #include <linux/jiffies.h>
>  #include <linux/semaphore.h>
> +#include <linux/security.h>
>  
>  #include <asm/io.h>
>  #include <linux/uaccess.h>
> @@ -180,8 +181,19 @@ acpi_physical_address __init acpi_os_get_root_pointer(void)
>  	acpi_physical_address pa;
>  
>  #ifdef CONFIG_KEXEC
> -	if (acpi_rsdp)
> +	/*
> +	 * We may have been provided with an RSDP on the command line,
> +	 * but if a malicious user has done so they may be pointing us
> +	 * at modified ACPI tables that could alter kernel behaviour -
> +	 * so, we check the lockdown status before making use of
> +	 * it. If we trust it then also stash it in an architecture
> +	 * specific location (if appropriate) so it can be carried
> +	 * over further kexec()s.
> +	 */
> +	if (acpi_rsdp && !security_locked_down(LOCKDOWN_ACPI_TABLES)) {
> +		acpi_arch_set_root_pointer(acpi_rsdp);
>  		return acpi_rsdp;
> +	}
>  #endif
>  	pa = acpi_arch_get_root_pointer();
>  	if (pa)
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index d315d86844e4..268a4d91f54c 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -632,6 +632,12 @@ bool acpi_gtdt_c3stop(int type);
>  int acpi_arch_timer_mem_init(struct arch_timer_mem *timer_mem, int *timer_count);
>  #endif
>  
> +#ifndef ACPI_HAVE_ARCH_SET_ROOT_POINTER
> +static inline void acpi_arch_set_root_pointer(u64 addr)
> +{
> +}
> +#endif
> +
>  #ifndef ACPI_HAVE_ARCH_GET_ROOT_POINTER
>  static inline u64 acpi_arch_get_root_pointer(void)
>  {
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox