* Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()
From: Andy Lutomirski @ 2019-08-01 21:39 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Andy Lutomirski, Dmitry Safonov, LKML, Dmitry Safonov,
Adrian Reber, Andrei Vagin, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu,
Linux API
In-Reply-To: <4D0E6734-066D-4A72-A119-2FD6482F857D@zytor.com>
On Wed, Jul 31, 2019 at 11:09 PM <hpa@zytor.com> wrote:
>
> On July 31, 2019 10:34:26 PM PDT, Andy Lutomirski <luto@kernel.org> wrote:
> >On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
> >>
> >> As it has been discussed on timens RFC, adding a new conditional
> >branch
> >> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> >> It will add a penalty for everybody as branch predictor may
> >mispredict
> >> the jump. Also there are instruction cache lines wasted on cmp/jmp.
> >
> >
> >>
> >> +#ifdef CONFIG_TIME_NS
> >> +int vdso_join_timens(struct task_struct *task)
> >> +{
> >> + struct mm_struct *mm = task->mm;
> >> + struct vm_area_struct *vma;
> >> +
> >> + if (down_write_killable(&mm->mmap_sem))
> >> + return -EINTR;
> >> +
> >> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >> + unsigned long size = vma->vm_end - vma->vm_start;
> >> +
> >> + if (vma_is_special_mapping(vma, &vvar_mapping) ||
> >> + vma_is_special_mapping(vma, &vdso_mapping))
> >> + zap_page_range(vma, vma->vm_start, size);
> >> + }
> >
> >This is, unfortunately, fundamentally buggy. If any thread is in the
> >vDSO or has the vDSO on the stack (due to a signal, for example), this
> >will crash it. I can think of three solutions:
> >
> >1. Say that you can't setns() if you have other mms and ignore the
> >signal issue. Anything with green threads will disapprove. It's also
> >rather gross.
> >
> >2. Make it so that you can flip the static branch safely. As in my
> >other email, you'll need to deal with CoW somehow,
> >
> >3. Make it so that you can't change timens, or at least that you can't
> >turn timens on or off, without execve() or fork().
> >
> >BTW, that static branch probably needs to be aligned to a cache line
> >or something similar to avoid all the nastiness with trying to poke
> >text that might be concurrently executing. This will be a mess.
>
> Since we are talking about different physical addresses I believe we should be okay as long as they don't cross page boundaries, and even if they do it can be managed with proper page invalidation sequencing – it's not like the problems of having to deal with XMC on live pages like in the kernel.
>
> Still, you really need each instruction sequence to be present, with the only difference being specific patch sites.
>
> Any fundamental reason this can't be strictly data driven? Seems odd to me if it couldn't, but I might be missing something obvious.
I think it can be. There are at least two places where vDSO slow
paths could hook without affecting fast paths: vclock_mode and the low
bit of the sequence number.
^ permalink raw reply
* Re: [PATCH V37 19/29] Lock down module params that specify hardware parameters (eg. ioport)
From: Matthew Garrett @ 2019-08-01 20:44 UTC (permalink / raw)
To: Jessica Yu
Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
David Howells, Alan Cox, Kees Cook
In-Reply-To: <20190801161933.GB5834@linux-8ccs>
On Thu, Aug 1, 2019 at 9:19 AM Jessica Yu <jeyu@kernel.org> wrote:
> Hm, I don't think the doing parameter ended up being used in this function?
Thanks for catching that, I'll fix.
^ permalink raw reply
* Re: [PATCH V37 04/29] Enforce module signatures if the kernel is locked down
From: Matthew Garrett @ 2019-08-01 20:42 UTC (permalink / raw)
To: Jessica Yu
Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
David Howells, Kees Cook
In-Reply-To: <20190801142157.GA5834@linux-8ccs>
On Thu, Aug 1, 2019 at 7:22 AM Jessica Yu <jeyu@kernel.org> wrote:
> Apologies if this was addressed in another patch in your series (I've
> only skimmed the first few), but what should happen if the kernel is
> locked down, but CONFIG_MODULE_SIG=n? Or shouldn't CONFIG_SECURITY_LOCKDOWN_LSM
> depend on CONFIG_MODULE_SIG? Otherwise I think we'll end up calling
> the empty !CONFIG_MODULE_SIG module_sig_check() stub even though
> lockdown is enabled.
Hm. Someone could certainly configure their kernel in that way. I'm
not sure that tying CONFIG_SECURITY_LOCKDOWN_LSM to CONFIG_MODULE_SIG
is the right solution, since the new LSM approach means that any other
LSM could also impose the same policy. Perhaps we should just document
this?
^ permalink raw reply
* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-08-01 18:46 UTC (permalink / raw)
To: Theodore Y. Ts'o, linux-fscrypt, linux-fsdevel, linux-ext4,
linux-f2fs-devel, linux-mtd, linux-api, linux-crypto, keyrings,
Paul Crowley, Satya Tangirala
In-Reply-To: <20190801183554.GA223822@gmail.com>
On Thu, Aug 01, 2019 at 11:35:56AM -0700, Eric Biggers wrote:
>
> "fscrypt lock" actually doesn't exist yet; it's a missing feature. My patch to
> the fscrypt tool adds it. So we get to decide on the semantics. We don't want
> to require root, though; so for v2 policy keys, the real semantics have to be
> that "fscrypt lock" registers the key for the user, and "fscrypt unlock"
> unregisters it for the user.
>
I meant the other way around, of course: "fscrypt unlock" registers the key for
the user, and "fscrypt lock" unregisters it for the user.
- Eric
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-08-01 18:35 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190801053108.GD2769@mit.edu>
On Thu, Aug 01, 2019 at 01:31:08AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Jul 31, 2019 at 06:11:40PM -0700, Eric Biggers wrote:
> >
> > Well, it's either
> >
> > 1a. Remove the user's handle.
> > OR
> > 1b. Remove all users' handles. (FSCRYPT_REMOVE_KEY_FLAG_ALL_USERS)
> >
> > Then
> >
> > 2. If no handles remain, try to evict all inodes that use the key.
> >
> > By "purge all keys" do you mean step (2)? Note that it doesn't require root by
> > itself; root is only required to remove other users' handles (1b).
>
> No, I was talking about 1b. I'd argue that 1a and 1b should be
> different ioctl. 1b requires root, and 1a doesn't.
>
> And 1a should just mean, "I don't need to use the encrypted files any
> more". In the PAM passphrase case, when you are just logging out, 1a
> is what's needed, and success is just fine. pam_session won't *care*
> whether or not there are other users keeping the key in use.
But in both cases, we still need to do the same things if the last user is
removed: remove the master key secret, try to evict the inodes, and (if all
inodes could be evicted) unlink the key object from the filesystem-level
keyring. So the ioctls would be nearly the same; it's just the
"remove key user(s)" step that would be different.
So in my view, these are variants of the same action, which is why it's a flag.
Likewise, there aren't separate ioctls for v1 and v2 policy keys, since
adding/removing those are variants on the same actions, and it allows the ioctls
to be extended to v3 in the future if it ever becomes necessary.
Now, I don't have *that* much of an issue with making it an separate ioctl
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS. We can make them call the same function
internally, with a bool argument to distinguish the two ioctls. It just seems
like an unnecessary complication.
>
> The problem with "fscrypt lock" is that the non-privileged user sort
> of wants to do REMOVE_FLAG_KEY_FLAG_FOR_ALL_USERS, but they doesn't
> have the privileges to do it, and they are hoping that removing their
> own key removes it the key from the system. That to me seems to be
> the fundamental disconnect. The "fscrypt unlock" and "fscrypt lock"
> commands comes from the v1 key management, and requires root. It's
> the translation to unprivileged mode where "fscrypt lock" seems to
> have problems.
"fscrypt lock" actually doesn't exist yet; it's a missing feature. My patch to
the fscrypt tool adds it. So we get to decide on the semantics. We don't want
to require root, though; so for v2 policy keys, the real semantics have to be
that "fscrypt lock" registers the key for the user, and "fscrypt unlock"
unregisters it for the user.
One could argue that because of this they should be named "fscrypt register_key"
and "fscrypt unregister_key". However, in the vast majority of cases these will
be run as a single user, with a key that is not shared with any other user.
[In fact, I recently changed the fscrypt tool to automatically set mode 0700 on
new encrypted directories, since most people found it surprising that their
unlocked encrypted files weren't private to them by default. So if someone
wants to share their encrypted directory with another user, they now also need
to explicitly chmod it, unless the root user is involved.]
So presenting the commands as locking/unlocking a directory is a useful
simplication that makes them much easier to use, IMO. People shouldn't need to
understand multi-user key registration in order to unlock/lock their personal
encrypted directories.
And if "fscrypt lock" does nevertheless hit the case where other users remain, I
think it would be most user-friendly to print a warning like this:
Directory not fully locked; other users have added the key.
Run 'sudo fscrypt lock --all-users DIR' if you want to force-lock the directory.
We *could* make the --all-users option a separate subcommand like
"fscrypt force_lock", but again to me it seems like a variant of the same thing.
> > > It seems to me that the EBUSY and EUSERS errors should be status bits
> > > which gets returned to the user in a bitfield --- and if the key has
> > > been removed, or the user's claim on the key's existence has been
> > > removed, the ioctl returns success.
> > >
> > > That way we don't have to deal with the semantic disconnect where some
> > > errors don't actually change system state, and other errors that *do*
> > > change system state (as in, the key gets removed, or the user's claim
> > > on the key gets removed), but still returns than error.
> > >
> >
> > Do you mean use a positive return value, or do you mean add an output field to
> > the struct passed to the ioctl?
>
> I meant adding an output field. I see EBUSY and EUSERS as status bits
> which *some* use cases might find useful. Other use cases, such as in
> the pam_keys session logout code, we won't care at *all* about those
> status reporting (or error codes). So if EBUSY and EUSERS are
> returned as errors, then it adds to complexity of those programs
> whichd don't care. (And even for those that do, it's still a bit more
> complex since they has to distinguish between EBUSY, EUSERS, or other
> errors --- in fact, *all* programs which use
> FS_IOC_REMOVE_ENCRYPTION_KEY will *have* to check for EBUSY and
> ESUSERS whether they care or not.)
>
I see it a little differently: I'd prefer for the API to be "secure by default",
i.e. it tries hard to really remove the key, and it returns an error if it
couldn't really be removed (but still did as much as possible). And if the
caller is fine with some case(s) where the key wasn't truly removed, then they
can just explicitly handle those case(s).
You're suggesting the opposite: the ioctl will return 0 if the key was
unregistered for current user only, or if some files are still in use; and if
someone cares whether the key was *really* removed, then they'd need to check
the additional status bits.
That's easier to misuse in the more important ways, in my view. Now, it's not a
huge deal, as the API provides the same information either way, and regardless
of which one we choose, I'll make sure it's used correctly in fscrypt, Android,
Chrome OS, etc...
> > Either way note that it doesn't really need to be a bitfield, since you can't
> > have both statuses at the same time. I.e. if there are still other users, we
> > couldn't have even gotten to checking for in-use files.
>
> That's actually an implementation detail, though, right? In theory,
> we could check to see if there are any in-use files, independently of
> whether there are any users or not.
>
Yes, but wouldn't people assume that if the bitfield is provided, that all the
bits are actually filled in? Remember that to determine the "in-use files" bit
we have to actually go through the inode list and try to evict all the inodes.
That's not really something we should be doing before the last user is removed.
- Eric
^ permalink raw reply
* Re: [PATCH bpf-next v10 10/10] landlock: Add user and kernel documentation for Landlock
From: Randy Dunlap @ 2019-08-01 17:49 UTC (permalink / raw)
To: Mickaël Salaün, Mickaël Salaün, linux-kernel
Cc: Alexander Viro, Alexei Starovoitov, Andrew Morton,
Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
Daniel Borkmann, David Drysdale, David S . Miller,
Eric W . Biederman, James Morris, Jann Horn, John Johansen,
Jonathan Corbet, Kees Cook, Michael Kerrisk, Paul Moore,
Sargun Dhillon, Serge E . Hallyn
In-Reply-To: <2ced8fc8-79a6-b0fb-70fe-6716fae92aa7@ssi.gouv.fr>
On 8/1/19 10:03 AM, Mickaël Salaün wrote:
>>> +Ptrace restrictions
>>> +-------------------
>>> +
>>> +A landlocked process has less privileges than a non-landlocked process and must
>>> +then be subject to additional restrictions when manipulating another process.
>>> +To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
>>> +process, a landlocked process must have a subset of the target process programs.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Maybe that last statement is correct, but it seems to me that it is missing something.
> What about this:
>
> To be allowed to trace a process (using :manpage:`ptrace(2)`), a
> landlocked tracer process must only be constrained by a subset (possibly
> empty) of the Landlock programs which are also applied to the tracee.
> This ensure that the tracer has less or the same constraints than the
ensures
> tracee, hence protecting against privilege escalation.
Yes, better. Thanks.
--
~Randy
^ permalink raw reply
* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Alexei Starovoitov @ 2019-08-01 17:35 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Mickaël Salaün, LKML, Alexander Viro,
Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
David Drysdale, David S . Miller, Eric W . Biederman,
James Morris, Jann Horn, John Johansen, Jonathan Corbet,
Kees Cook, Michael Kerrisk, Paul Moore
In-Reply-To: <59e8fab9-34df-0ebe-ca6b-8b34bf582b75@ssi.gouv.fr>
On Wed, Jul 31, 2019 at 09:11:10PM +0200, Mickaël Salaün wrote:
>
>
> On 31/07/2019 20:58, Alexei Starovoitov wrote:
> > On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
> > <mickael.salaun@ssi.gouv.fr> wrote:
> >>>> + for (i = 0; i < htab->n_buckets; i++) {
> >>>> + head = select_bucket(htab, i);
> >>>> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> >>>> + landlock_inode_remove_map(*((struct inode **)l->key), map);
> >>>> + }
> >>>> + }
> >>>> + htab_map_free(map);
> >>>> +}
> >>>
> >>> user space can delete the map.
> >>> that will trigger inode_htab_map_free() which will call
> >>> landlock_inode_remove_map().
> >>> which will simply itereate the list and delete from the list.
> >>
> >> landlock_inode_remove_map() removes the reference to the map (being
> >> freed) from the inode (with an RCU lock).
> >
> > I'm going to ignore everything else for now and focus only on this bit,
> > since it's fundamental issue to address before this discussion can
> > go any further.
> > rcu_lock is not a spin_lock. I'm pretty sure you know this.
> > But you're arguing that it's somehow protecting from the race
> > I mentioned above?
> >
>
> I was just clarifying your comment to avoid misunderstanding about what
> is being removed.
>
> As said in the full response, there is currently a race but, if I add a
> bpf_map_inc() call when the map is referenced by inode->security, then I
> don't see how a race could occur because such added map could only be
> freed in a security_inode_free() (as long as it retains a reference to
> this inode).
then it will be a cycle and a map will never be deleted?
closing map_fd should delete a map. It cannot be alive if it's not
pinned in bpffs, there are no FDs that are holding it, and no progs using it.
So the map deletion will iterate over inodes that belong to this map.
In parallel security_inode_free() will be called that will iterate
over its link list that contains elements from different maps.
So the same link list is modified by two cpus.
Where is a lock that protects from concurrent links list manipulations?
> Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your da
ta, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
Please get rid of this. It's absolutely not appropriate on public mailing list.
Next time I'd have to ignore emails that contain such disclaimers.
^ permalink raw reply
* Re: [PATCH bpf-next v10 10/10] landlock: Add user and kernel documentation for Landlock
From: Mickaël Salaün @ 2019-08-01 17:03 UTC (permalink / raw)
To: Randy Dunlap, Mickaël Salaün, linux-kernel
Cc: Alexander Viro, Alexei Starovoitov, Andrew Morton,
Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
Daniel Borkmann, David Drysdale, David S . Miller,
Eric W . Biederman, James Morris, Jann Horn, John Johansen,
Jonathan Corbet, Kees Cook, Michael Kerrisk, Paul Moore,
Sargun Dhillon, Serge E . Hallyn
In-Reply-To: <88e90c22-1b78-c2f2-8823-fa776265361c@infradead.org>
Thanks for this spelling fixes. Some comments:
On 31/07/2019 03:53, Randy Dunlap wrote:
> On 7/21/19 2:31 PM, Mickaël Salaün wrote:
>> This documentation can be built with the Sphinx framework.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: James Morris <jmorris@namei.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Serge E. Hallyn <serge@hallyn.com>
>> ---
>>
>> Changes since v9:
>> * update with expected attach type and expected attach triggers
>>
>> Changes since v8:
>> * remove documentation related to chaining and tagging according to this
>> patch series
>>
>> Changes since v7:
>> * update documentation according to the Landlock revamp
>>
>> Changes since v6:
>> * add a check for ctx->event
>> * rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
>> * rename Landlock version to ABI to better reflect its purpose and add a
>> dedicated changelog section
>> * update tables
>> * relax no_new_privs recommendations
>> * remove ABILITY_WRITE related functions
>> * reword rule "appending" to "prepending" and explain it
>> * cosmetic fixes
>>
>> Changes since v5:
>> * update the rule hierarchy inheritance explanation
>> * briefly explain ctx->arg2
>> * add ptrace restrictions
>> * explain EPERM
>> * update example (subtype)
>> * use ":manpage:"
>> ---
>> Documentation/security/index.rst | 1 +
>> Documentation/security/landlock/index.rst | 20 +++
>> Documentation/security/landlock/kernel.rst | 99 ++++++++++++++
>> Documentation/security/landlock/user.rst | 147 +++++++++++++++++++++
>> 4 files changed, 267 insertions(+)
>> create mode 100644 Documentation/security/landlock/index.rst
>> create mode 100644 Documentation/security/landlock/kernel.rst
>> create mode 100644 Documentation/security/landlock/user.rst
>
>
>> diff --git a/Documentation/security/landlock/kernel.rst b/Documentation/security/landlock/kernel.rst
>> new file mode 100644
>> index 000000000000..7d1e06d544bf
>> --- /dev/null
>> +++ b/Documentation/security/landlock/kernel.rst
>> @@ -0,0 +1,99 @@
>> +==============================
>> +Landlock: kernel documentation
>> +==============================
>> +
>> +eBPF properties
>> +===============
>> +
>> +To get an expressive language while still being safe and small, Landlock is
>> +based on eBPF. Landlock should be usable by untrusted processes and must
>> +therefore expose a minimal attack surface. The eBPF bytecode is minimal,
>> +powerful, widely used and designed to be used by untrusted applications. Thus,
>> +reusing the eBPF support in the kernel enables a generic approach while
>> +minimizing new code.
>> +
>> +An eBPF program has access to an eBPF context containing some fields used to
>> +inspect the current object. These arguments can be used directly (e.g. cookie)
>> +or passed to helper functions according to their types (e.g. inode pointer). It
>> +is then possible to do complex access checks without race conditions or
>> +inconsistent evaluation (i.e. `incorrect mirroring of the OS code and state
>> +<https://www.ndss-symposium.org/ndss2003/traps-and-pitfalls-practical-problems-system-call-interposition-based-security-tools/>`_).
>> +
>> +A Landlock hook describes a particular access type. For now, there is two
>
> there are two
>
>> +hooks dedicated to filesystem related operations: LANDLOCK_HOOK_FS_PICK and
>> +LANDLOCK_HOOK_FS_WALK. A Landlock program is tied to one hook. This makes it
>> +possible to statically check context accesses, potentially performed by such
>> +program, and hence prevents kernel address leaks and ensure the right use of
>
> ensures
>
>> +hook arguments with eBPF functions. Any user can add multiple Landlock
>> +programs per Landlock hook. They are stacked and evaluated one after the
>> +other, starting from the most recent program, as seccomp-bpf does with its
>> +filters. Underneath, a hook is an abstraction over a set of LSM hooks.
>> +
>> +
>> +Guiding principles
>> +==================
>> +
>> +Unprivileged use
>> +----------------
>> +
>> +* Landlock helpers and context should be usable by any unprivileged and
>> + untrusted program while following the system security policy enforced by
>> + other access control mechanisms (e.g. DAC, LSM).
>> +
>> +
>> +Landlock hook and context
>> +-------------------------
>> +
>> +* A Landlock hook shall be focused on access control on kernel objects instead
>> + of syscall filtering (i.e. syscall arguments), which is the purpose of
>> + seccomp-bpf.
>> +* A Landlock context provided by a hook shall express the minimal and more
>> + generic interface to control an access for a kernel object.
>> +* A hook shall guaranty that all the BPF function calls from a program are> + safe. Thus, the related Landlock context arguments shall always be of the
>> + same type for a particular hook. For example, a network hook could share
>> + helpers with a file hook because of UNIX socket. However, the same helpers
>> + may not be compatible for a file system handle and a net handle.
>> +* Multiple hooks may use the same context interface.
>> +
>> +
>> +Landlock helpers
>> +----------------
>> +
>> +* Landlock helpers shall be as generic as possible while at the same time being
>> + as simple as possible and following the syscall creation principles (cf.
>> + *Documentation/adding-syscalls.txt*).
>> +* The only behavior change allowed on a helper is to fix a (logical) bug to
>> + match the initial semantic.
>> +* Helpers shall be reentrant, i.e. only take inputs from arguments (e.g. from
>> + the BPF context), to enable a hook to use a cache. Future program options
>> + might change this cache behavior.
>> +* It is quite easy to add new helpers to extend Landlock. The main concern
>> + should be about the possibility to leak information from the kernel that may
>> + not be accessible otherwise (i.e. side-channel attack).
>> +
>> +
>> +Questions and answers
>> +=====================
>> +
>> +Why not create a custom hook for each kind of action?
>> +-----------------------------------------------------
>> +
>> +Landlock programs can handle these checks. Adding more exceptions to the
>> +kernel code would lead to more code complexity. A decision to ignore a kind of
>> +action can and should be done at the beginning of a Landlock program.
>> +
>> +
>> +Why a program does not return an errno or a kill code?
>> +------------------------------------------------------
>> +
>> +seccomp filters can return multiple kind of code, including an errno value or a
>
> kinds
>
>> +kill signal, which may be convenient for access control. Those return codes
>> +are hardwired in the userland ABI. Instead, Landlock's approach is to return a
>> +boolean to allow or deny an action, which is much simpler and more generic.
>> +Moreover, we do not really have a choice because, unlike to seccomp, Landlock
>> +programs are not enforced at the syscall entry point but may be executed at any
>> +point in the kernel (through LSM hooks) where an errno return code may not make
>> +sense. However, with this simple ABI and with the ability to call helpers,
>> +Landlock may gain features similar to seccomp-bpf in the future while being
>> +compatible with previous programs.
>> diff --git a/Documentation/security/landlock/user.rst b/Documentation/security/landlock/user.rst
>> new file mode 100644
>> index 000000000000..14c4f3b377bd
>> --- /dev/null
>> +++ b/Documentation/security/landlock/user.rst
>> @@ -0,0 +1,147 @@
>> +================================
>> +Landlock: userland documentation
>> +================================
>> +
>> +Landlock programs
>> +=================
>> +
>> +eBPF programs are used to create security programs. They are contained and can
>> +call only a whitelist of dedicated functions. Moreover, they can only loop
>> +under strict conditions, which protects from denial of service. More
>> +information on BPF can be found in *Documentation/networking/filter.txt*.
>> +
>> +
>> +Writing a program
>> +-----------------
>> +
>> +To enforce a security policy, a thread first needs to create a Landlock program.
>> +The easiest way to write an eBPF program depicting a security program is to write
>> +it in the C language. As described in *samples/bpf/README.rst*, LLVM can
>> +compile such programs. Files *samples/bpf/landlock1_kern.c* and those in
>> +*tools/testing/selftests/landlock/* can be used as examples.
>> +
>> +Once the eBPF program is created, the next step is to create the metadata
>> +describing the Landlock program. This metadata includes an expected attach type which
>> +contains the hook type to which the program is tied, and expected attach
>> +triggers which identify the actions for which the program should be run.
>> +
>> +A hook is a policy decision point which exposes the same context type for
>> +each program evaluation.
>> +
>> +A Landlock hook describes the kind of kernel object for which a program will be
>> +triggered to allow or deny an action. For example, the hook
>> +BPF_LANDLOCK_FS_PICK can be triggered every time a landlocked thread performs a
>> +set of action related to the filesystem (e.g. open, read, write, mount...).
>
> actions
>
>> +This actions are identified by the `triggers` bitfield.
>> +
>> +The next step is to fill a :c:type:`struct bpf_load_program_attr
>> +<bpf_load_program_attr>` with BPF_PROG_TYPE_LANDLOCK_HOOK, the expected attach
>> +type and other BPF program metadata. This bpf_attr must then be passed to the
>> +:manpage:`bpf(2)` syscall alongside the BPF_PROG_LOAD command. If everything
>> +is deemed correct by the kernel, the thread gets a file descriptor referring to
>> +this program.
>> +
>> +In the following code, the *insn* variable is an array of BPF instructions
>> +which can be extracted from an ELF file as is done in bpf_load_file() from
>> +*samples/bpf/bpf_load.c*.
>
> A little confusing. Is there a mixup of <insn> and <insns>?
Indeed, a typo was inserted with a rewrite of this part.
>
>> +
>> +.. code-block:: c
>> +
>> + int prog_fd;
>> + struct bpf_load_program_attr load_attr;
>> +
>> + memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
>> + load_attr.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK;
>> + load_attr.expected_attach_type = BPF_LANDLOCK_FS_PICK;
>> + load_attr.expected_attach_triggers = LANDLOCK_TRIGGER_FS_PICK_OPEN;
>> + load_attr.insns = insns;
>> + load_attr.insns_cnt = sizeof(insn) / sizeof(struct bpf_insn);
>> + load_attr.license = "GPL";
>> +
>> + prog_fd = bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
>> + if (prog_fd == -1)
>> + exit(1);
>> +
>> +
>> +Enforcing a program
>> +-------------------
>> +
>> +Once the Landlock program has been created or received (e.g. through a UNIX
>> +socket), the thread willing to sandbox itself (and its future children) should
>> +perform the following two steps.
>> +
>> +The thread should first request to never be allowed to get new privileges with a
>> +call to :manpage:`prctl(2)` and the PR_SET_NO_NEW_PRIVS option. More
>> +information can be found in *Documentation/prctl/no_new_privs.txt*.
>> +
>> +.. code-block:: c
>> +
>> + if (prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0))
>> + exit(1);
>> +
>> +A thread can apply a program to itself by using the :manpage:`seccomp(2)` syscall.
>> +The operation is SECCOMP_PREPEND_LANDLOCK_PROG, the flags must be empty and the
>> +*args* argument must point to a valid Landlock program file descriptor.
>> +
>> +.. code-block:: c
>> +
>> + if (seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd))
>> + exit(1);
>> +
>> +If the syscall succeeds, the program is now enforced on the calling thread and
>> +will be enforced on all its subsequently created children of the thread as
>> +well. Once a thread is landlocked, there is no way to remove this security
>> +policy, only stacking more restrictions is allowed. The program evaluation is
>> +performed from the newest to the oldest.
>> +
>> +When a syscall ask for an action on a kernel object, if this action is denied,
>
> asks
>
>> +then an EACCES errno code is returned through the syscall.
>> +
>> +
>> +.. _inherited_programs:
>> +
>> +Inherited programs
>> +------------------
>> +
>> +Every new thread resulting from a :manpage:`clone(2)` inherits Landlock program
>> +restrictions from its parent. This is similar to the seccomp inheritance as
>> +described in *Documentation/prctl/seccomp_filter.txt*.
>> +
>> +
>> +Ptrace restrictions
>> +-------------------
>> +
>> +A landlocked process has less privileges than a non-landlocked process and must
>> +then be subject to additional restrictions when manipulating another process.
>> +To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
>> +process, a landlocked process must have a subset of the target process programs.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Maybe that last statement is correct, but it seems to me that it is missing something.
What about this:
To be allowed to trace a process (using :manpage:`ptrace(2)`), a
landlocked tracer process must only be constrained by a subset (possibly
empty) of the Landlock programs which are also applied to the tracee.
This ensure that the tracer has less or the same constraints than the
tracee, hence protecting against privilege escalation.
>
>> +
>> +
>> +Landlock structures and constants
>> +=================================
>> +
>> +Hook types
>> +----------
>> +
>> +.. kernel-doc:: include/uapi/linux/landlock.h
>> + :functions: landlock_hook_type
>> +
>> +
>> +Contexts
>> +--------
>> +
>> +.. kernel-doc:: include/uapi/linux/landlock.h
>> + :functions: landlock_ctx_fs_pick landlock_ctx_fs_walk landlock_ctx_fs_get
>> +
>> +
>> +Triggers for fs_pick
>> +--------------------
>> +
>> +.. kernel-doc:: include/uapi/linux/landlock.h
>> + :functions: landlock_triggers
>> +
>> +
>> +Additional documentation
>> +========================
>> +
>> +See https://landlock.io
>>
>
>
--
Mickaël Salaün
ANSSI/SDE/ST/LAM
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH V37 19/29] Lock down module params that specify hardware parameters (eg. ioport)
From: Jessica Yu @ 2019-08-01 16:19 UTC (permalink / raw)
To: Matthew Garrett
Cc: jmorris, linux-security-module, linux-kernel, linux-api,
David Howells, Alan Cox, Matthew Garrett, Kees Cook
In-Reply-To: <20190731221617.234725-20-matthewgarrett@google.com>
+++ Matthew Garrett [31/07/19 15:16 -0700]:
>From: David Howells <dhowells@redhat.com>
>
>Provided an annotation for module parameters that specify hardware
>parameters (such as io ports, iomem addresses, irqs, dma channels, fixed
>dma buffers and other types).
>
>Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
>Signed-off-by: David Howells <dhowells@redhat.com>
>Signed-off-by: Matthew Garrett <mjg59@google.com>
>Reviewed-by: Kees Cook <keescook@chromium.org>
>Cc: Jessica Yu <jeyu@kernel.org>
>---
> include/linux/security.h | 1 +
> kernel/params.c | 28 +++++++++++++++++++++++-----
> security/lockdown/lockdown.c | 1 +
> 3 files changed, 25 insertions(+), 5 deletions(-)
>
>diff --git a/include/linux/security.h b/include/linux/security.h
>index 8f7048395114..43fa3486522b 100644
>--- a/include/linux/security.h
>+++ b/include/linux/security.h
>@@ -113,6 +113,7 @@ enum lockdown_reason {
> LOCKDOWN_ACPI_TABLES,
> LOCKDOWN_PCMCIA_CIS,
> LOCKDOWN_TIOCSSERIAL,
>+ LOCKDOWN_MODULE_PARAMETERS,
> LOCKDOWN_INTEGRITY_MAX,
> LOCKDOWN_CONFIDENTIALITY_MAX,
> };
>diff --git a/kernel/params.c b/kernel/params.c
>index cf448785d058..f2779a76d39a 100644
>--- a/kernel/params.c
>+++ b/kernel/params.c
>@@ -12,6 +12,7 @@
> #include <linux/err.h>
> #include <linux/slab.h>
> #include <linux/ctype.h>
>+#include <linux/security.h>
>
> #ifdef CONFIG_SYSFS
> /* Protects all built-in parameters, modules use their own param_lock */
>@@ -96,13 +97,20 @@ bool parameq(const char *a, const char *b)
> return parameqn(a, b, strlen(a)+1);
> }
>
>-static void param_check_unsafe(const struct kernel_param *kp)
>+static bool param_check_unsafe(const struct kernel_param *kp,
>+ const char *doing)
Hm, I don't think the doing parameter ended up being used in this function?
> {
>+ if (kp->flags & KERNEL_PARAM_FL_HWPARAM &&
>+ security_locked_down(LOCKDOWN_MODULE_PARAMETERS))
>+ return false;
>+
> if (kp->flags & KERNEL_PARAM_FL_UNSAFE) {
> pr_notice("Setting dangerous option %s - tainting kernel\n",
> kp->name);
> add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> }
>+
>+ return true;
> }
>
> static int parse_one(char *param,
>@@ -132,8 +140,10 @@ static int parse_one(char *param,
> pr_debug("handling %s with %p\n", param,
> params[i].ops->set);
> kernel_param_lock(params[i].mod);
>- param_check_unsafe(¶ms[i]);
>- err = params[i].ops->set(val, ¶ms[i]);
>+ if (param_check_unsafe(¶ms[i], doing))
>+ err = params[i].ops->set(val, ¶ms[i]);
>+ else
>+ err = -EPERM;
> kernel_param_unlock(params[i].mod);
> return err;
> }
>@@ -541,6 +551,12 @@ static ssize_t param_attr_show(struct module_attribute *mattr,
> return count;
> }
>
>+#ifdef CONFIG_MODULES
>+#define mod_name(mod) ((mod)->name)
>+#else
>+#define mod_name(mod) "unknown"
>+#endif
>+
> /* sysfs always hands a nul-terminated string in buf. We rely on that. */
> static ssize_t param_attr_store(struct module_attribute *mattr,
> struct module_kobject *mk,
>@@ -553,8 +569,10 @@ static ssize_t param_attr_store(struct module_attribute *mattr,
> return -EPERM;
>
> kernel_param_lock(mk->mod);
>- param_check_unsafe(attribute->param);
>- err = attribute->param->ops->set(buf, attribute->param);
>+ if (param_check_unsafe(attribute->param, mod_name(mk->mod)))
>+ err = attribute->param->ops->set(buf, attribute->param);
>+ else
>+ err = -EPERM;
> kernel_param_unlock(mk->mod);
> if (!err)
> return len;
>diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
>index 00a3a6438dd2..5177938cfa0d 100644
>--- a/security/lockdown/lockdown.c
>+++ b/security/lockdown/lockdown.c
>@@ -28,6 +28,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
> [LOCKDOWN_ACPI_TABLES] = "modifying ACPI tables",
> [LOCKDOWN_PCMCIA_CIS] = "direct PCMCIA CIS storage",
> [LOCKDOWN_TIOCSSERIAL] = "reconfiguration of serial port IO",
>+ [LOCKDOWN_MODULE_PARAMETERS] = "unsafe module parameters",
> [LOCKDOWN_INTEGRITY_MAX] = "integrity",
> [LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
> };
>--
>2.22.0.770.g0f2c4a37fd-goog
>
^ permalink raw reply
* Re: [PATCH V37 04/29] Enforce module signatures if the kernel is locked down
From: Jessica Yu @ 2019-08-01 14:21 UTC (permalink / raw)
To: Matthew Garrett
Cc: jmorris, linux-security-module, linux-kernel, linux-api,
David Howells, Kees Cook
In-Reply-To: <20190731221617.234725-5-matthewgarrett@google.com>
+++ Matthew Garrett [31/07/19 15:15 -0700]:
>From: David Howells <dhowells@redhat.com>
>
>If the kernel is locked down, require that all modules have valid
>signatures that we can verify.
>
>I have adjusted the errors generated:
>
> (1) If there's no signature (ENODATA) or we can't check it (ENOPKG,
> ENOKEY), then:
>
> (a) If signatures are enforced then EKEYREJECTED is returned.
>
> (b) If there's no signature or we can't check it, but the kernel is
> locked down then EPERM is returned (this is then consistent with
> other lockdown cases).
>
> (2) If the signature is unparseable (EBADMSG, EINVAL), the signature fails
> the check (EKEYREJECTED) or a system error occurs (eg. ENOMEM), we
> return the error we got.
>
>Note that the X.509 code doesn't check for key expiry as the RTC might not
>be valid or might not have been transferred to the kernel's clock yet.
>
> [Modified by Matthew Garrett to remove the IMA integration. This will
> be replaced with integration with the IMA architecture policy
> patchset.]
>
>Signed-off-by: David Howells <dhowells@redhat.com>
>Signed-off-by: Matthew Garrett <matthewgarrett@google.com>
>Reviewed-by: Kees Cook <keescook@chromium.org>
>Cc: Jessica Yu <jeyu@kernel.org>
>---
> include/linux/security.h | 1 +
> kernel/module.c | 37 +++++++++++++++++++++++++++++-------
> security/lockdown/lockdown.c | 1 +
> 3 files changed, 32 insertions(+), 7 deletions(-)
>
>diff --git a/include/linux/security.h b/include/linux/security.h
>index 54a0532ec12f..8e70063074a1 100644
>--- a/include/linux/security.h
>+++ b/include/linux/security.h
>@@ -103,6 +103,7 @@ enum lsm_event {
> */
> enum lockdown_reason {
> LOCKDOWN_NONE,
>+ LOCKDOWN_MODULE_SIGNATURE,
> LOCKDOWN_INTEGRITY_MAX,
> LOCKDOWN_CONFIDENTIALITY_MAX,
> };
>diff --git a/kernel/module.c b/kernel/module.c
>index cd8df516666d..318209889e26 100644
>--- a/kernel/module.c
>+++ b/kernel/module.c
>@@ -2771,8 +2771,9 @@ static inline void kmemleak_load_module(const struct module *mod,
> #ifdef CONFIG_MODULE_SIG
> static int module_sig_check(struct load_info *info, int flags)
> {
>- int err = -ENOKEY;
>+ int err = -ENODATA;
> const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
>+ const char *reason;
> const void *mod = info->hdr;
>
> /*
>@@ -2787,16 +2788,38 @@ static int module_sig_check(struct load_info *info, int flags)
> err = mod_verify_sig(mod, info);
> }
>
>- if (!err) {
>+ switch (err) {
>+ case 0:
> info->sig_ok = true;
> return 0;
>- }
>
>- /* Not having a signature is only an error if we're strict. */
>- if (err == -ENOKEY && !is_module_sig_enforced())
>- err = 0;
>+ /* We don't permit modules to be loaded into trusted kernels
>+ * without a valid signature on them, but if we're not
>+ * enforcing, certain errors are non-fatal.
>+ */
>+ case -ENODATA:
>+ reason = "Loading of unsigned module";
>+ goto decide;
>+ case -ENOPKG:
>+ reason = "Loading of module with unsupported crypto";
>+ goto decide;
>+ case -ENOKEY:
>+ reason = "Loading of module with unavailable key";
>+ decide:
>+ if (is_module_sig_enforced()) {
>+ pr_notice("%s is rejected\n", reason);
>+ return -EKEYREJECTED;
>+ }
>
>- return err;
>+ return security_locked_down(LOCKDOWN_MODULE_SIGNATURE);
>+
>+ /* All other errors are fatal, including nomem, unparseable
>+ * signatures and signature check failures - even if signatures
>+ * aren't required.
>+ */
>+ default:
>+ return err;
>+ }
> }
> #else /* !CONFIG_MODULE_SIG */
> static int module_sig_check(struct load_info *info, int flags)
Hi Matthew!
Apologies if this was addressed in another patch in your series (I've
only skimmed the first few), but what should happen if the kernel is
locked down, but CONFIG_MODULE_SIG=n? Or shouldn't CONFIG_SECURITY_LOCKDOWN_LSM
depend on CONFIG_MODULE_SIG? Otherwise I think we'll end up calling
the empty !CONFIG_MODULE_SIG module_sig_check() stub even though
lockdown is enabled.
Thanks,
Jessica
>diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
>index d30c4d254b5f..2c53fd9f5c9b 100644
>--- a/security/lockdown/lockdown.c
>+++ b/security/lockdown/lockdown.c
>@@ -18,6 +18,7 @@ static enum lockdown_reason kernel_locked_down;
>
> static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
> [LOCKDOWN_NONE] = "none",
>+ [LOCKDOWN_MODULE_SIGNATURE] = "unsigned module loading",
> [LOCKDOWN_INTEGRITY_MAX] = "integrity",
> [LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
> };
>--
>2.22.0.770.g0f2c4a37fd-goog
>
^ permalink raw reply
* Re: [PATCH v12 0/6] Add utilization clamping support (CGroups API)
From: Patrick Bellasi @ 2019-08-01 11:25 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
Peter Zijlstra, Rafael J . Wysocki, Vincent Guittot, Viresh Kumar,
Paul Turner, Michal Koutny, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190729200606.GA136335@devbig004.ftw2.facebook.com>
On 29-Jul 13:06, Tejun Heo wrote:
> Hello,
Hi Tejun,
> Looks good to me. On cgroup side,
>
> Acked-by: Tejun Heo <tj@kernel.org>
Thanks!
Happy we converged toward something working for you.
I'll add the two small changes suggested by Michal and respin a v13
with your ACK on the series.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCH v12 3/6] sched/core: uclamp: Propagate system defaults to root group
From: Patrick Bellasi @ 2019-08-01 11:10 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-api, linux-kernel, linux-pm, Alessio Balsini,
Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
Joel Fernandes, Paul Turner, Steve Muckle, Suren Baghdasaryan,
Todd Kjos, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
Vincent Guittot, Viresh Kumar, Juri Lelli, Ingo Molnar
In-Reply-To: <20190725114126.GA4130@blackbody.suse.cz>
On 25-Jul 13:41, Michal Koutný wrote:
> On Thu, Jul 18, 2019 at 07:17:45PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > The clamp values are not tunable at the level of the root task group.
> > That's for two main reasons:
> >
> > - the root group represents "system resources" which are always
> > entirely available from the cgroup standpoint.
> >
> > - when tuning/restricting "system resources" makes sense, tuning must
> > be done using a system wide API which should also be available when
> > control groups are not.
> >
> > When a system wide restriction is available, cgroups should be aware of
> > its value in order to know exactly how much "system resources" are
> > available for the subgroups.
> IIUC, the global default would apply in uclamp_eff_get(), so this
> propagation isn't strictly necessary in order to apply to tasks (that's
> how it works under !CONFIG_UCLAMP_TASK_GROUP).
That's right.
> The reason is that effective value (which isn't exposed currently) in a
> group takes into account this global restriction, right?
Yep, well admittedly in this area things changed in a slightly confusing way.
Up to v10:
- effective values was exposed to userspace
- system defaults was enforced only at enqueue time
Now instead:
- effective values are not exposed anymore (because of Tejun request)
- system defaults are applied to the root group and propagated down
the hierarchy to all effective values
Both solutions are functionally correct but, in the first case, the
cgroup's effective values was not really reflecting what a task will
get while, in the current solution, we force update all effective
values while not exposing them anymore.
However, I think this solution is better in keeping information more
consistent and should create less confusion if in the future we decide
to expose effective values to user-space.
Thought?
> > @@ -1043,12 +1063,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> > [...]
> > + if (update_root_tg)
> > + uclamp_update_root_tg();
> > +
> > /*
> > * Updating all the RUNNABLE task is expensive, keep it simple and do
> > * just a lazy update at each next enqueue time.
> Since uclamp_update_root_tg() traverses down to
> uclamp_update_active_tasks() is this comment half true now?
Right, this comment is now wrong. We update all RUNNABLE tasks on
system default changes. However, despite the above command it's
difficult to say how much expensive that operation can be.
It really depends on how many RUNNABLE tasks we have, the number of
CPUs and also how many tasks are not already clamped by a more
restrictive "effective" value. Thus, for the time being, we can
consider speculation the above statement and add in a simple change if
in the future that should be reported as a real issue to justify a
lazy update.
The upside is that with the current implementation we have a more
strict control on task. Even long running tasks can be clamped on
sysadmin demand without waiting for them to sleep.
Does that makes sense?
If it does, I'll drop the above comment in v13.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCH v12 1/6] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-08-01 10:40 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-api, linux-kernel, linux-pm, Alessio Balsini,
Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
Joel Fernandes, Paul Turner, Steve Muckle, Suren Baghdasaryan,
Todd Kjos, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
Vincent Guittot, Viresh Kumar, Juri Lelli, Ingo Molnar
In-Reply-To: <20190725114104.GA32159@blackbody.suse.cz>
On 25-Jul 13:41, Michal Koutný wrote:
> On Thu, Jul 18, 2019 at 07:17:43PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > [...]
> > +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > [...]
> These two functions are almost identical yet not trivial. I think it
> wouldn be better to have the code at one place only and distinguish by
> the passed clamp_id.
Good point, since the removal of the boundary checks on values we now
have two identical methods. I'll factor our the common code in a
single function.
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCHv5 28/37] x86/vdso: Enable static branches for the timens vdso
From: Thomas Gleixner @ 2019-08-01 6:48 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dmitry Safonov, LKML, Dmitry Safonov, Andrei Vagin, Adrian Reber,
Andrei Vagin, Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Vincenzo Frascino, Linux Containers, criu, Linux API
In-Reply-To: <CALCETrXL6pzDoHfn9Niw_CxNX-_W3=yotDYuqK+kxPhOLofmNA@mail.gmail.com>
On Wed, 31 Jul 2019, Andy Lutomirski wrote:
> On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
> > As it has been discussed on timens RFC, adding a new conditional branch
> > `if (inside_time_ns)` on VDSO for all processes is undesirable.
> >
> > Addressing those problems, there are two versions of VDSO's .so:
> > for host tasks (without any penalty) and for processes inside of time
> > namespace with clk_to_ns() that subtracts offsets from host's time.
> >
> > The timens code in vdso looks like this:
> >
> > if (timens_static_branch_unlikely()) {
> > clk_to_ns(clk, ts);
> > }
>
> I'm confused. Now we effectively have *three* versions: the vDSO
> without timens, and vDSO with timens but with it switched off, and the
> vDSO with timens on. This seems like too much.
The problem is that if you have a single VDSO, then omce one process joins
a time namespace _ALL_ other processes get an extra conditional with at
least one extra cache line as a bonus.
This has been discussed at Plumbers last year and the agreement was to
create VDSO plain (no namespace support) and VDSO extra (namespace
support).
The performance hit of the conditional + one etra cache line for tasks
which do not use a time name space is measurable.
That also avoids the whole mess of dealing with the static branch being
flipped. The time name space property is determined on fork/exec _before_
anything can touch the VDSO and depending on it the approriate version of
the VDSO is loaded.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()
From: hpa @ 2019-08-01 6:09 UTC (permalink / raw)
To: Andy Lutomirski, Dmitry Safonov
Cc: LKML, Dmitry Safonov, Adrian Reber, Andrei Vagin, Arnd Bergmann,
Christian Brauner, Cyrill Gorcunov, Eric W. Biederman,
Ingo Molnar, Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov,
Shuah Khan, Thomas Gleixner, Vincenzo Frascino, Linux Containers,
criu, Linux API, X86 ML, Andrei Vagin
In-Reply-To: <CALCETrUpOhTCQkhB3S73LBFAiTp07PwXP32Q6Bn0m2LTqiw9hA@mail.gmail.com>
On July 31, 2019 10:34:26 PM PDT, Andy Lutomirski <luto@kernel.org> wrote:
>On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>>
>> As it has been discussed on timens RFC, adding a new conditional
>branch
>> `if (inside_time_ns)` on VDSO for all processes is undesirable.
>> It will add a penalty for everybody as branch predictor may
>mispredict
>> the jump. Also there are instruction cache lines wasted on cmp/jmp.
>
>
>>
>> +#ifdef CONFIG_TIME_NS
>> +int vdso_join_timens(struct task_struct *task)
>> +{
>> + struct mm_struct *mm = task->mm;
>> + struct vm_area_struct *vma;
>> +
>> + if (down_write_killable(&mm->mmap_sem))
>> + return -EINTR;
>> +
>> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> + unsigned long size = vma->vm_end - vma->vm_start;
>> +
>> + if (vma_is_special_mapping(vma, &vvar_mapping) ||
>> + vma_is_special_mapping(vma, &vdso_mapping))
>> + zap_page_range(vma, vma->vm_start, size);
>> + }
>
>This is, unfortunately, fundamentally buggy. If any thread is in the
>vDSO or has the vDSO on the stack (due to a signal, for example), this
>will crash it. I can think of three solutions:
>
>1. Say that you can't setns() if you have other mms and ignore the
>signal issue. Anything with green threads will disapprove. It's also
>rather gross.
>
>2. Make it so that you can flip the static branch safely. As in my
>other email, you'll need to deal with CoW somehow,
>
>3. Make it so that you can't change timens, or at least that you can't
>turn timens on or off, without execve() or fork().
>
>BTW, that static branch probably needs to be aligned to a cache line
>or something similar to avoid all the nastiness with trying to poke
>text that might be concurrently executing. This will be a mess.
Since we are talking about different physical addresses I believe we should be okay as long as they don't cross page boundaries, and even if they do it can be managed with proper page invalidation sequencing – it's not like the problems of having to deal with XMC on live pages like in the kernel.
Still, you really need each instruction sequence to be present, with the only difference being specific patch sites.
Any fundamental reason this can't be strictly data driven? Seems odd to me if it couldn't, but I might be missing something obvious.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
^ permalink raw reply
* Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()
From: Andy Lutomirski @ 2019-08-01 5:34 UTC (permalink / raw)
To: Dmitry Safonov
Cc: LKML, Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu,
Linux API
In-Reply-To: <20190729215758.28405-26-dima@arista.com>
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>
> As it has been discussed on timens RFC, adding a new conditional branch
> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> It will add a penalty for everybody as branch predictor may mispredict
> the jump. Also there are instruction cache lines wasted on cmp/jmp.
>
> +#ifdef CONFIG_TIME_NS
> +int vdso_join_timens(struct task_struct *task)
> +{
> + struct mm_struct *mm = task->mm;
> + struct vm_area_struct *vma;
> +
> + if (down_write_killable(&mm->mmap_sem))
> + return -EINTR;
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + unsigned long size = vma->vm_end - vma->vm_start;
> +
> + if (vma_is_special_mapping(vma, &vvar_mapping) ||
> + vma_is_special_mapping(vma, &vdso_mapping))
> + zap_page_range(vma, vma->vm_start, size);
> + }
This is, unfortunately, fundamentally buggy. If any thread is in the
vDSO or has the vDSO on the stack (due to a signal, for example), this
will crash it. I can think of three solutions:
1. Say that you can't setns() if you have other mms and ignore the
signal issue. Anything with green threads will disapprove. It's also
rather gross.
2. Make it so that you can flip the static branch safely. As in my
other email, you'll need to deal with CoW somehow,
3. Make it so that you can't change timens, or at least that you can't
turn timens on or off, without execve() or fork().
BTW, that static branch probably needs to be aligned to a cache line
or something similar to avoid all the nastiness with trying to poke
text that might be concurrently executing. This will be a mess.
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Theodore Y. Ts'o @ 2019-08-01 5:31 UTC (permalink / raw)
To: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
linux-mtd, linux-api, linux-crypto, keyrings, Paul Crowley,
Satya Tangirala
In-Reply-To: <20190801011140.GB687@sol.localdomain>
On Wed, Jul 31, 2019 at 06:11:40PM -0700, Eric Biggers wrote:
>
> Well, it's either
>
> 1a. Remove the user's handle.
> OR
> 1b. Remove all users' handles. (FSCRYPT_REMOVE_KEY_FLAG_ALL_USERS)
>
> Then
>
> 2. If no handles remain, try to evict all inodes that use the key.
>
> By "purge all keys" do you mean step (2)? Note that it doesn't require root by
> itself; root is only required to remove other users' handles (1b).
No, I was talking about 1b. I'd argue that 1a and 1b should be
different ioctl. 1b requires root, and 1a doesn't.
And 1a should just mean, "I don't need to use the encrypted files any
more". In the PAM passphrase case, when you are just logging out, 1a
is what's needed, and success is just fine. pam_session won't *care*
whether or not there are other users keeping the key in use.
The problem with "fscrypt lock" is that the non-privileged user sort
of wants to do REMOVE_FLAG_KEY_FLAG_FOR_ALL_USERS, but they doesn't
have the privileges to do it, and they are hoping that removing their
own key removes it the key from the system. That to me seems to be
the fundamental disconnect. The "fscrypt unlock" and "fscrypt lock"
commands comes from the v1 key management, and requires root. It's
the translation to unprivileged mode where "fscrypt lock" seems to
have problems.
> > What about having "fscrypt lock" call FS_IOC_GET_ENCRYPTION_KEY_STATUS
> > and print a warning message saying, "we can't lock it because N other
> > users who have registered a key". I'd argue fscrypt should do this
> > regardless of whether or not FS_IOC_REMOVE_ENCRYPTION_KEY returns
> > EUSERS or not.
>
> Shouldn't "fscrypt lock" still remove the user's handle, as opposed to refuse to
> do anything, though? So it would still need to callh
> FS_IOC_REMOVE_ENCRYPTION_KEY, and could get the status from it rather than also
> needing to call FS_IOC_GET_ENCRYPTION_KEY_STATUS.
>
> Though, FS_IOC_GET_ENCRYPTION_KEY_STATUS would be needed if we wanted to show
> the specific count of other users.
So my perspective is that the ioctl's should have very clean
semantics, and errors should be consistent with how the Unix system
calls and error reporting.
If we need to make "fscrypt lock" and "fscrypt unlock" have semantics
that are more consistent with previous user interface choices, then
fscrypt can use FS_IOC_GET_ENCRYPTION_KEY_STATUS to print the warning
before it calls FS_IOC_REMOVE_ENCRYPTION_KEY --- with "fscrypt purge_keys"
calling something like FS_IOC_REMOVE_ALL_USER_ENCRYPTION_KEYS.
> > It seems to me that the EBUSY and EUSERS errors should be status bits
> > which gets returned to the user in a bitfield --- and if the key has
> > been removed, or the user's claim on the key's existence has been
> > removed, the ioctl returns success.
> >
> > That way we don't have to deal with the semantic disconnect where some
> > errors don't actually change system state, and other errors that *do*
> > change system state (as in, the key gets removed, or the user's claim
> > on the key gets removed), but still returns than error.
> >
>
> Do you mean use a positive return value, or do you mean add an output field to
> the struct passed to the ioctl?
I meant adding an output field. I see EBUSY and EUSERS as status bits
which *some* use cases might find useful. Other use cases, such as in
the pam_keys session logout code, we won't care at *all* about those
status reporting (or error codes). So if EBUSY and EUSERS are
returned as errors, then it adds to complexity of those programs
whichd don't care. (And even for those that do, it's still a bit more
complex since they has to distinguish between EBUSY, EUSERS, or other
errors --- in fact, *all* programs which use
FS_IOC_REMOVE_ENCRYPTION_KEY will *have* to check for EBUSY and
ESUSERS whether they care or not.)
> Either way note that it doesn't really need to be a bitfield, since you can't
> have both statuses at the same time. I.e. if there are still other users, we
> couldn't have even gotten to checking for in-use files.
That's actually an implementation detail, though, right? In theory,
we could check to see if there are any in-use files, independently of
whether there are any users or not.
- Ted
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCHv5 01/37] ns: Introduce Time Namespace
From: Andy Lutomirski @ 2019-08-01 5:29 UTC (permalink / raw)
To: Dmitry Safonov
Cc: LKML, Dmitry Safonov, Andrei Vagin, Adrian Reber, Andy Lutomirski,
Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu,
Linux API
In-Reply-To: <20190729215758.28405-2-dima@arista.com>
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>
> From: Andrei Vagin <avagin@openvz.org>
>
> Time Namespace isolates clock values.
> +static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
> +{
> + struct time_namespace *ns = to_time_ns(new);
> +
> + if (!thread_group_empty(current))
> + return -EINVAL;
You also need to check for other users of the mm.
> +
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> + !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + get_time_ns(ns);
> + get_time_ns(ns);
> + put_time_ns(nsproxy->time_ns);
> + put_time_ns(nsproxy->time_ns_for_children);
> + nsproxy->time_ns = ns;
> + nsproxy->time_ns_for_children = ns;
> + ns->initialized = true;
I really really wish that setns() took an explicit flag for "change
now" or "change for children", since the semantics are different. Oh
well.
^ permalink raw reply
* Re: [PATCHv5 21/37] x86/vdso: Restrict splitting VVAR VMA
From: Andy Lutomirski @ 2019-08-01 5:23 UTC (permalink / raw)
To: Dmitry Safonov
Cc: LKML, Dmitry Safonov, Adrian Reber, Andrei Vagin, Andy Lutomirski,
Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu,
Linux API
In-Reply-To: <20190729215758.28405-22-dima@arista.com>
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>
> Although, time namespace can work with VVAR VMA split, it seems worth
> to forbid splitting VVAR resulting in stricter ABI and reducing amount
> of corner-cases to consider while working further on VDSO.
>
> I don't think there is any use-case for partial mremap() of vvar,
> but if there is any - this patch can be easily reverted.
Seems reasonable to me.
^ permalink raw reply
* Re: [PATCHv5 23/37] x86/vdso: Add offsets page in vvar
From: Andy Lutomirski @ 2019-08-01 5:22 UTC (permalink / raw)
To: Dmitry Safonov
Cc: LKML, Dmitry Safonov, Andrei Vagin, Adrian Reber, Andy Lutomirski,
Arnd Bergmann, Christian Brauner, Cyrill Gorcunov,
Eric W. Biederman, H. Peter Anvin, Ingo Molnar, Jann Horn,
Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu,
Linux API
In-Reply-To: <20190729215758.28405-24-dima@arista.com>
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>
> From: Andrei Vagin <avagin@openvz.org>
>
> As modern applications fetch time from VDSO without entering the kernel,
> it's needed to provide offsets for userspace code inside time namespace.
>
> A page for timens offsets is allocated on time namespace construction.
> Put that page into VVAR for tasks inside timens and zero page for
> host processes.
>
> As VDSO code is already optimized as much as possible in terms of speed,
> any new if-condition in VDSO code is undesirable; the goal is to provide
> two .so(s), as was originally suggested by Andy and Thomas:
> - for host tasks with optimized-out clk_to_ns() without any penalty
> - for processes inside timens with clk_to_ns()
> For this purpose, define clk_to_ns() under CONFIG_TIME_NS.
>
> To eliminate any performance regression, clk_to_ns() will be called
> under static_branch with follow-up patches, that adds support for
> patching vdso.
>
> VDSO mappings are platform-specific, add Kconfig dependency for arch.
>
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> Co-developed-by: Dmitry Safonov <dima@arista.com>
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
> arch/Kconfig | 5 +++
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vdso/vdso-layout.lds.S | 9 ++++-
> arch/x86/entry/vdso/vdso2c.c | 3 ++
> arch/x86/entry/vdso/vma.c | 12 +++++++
> arch/x86/include/asm/vdso.h | 1 +
> init/Kconfig | 1 +
> lib/vdso/gettimeofday.c | 47 +++++++++++++++++++++++++++
> 8 files changed, 78 insertions(+), 1 deletion(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a7b57dd42c26..e43d27f510ec 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -729,6 +729,11 @@ config HAVE_ARCH_NVRAM_OPS
> config ISA_BUS_API
> def_bool ISA
>
> +config ARCH_HAS_VDSO_TIME_NS
> + bool
> + help
> + VDSO can add time-ns offsets without entering kernel.
> +
> #
> # ABI hall of shame
> #
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 222855cc0158..91615938b470 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -81,6 +81,7 @@ config X86
> select ARCH_HAS_STRICT_MODULE_RWX
> select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> select ARCH_HAS_UBSAN_SANITIZE_ALL
> + select ARCH_HAS_VDSO_TIME_NS
> select ARCH_HAVE_NMI_SAFE_CMPXCHG
> select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
> select ARCH_MIGHT_HAVE_PC_PARPORT
> diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
> index 93c6dc7812d0..ba216527e59f 100644
> --- a/arch/x86/entry/vdso/vdso-layout.lds.S
> +++ b/arch/x86/entry/vdso/vdso-layout.lds.S
> @@ -7,6 +7,12 @@
> * This script controls its layout.
> */
>
> +#ifdef CONFIG_TIME_NS
> +# define TIMENS_SZ PAGE_SIZE
> +#else
> +# define TIMENS_SZ 0
> +#endif
> +
> SECTIONS
> {
> /*
> @@ -16,7 +22,7 @@ SECTIONS
> * segment.
> */
>
> - vvar_start = . - 3 * PAGE_SIZE;
> + vvar_start = . - (3 * PAGE_SIZE + TIMENS_SZ);
> vvar_page = vvar_start;
>
> /* Place all vvars at the offsets in asm/vvar.h. */
> @@ -28,6 +34,7 @@ SECTIONS
>
> pvclock_page = vvar_start + PAGE_SIZE;
> hvclock_page = vvar_start + 2 * PAGE_SIZE;
> + timens_page = vvar_start + 3 * PAGE_SIZE;
>
> . = SIZEOF_HEADERS;
>
> diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
> index ce67370d14e5..7380908045c7 100644
> --- a/arch/x86/entry/vdso/vdso2c.c
> +++ b/arch/x86/entry/vdso/vdso2c.c
> @@ -75,12 +75,14 @@ enum {
> sym_vvar_page,
> sym_pvclock_page,
> sym_hvclock_page,
> + sym_timens_page,
> };
>
> const int special_pages[] = {
> sym_vvar_page,
> sym_pvclock_page,
> sym_hvclock_page,
> + sym_timens_page,
> };
>
> struct vdso_sym {
> @@ -93,6 +95,7 @@ struct vdso_sym required_syms[] = {
> [sym_vvar_page] = {"vvar_page", true},
> [sym_pvclock_page] = {"pvclock_page", true},
> [sym_hvclock_page] = {"hvclock_page", true},
> + [sym_timens_page] = {"timens_page", true},
> {"VDSO32_NOTE_MASK", true},
> {"__kernel_vsyscall", true},
> {"__kernel_sigreturn", true},
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 2dc4f0b5481c..9bd66f84db5e 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -14,6 +14,7 @@
> #include <linux/elf.h>
> #include <linux/cpu.h>
> #include <linux/ptrace.h>
> +#include <linux/time_namespace.h>
> #include <asm/pvclock.h>
> #include <asm/vgtod.h>
> #include <asm/proto.h>
> @@ -23,6 +24,7 @@
> #include <asm/desc.h>
> #include <asm/cpufeature.h>
> #include <clocksource/hyperv_timer.h>
> +#include <asm/page.h>
>
> #if defined(CONFIG_X86_64)
> unsigned int __read_mostly vdso64_enabled = 1;
> @@ -135,6 +137,16 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
> if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
> return vmf_insert_pfn(vma, vmf->address,
> vmalloc_to_pfn(tsc_pg));
> + } else if (sym_offset == image->sym_timens_page) {
> + struct time_namespace *ns = current->nsproxy->time_ns;
What, if anything, guarantees that all tasks in the mm share the same timens?
--Andy
^ permalink raw reply
* Re: [PATCHv5 28/37] x86/vdso: Enable static branches for the timens vdso
From: Andy Lutomirski @ 2019-08-01 5:21 UTC (permalink / raw)
To: Dmitry Safonov
Cc: LKML, Dmitry Safonov, Andrei Vagin, Adrian Reber, Andrei Vagin,
Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, Linux Containers, criu
In-Reply-To: <20190729215758.28405-29-dima@arista.com>
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov <dima@arista.com> wrote:
>
> From: Andrei Vagin <avagin@gmail.com>
>
> As it has been discussed on timens RFC, adding a new conditional branch
> `if (inside_time_ns)` on VDSO for all processes is undesirable.
>
> Addressing those problems, there are two versions of VDSO's .so:
> for host tasks (without any penalty) and for processes inside of time
> namespace with clk_to_ns() that subtracts offsets from host's time.
>
> The timens code in vdso looks like this:
>
> if (timens_static_branch_unlikely()) {
> clk_to_ns(clk, ts);
> }
I'm confused. Now we effectively have *three* versions: the vDSO
without timens, and vDSO with timens but with it switched off, and the
vDSO with timens on. This seems like too much.
What you need is, IMO, a static-branch-ish thing that is per mm. This
has a fundamental problem that the vDSO can be modified using
FOLL_FORCE. Perhaps any CoW of the vDSO should implicitly switch the
static branch on, which at least gives some degree of correctness even
if it's a bit surprising.
--Andy
^ permalink raw reply
* Re: [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Jaegeuk Kim @ 2019-08-01 4:05 UTC (permalink / raw)
To: Chao Yu
Cc: Nathan Chancellor, Daniel Rosenberg, Jonathan Corbet,
linux-f2fs-devel, linux-kernel, linux-doc, linux-fsdevel,
linux-api, kernel-team
In-Reply-To: <5d6c5da8-ad1e-26e2-0a3d-84949cd4e9aa@huawei.com>
On 08/01, Chao Yu wrote:
> Hi Nathan,
>
> Thanks for the report! :)
>
> On 2019/8/1 1:57, Nathan Chancellor wrote:
> > Hi all,
> >
> > <snip>
> >
> >> diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c
> >> index cc82f142f811f..99e79934f5088 100644
> >> --- a/fs/f2fs/hash.c
> >> +++ b/fs/f2fs/hash.c
> >> @@ -14,6 +14,7 @@
> >> #include <linux/f2fs_fs.h>
> >> #include <linux/cryptohash.h>
> >> #include <linux/pagemap.h>
> >> +#include <linux/unicode.h>
> >>
> >> #include "f2fs.h"
> >>
> >> @@ -67,7 +68,7 @@ static void str2hashbuf(const unsigned char *msg, size_t len,
> >> *buf++ = pad;
> >> }
> >>
> >> -f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
> >> +static f2fs_hash_t __f2fs_dentry_hash(const struct qstr *name_info,
> >> struct fscrypt_name *fname)
> >> {
> >> __u32 hash;
> >> @@ -103,3 +104,35 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
> >> f2fs_hash = cpu_to_le32(hash & ~F2FS_HASH_COL_BIT);
> >> return f2fs_hash;
> >> }
> >> +
> >> +f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
> >> + const struct qstr *name_info, struct fscrypt_name *fname)
> >> +{
> >> +#ifdef CONFIG_UNICODE
> >> + struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
> >> + const struct unicode_map *um = sbi->s_encoding;
> >> + int r, dlen;
> >> + unsigned char *buff;
> >> + struct qstr *folded;
> >> +
> >> + if (name_info->len && IS_CASEFOLDED(dir)) {
> >> + buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
> >> + if (!buff)
> >> + return -ENOMEM;
> >> +
> >> + dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
> >> + if (dlen < 0) {
> >> + kvfree(buff);
> >> + goto opaque_seq;
> >> + }
> >> + folded->name = buff;
> >> + folded->len = dlen;
> >> + r = __f2fs_dentry_hash(folded, fname);
> >> +
> >> + kvfree(buff);
> >> + return r;
> >> + }
> >> +opaque_seq:
> >> +#endif
> >> + return __f2fs_dentry_hash(name_info, fname);
> >> +}
> >
> > Clang now warns:
> >
> > fs/f2fs/hash.c:128:3: warning: variable 'folded' is uninitialized when used here [-Wuninitialized]
> > folded->name = buff;
> > ^~~~~~
> > fs/f2fs/hash.c:116:21: note: initialize the variable 'folded' to silence this warning
> > struct qstr *folded;
> > ^
> > = NULL
> > 1 warning generated.
> >
> > I assume that it wants to be initialized with f2fs_kzalloc as well but
> > I am not familiar with this code and what it expects to do.
> >
> > Please look into this when you get a chance!
>
> That should be a bug, it needs to define a struct qstr type variable rather than
> a pointer there.
>
> Jaegeuk, could you fix this in you branch?
Yeah, let me apply this.
--- a/fs/f2fs/hash.c
+++ b/fs/f2fs/hash.c
@@ -113,25 +113,27 @@ f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
const struct unicode_map *um = sbi->s_encoding;
int r, dlen;
unsigned char *buff;
- struct qstr *folded;
+ struct qstr folded;
- if (name_info->len && IS_CASEFOLDED(dir)) {
- buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
- if (!buff)
- return -ENOMEM;
+ if (!name_info->len || !IS_CASEFOLDED(dir))
+ goto opaque_seq;
- dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
- if (dlen < 0) {
- kvfree(buff);
- goto opaque_seq;
- }
- folded->name = buff;
- folded->len = dlen;
- r = __f2fs_dentry_hash(folded, fname);
+ buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
+ if (!buff)
+ return -ENOMEM;
+ dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
+ if (dlen < 0) {
kvfree(buff);
- return r;
+ goto opaque_seq;
}
+ folded.name = buff;
+ folded.len = dlen;
+ r = __f2fs_dentry_hash(&folded, fname);
+
+ kvfree(buff);
+ return r;
+
opaque_seq:
#endif
return __f2fs_dentry_hash(name_info, fname);
^ permalink raw reply
* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-08-01 1:11 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190731233843.GA2769@mit.edu>
On Wed, Jul 31, 2019 at 07:38:43PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Jul 31, 2019 at 11:38:02AM -0700, Eric Biggers wrote:
> >
> > This is perhaps different from what users expect from unlink(). It's well known
> > that unlink() just deletes the filename, not the file itself if it's still open
> > or has other links. And unlink() by itself isn't meant for use cases where the
> > file absolutely must be securely erased. But FS_IOC_REMOVE_ENCRYPTION_KEY
> > really is meant primarily for that sort of thing.
>
> Seems to me that part of the confusion is FS_IOC_REMOVE_ENCRYPTION_KEY
> does two things. One is "remove the user's handle on the key". The
> other is "purge all keys" (which requires root). So it does two
> different things with one ioctl.
>
Well, it's either
1a. Remove the user's handle.
OR
1b. Remove all users' handles. (FSCRYPT_REMOVE_KEY_FLAG_ALL_USERS)
Then
2. If no handles remain, try to evict all inodes that use the key.
By "purge all keys" do you mean step (2)? Note that it doesn't require root by
itself; root is only required to remove other users' handles (1b).
It could be argued that (2) should be a separate ioctl, so we'd have UNLINK_KEY
then LOCK_KEY. But is there a real use case for this split? I.e. would anyone
ever want to UNLINK_KEY without also LOCK_KEY? Is that really something we
want/need to support? I'd really like the API to be as straightforward as
possible for the normal use case of locking a directory, and not require some
series of multiple ioctl's, which would be more difficult to use correctly.
> > To give a concrete example: my patch for the userspace tool
> > https://github.com/google/fscrypt adds a command 'fscrypt lock' which locks an
> > encrypted directory. If, say, someone runs 'fscrypt unlock' as uid 0 and then
> > 'fscrypt lock' as uid 1000, then FS_IOC_REMOVE_ENCRYPTION_KEY can't actually
> > remove the key. I need to make the tool show a proper error message in this
> > case. To do so, it would help to get a unique error code (e.g. EUSERS) from
> > FS_IOC_REMOVE_ENCRYPTION_KEY, rather than get the ambiguous error code ENOKEY
> > and have to call FS_IOC_GET_ENCRYPTION_KEY_STATUS to get the real status.
>
> What about having "fscrypt lock" call FS_IOC_GET_ENCRYPTION_KEY_STATUS
> and print a warning message saying, "we can't lock it because N other
> users who have registered a key". I'd argue fscrypt should do this
> regardless of whether or not FS_IOC_REMOVE_ENCRYPTION_KEY returns
> EUSERS or not.
Shouldn't "fscrypt lock" still remove the user's handle, as opposed to refuse to
do anything, though? So it would still need to call
FS_IOC_REMOVE_ENCRYPTION_KEY, and could get the status from it rather than also
needing to call FS_IOC_GET_ENCRYPTION_KEY_STATUS.
Though, FS_IOC_GET_ENCRYPTION_KEY_STATUS would be needed if we wanted to show
the specific count of other users.
>
> > Also, we already have the EBUSY case. This means that the ioctl removed the
> > master key secret itself; however, some files were still in-use, so the key
> > remains in the "incompletely removed" state. If we were actually going for
> > unlink() semantics, then for consistency this case really ought to return 0 and
> > unlink the key object, and people who care about in-use files would need to use
> > FS_IOC_GET_ENCRYPTION_KEY_STATUS. But most people *will* care about this, and
> > may even want to retry the ioctl later, which isn't something youh can do with
> > pure unlink() semantics.
>
> It seems to me that the EBUSY and EUSERS errors should be status bits
> which gets returned to the user in a bitfield --- and if the key has
> been removed, or the user's claim on the key's existence has been
> removed, the ioctl returns success.
>
> That way we don't have to deal with the semantic disconnect where some
> errors don't actually change system state, and other errors that *do*
> change system state (as in, the key gets removed, or the user's claim
> on the key gets removed), but still returns than error.
>
> We could also add a flag which indicates where if there are files that
> are still busy, or there are other users keeping a key in use, the
> ioctl fails hard and returns an error. At least that way we keep
> consistency where an error means, "nothing has changed".
>
> - Ted
Do you mean use a positive return value, or do you mean add an output field to
the struct passed to the ioctl?
The latter might be more error-prone, since it invites bugs where a directory
silently fails to be locked, because the second field was not checked.
Either way note that it doesn't really need to be a bitfield, since you can't
have both statuses at the same time. I.e. if there are still other users, we
couldn't have even gotten to checking for in-use files.
>
> P.S. BTW, one of the comments which I didn't make was the
> documentation didn't adequately explain which error codes means,
> "success but with a caveat", and which errors means, "we failed and
> didn't do anything". But since I was arguing for changing the
> behavior, I decided not to complain about the documentation.
>
Yes, in any case the FS_IOC_REMOVE_ENCRYPTION_KEY documentation needs
improvement. I have some updates pending for it.
- Eric
^ permalink raw reply
* Re: [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Chao Yu @ 2019-08-01 1:11 UTC (permalink / raw)
To: Nathan Chancellor, Daniel Rosenberg
Cc: Jaegeuk Kim, Jonathan Corbet, linux-f2fs-devel, linux-kernel,
linux-doc, linux-fsdevel, linux-api, kernel-team
In-Reply-To: <20190731175748.GA48637@archlinux-threadripper>
Hi Nathan,
Thanks for the report! :)
On 2019/8/1 1:57, Nathan Chancellor wrote:
> Hi all,
>
> <snip>
>
>> diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c
>> index cc82f142f811f..99e79934f5088 100644
>> --- a/fs/f2fs/hash.c
>> +++ b/fs/f2fs/hash.c
>> @@ -14,6 +14,7 @@
>> #include <linux/f2fs_fs.h>
>> #include <linux/cryptohash.h>
>> #include <linux/pagemap.h>
>> +#include <linux/unicode.h>
>>
>> #include "f2fs.h"
>>
>> @@ -67,7 +68,7 @@ static void str2hashbuf(const unsigned char *msg, size_t len,
>> *buf++ = pad;
>> }
>>
>> -f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
>> +static f2fs_hash_t __f2fs_dentry_hash(const struct qstr *name_info,
>> struct fscrypt_name *fname)
>> {
>> __u32 hash;
>> @@ -103,3 +104,35 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
>> f2fs_hash = cpu_to_le32(hash & ~F2FS_HASH_COL_BIT);
>> return f2fs_hash;
>> }
>> +
>> +f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
>> + const struct qstr *name_info, struct fscrypt_name *fname)
>> +{
>> +#ifdef CONFIG_UNICODE
>> + struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
>> + const struct unicode_map *um = sbi->s_encoding;
>> + int r, dlen;
>> + unsigned char *buff;
>> + struct qstr *folded;
>> +
>> + if (name_info->len && IS_CASEFOLDED(dir)) {
>> + buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
>> + if (!buff)
>> + return -ENOMEM;
>> +
>> + dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
>> + if (dlen < 0) {
>> + kvfree(buff);
>> + goto opaque_seq;
>> + }
>> + folded->name = buff;
>> + folded->len = dlen;
>> + r = __f2fs_dentry_hash(folded, fname);
>> +
>> + kvfree(buff);
>> + return r;
>> + }
>> +opaque_seq:
>> +#endif
>> + return __f2fs_dentry_hash(name_info, fname);
>> +}
>
> Clang now warns:
>
> fs/f2fs/hash.c:128:3: warning: variable 'folded' is uninitialized when used here [-Wuninitialized]
> folded->name = buff;
> ^~~~~~
> fs/f2fs/hash.c:116:21: note: initialize the variable 'folded' to silence this warning
> struct qstr *folded;
> ^
> = NULL
> 1 warning generated.
>
> I assume that it wants to be initialized with f2fs_kzalloc as well but
> I am not familiar with this code and what it expects to do.
>
> Please look into this when you get a chance!
That should be a bug, it needs to define a struct qstr type variable rather than
a pointer there.
Jaegeuk, could you fix this in you branch?
Thanks,
> Nathan
> .
>
^ permalink raw reply
* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Theodore Y. Ts'o @ 2019-07-31 23:38 UTC (permalink / raw)
To: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
linux-mtd, linux-api, linux-crypto, keyrings, Paul Crowley,
Satya Tangirala
In-Reply-To: <20190731183802.GA687@sol.localdomain>
On Wed, Jul 31, 2019 at 11:38:02AM -0700, Eric Biggers wrote:
>
> This is perhaps different from what users expect from unlink(). It's well known
> that unlink() just deletes the filename, not the file itself if it's still open
> or has other links. And unlink() by itself isn't meant for use cases where the
> file absolutely must be securely erased. But FS_IOC_REMOVE_ENCRYPTION_KEY
> really is meant primarily for that sort of thing.
Seems to me that part of the confusion is FS_IOC_REMOVE_ENCRYPTION_KEY
does two things. One is "remove the user's handle on the key". The
other is "purge all keys" (which requires root). So it does two
different things with one ioctl.
> To give a concrete example: my patch for the userspace tool
> https://github.com/google/fscrypt adds a command 'fscrypt lock' which locks an
> encrypted directory. If, say, someone runs 'fscrypt unlock' as uid 0 and then
> 'fscrypt lock' as uid 1000, then FS_IOC_REMOVE_ENCRYPTION_KEY can't actually
> remove the key. I need to make the tool show a proper error message in this
> case. To do so, it would help to get a unique error code (e.g. EUSERS) from
> FS_IOC_REMOVE_ENCRYPTION_KEY, rather than get the ambiguous error code ENOKEY
> and have to call FS_IOC_GET_ENCRYPTION_KEY_STATUS to get the real status.
What about having "fscrypt lock" call FS_IOC_GET_ENCRYPTION_KEY_STATUS
and print a warning message saying, "we can't lock it because N other
users who have registered a key". I'd argue fscrypt should do this
regardless of whether or not FS_IOC_REMOVE_ENCRYPTION_KEY returns
EUSERS or not.
> Also, we already have the EBUSY case. This means that the ioctl removed the
> master key secret itself; however, some files were still in-use, so the key
> remains in the "incompletely removed" state. If we were actually going for
> unlink() semantics, then for consistency this case really ought to return 0 and
> unlink the key object, and people who care about in-use files would need to use
> FS_IOC_GET_ENCRYPTION_KEY_STATUS. But most people *will* care about this, and
> may even want to retry the ioctl later, which isn't something youh can do with
> pure unlink() semantics.
It seems to me that the EBUSY and EUSERS errors should be status bits
which gets returned to the user in a bitfield --- and if the key has
been removed, or the user's claim on the key's existence has been
removed, the ioctl returns success.
That way we don't have to deal with the semantic disconnect where some
errors don't actually change system state, and other errors that *do*
change system state (as in, the key gets removed, or the user's claim
on the key gets removed), but still returns than error.
We could also add a flag which indicates where if there are files that
are still busy, or there are other users keeping a key in use, the
ioctl fails hard and returns an error. At least that way we keep
consistency where an error means, "nothing has changed".
- Ted
P.S. BTW, one of the comments which I didn't make was the
documentation didn't adequately explain which error codes means,
"success but with a caveat", and which errors means, "we failed and
didn't do anything". But since I was arguing for changing the
behavior, I decided not to complain about the documentation.
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox