* Re: [PATCH v2 0/5] Add support for O_MAYEXEC
From: Mickaël Salaün @ 2019-09-09 9:09 UTC (permalink / raw)
To: Aleksa Sarai, Andy Lutomirski
Cc: Steve Grubb, Florian Weimer, Mickaël Salaün,
linux-kernel, Alexei Starovoitov, Al Viro, Andy Lutomirski,
Christian Heimes, Daniel Borkmann, Eric Chiang, James Morris,
Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <20190906224410.lffd6l5lnm4z3hht@yavin.dot.cyphar.com>
On 07/09/2019 00:44, Aleksa Sarai wrote:
> On 2019-09-06, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Sep 6, 2019, at 12:07 PM, Steve Grubb <sgrubb@redhat.com> wrote:
>>>
>>>> On Friday, September 6, 2019 2:57:00 PM EDT Florian Weimer wrote:
>>>> * Steve Grubb:
>>>>> Now with LD_AUDIT
>>>>> $ LD_AUDIT=/home/sgrubb/test/openflags/strip-flags.so.0 strace ./test
>>>>> 2>&1 | grep passwd openat(3, "passwd", O_RDONLY) = 4
>>>>>
>>>>> No O_CLOEXEC flag.
>>>>
>>>> I think you need to explain in detail why you consider this a problem.
Right, LD_PRELOAD and such things are definitely not part of the threat
model for O_MAYEXEC, on purpose, because this must be addressed with
other security mechanism (e.g. correct file system access-control, IMA
policy, SELinux or other LSM security policies). This is a requirement
for O_MAYEXEC to be useful.
An interpreter is just a flexible program which is generic and doesn't
have other purpose other than behaving accordingly to external rules
(i.e. scripts). If you don't trust your interpreter, it should not be
executable in the first place. O_MAYEXEC enables to restrict the use of
(some) interpreters accordingly to a *global* system security policy.
>>>
>>> Because you can strip the O_MAYEXEC flag from being passed into the kernel.
>>> Once you do that, you defeat the security mechanism because it never gets
>>> invoked. The issue is that the only thing that knows _why_ something is being
>>> opened is user space. With this mechanism, you can attempt to pass this
>>> reason to the kernel so that it may see if policy permits this. But you can
>>> just remove the flag.
>>
>> I’m with Florian here. Once you are executing code in a process, you
>> could just emulate some other unapproved code. This series is not
>> intended to provide the kind of absolute protection you’re imagining.
>
> I also agree, though I think that there is a separate argument to be
> made that there are two possible problems with O_MAYEXEC (which might
> not be really big concerns):
>
> * It's very footgun-prone if you didn't call O_MAYEXEC yourself and
> you pass the descriptor elsewhere. You need to check f_flags to see
> if it contains O_MAYEXEC. Maybe there is an argument to be made that
> passing O_MAYEXECs around isn't a valid use-case, but in that case
> there should be some warnings about that.
That could be an issue if you don't trust your system, especially if the
mount points (and the "noexec" option) can be changed by untrusted
users. As I said above, there is a requirement for basic security
properties as a meaningful file system access control, and obviously not
letting any user change mount points (which can lead to much sever
security issues anyway).
If a process A pass a FD to an interpreter B, then the interpreter B
must trust the process A. Moreover, being able to tell if the FD was
open with O_MAYEXEC and relying on it may create a wrong feeling of
security. As I said in a previous email, being able to probe for
O_MAYEXEC does not make sense because it would not be enough to
know the system policy (either this flag is enforced or not, for mount
points, based on xattr, time…). The main goal of O_MAYEXEC is to ask the
kernel, on a trusted link (hence without LD_PRELOAD-like interfering),
for a file which is allowed to be interpreted/executed by this interpreter.
To be able to correctly handle the case you pointed out (FD passing),
either an existing or a new LSM should handle this behavior according to
the origin of the FD and the chain of processes getting it.
Some advanced LSM rules could tie interpreters with scripts dedicated to
them, and have different behavior for the same scripts but with
different interpreters.
>
> * There's effectively a TOCTOU flaw (even if you are sure O_MAYEXEC is
> in f_flags) -- if the filesystem becomes re-mounted noexec (or the
> file has a-x permissions) after you've done the check you won't get
> hit with an error when you go to use the file descriptor later.
Again, the threat model needs to be appropriate to make O_MAYEXEC
useful. The security policies of the system need to be seen as a whole,
and updated as such.
As for most file system access control on Linux, it may be possible to
have TOCTOU, but the whole system should be designed to protect against
that. For example, changing file access control (e.g. mount point
options) without a reboot may lead to inconsistent security properties,
which is why such thing are discouraged by some access control systems
(e.g. SELinux).
>
> To fix both you'd need to do what you mention later:
>
>> What the kernel *could* do is prevent mmapping a non-FMODE_EXEC file
>> with PROT_EXEC, which would indeed have a real effect (in an iOS-like
>> world, for example) but would break many, many things.
>
> And I think this would be useful (with the two possible ways of
> executing .text split into FMODE_EXEC and FMODE_MAP_EXEC, as mentioned
> in a sister subthread), but would have to be opt-in for the obvious
> reason you outlined. However, we could make it the default for
> openat2(2) -- assuming we can agree on what the semantics of a
> theoretical FMODE_EXEC should be.
>
> And of course we'd need to do FMODE_UPGRADE_EXEC (which would need to
> also permit fexecve(2) though probably not PROT_EXEC -- I don't think
> you can mmap() an O_PATH descriptor).
The mmapping restriction may be interesting but it is a different use
case. This series address the interpreter/script problem. Either the
script may be mapped executable is the choice of the interpreter. In
most cases, no script are mapped as such, exactly because they are
interpreted by a process but not by the CPU.
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 9:18 UTC (permalink / raw)
To: Andy Lutomirski, Jeff Layton
Cc: Florian Weimer, Mickaël Salaün, linux-kernel,
Aleksa Sarai, Alexei Starovoitov, Al Viro, Andy Lutomirski,
Christian Heimes, Daniel Borkmann, Eric Chiang, James Morris,
Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <D1212E06-773B-42B9-B7C3-C4C1C2A6111D@amacapital.net>
On 06/09/2019 20:41, Andy Lutomirski wrote:
>
>
>> On Sep 6, 2019, at 11:38 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>
>>> On Fri, 2019-09-06 at 19:14 +0200, Mickaël Salaün wrote:
>>>> On 06/09/2019 18:48, Jeff Layton wrote:
>>>>> On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
>>>>>> On 06/09/2019 17:56, Florian Weimer wrote:
>>>>>> Let's assume I want to add support for this to the glibc dynamic loader,
>>>>>> while still being able to run on older kernels.
>>>>>>
>>>>>> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
>>>>>> with EINVAL, try again without O_MAYEXEC?
>>>>>
>>>>> The kernel ignore unknown open(2) flags, so yes, it is safe even for
>>>>> older kernel to use O_MAYEXEC.
>>>>>
>>>>
>>>> Well...maybe. What about existing programs that are sending down bogus
>>>> open flags? Once you turn this on, they may break...or provide a way to
>>>> circumvent the protections this gives.
>>>
>>> Well, I don't think we should nor could care about bogus programs that
>>> do not conform to the Linux ABI.
>>>
>>
>> But they do conform. The ABI is just undefined here. Unknown flags are
>> ignored so we never really know if $random_program may be setting them.
>>
>>>> Maybe this should be a new flag that is only usable in the new openat2()
>>>> syscall that's still under discussion? That syscall will enforce that
>>>> all flags are recognized. You presumably wouldn't need the sysctl if you
>>>> went that route too.
>>>
>>> Here is a thread about a new syscall:
>>> https://lore.kernel.org/lkml/1544699060.6703.11.camel@linux.ibm.com/
>>>
>>> I don't think it fit well with auditing nor integrity. Moreover using
>>> the current open(2) behavior of ignoring unknown flags fit well with the
>>> usage of O_MAYEXEC (because it is only a hint to the kernel about the
>>> use of the *opened* file).
>>>
>>
>> The fact that open and openat didn't vet unknown flags is really a bug.
>>
>> Too late to fix it now, of course, and as Aleksa points out, we've
>> worked around that in the past. Now though, we have a new openat2
>> syscall on the horizon. There's little need to continue these sorts of
>> hacks.
>>
>> New open flags really have no place in the old syscalls, IMO.
>>
>>>> Anyone that wants to use this will have to recompile anyway. If the
>>>> kernel doesn't support openat2 or if the flag is rejected then you know
>>>> that you have no O_MAYEXEC support and can decide what to do.
>>>
>>> If we want to enforce a security policy, we need to either be the system
>>> administrator or the distro developer. If a distro ship interpreters
>>> using this flag, we don't need to recompile anything, but we need to be
>>> able to control the enforcement according to the mount point
>>> configuration (or an advanced MAC, or an IMA config). I don't see why an
>>> userspace process should check if this flag is supported or not, it
>>> should simply use it, and the sysadmin will enable an enforcement if it
>>> makes sense for the whole system.
>>>
>>
>> A userland program may need to do other risk mitigation if it sets
>> O_MAYEXEC and the kernel doesn't recognize it.
>>
>> Personally, here's what I'd suggest:
>>
>> - Base this on top of the openat2 set
>> - Change it that so that openat2() files are non-executable by default. Anyone wanting to do that needs to set O_MAYEXEC or upgrade the fd somehow.
>> - Only have the openat2 syscall pay attention to O_MAYEXEC. Let open and openat continue ignoring the new flag.
>>
>> That works around a whole pile of potential ABI headaches. Note that
>> we'd need to make that decision before the openat2 patches are merged.
>>
>> Even better would be to declare the new flag in some openat2-only flag
>> space, so there's no confusion about it being supported by legacy open
>> calls.
>>
>> If glibc wants to implement an open -> openat2 wrapper in userland
>> later, it can set that flag in the wrapper implicitly to emulate the old
>> behavior.
>>
>> Given that you're going to have to recompile software to take advantage
>> of this anyway, what's the benefit to changing legacy syscalls?
>>
>>>>>> Or do I risk disabling this security feature if I do that?
>>>>>
>>>>> It is only a security feature if the kernel support it, otherwise it is
>>>>> a no-op.
>>>>>
>>>>
>>>> With a security feature, I think we really want userland to aware of
>>>> whether it works.
>>>
>>> If userland would like to enforce something, it can already do it
>>> without any kernel modification. The goal of the O_MAYEXEC flag is to
>>> enable the kernel, hence sysadmins or system designers, to enforce a
>>> global security policy that makes sense.
>>>
>>
>> I don't see how this helps anything if you can't tell whether the kernel
>> recognizes the damned thing. Also, our track record with global sysctl
>> switches like this is pretty poor. They're an administrative headache as
>> well as a potential attack vector.
>
> I tend to agree. The sysctl seems like it’s asking for trouble. I can see an ld.so.conf option to turn this thing off making sense.
The sysctl is required to enable the adoption of this flag without
breaking existing systems. Current systems may have "noexec" on mount
points containing scripts. Without giving the ability to the sysadmin to
control that behavior, updating to a newer version of an interpreter
using O_MAYEXEC may break such systems.
How would you do this with ld.so.conf ?
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 9:25 UTC (permalink / raw)
To: James Morris, Jeff Layton
Cc: Florian Weimer, Mickaël Salaün, linux-kernel,
Aleksa Sarai, Alexei Starovoitov, Al Viro, Andy Lutomirski,
Christian Heimes, Daniel Borkmann, Eric Chiang, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell
In-Reply-To: <alpine.LRH.2.21.1909061202070.18660@namei.org>
On 06/09/2019 21:03, James Morris wrote:
> On Fri, 6 Sep 2019, Jeff Layton wrote:
>
>> The fact that open and openat didn't vet unknown flags is really a bug.
>>
>> Too late to fix it now, of course, and as Aleksa points out, we've
>> worked around that in the past. Now though, we have a new openat2
>> syscall on the horizon. There's little need to continue these sorts of
>> hacks.
>>
>> New open flags really have no place in the old syscalls, IMO.
>
> Agree here. It's unfortunate but a reality and Linus will reject any such
> changes which break existing userspace.
Do you mean that adding new flags to open(2) is not possible?
Does it means that unspecified behaviors are definitely part of the
Linux specification and can't be fixed?
As I said, O_MAYEXEC should be ignored if it is not supported by the
kernel, which perfectly fit with the current open(2) flags behavior, and
should also behave the same with openat2(2).
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 9:33 UTC (permalink / raw)
To: Andy Lutomirski, Jeff Layton
Cc: Aleksa Sarai, Florian Weimer, Mickaël Salaün,
linux-kernel, Alexei Starovoitov, Al Viro, Andy Lutomirski,
Christian Heimes, Daniel Borkmann, Eric Chiang, James Morris,
Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <D2A57C7B-B0FD-424E-9F81-B858FFF21FF0@amacapital.net>
On 06/09/2019 22:06, Andy Lutomirski wrote:
>
>
>> On Sep 6, 2019, at 12:43 PM, Jeff Layton <jlayton@kernel.org> wrote:
>>
>>> On Sat, 2019-09-07 at 03:13 +1000, Aleksa Sarai wrote:
>>>> On 2019-09-06, Jeff Layton <jlayton@kernel.org> wrote:
>>>>> On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
>>>>>> On 06/09/2019 17:56, Florian Weimer wrote:
>>>>>> Let's assume I want to add support for this to the glibc dynamic loader,
>>>>>> while still being able to run on older kernels.
>>>>>>
>>>>>> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
>>>>>> with EINVAL, try again without O_MAYEXEC?
>>>>>
>>>>> The kernel ignore unknown open(2) flags, so yes, it is safe even for
>>>>> older kernel to use O_MAYEXEC.
>>>>>
>>>>
>>>> Well...maybe. What about existing programs that are sending down bogus
>>>> open flags? Once you turn this on, they may break...or provide a way to
>>>> circumvent the protections this gives.
>>>
>>> It should be noted that this has been a valid concern for every new O_*
>>> flag introduced (and yet we still introduced new flags, despite the
>>> concern) -- though to be fair, O_TMPFILE actually does have a
>>> work-around with the O_DIRECTORY mask setup.
>>>
>>> The openat2() set adds O_EMPTYPATH -- though in fairness it's also
>>> backwards compatible because empty path strings have always given ENOENT
>>> (or EINVAL?) while O_EMPTYPATH is a no-op non-empty strings.
>>>
>>>> Maybe this should be a new flag that is only usable in the new openat2()
>>>> syscall that's still under discussion? That syscall will enforce that
>>>> all flags are recognized. You presumably wouldn't need the sysctl if you
>>>> went that route too.
>>>
>>> I'm also interested in whether we could add an UPGRADE_NOEXEC flag to
>>> how->upgrade_mask for the openat2(2) patchset (I reserved a flag bit for
>>> it, since I'd heard about this work through the grape-vine).
>>>
>>
>> I rather like the idea of having openat2 fds be non-executable by
>> default, and having userland request it specifically via O_MAYEXEC (or
>> some similar openat2 flag) if it's needed. Then you could add an
>> UPGRADE_EXEC flag instead?
>>
>> That seems like something reasonable to do with a brand new API, and
>> might be very helpful for preventing certain classes of attacks.
>>
>>
>
> There are at least four concepts of executability here:
>
> - Just check the file mode and any other relevant permissions. Return a normal fd. Makes sense for script interpreters, perhaps.
This is the purpose of this patch series. It doesn't make sense to add
memory restrictions nor constrain fexecve and such.
>
> - Make the fd fexecve-able.
>
> - Make the resulting fd mappable PROT_EXEC.
>
> - Make the resulting fd upgradable.
>
> I’m not at all convinced that the kernel needs to distinguish all these, but at least upgradability should be its own thing IMO.
>
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH 0/7] Rework random blocking
From: Pavel Machek @ 2019-09-09 9:42 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Tso, LKML, Linux API, Kees Cook, Jason A. Donenfeld
In-Reply-To: <cover.1567126741.git.luto@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]
On Thu 2019-08-29 18:11:35, Andy Lutomirski wrote:
> This makes two major semantic changes to Linux's random APIs:
>
> It adds getentropy(..., GRND_INSECURE). This causes getentropy to
> always return *something*. There is no guarantee whatsoever that
> the result will be cryptographically random or even unique, but the
> kernel will give the best quality random output it can. The name is
> a big hint: the resulting output is INSECURE.
>
> The purpose of this is to allow programs that genuinely want
> best-effort entropy to get it without resorting to /dev/urandom.
> Plenty of programs do this because they need to do *something*
> during boot and they can't afford to wait. Calling it "INSECURE" is
> probably the best we can do to discourage using this API for things
> that need security.
>
> This series also removes the blocking pool and makes /dev/random
> work just like getentropy(..., 0) and makes GRND_RANDOM a no-op. I
> believe that Linux's blocking pool has outlived its usefulness.
> Linux's CRNG generates output that is good enough to use even for
> key generation. The blocking pool is not stronger in any material
> way, and keeping it around requires a lot of infrastructure of
> dubious value.
Could you give some more justification? If crng is good enough for
you, you can use /dev/urandom...
are
> This series should not break any existing programs. /dev/urandom is
> unchanged. /dev/random will still block just after booting, but it
> will block less than it used to. getentropy() with existing flags
> will return output that is, for practical purposes, just as strong
> as before.
So what is the exact semantic of /dev/random after your change?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: James Morris @ 2019-09-09 10:12 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Jeff Layton, Florian Weimer, Mickaël Salaün,
linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <49e98ece-e85f-3006-159b-2e04ba67019e@ssi.gouv.fr>
[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]
On Mon, 9 Sep 2019, Mickaël Salaün wrote:
>
> On 06/09/2019 21:03, James Morris wrote:
> > On Fri, 6 Sep 2019, Jeff Layton wrote:
> >
> >> The fact that open and openat didn't vet unknown flags is really a bug.
> >>
> >> Too late to fix it now, of course, and as Aleksa points out, we've
> >> worked around that in the past. Now though, we have a new openat2
> >> syscall on the horizon. There's little need to continue these sorts of
> >> hacks.
> >>
> >> New open flags really have no place in the old syscalls, IMO.
> >
> > Agree here. It's unfortunate but a reality and Linus will reject any such
> > changes which break existing userspace.
>
> Do you mean that adding new flags to open(2) is not possible?
>
> Does it means that unspecified behaviors are definitely part of the
> Linux specification and can't be fixed?
This is my understanding.
>
> As I said, O_MAYEXEC should be ignored if it is not supported by the
> kernel, which perfectly fit with the current open(2) flags behavior, and
> should also behave the same with openat2(2).
The problem here is programs which are already using the value of
O_MAYEXEC, which will break. Hence, openat2(2).
--
James Morris
<jmorris@namei.org>
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 10:54 UTC (permalink / raw)
To: James Morris
Cc: Jeff Layton, Florian Weimer, Mickaël Salaün,
linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook, Matthew Garrett,
Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <alpine.LRH.2.21.1909090309260.27895@namei.org>
On 09/09/2019 12:12, James Morris wrote:
> On Mon, 9 Sep 2019, Mickaël Salaün wrote:
>
>>
>> On 06/09/2019 21:03, James Morris wrote:
>>> On Fri, 6 Sep 2019, Jeff Layton wrote:
>>>
>>>> The fact that open and openat didn't vet unknown flags is really a bug.
>>>>
>>>> Too late to fix it now, of course, and as Aleksa points out, we've
>>>> worked around that in the past. Now though, we have a new openat2
>>>> syscall on the horizon. There's little need to continue these sorts of
>>>> hacks.
>>>>
>>>> New open flags really have no place in the old syscalls, IMO.
>>>
>>> Agree here. It's unfortunate but a reality and Linus will reject any such
>>> changes which break existing userspace.
>>
>> Do you mean that adding new flags to open(2) is not possible?
>>
>> Does it means that unspecified behaviors are definitely part of the
>> Linux specification and can't be fixed?
>
> This is my understanding.
>
>>
>> As I said, O_MAYEXEC should be ignored if it is not supported by the
>> kernel, which perfectly fit with the current open(2) flags behavior, and
>> should also behave the same with openat2(2).
>
> The problem here is programs which are already using the value of
> O_MAYEXEC, which will break. Hence, openat2(2).
Well, it still depends on the sysctl, which doesn't enforce anything by
default, hence doesn't break existing behavior, and this unused flags
could be fixed/removed or reported by sysadmins or distro developers.
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Aleksa Sarai @ 2019-09-09 11:54 UTC (permalink / raw)
To: Mickaël Salaün
Cc: James Morris, Jeff Layton, Florian Weimer,
Mickaël Salaün, linux-kernel, Alexei Starovoitov,
Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
Eric Chiang, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <49e98ece-e85f-3006-159b-2e04ba67019e@ssi.gouv.fr>
[-- Attachment #1: Type: text/plain, Size: 2717 bytes --]
On 2019-09-09, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
> On 06/09/2019 21:03, James Morris wrote:
> > On Fri, 6 Sep 2019, Jeff Layton wrote:
> >
> >> The fact that open and openat didn't vet unknown flags is really a bug.
> >>
> >> Too late to fix it now, of course, and as Aleksa points out, we've
> >> worked around that in the past. Now though, we have a new openat2
> >> syscall on the horizon. There's little need to continue these sorts of
> >> hacks.
> >>
> >> New open flags really have no place in the old syscalls, IMO.
> >
> > Agree here. It's unfortunate but a reality and Linus will reject any such
> > changes which break existing userspace.
>
> Do you mean that adding new flags to open(2) is not possible?
It is possible, as long as there is no case where a program that works
today (and passes garbage to the unused bits in flags) works with the
change.
O_TMPFILE was okay because it's actually two flags (one is O_DIRECTORY)
and no working program does file IO to a directory (there are also some
other tricky things done there, I'll admit I don't fully understand it).
O_EMPTYPATH works because it's a no-op with non-empty path strings, and
empty path strings have always given an error (so no working program
does it today).
However, O_MAYEXEC will result in programs that pass garbage bits to
potentially get -EACCES that worked previously.
> As I said, O_MAYEXEC should be ignored if it is not supported by the
> kernel, which perfectly fit with the current open(2) flags behavior, and
> should also behave the same with openat2(2).
NACK on having that behaviour with openat2(2). -EINVAL on unknown flags
is how all other syscalls work (any new syscall proposed today that
didn't do that would be rightly rejected), and is a quirk of open(2)
which unfortunately cannot be fixed. The fact that *every new O_ flag
needs to work around this problem* should be an indication that this
interface mis-design should not be allowed to infect any more syscalls.
Note that this point is regardless of the fact that O_MAYEXEC is a
*security* flag -- if userspace wants to have a secure fallback on
old kernels (which is "the right thing" to do) they would have to do
more work than necessary. And programs that don't care don't have to do
anything special.
However with -EINVAL, the programs doing "the right thing" get an easy
-EINVAL check. And programs that don't care can just un-set O_MAYEXEC
and retry. You should be forced to deal with the case where a flag is
not supported -- and this is doubly true of security flags!
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Aleksa Sarai @ 2019-09-09 12:28 UTC (permalink / raw)
To: Mickaël Salaün
Cc: James Morris, Jeff Layton, Florian Weimer,
Mickaël Salaün, linux-kernel, Alexei Starovoitov,
Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
Eric Chiang, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <073cb831-7c6b-1882-9b7d-eb810a2ef955@ssi.gouv.fr>
[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]
On 2019-09-09, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
> On 09/09/2019 12:12, James Morris wrote:
> > On Mon, 9 Sep 2019, Mickaël Salaün wrote:
> >> As I said, O_MAYEXEC should be ignored if it is not supported by the
> >> kernel, which perfectly fit with the current open(2) flags behavior, and
> >> should also behave the same with openat2(2).
> >
> > The problem here is programs which are already using the value of
> > O_MAYEXEC, which will break. Hence, openat2(2).
>
> Well, it still depends on the sysctl, which doesn't enforce anything by
> default, hence doesn't break existing behavior, and this unused flags
> could be fixed/removed or reported by sysadmins or distro developers.
Okay, but then this means that new programs which really want to enforce
O_MAYEXEC (and know that they really do want this feature) won't be able
to unless an admin has set the relevant sysctl. Not to mention that the
old-kernel fallback will not cover the "it's disabled by the sysctl"
case -- so the fallback handling would need to be:
int fd = open("foo", O_MAYEXEC|O_RDONLY);
if (!(fcntl(fd, F_GETFL) & O_MAYEXEC))
fallback();
if (!sysctl_feature_is_enabled)
fallback();
However, there is still a race here -- if an administrator enables
O_MAYEXEC after the program gets the fd, then you still won't hit the
fallback (and you can't tell that O_MAYEXEC checks weren't done).
You could fix the issue with the sysctl by clearing O_MAYEXEC from
f_flags if the sysctl is disabled. You could also avoid some of the
problems with it being a global setting by making it a prctl(2) which
processes can opt-in to (though this has its own major problems).
Sorry, but I'm just really not a fan of this.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 12:28 UTC (permalink / raw)
To: Aleksa Sarai
Cc: James Morris, Jeff Layton, Florian Weimer,
Mickaël Salaün, linux-kernel, Alexei Starovoitov,
Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
Eric Chiang, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <20190909115437.jwpyslcdhhvzo7g5@yavin>
On 09/09/2019 13:54, Aleksa Sarai wrote:
> On 2019-09-09, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
>> On 06/09/2019 21:03, James Morris wrote:
>>> On Fri, 6 Sep 2019, Jeff Layton wrote:
>>>
>>>> The fact that open and openat didn't vet unknown flags is really a bug.
>>>>
>>>> Too late to fix it now, of course, and as Aleksa points out, we've
>>>> worked around that in the past. Now though, we have a new openat2
>>>> syscall on the horizon. There's little need to continue these sorts of
>>>> hacks.
>>>>
>>>> New open flags really have no place in the old syscalls, IMO.
>>>
>>> Agree here. It's unfortunate but a reality and Linus will reject any such
>>> changes which break existing userspace.
>>
>> Do you mean that adding new flags to open(2) is not possible?
>
> It is possible, as long as there is no case where a program that works
> today (and passes garbage to the unused bits in flags) works with the
> change.
>
> O_TMPFILE was okay because it's actually two flags (one is O_DIRECTORY)
> and no working program does file IO to a directory (there are also some
> other tricky things done there, I'll admit I don't fully understand it).
>
> O_EMPTYPATH works because it's a no-op with non-empty path strings, and
> empty path strings have always given an error (so no working program
> does it today).
>
> However, O_MAYEXEC will result in programs that pass garbage bits to
> potentially get -EACCES that worked previously.
>
>> As I said, O_MAYEXEC should be ignored if it is not supported by the
>> kernel, which perfectly fit with the current open(2) flags behavior, and
>> should also behave the same with openat2(2).
>
> NACK on having that behaviour with openat2(2). -EINVAL on unknown flags
> is how all other syscalls work (any new syscall proposed today that
> didn't do that would be rightly rejected), and is a quirk of open(2)
> which unfortunately cannot be fixed. The fact that *every new O_ flag
> needs to work around this problem* should be an indication that this
> interface mis-design should not be allowed to infect any more syscalls.
It's definitely OK (and a sane interface) to always return -EINVAL for
unknown flags with openat2(2) (and other new syscalls). With openat2(2),
userland need to handle the case where some flags may be unknown to the
kernel (and handling the fact that this syscall may be unknown too). So
there is not an issue with openat2(2).
However, *userland* should not try to infer possible security
restrictions from the O_MAYEXEC flag (then, my use of "ignore" above),
which may return -EACCES or not, according to the current running system
security policy.
Following this reasoning, the current behavior or open(2) is fine for
O_MAYEXEC. The openat2(2) strict flag handling (i.e. -EINVAL) is fine
too for O_MAYEXEC.
>
> Note that this point is regardless of the fact that O_MAYEXEC is a
> *security* flag -- if userspace wants to have a secure fallback on
> old kernels (which is "the right thing" to do) they would have to do
> more work than necessary. And programs that don't care don't have to do
> anything special.
Most of the time this reasoning is good for most security stuff.
However, the O_MAYEXEC flag is not a security feature on its own, it is
an indication to the kernel to how this file would be used by userland.
The *kernel* security policy may tell back to userland if the system
security policy allow it or not. Most of the time, Policy Decision
Points (PDP) and Policy Enforcement Points (PEP) are in the same
software component (e.g. the kernel). Here the kernel is the PDP and
userland interpreters are PDP. Obviously, it means that these
interpreters must be (sub)part of your TCB (thanks to other security
features).
>
> However with -EINVAL, the programs doing "the right thing" get an easy
> -EINVAL check. And programs that don't care can just un-set O_MAYEXEC
> and retry. You should be forced to deal with the case where a flag is
> not supported -- and this is doubly true of security flags!
I'm in favor of doing this for openat2(2) with O_MAYEXEC, but it is not
because of the "security purposes" of this flag, as I said above, it is
because it is a saner ABI that every syscall should follow. But again,
it doesn't change my point about open(2). :)
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-09 12:33 UTC (permalink / raw)
To: Aleksa Sarai
Cc: James Morris, Jeff Layton, Florian Weimer,
Mickaël Salaün, linux-kernel, Alexei Starovoitov,
Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
Eric Chiang, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet
In-Reply-To: <20190909122802.imfx6wp4zeroktuz@yavin>
On 09/09/2019 14:28, Aleksa Sarai wrote:
> On 2019-09-09, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
>> On 09/09/2019 12:12, James Morris wrote:
>>> On Mon, 9 Sep 2019, Mickaël Salaün wrote:
>>>> As I said, O_MAYEXEC should be ignored if it is not supported by the
>>>> kernel, which perfectly fit with the current open(2) flags behavior, and
>>>> should also behave the same with openat2(2).
>>>
>>> The problem here is programs which are already using the value of
>>> O_MAYEXEC, which will break. Hence, openat2(2).
>>
>> Well, it still depends on the sysctl, which doesn't enforce anything by
>> default, hence doesn't break existing behavior, and this unused flags
>> could be fixed/removed or reported by sysadmins or distro developers.
>
> Okay, but then this means that new programs which really want to enforce
> O_MAYEXEC (and know that they really do want this feature) won't be able
> to unless an admin has set the relevant sysctl. Not to mention that the
> old-kernel fallback will not cover the "it's disabled by the sysctl"
> case -- so the fallback handling would need to be:
>
> int fd = open("foo", O_MAYEXEC|O_RDONLY);
> if (!(fcntl(fd, F_GETFL) & O_MAYEXEC))
> fallback();
> if (!sysctl_feature_is_enabled)
> fallback();
>
> However, there is still a race here -- if an administrator enables
> O_MAYEXEC after the program gets the fd, then you still won't hit the
> fallback (and you can't tell that O_MAYEXEC checks weren't done).
I just replied to this concern here:
https://lore.kernel.org/lkml/70e4244e-4dfb-6e67-416b-445e383aa1b5@ssi.gouv.fr/
>
> You could fix the issue with the sysctl by clearing O_MAYEXEC from
> f_flags if the sysctl is disabled. You could also avoid some of the
> problems with it being a global setting by making it a prctl(2) which
> processes can opt-in to (though this has its own major problems).
Security definition and enforcement should be manageable by sysadmins
and distro developers.
>
> Sorry, but I'm just really not a fan of this.
I guess there is some misunderstanding. I just replied to another thread
and I think it should answer your concerns (especially about the PDP and
PEP):
https://lore.kernel.org/lkml/70e4244e-4dfb-6e67-416b-445e383aa1b5@ssi.gouv.fr/
--
Mickaël Salaün
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
^ permalink raw reply
* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Andy Lutomirski @ 2019-09-09 15:49 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Jeff Layton, Florian Weimer, Mickaël Salaün,
linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar
In-Reply-To: <9e43ca3f-04c0-adba-1ab4-bbc8ed487934@ssi.gouv.fr>
> On Sep 9, 2019, at 2:18 AM, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
>
>
>> On 06/09/2019 20:41, Andy Lutomirski wrote:
>>
>>
>>>> On Sep 6, 2019, at 11:38 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>>>
>>>>> On Fri, 2019-09-06 at 19:14 +0200, Mickaël Salaün wrote:
>>>>>> On 06/09/2019 18:48, Jeff Layton wrote:
>>>>>>> On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
>>>>>>> On 06/09/2019 17:56, Florian Weimer wrote:
>>>>>>> Let's assume I want to add support for this to the glibc dynamic loader,
>>>>>>> while still being able to run on older kernels.
>>>>>>>
>>>>>>> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
>>>>>>> with EINVAL, try again without O_MAYEXEC?
>>>>>>
>>>>>> The kernel ignore unknown open(2) flags, so yes, it is safe even for
>>>>>> older kernel to use O_MAYEXEC.
>>>>>>
>>>>>
>>>>> Well...maybe. What about existing programs that are sending down bogus
>>>>> open flags? Once you turn this on, they may break...or provide a way to
>>>>> circumvent the protections this gives.
>>>>
>>>> Well, I don't think we should nor could care about bogus programs that
>>>> do not conform to the Linux ABI.
>>>>
>>>
>>> But they do conform. The ABI is just undefined here. Unknown flags are
>>> ignored so we never really know if $random_program may be setting them.
>>>
>>>>> Maybe this should be a new flag that is only usable in the new openat2()
>>>>> syscall that's still under discussion? That syscall will enforce that
>>>>> all flags are recognized. You presumably wouldn't need the sysctl if you
>>>>> went that route too.
>>>>
>>>> Here is a thread about a new syscall:
>>>> https://lore.kernel.org/lkml/1544699060.6703.11.camel@linux.ibm.com/
>>>>
>>>> I don't think it fit well with auditing nor integrity. Moreover using
>>>> the current open(2) behavior of ignoring unknown flags fit well with the
>>>> usage of O_MAYEXEC (because it is only a hint to the kernel about the
>>>> use of the *opened* file).
>>>>
>>>
>>> The fact that open and openat didn't vet unknown flags is really a bug.
>>>
>>> Too late to fix it now, of course, and as Aleksa points out, we've
>>> worked around that in the past. Now though, we have a new openat2
>>> syscall on the horizon. There's little need to continue these sorts of
>>> hacks.
>>>
>>> New open flags really have no place in the old syscalls, IMO.
>>>
>>>>> Anyone that wants to use this will have to recompile anyway. If the
>>>>> kernel doesn't support openat2 or if the flag is rejected then you know
>>>>> that you have no O_MAYEXEC support and can decide what to do.
>>>>
>>>> If we want to enforce a security policy, we need to either be the system
>>>> administrator or the distro developer. If a distro ship interpreters
>>>> using this flag, we don't need to recompile anything, but we need to be
>>>> able to control the enforcement according to the mount point
>>>> configuration (or an advanced MAC, or an IMA config). I don't see why an
>>>> userspace process should check if this flag is supported or not, it
>>>> should simply use it, and the sysadmin will enable an enforcement if it
>>>> makes sense for the whole system.
>>>>
>>>
>>> A userland program may need to do other risk mitigation if it sets
>>> O_MAYEXEC and the kernel doesn't recognize it.
>>>
>>> Personally, here's what I'd suggest:
>>>
>>> - Base this on top of the openat2 set
>>> - Change it that so that openat2() files are non-executable by default. Anyone wanting to do that needs to set O_MAYEXEC or upgrade the fd somehow.
>>> - Only have the openat2 syscall pay attention to O_MAYEXEC. Let open and openat continue ignoring the new flag.
>>>
>>> That works around a whole pile of potential ABI headaches. Note that
>>> we'd need to make that decision before the openat2 patches are merged.
>>>
>>> Even better would be to declare the new flag in some openat2-only flag
>>> space, so there's no confusion about it being supported by legacy open
>>> calls.
>>>
>>> If glibc wants to implement an open -> openat2 wrapper in userland
>>> later, it can set that flag in the wrapper implicitly to emulate the old
>>> behavior.
>>>
>>> Given that you're going to have to recompile software to take advantage
>>> of this anyway, what's the benefit to changing legacy syscalls?
>>>
>>>>>>> Or do I risk disabling this security feature if I do that?
>>>>>>
>>>>>> It is only a security feature if the kernel support it, otherwise it is
>>>>>> a no-op.
>>>>>>
>>>>>
>>>>> With a security feature, I think we really want userland to aware of
>>>>> whether it works.
>>>>
>>>> If userland would like to enforce something, it can already do it
>>>> without any kernel modification. The goal of the O_MAYEXEC flag is to
>>>> enable the kernel, hence sysadmins or system designers, to enforce a
>>>> global security policy that makes sense.
>>>>
>>>
>>> I don't see how this helps anything if you can't tell whether the kernel
>>> recognizes the damned thing. Also, our track record with global sysctl
>>> switches like this is pretty poor. They're an administrative headache as
>>> well as a potential attack vector.
>>
>> I tend to agree. The sysctl seems like it’s asking for trouble. I can see an ld.so.conf option to turn this thing off making sense.
>
> The sysctl is required to enable the adoption of this flag without
> breaking existing systems. Current systems may have "noexec" on mount
> points containing scripts. Without giving the ability to the sysadmin to
> control that behavior, updating to a newer version of an interpreter
> using O_MAYEXEC may break such systems.
>
> How would you do this with ld.so.conf ?
>
By telling user code not to use O_MAYEXEC?
Alternatively, you could allow O_MAYEXEC even on a noexec mount and have a strong_noexec option that blocks it.
^ permalink raw reply
* Re: [PATCH v4 bpf-next 1/4] capability: introduce CAP_BPF and CAP_TRACING
From: Andy Lutomirski @ 2019-09-09 22:52 UTC (permalink / raw)
To: Alexei Starovoitov, James Morris, LSM List, Kees Cook, Jann Horn,
Steven Rostedt
Cc: David S. Miller, Daniel Borkmann, Peter Zijlstra,
Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190906231053.1276792-2-ast@kernel.org>
On Fri, Sep 6, 2019 at 4:10 PM Alexei Starovoitov <ast@kernel.org> wrote:
>
> Split BPF and perf/tracing operations that are allowed under
> CAP_SYS_ADMIN into corresponding CAP_BPF and CAP_TRACING.
> For backward compatibility include them in CAP_SYS_ADMIN as well.
>
> The end result provides simple safety model for applications that use BPF:
> - for tracing program types
> BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
> use CAP_BPF and CAP_TRACING
> - for networking program types
> BPF_PROG_TYPE_{SCHED_CLS, XDP, CGROUP_SKB, SK_SKB, etc}
> use CAP_BPF and CAP_NET_ADMIN
>
> There are few exceptions from this simple rule:
> - bpf_trace_printk() is allowed in networking programs, but it's using
> ftrace mechanism, hence this helper needs additional CAP_TRACING.
> - cpumap is used by XDP programs. Currently it's kept under CAP_SYS_ADMIN,
> but could be relaxed to CAP_NET_ADMIN in the future.
> - BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
> to discourage production use.
> - BPF HW offload is allowed under CAP_SYS_ADMIN.
> - cg_sysctl, cg_device, lirc program types are neither networking nor tracing.
> They can be loaded under CAP_BPF, but attach is allowed under CAP_NET_ADMIN.
> This will be cleaned up in the future.
>
> userid=nobody + (CAP_TRACING | CAP_NET_ADMIN) + CAP_BPF is safer than
> typical setup with userid=root and sudo by existing bpf applications.
> It's not secure, since these capabilities:
> - allow bpf progs access arbitrary memory
> - let tasks access any bpf map
> - let tasks attach/detach any bpf prog
>
> bpftool, bpftrace, bcc tools binaries should not be installed with
> cap_bpf+cap_tracing, since unpriv users will be able to read kernel secrets.
>
> CAP_BPF, CAP_NET_ADMIN, CAP_TRACING are roughly equal in terms of
> damage they can make to the system.
> Example:
> CAP_NET_ADMIN can stop network traffic. CAP_BPF can write into map
> and if that map is used by firewall-like bpf prog the network traffic
> may stop.
> CAP_BPF allows many bpf prog_load commands in parallel. The verifier
> may consume large amount of memory and significantly slow down the system.
> CAP_TRACING allows many kprobes that can slow down the system.
Do we want to split CAP_TRACE_KERNEL and CAP_TRACE_USER? It's not
entirely clear to me that it's useful.
>
> In the future more fine-grained bpf permissions may be added.
>
> Existing unprivileged BPF operations are not affected.
> In particular unprivileged users are allowed to load socket_filter and cg_skb
> program types and to create array, hash, prog_array, map-in-map map types.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/capability.h | 18 +++++++++++
> include/uapi/linux/capability.h | 49 ++++++++++++++++++++++++++++-
> security/selinux/include/classmap.h | 4 +--
> 3 files changed, 68 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index ecce0f43c73a..13eb49c75797 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -247,6 +247,24 @@ static inline bool ns_capable_setid(struct user_namespace *ns, int cap)
> return true;
> }
> #endif /* CONFIG_MULTIUSER */
> +
> +static inline bool capable_bpf(void)
> +{
> + return capable(CAP_SYS_ADMIN) || capable(CAP_BPF);
> +}
> +static inline bool capable_tracing(void)
> +{
> + return capable(CAP_SYS_ADMIN) || capable(CAP_TRACING);
> +}
> +static inline bool capable_bpf_tracing(void)
> +{
> + return capable(CAP_SYS_ADMIN) || (capable(CAP_BPF) && capable(CAP_TRACING));
> +}
> +static inline bool capable_bpf_net_admin(void)
> +{
> + return (capable(CAP_SYS_ADMIN) || capable(CAP_BPF)) && capable(CAP_NET_ADMIN);
> +}
> +
These helpers are all wrong, unfortunately, since they will produce
inappropriate audit events. capable_bpf() should look more like this:
if (capable_noaudit(CAP_BPF))
return capable(CAP_BPF);
if (capable_noaudit(CAP_SYS_ADMIN))
return capable(CAP_SYS_ADMIN);
return capable(CAP_BPF);
James, etc: should there instead be new helpers to do this more
generically rather than going through the noaudit contortions? My
code above is horrible.
^ permalink raw reply
* Re: [PATCH 0/7] Rework random blocking
From: Andy Lutomirski @ 2019-09-09 22:57 UTC (permalink / raw)
To: Pavel Machek
Cc: Andy Lutomirski, Theodore Tso, LKML, Linux API, Kees Cook,
Jason A. Donenfeld
In-Reply-To: <20190909094230.GB27626@amd>
On Mon, Sep 9, 2019 at 2:42 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> On Thu 2019-08-29 18:11:35, Andy Lutomirski wrote:
> > This makes two major semantic changes to Linux's random APIs:
> >
> > It adds getentropy(..., GRND_INSECURE). This causes getentropy to
> > always return *something*. There is no guarantee whatsoever that
> > the result will be cryptographically random or even unique, but the
> > kernel will give the best quality random output it can. The name is
> > a big hint: the resulting output is INSECURE.
> >
> > The purpose of this is to allow programs that genuinely want
> > best-effort entropy to get it without resorting to /dev/urandom.
> > Plenty of programs do this because they need to do *something*
> > during boot and they can't afford to wait. Calling it "INSECURE" is
> > probably the best we can do to discourage using this API for things
> > that need security.
> >
> > This series also removes the blocking pool and makes /dev/random
> > work just like getentropy(..., 0) and makes GRND_RANDOM a no-op. I
> > believe that Linux's blocking pool has outlived its usefulness.
> > Linux's CRNG generates output that is good enough to use even for
> > key generation. The blocking pool is not stronger in any material
> > way, and keeping it around requires a lot of infrastructure of
> > dubious value.
>
> Could you give some more justification? If crng is good enough for
> you, you can use /dev/urandom...
Take a look at the diffstat. The random code is extremely security
sensitive, and it's made considerably more complicated by the need to
support the blocking semantics for /dev/random. My primary argument
is that there is no real reason for the kernel to continue to support
it.
>
>
> are
>
> > This series should not break any existing programs. /dev/urandom is
> > unchanged. /dev/random will still block just after booting, but it
> > will block less than it used to. getentropy() with existing flags
> > will return output that is, for practical purposes, just as strong
> > as before.
>
> So what is the exact semantic of /dev/random after your change?
Reads return immediately if the CRNG is initialized, i.e reads return
immediately if and only if getentropy(..., 0) would succeed.
Otherwise reads block.
--Andy
^ permalink raw reply
* [PATCH v6 0/3] add thermal/power management features for FPGA DFL drivers
From: Wu Hao @ 2019-09-10 5:50 UTC (permalink / raw)
To: mdf, linux-fpga, linux-kernel
Cc: linux-api, linux-hwmon, linux, jdelvare, gregkh, Wu Hao
Hi Mortiz and all,
This patchset adds thermal and power management features for FPGA DFL
drivers. Both patches are using hwmon as userspace interfaces.
As previous dfl patches have been merged, so I resubmit this one after
rebase and clean up, as dependency is resolved now.
(This patchset is generated against char-misc-next).
Please help with review to see if any comments, thank you very much!
Main changes from v5:
- rebase and clean up (remove empty uinit function) per changes in recent
merged dfl patches.
- update date in sysfs doc.
Main changes from v4:
- rebase due to Documentation format change (dfl.txt -> rst).
- clamp threshold inputs for sysfs interfaces. (patch#3)
- update sysfs doc to add more description for ltr sysfs interfaces.
(patch#3)
Main changes from v3:
- use HWMON_CHANNEL_INFO.
Main changes from v2:
- switch to standard hwmon APIs for thermal hwmon:
temp1_alarm --> temp1_max
temp1_alarm_status --> temp1_max_alarm
temp1_crit_status --> temp1_crit_alarm
temp1_alarm_policy --> temp1_max_policy
- switch to standard hwmon APIs for power hwmon:
power1_cap --> power1_max
power1_cap_status --> power1_max_alarm
power1_crit_status --> power1_crit_alarm
Wu Hao (2):
fpga: dfl: fme: add thermal management support
fpga: dfl: fme: add power management support
Xu Yilun (1):
Documentation: fpga: dfl: add descriptions for thermal/power
management interfaces
Documentation/ABI/testing/sysfs-platform-dfl-fme | 134 ++++++++
Documentation/fpga/dfl.rst | 10 +
drivers/fpga/Kconfig | 2 +-
drivers/fpga/dfl-fme-main.c | 385 +++++++++++++++++++++++
4 files changed, 530 insertions(+), 1 deletion(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH v6 1/3] Documentation: fpga: dfl: add descriptions for thermal/power management interfaces
From: Wu Hao @ 2019-09-10 5:50 UTC (permalink / raw)
To: mdf, linux-fpga, linux-kernel
Cc: linux-api, linux-hwmon, linux, jdelvare, gregkh, Xu Yilun, Wu Hao
In-Reply-To: <1568094640-4920-1-git-send-email-hao.wu@intel.com>
From: Xu Yilun <yilun.xu@intel.com>
This patch add introductions to thermal/power interfaces. They are
implemented as hwmon sysfs interfaces by thermal/power private
feature drivers.
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
Documentation/fpga/dfl.rst | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 6fa483f..094fc8a 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -108,6 +108,16 @@ More functions are exposed through sysfs
error reporting sysfs interfaces allow user to read errors detected by the
hardware, and clear the logged errors.
+ Power management (dfl_fme_power hwmon)
+ power management hwmon sysfs interfaces allow user to read power management
+ information (power consumption, thresholds, threshold status, limits, etc.)
+ and configure power thresholds for different throttling levels.
+
+ Thermal management (dfl_fme_thermal hwmon)
+ thermal management hwmon sysfs interfaces allow user to read thermal
+ management information (current temperature, thresholds, threshold status,
+ etc.).
+
FIU - PORT
==========
--
1.8.3.1
^ permalink raw reply related
* [PATCH v6 2/3] fpga: dfl: fme: add thermal management support
From: Wu Hao @ 2019-09-10 5:50 UTC (permalink / raw)
To: mdf, linux-fpga, linux-kernel
Cc: linux-api, linux-hwmon, linux, jdelvare, gregkh, Wu Hao,
Luwei Kang, Russ Weight, Xu Yilun
In-Reply-To: <1568094640-4920-1-git-send-email-hao.wu@intel.com>
This patch adds support to thermal management private feature for DFL
FPGA Management Engine (FME). This private feature driver registers
a hwmon for thermal/temperature monitoring (hwmon temp1_input).
If hardware automatic throttling is supported by this hardware, then
driver also exposes sysfs interfaces under hwmon for thresholds
(temp1_max/ crit/ emergency), threshold alarms (temp1_max_alarm/
temp1_crit_alarm) and throttling policy (temp1_max_policy).
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Moritz Fischer <mdf@kernel.org>
---
v2: create a dfl_fme_thermal hwmon to expose thermal information.
move all sysfs interfaces under hwmon
tempareture --> hwmon temp1_input
threshold1 --> hwmon temp1_alarm
threshold2 --> hwmon temp1_crit
trip_threshold --> hwmon temp1_emergency
threshold1_status --> hwmon temp1_alarm_status
threshold2_status --> hwmon temp1_crit_status
threshold1_policy --> hwmon temp1_alarm_policy
v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
temp1_alarm --> temp1_max
temp1_alarm_status --> temp1_max_alarm
temp1_crit_status --> temp1_crit_alarm
temp1_alarm_policy --> temp1_max_policy
update sysfs doc for above sysfs interface changes.
replace scnprintf with sprintf in sysfs interface.
v4: use HWMON_CHANNEL_INFO.
rebase, and update date in sysfs doc.
v5: no change.
v6: rebased, and clean up (remove empty uinit function).
update date in sysfs doc.
---
Documentation/ABI/testing/sysfs-platform-dfl-fme | 64 ++++++++
drivers/fpga/Kconfig | 2 +-
drivers/fpga/dfl-fme-main.c | 178 +++++++++++++++++++++++
3 files changed, 243 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
index 72634d3..c84b3c1 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
@@ -106,3 +106,67 @@ KernelVersion: 5.4
Contact: Wu Hao <hao.wu@intel.com>
Description: Read-only. Read this file to get the second error detected by
hardware.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. Read this file to get the name of hwmon device, it
+ supports values:
+ 'dfl_fme_thermal' - thermal hwmon device name
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns FPGA device temperature in millidegrees
+ Celsius.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns hardware threshold1 temperature in
+ millidegrees Celsius. If temperature rises at or above this
+ threshold, hardware starts 50% or 90% throttling (see
+ 'temp1_max_policy').
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns hardware threshold2 temperature in
+ millidegrees Celsius. If temperature rises at or above this
+ threshold, hardware starts 100% throttling.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns hardware trip threshold temperature in
+ millidegrees Celsius. If temperature rises at or above this
+ threshold, a fatal event will be triggered to board management
+ controller (BMC) to shutdown FPGA.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_alarm
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-only. It returns 1 if temperature is currently at or above
+ hardware threshold1 (see 'temp1_max'), otherwise 0.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_alarm
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-only. It returns 1 if temperature is currently at or above
+ hardware threshold2 (see 'temp1_crit'), otherwise 0.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_policy
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. Read this file to get the policy of hardware threshold1
+ (see 'temp1_max'). It only supports two values (policies):
+ 0 - AP2 state (90% throttling)
+ 1 - AP1 state (50% throttling)
diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index 73c779e..72380e1 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -156,7 +156,7 @@ config FPGA_DFL
config FPGA_DFL_FME
tristate "FPGA DFL FME Driver"
- depends on FPGA_DFL
+ depends on FPGA_DFL && HWMON
help
The FPGA Management Engine (FME) is a feature device implemented
under Device Feature List (DFL) framework. Select this option to
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 4d78e18..752d71c 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -14,6 +14,8 @@
* Henry Mitchel <henry.mitchel@intel.com>
*/
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/uaccess.h>
@@ -181,6 +183,178 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
.ioctl = fme_hdr_ioctl,
};
+#define FME_THERM_THRESHOLD 0x8
+#define TEMP_THRESHOLD1 GENMASK_ULL(6, 0)
+#define TEMP_THRESHOLD1_EN BIT_ULL(7)
+#define TEMP_THRESHOLD2 GENMASK_ULL(14, 8)
+#define TEMP_THRESHOLD2_EN BIT_ULL(15)
+#define TRIP_THRESHOLD GENMASK_ULL(30, 24)
+#define TEMP_THRESHOLD1_STATUS BIT_ULL(32) /* threshold1 reached */
+#define TEMP_THRESHOLD2_STATUS BIT_ULL(33) /* threshold2 reached */
+/* threshold1 policy: 0 - AP2 (90% throttle) / 1 - AP1 (50% throttle) */
+#define TEMP_THRESHOLD1_POLICY BIT_ULL(44)
+
+#define FME_THERM_RDSENSOR_FMT1 0x10
+#define FPGA_TEMPERATURE GENMASK_ULL(6, 0)
+
+#define FME_THERM_CAP 0x20
+#define THERM_NO_THROTTLE BIT_ULL(0)
+
+#define MD_PRE_DEG
+
+static bool fme_thermal_throttle_support(void __iomem *base)
+{
+ u64 v = readq(base + FME_THERM_CAP);
+
+ return FIELD_GET(THERM_NO_THROTTLE, v) ? false : true;
+}
+
+static umode_t thermal_hwmon_attrs_visible(const void *drvdata,
+ enum hwmon_sensor_types type,
+ u32 attr, int channel)
+{
+ const struct dfl_feature *feature = drvdata;
+
+ /* temperature is always supported, and check hardware cap for others */
+ if (attr == hwmon_temp_input)
+ return 0444;
+
+ return fme_thermal_throttle_support(feature->ioaddr) ? 0444 : 0;
+}
+
+static int thermal_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long *val)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u64 v;
+
+ switch (attr) {
+ case hwmon_temp_input:
+ v = readq(feature->ioaddr + FME_THERM_RDSENSOR_FMT1);
+ *val = (long)(FIELD_GET(FPGA_TEMPERATURE, v) * 1000);
+ break;
+ case hwmon_temp_max:
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+ *val = (long)(FIELD_GET(TEMP_THRESHOLD1, v) * 1000);
+ break;
+ case hwmon_temp_crit:
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+ *val = (long)(FIELD_GET(TEMP_THRESHOLD2, v) * 1000);
+ break;
+ case hwmon_temp_emergency:
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+ *val = (long)(FIELD_GET(TRIP_THRESHOLD, v) * 1000);
+ break;
+ case hwmon_temp_max_alarm:
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+ *val = (long)FIELD_GET(TEMP_THRESHOLD1_STATUS, v);
+ break;
+ case hwmon_temp_crit_alarm:
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+ *val = (long)FIELD_GET(TEMP_THRESHOLD2_STATUS, v);
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static const struct hwmon_ops thermal_hwmon_ops = {
+ .is_visible = thermal_hwmon_attrs_visible,
+ .read = thermal_hwmon_read,
+};
+
+static const struct hwmon_channel_info *thermal_hwmon_info[] = {
+ HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_EMERGENCY |
+ HWMON_T_MAX | HWMON_T_MAX_ALARM |
+ HWMON_T_CRIT | HWMON_T_CRIT_ALARM),
+ NULL
+};
+
+static const struct hwmon_chip_info thermal_hwmon_chip_info = {
+ .ops = &thermal_hwmon_ops,
+ .info = thermal_hwmon_info,
+};
+
+static ssize_t temp1_max_policy_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u64 v;
+
+ v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+
+ return sprintf(buf, "%u\n",
+ (unsigned int)FIELD_GET(TEMP_THRESHOLD1_POLICY, v));
+}
+
+static DEVICE_ATTR_RO(temp1_max_policy);
+
+static struct attribute *thermal_extra_attrs[] = {
+ &dev_attr_temp1_max_policy.attr,
+ NULL,
+};
+
+static umode_t thermal_extra_attrs_visible(struct kobject *kobj,
+ struct attribute *attr, int index)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+
+ return fme_thermal_throttle_support(feature->ioaddr) ? attr->mode : 0;
+}
+
+static const struct attribute_group thermal_extra_group = {
+ .attrs = thermal_extra_attrs,
+ .is_visible = thermal_extra_attrs_visible,
+};
+__ATTRIBUTE_GROUPS(thermal_extra);
+
+static int fme_thermal_mgmt_init(struct platform_device *pdev,
+ struct dfl_feature *feature)
+{
+ struct device *hwmon;
+
+ /*
+ * create hwmon to allow userspace monitoring temperature and other
+ * threshold information.
+ *
+ * temp1_input -> FPGA device temperature
+ * temp1_max -> hardware threshold 1 -> 50% or 90% throttling
+ * temp1_crit -> hardware threshold 2 -> 100% throttling
+ * temp1_emergency -> hardware trip_threshold to shutdown FPGA
+ * temp1_max_alarm -> hardware threshold 1 alarm
+ * temp1_crit_alarm -> hardware threshold 2 alarm
+ *
+ * create device specific sysfs interfaces, e.g. read temp1_max_policy
+ * to understand the actual hardware throttling action (50% vs 90%).
+ *
+ * If hardware doesn't support automatic throttling per thresholds,
+ * then all above sysfs interfaces are not visible except temp1_input
+ * for temperature.
+ */
+ hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
+ "dfl_fme_thermal", feature,
+ &thermal_hwmon_chip_info,
+ thermal_extra_groups);
+ if (IS_ERR(hwmon)) {
+ dev_err(&pdev->dev, "Fail to register thermal hwmon\n");
+ return PTR_ERR(hwmon);
+ }
+
+ return 0;
+}
+
+static const struct dfl_feature_id fme_thermal_mgmt_id_table[] = {
+ {.id = FME_FEATURE_ID_THERMAL_MGMT,},
+ {0,}
+};
+
+static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
+ .init = fme_thermal_mgmt_init,
+};
+
static struct dfl_feature_driver fme_feature_drvs[] = {
{
.id_table = fme_hdr_id_table,
@@ -195,6 +369,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
.ops = &fme_global_err_ops,
},
{
+ .id_table = fme_thermal_mgmt_id_table,
+ .ops = &fme_thermal_mgmt_ops,
+ },
+ {
.ops = NULL,
},
};
--
1.8.3.1
^ permalink raw reply related
* [PATCH v6 3/3] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-09-10 5:50 UTC (permalink / raw)
To: mdf, linux-fpga, linux-kernel
Cc: linux-api, linux-hwmon, linux, jdelvare, gregkh, Wu Hao,
Luwei Kang, Xu Yilun
In-Reply-To: <1568094640-4920-1-git-send-email-hao.wu@intel.com>
This patch adds support for power management private feature under
FPGA Management Engine (FME). This private feature driver registers
a hwmon for power (power1_input), thresholds information, e.g.
(power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
interfaces for other power management information. For configuration,
user could write threshold values via above power1_max / crit sysfs
interface under hwmon too.
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Moritz Fischer <mdf@kernel.org>
---
v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
move all sysfs interfaces under hwmon
consumed --> hwmon power1_input
threshold1 --> hwmon power1_cap
threshold2 --> hwmon power1_crit
threshold1_status --> hwmon power1_cap_status
threshold2_status --> hwmon power1_crit_status
xeon_limit --> hwmon power1_xeon_limit
fpga_limit --> hwmon power1_fpga_limit
ltr --> hwmon power1_ltr
v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
power1_cap --> power1_max
power1_cap_status --> power1_max_alarm
power1_crit_status --> power1_crit_alarm
update sysfs doc for above sysfs interface changes.
replace scnprintf with sprintf in sysfs interface.
v4: use HWMON_CHANNEL_INFO.
update date in sysfs doc.
v5: clamp threshold inputs in power_hwmon_write function.
update sysfs doc as threshold inputs are clamped now.
add more descriptions to ltr sysfs interface.
v6: rebase and clean up (remove empty uinit function).
update date in sysfs doc.
---
Documentation/ABI/testing/sysfs-platform-dfl-fme | 70 ++++++++
drivers/fpga/dfl-fme-main.c | 207 +++++++++++++++++++++++
2 files changed, 277 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
index c84b3c1..9ea4697 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
@@ -114,6 +114,7 @@ Contact: Wu Hao <hao.wu@intel.com>
Description: Read-Only. Read this file to get the name of hwmon device, it
supports values:
'dfl_fme_thermal' - thermal hwmon device name
+ 'dfl_fme_power' - power hwmon device name
What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
Date: September 2019
@@ -170,3 +171,72 @@ Description: Read-Only. Read this file to get the policy of hardware threshold1
(see 'temp1_max'). It only supports two values (policies):
0 - AP2 state (90% throttling)
1 - AP1 state (50% throttling)
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns current FPGA power consumption in uW.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
+Date: September 2019
+KernelVersion: 5.4
+Date: June 2019
+KernelVersion: 5.3
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Write. Read this file to get current hardware power
+ threshold1 in uW. If power consumption rises at or above
+ this threshold, hardware starts 50% throttling.
+ Write this file to set current hardware power threshold1 in uW.
+ As hardware only accepts values in Watts, so input value will
+ be round down per Watts (< 1 watts part will be discarded) and
+ clamped within the range from 0 to 127 Watts. Write fails with
+ -EINVAL if input parsing fails.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Write. Read this file to get current hardware power
+ threshold2 in uW. If power consumption rises at or above
+ this threshold, hardware starts 90% throttling.
+ Write this file to set current hardware power threshold2 in uW.
+ As hardware only accepts values in Watts, so input value will
+ be round down per Watts (< 1 watts part will be discarded) and
+ clamped within the range from 0 to 127 Watts. Write fails with
+ -EINVAL if input parsing fails.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-only. It returns 1 if power consumption is currently at or
+ above hardware threshold1 (see 'power1_max'), otherwise 0.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-only. It returns 1 if power consumption is currently at or
+ above hardware threshold2 (see 'power1_crit'), otherwise 0.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns power limit for XEON in uW.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-Only. It returns power limit for FPGA in uW.
+
+What: /sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
+Date: September 2019
+KernelVersion: 5.4
+Contact: Wu Hao <hao.wu@intel.com>
+Description: Read-only. Read this file to get current Latency Tolerance
+ Reporting (ltr) value. It returns 1 if all Accelerated
+ Function Units (AFUs) can tolerate latency >= 40us for memory
+ access or 0 if any AFU is latency sensitive (< 40us).
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 752d71c..7c930e6 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -355,6 +355,209 @@ static int fme_thermal_mgmt_init(struct platform_device *pdev,
.init = fme_thermal_mgmt_init,
};
+#define FME_PWR_STATUS 0x8
+#define FME_LATENCY_TOLERANCE BIT_ULL(18)
+#define PWR_CONSUMED GENMASK_ULL(17, 0)
+
+#define FME_PWR_THRESHOLD 0x10
+#define PWR_THRESHOLD1 GENMASK_ULL(6, 0) /* in Watts */
+#define PWR_THRESHOLD2 GENMASK_ULL(14, 8) /* in Watts */
+#define PWR_THRESHOLD_MAX 0x7f /* in Watts */
+#define PWR_THRESHOLD1_STATUS BIT_ULL(16)
+#define PWR_THRESHOLD2_STATUS BIT_ULL(17)
+
+#define FME_PWR_XEON_LIMIT 0x18
+#define XEON_PWR_LIMIT GENMASK_ULL(14, 0) /* in 0.1 Watts */
+#define XEON_PWR_EN BIT_ULL(15)
+#define FME_PWR_FPGA_LIMIT 0x20
+#define FPGA_PWR_LIMIT GENMASK_ULL(14, 0) /* in 0.1 Watts */
+#define FPGA_PWR_EN BIT_ULL(15)
+
+static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long *val)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u64 v;
+
+ switch (attr) {
+ case hwmon_power_input:
+ v = readq(feature->ioaddr + FME_PWR_STATUS);
+ *val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
+ break;
+ case hwmon_power_max:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ *val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
+ break;
+ case hwmon_power_crit:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ *val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
+ break;
+ case hwmon_power_max_alarm:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ *val = (long)FIELD_GET(PWR_THRESHOLD1_STATUS, v);
+ break;
+ case hwmon_power_crit_alarm:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ *val = (long)FIELD_GET(PWR_THRESHOLD2_STATUS, v);
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long val)
+{
+ struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ int ret = 0;
+ u64 v;
+
+ val = clamp_val(val / 1000000, 0, PWR_THRESHOLD_MAX);
+
+ mutex_lock(&pdata->lock);
+
+ switch (attr) {
+ case hwmon_power_max:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ v &= ~PWR_THRESHOLD1;
+ v |= FIELD_PREP(PWR_THRESHOLD1, val);
+ writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+ break;
+ case hwmon_power_crit:
+ v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+ v &= ~PWR_THRESHOLD2;
+ v |= FIELD_PREP(PWR_THRESHOLD2, val);
+ writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+ break;
+ default:
+ ret = -EOPNOTSUPP;
+ break;
+ }
+
+ mutex_unlock(&pdata->lock);
+
+ return ret;
+}
+
+static umode_t power_hwmon_attrs_visible(const void *drvdata,
+ enum hwmon_sensor_types type,
+ u32 attr, int channel)
+{
+ switch (attr) {
+ case hwmon_power_input:
+ case hwmon_power_max_alarm:
+ case hwmon_power_crit_alarm:
+ return 0444;
+ case hwmon_power_max:
+ case hwmon_power_crit:
+ return 0644;
+ }
+
+ return 0;
+}
+
+static const struct hwmon_ops power_hwmon_ops = {
+ .is_visible = power_hwmon_attrs_visible,
+ .read = power_hwmon_read,
+ .write = power_hwmon_write,
+};
+
+static const struct hwmon_channel_info *power_hwmon_info[] = {
+ HWMON_CHANNEL_INFO(power, HWMON_P_INPUT |
+ HWMON_P_MAX | HWMON_P_MAX_ALARM |
+ HWMON_P_CRIT | HWMON_P_CRIT_ALARM),
+ NULL
+};
+
+static const struct hwmon_chip_info power_hwmon_chip_info = {
+ .ops = &power_hwmon_ops,
+ .info = power_hwmon_info,
+};
+
+static ssize_t power1_xeon_limit_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u16 xeon_limit = 0;
+ u64 v;
+
+ v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
+
+ if (FIELD_GET(XEON_PWR_EN, v))
+ xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
+
+ return sprintf(buf, "%u\n", xeon_limit * 100000);
+}
+
+static ssize_t power1_fpga_limit_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u16 fpga_limit = 0;
+ u64 v;
+
+ v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
+
+ if (FIELD_GET(FPGA_PWR_EN, v))
+ fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
+
+ return sprintf(buf, "%u\n", fpga_limit * 100000);
+}
+
+static ssize_t power1_ltr_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dfl_feature *feature = dev_get_drvdata(dev);
+ u64 v;
+
+ v = readq(feature->ioaddr + FME_PWR_STATUS);
+
+ return sprintf(buf, "%u\n",
+ (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
+}
+
+static DEVICE_ATTR_RO(power1_xeon_limit);
+static DEVICE_ATTR_RO(power1_fpga_limit);
+static DEVICE_ATTR_RO(power1_ltr);
+
+static struct attribute *power_extra_attrs[] = {
+ &dev_attr_power1_xeon_limit.attr,
+ &dev_attr_power1_fpga_limit.attr,
+ &dev_attr_power1_ltr.attr,
+ NULL
+};
+
+ATTRIBUTE_GROUPS(power_extra);
+
+static int fme_power_mgmt_init(struct platform_device *pdev,
+ struct dfl_feature *feature)
+{
+ struct device *hwmon;
+
+ hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
+ "dfl_fme_power", feature,
+ &power_hwmon_chip_info,
+ power_extra_groups);
+ if (IS_ERR(hwmon)) {
+ dev_err(&pdev->dev, "Fail to register power hwmon\n");
+ return PTR_ERR(hwmon);
+ }
+
+ return 0;
+}
+
+static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
+ {.id = FME_FEATURE_ID_POWER_MGMT,},
+ {0,}
+};
+
+static const struct dfl_feature_ops fme_power_mgmt_ops = {
+ .init = fme_power_mgmt_init,
+};
+
static struct dfl_feature_driver fme_feature_drvs[] = {
{
.id_table = fme_hdr_id_table,
@@ -373,6 +576,10 @@ static int fme_thermal_mgmt_init(struct platform_device *pdev,
.ops = &fme_thermal_mgmt_ops,
},
{
+ .id_table = fme_power_mgmt_id_table,
+ .ops = &fme_power_mgmt_ops,
+ },
+ {
.ops = NULL,
},
};
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v12 11/12] open: openat2(2) syscall
From: Ingo Molnar @ 2019-09-10 6:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-ia64, Linux-sh list, Peter Zijlstra, Rasmus Villemoes,
Alexei Starovoitov, Linux List Kernel Mailing, David Howells,
open list:KERNEL SELFTEST FRAMEWORK, sparclinux, Shuah Khan,
linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai, Jiri Olsa,
Alexander Shishkin, Ingo Molnar, Linux ARM, linux-mips,
linux-xtensa, Kees Cook, Arnd Bergmann, Jann Horn
In-Reply-To: <CAHk-=whe90Ec_RRrMRLE0=bJOHNS9YmVwcytVxmrfK3oCuZF6A@mail.gmail.com>
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Linus, you rejected resolveat() because you wanted a *nice* API
>
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
>
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
>
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.
I suspect there is a "add a clean new flags namespace" analogy to the
classic "add a clean new standard" XKCD:
https://xkcd.com/927/
Thanks,
Ingo
^ permalink raw reply
* [PATCH 0/2] Minor lockdown fixups
From: Matthew Garrett @ 2019-09-10 10:03 UTC (permalink / raw)
To: jmorris; +Cc: linux-security-module, linux-kernel, linux-api
Constify some arrays and fix an #ifdef that I typoed.
^ permalink raw reply
* [PATCH 1/2] security: constify some arrays in lockdown LSM
From: Matthew Garrett @ 2019-09-10 10:03 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Matthew Garrett, David Howells
In-Reply-To: <20190910100318.204420-1-matthewgarrett@google.com>
No reason for these not to be const.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Suggested-by: David Howells <dhowells@redhat.com>
---
security/lockdown/lockdown.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 0068cec77c05..8a10b43daf74 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -16,7 +16,7 @@
static enum lockdown_reason kernel_locked_down;
-static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
+static const char *const lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_NONE] = "none",
[LOCKDOWN_MODULE_SIGNATURE] = "unsigned module loading",
[LOCKDOWN_DEV_MEM] = "/dev/mem,kmem,port",
@@ -40,7 +40,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
};
-static enum lockdown_reason lockdown_levels[] = {LOCKDOWN_NONE,
+static const enum lockdown_reason lockdown_levels[] = {LOCKDOWN_NONE,
LOCKDOWN_INTEGRITY_MAX,
LOCKDOWN_CONFIDENTIALITY_MAX};
--
2.23.0.162.g0b9fbb3734-goog
^ permalink raw reply related
* [PATCH 2/2] kexec: Fix file verification on S390
From: Matthew Garrett @ 2019-09-10 10:03 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Matthew Garrett, Philipp Rudo
In-Reply-To: <20190910100318.204420-1-matthewgarrett@google.com>
I accidentally typoed this #ifdef, so verification would always be
disabled.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reported-by: Philipp Rudo <prudo@linux.ibm.com>
---
arch/s390/kernel/kexec_elf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/s390/kernel/kexec_elf.c b/arch/s390/kernel/kexec_elf.c
index 9b4f37a4edf1..9da6fa30c447 100644
--- a/arch/s390/kernel/kexec_elf.c
+++ b/arch/s390/kernel/kexec_elf.c
@@ -130,7 +130,7 @@ static int s390_elf_probe(const char *buf, unsigned long len)
const struct kexec_file_ops s390_kexec_elf_ops = {
.probe = s390_elf_probe,
.load = s390_elf_load,
-#ifdef CONFIG_KEXEC__SIG
+#ifdef CONFIG_KEXEC_SIG
.verify_sig = s390_verify_sig,
#endif /* CONFIG_KEXEC_SIG */
};
--
2.23.0.162.g0b9fbb3734-goog
^ permalink raw reply related
* Re: [PATCH V40 03/29] security: Add a static lockdown policy LSM
From: Matthew Garrett @ 2019-09-10 10:06 UTC (permalink / raw)
To: David Howells
Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
Kees Cook
In-Reply-To: <CACdnJuvR7mqhpzEQZdgw9EE_PsM-QWQ_JmwFLcoeLbAuKCHnOA@mail.gmail.com>
On Wed, Sep 4, 2019 at 12:51 PM Matthew Garrett <mjg59@google.com> wrote:
> On Fri, Aug 30, 2019 at 9:28 AM David Howells <dhowells@redhat.com> wrote:
> > > +static int lock_kernel_down(const char *where, enum lockdown_reason level)
> >
> > Is the last parameter the reason or the level? You're mixing the terms.
>
> Fair.
Actually, on re-reading, I think this correct - this is setting the
lockdown level, it's just that the lockdown level is an enum
lockdown_reason for the sake of convenience.
^ permalink raw reply
* Re: [PATCH 0/2] Minor lockdown fixups
From: James Morris @ 2019-09-10 12:29 UTC (permalink / raw)
To: Matthew Garrett; +Cc: linux-security-module, linux-kernel, linux-api
In-Reply-To: <20190910100318.204420-1-matthewgarrett@google.com>
On Tue, 10 Sep 2019, Matthew Garrett wrote:
> Constify some arrays and fix an #ifdef that I typoed.
>
Applied to
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git next-lockdown
and next-testing
--
James Morris
<jmorris@namei.org>
^ permalink raw reply
* Re: [RFC] Add critical process prctl
From: Andy Lutomirski @ 2019-09-10 16:56 UTC (permalink / raw)
To: Daniel Colascione; +Cc: Tim Murray, Suren Baghdasaryan, LKML, Linux API
In-Reply-To: <20190905005313.126823-1-dancol@google.com>
On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
>
> A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> meaning that if the task ever exits, the kernel panics. This facility
> is intended for use by low-level core system processes that cannot
> gracefully restart without a reboot. This prctl allows these processes
> to ensure that the system restarts when they die regardless of whether
> the rest of userspace is operational.
The kind of panic produced by init crashing is awful -- logs don't get
written, etc. I'm wondering if you would be better off with a new
watchdog-like device that, when closed, kills the system in a
configurable way (e.g. after a certain amount of time, while still
logging something and having a decent chance of getting the logs
written out.) This could plausibly even be an extension to the
existing /dev/watchdog API.
--Andy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox