Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
From: Michal Hocko @ 2019-08-14  8:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: khlebnikov, linux-kernel, Minchan Kim, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Robin Murphy, Roman Gushchin
In-Reply-To: <20190813153659.GD14622@google.com>

On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > Idle page tracking currently does not work well in the following
> > > scenario:
> > >  1. mark page-A idle which was present at that time.
> > >  2. run workload
> > >  3. page-A is not touched by workload
> > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > >     so idle bit not set in output) - but it's incorrect.
> > > 
> > > To fix this, we store the idle information into a new idle bit of the
> > > swap PTE during swapping of anonymous pages.
> > >
> > > Also in the future, madvise extensions will allow a system process
> > > manager (like Android's ActivityManager) to swap pages out of a process
> > > that it knows will be cold. To an external process like a heap profiler
> > > that is doing idle tracking on another process, this procedure will
> > > interfere with the idle page tracking similar to the above steps.
> > 
> > This could be solved by checking the !present/swapped out pages
> > right? Whoever decided to put the page out to the swap just made it
> > idle effectively.  So the monitor can make some educated guess for
> > tracking. If that is fundamentally not possible then please describe
> > why.
> 
> But the monitoring process (profiler) does not have control over the 'whoever
> made it effectively idle' process.

Why does that matter? Whether it is a global/memcg reclaim or somebody
calling MADV_PAGEOUT or whatever it is a decision to make the page not
hot. Sure you could argue that a missing idle bit on swap entries might
mean that the swap out decision was pre-mature/sub-optimal/wrong but is
this the aim of the interface?

> As you said it will be a guess, it will not be accurate.

Yes and the point I am trying to make is that having some space and not
giving a guarantee sounds like a safer option for this interface because
...
> 
> I am curious what is your concern with using a bit in the swap PTE?

... It is a promiss of the semantic I find limiting for future. The bit
in the pte might turn out insufficient (e.g. pte reclaim) so teaching
the userspace to consider this a hard guarantee is a ticket to problems
later on. Maybe I am overly paranoid because I have seen so many "nice
to have" features turning into a maintenance burden in the past.

If this is really considered mostly debugging purpouse interface then a
certain level of imprecision should be tolerateable. If there is a
really strong real world usecase that simply has no other way to go
then this might be added later. Adding an information is always safer
than take it away.

That being said, if I am a minority voice here then I will not really
stand in the way and won't nack the patch. I will not ack it neither
though.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v8 01/27] Documentation/x86: Add CET description
From: Florian Weimer @ 2019-08-14  8:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <20190813205225.12032-2-yu-cheng.yu@intel.com>

* Yu-cheng Yu:

> +ENDBR
> +    The compiler inserts an ENDBR at all valid branch targets.  Any
> +    CALL/JMP to a target without an ENDBR triggers a control
> +    protection fault.

Is this really correct?  I think ENDBR is needed only for indirect
branch targets where the jump/call does not have a NOTRACK prefix.  In
general, for security hardening, it seems best to minimize the number of
ENDBR instructions, and use NOTRACK for indirect jumps which derive the
branch target address from information that cannot be modified.

Thanks,
Florian

^ permalink raw reply

* Re: [RESEND PATCH 1/2 -mm] mm: account lazy free pages separately
From: Vlastimil Babka @ 2019-08-14 12:49 UTC (permalink / raw)
  To: Yang Shi, Michal Hocko
  Cc: kirill.shutemov, hannes, rientjes, akpm, linux-mm, linux-kernel,
	Linux API
In-Reply-To: <79c90f6b-fcac-02e1-015a-0eaa4eafdf7d@linux.alibaba.com>

On 8/9/19 8:26 PM, Yang Shi wrote:
> Here the new counter is introduced for patch 2/2 to account deferred 
> split THPs into available memory since NR_ANON_THPS may contain 
> non-deferred split THPs.
> 
> I could use an internal counter for deferred split THPs, but if it is 
> accounted by mod_node_page_state, why not just show it in /proc/meminfo? 

The answer to "Why not" is that it becomes part of userspace API (btw this
patchset should have CC'd linux-api@ - please do for further iterations) and
even if the implementation detail of deferred splitting might change in the
future, we'll basically have to keep the counter (even with 0 value) in
/proc/meminfo forever.

Also, quite recently we have added the following counter:

KReclaimable: Kernel allocations that the kernel will attempt to reclaim
              under memory pressure. Includes SReclaimable (below), and other
              direct allocations with a shrinker.

Although THP allocations are not exactly "kernel allocations", once they are
unmapped, they are in fact kernel-only, so IMHO it wouldn't be a big stretch to
add the lazy THP pages there?

> Or we fix NR_ANON_THPS and show deferred split THPs in /proc/meminfo?
> 
>>
> 

^ permalink raw reply

* Re: [RESEND PATCH 1/2 -mm] mm: account lazy free pages separately
From: Michal Hocko @ 2019-08-14 12:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Yang Shi, kirill.shutemov, hannes, rientjes, akpm, linux-mm,
	linux-kernel, Linux API
In-Reply-To: <564a0860-94f1-6301-5527-5c2272931d8b@suse.cz>

On Wed 14-08-19 14:49:18, Vlastimil Babka wrote:
> On 8/9/19 8:26 PM, Yang Shi wrote:
> > Here the new counter is introduced for patch 2/2 to account deferred 
> > split THPs into available memory since NR_ANON_THPS may contain 
> > non-deferred split THPs.
> > 
> > I could use an internal counter for deferred split THPs, but if it is 
> > accounted by mod_node_page_state, why not just show it in /proc/meminfo? 
> 
> The answer to "Why not" is that it becomes part of userspace API (btw this
> patchset should have CC'd linux-api@ - please do for further iterations) and
> even if the implementation detail of deferred splitting might change in the
> future, we'll basically have to keep the counter (even with 0 value) in
> /proc/meminfo forever.
> 
> Also, quite recently we have added the following counter:
> 
> KReclaimable: Kernel allocations that the kernel will attempt to reclaim
>               under memory pressure. Includes SReclaimable (below), and other
>               direct allocations with a shrinker.
> 
> Although THP allocations are not exactly "kernel allocations", once they are
> unmapped, they are in fact kernel-only, so IMHO it wouldn't be a big stretch to
> add the lazy THP pages there?

That would indeed fit in much better than a dedicated counter.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v8 01/27] Documentation/x86: Add CET description
From: Yu-cheng Yu @ 2019-08-14 15:57 UTC (permalink / raw)
  To: Florian Weimer
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <87tvakgofi.fsf@oldenburg2.str.redhat.com>

On Wed, 2019-08-14 at 10:07 +0200, Florian Weimer wrote:
> * Yu-cheng Yu:
> 
> > +ENDBR
> > +    The compiler inserts an ENDBR at all valid branch targets.  Any
> > +    CALL/JMP to a target without an ENDBR triggers a control
> > +    protection fault.
> 
> Is this really correct?  I think ENDBR is needed only for indirect
> branch targets where the jump/call does not have a NOTRACK prefix.

You are right.  I will fix the wording.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v8 09/27] mm/mmap: Prevent Shadow Stack VMA merges
From: Yu-cheng Yu @ 2019-08-14 16:20 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <5ba3d1b3-5587-e7dd-b9de-9a954172d31f@intel.com>

On Tue, 2019-08-13 at 15:34 -0700, Dave Hansen wrote:
> On 8/13/19 1:52 PM, Yu-cheng Yu wrote:
> > To prevent function call/return spills into the next shadow stack
> > area, do not merge shadow stack areas.
> 
> How does this prevent call/return spills?

It does not.  I will fix the description.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
From: Yu-cheng Yu @ 2019-08-14 16:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <CALCETrVKbqzivPfUOiGi5efHUpEsfPkNzP0CrmAZzcwUgf7quA@mail.gmail.com>

On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote:
> On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > 
> > When a task does fork(), its shadow stack (SHSTK) must be duplicated
> > for the child.  This patch implements a flow similar to copy-on-write
> > of an anonymous page, but for SHSTK.
> > 
> > A SHSTK PTE must be RO and dirty.  This dirty bit requirement is used
> > to effect the copying.  In copy_one_pte(), clear the dirty bit from a
> > SHSTK PTE to cause a page fault upon the next SHSTK access.  At that
> > time, fix the PTE and copy/re-use the page.
> 
> Is using VM_SHSTK and special-casing all of this really better than
> using a special mapping or other pseudo-file-backed VMA and putting
> all the magic in the vm_operations?

A special mapping is cleaner.  However, we also need to exclude normal [RO +
dirty] pages from shadow stack.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
From: Joel Fernandes @ 2019-08-14 16:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: khlebnikov, linux-kernel, Minchan Kim, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Robin Murphy, Roman Gushchin
In-Reply-To: <20190814080531.GP17933@dhcp22.suse.cz>

On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote:
> On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > > Idle page tracking currently does not work well in the following
> > > > scenario:
> > > >  1. mark page-A idle which was present at that time.
> > > >  2. run workload
> > > >  3. page-A is not touched by workload
> > > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > > >     so idle bit not set in output) - but it's incorrect.
> > > > 
> > > > To fix this, we store the idle information into a new idle bit of the
> > > > swap PTE during swapping of anonymous pages.
> > > >
> > > > Also in the future, madvise extensions will allow a system process
> > > > manager (like Android's ActivityManager) to swap pages out of a process
> > > > that it knows will be cold. To an external process like a heap profiler
> > > > that is doing idle tracking on another process, this procedure will
> > > > interfere with the idle page tracking similar to the above steps.
> > > 
> > > This could be solved by checking the !present/swapped out pages
> > > right? Whoever decided to put the page out to the swap just made it
> > > idle effectively.  So the monitor can make some educated guess for
> > > tracking. If that is fundamentally not possible then please describe
> > > why.
> > 
> > But the monitoring process (profiler) does not have control over the 'whoever
> > made it effectively idle' process.
> 
> Why does that matter? Whether it is a global/memcg reclaim or somebody
> calling MADV_PAGEOUT or whatever it is a decision to make the page not
> hot. Sure you could argue that a missing idle bit on swap entries might
> mean that the swap out decision was pre-mature/sub-optimal/wrong but is
> this the aim of the interface?
> 
> > As you said it will be a guess, it will not be accurate.
> 
> Yes and the point I am trying to make is that having some space and not
> giving a guarantee sounds like a safer option for this interface because

I do see your point of view, but jJust because a future (and possibly not
going to happen) usecase which you mentioned as pte reclaim, makes you feel
that userspace may be subject to inaccuracies anyway, doesn't mean we should
make everything inaccurate..  We already know idle page tracking is not
completely accurate. But that doesn't mean we miss out on the opportunity to
make the "non pte-reclaim" usecase inaccurate as well. 

IMO, we should do our best for today, and not hypothesize. How likely is pte
reclaim and is there a thread to describe that direction?

> > I am curious what is your concern with using a bit in the swap PTE?
> 
> ... It is a promiss of the semantic I find limiting for future. The bit
> in the pte might turn out insufficient (e.g. pte reclaim) so teaching
> the userspace to consider this a hard guarantee is a ticket to problems
> later on. Maybe I am overly paranoid because I have seen so many "nice
> to have" features turning into a maintenance burden in the past.
> 
> If this is really considered mostly debugging purpouse interface then a
> certain level of imprecision should be tolerateable. If there is a
> really strong real world usecase that simply has no other way to go
> then this might be added later. Adding an information is always safer
> than take it away.
> 
> That being said, if I am a minority voice here then I will not really
> stand in the way and won't nack the patch. I will not ack it neither
> though.

Ok.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR
From: Scott Wood @ 2019-08-14 16:34 UTC (permalink / raw)
  To: Wu Hao, Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <20190724142235.GE8463@hao-dev>

On Wed, 2019-07-24 at 22:22 +0800, Wu Hao wrote:
> On Wed, Jul 24, 2019 at 11:35:32AM +0200, Greg KH wrote:
> > On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote:
> > >  
> > > @@ -67,8 +69,43 @@
> > >  #define PR_WAIT_TIMEOUT   8000000
> > >  #define PR_HOST_STATUS_IDLE	0
> > >  
> > > +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
> > > +
> > > +#include <linux/cpufeature.h>
> > > +#include <asm/fpu/api.h>
> > > +
> > > +static inline int is_cpu_avx512_enabled(void)
> > > +{
> > > +	return cpu_feature_enabled(X86_FEATURE_AVX512F);
> > > +}
> > 
> > That's a very arch specific function, why would a driver ever care about
> > this?
> 
> Yes, this is only applied to a specific FPGA solution, which FPGA
> has been integrated with XEON. Hardware indicates this using register
> to software. As it's cpu integrated solution, so CPU always has this
> AVX512 capability. The only check we do, is make sure this is not
> manually disabled by kernel.
> 
> With this hardware, software could use AVX512 to accelerate the FPGA
> partial reconfiguration as mentioned in the patch commit message.
> It brings performance benifits to people who uses it. This is only one
> optimization (512 vs 32bit data write to hw) for a specific hardware.

I thought earlier you said that 512 bit accesses were required for this
particular integrated-only version of the device, and not just an
optimization?

> > > +#else
> > > +static inline int is_cpu_avx512_enabled(void)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline void copy512(const void *src, void __iomem *dst)
> > > +{
> > > +	WARN_ON_ONCE(1);
> > 
> > Are you trying to get reports from syzbot?  :)
> 
> Oh.. no.. I will remove it. :)
> 
> Thank you very much!

What's wrong with this?  The driver should never call copy512() if
is_cpu_avx512_enabled() returns 0, and if syzbot can somehow make the driver
do so, then yes we do want a report.

-Scott

^ permalink raw reply

* Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
From: Yu-cheng Yu @ 2019-08-14 16:42 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <dac2d62b-9045-4767-87dd-eac12e7abafd@intel.com>

On Tue, 2019-08-13 at 16:02 -0700, Dave Hansen wrote:
[...]
> Please also reconcile the supervisor XSAVE portion of your patches with
> the ones that Fenghua has been sending around.  I've given quite a bit
> of feedback to improve those.  Please consolidate and agree on a common
> set of patches with him.

XSAVES supervisor is now a six-patch set.  Maybe we can make it a separate
series?  I will consolidate and send it out.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
From: Dave Hansen @ 2019-08-14 16:48 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <eabd0c16bd2028ad9fef8d10ddf570b3a10d5680.camel@intel.com>

On 8/14/19 9:27 AM, Yu-cheng Yu wrote:
> On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote:
>> On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>> When a task does fork(), its shadow stack (SHSTK) must be duplicated
>>> for the child.  This patch implements a flow similar to copy-on-write
>>> of an anonymous page, but for SHSTK.
>>>
>>> A SHSTK PTE must be RO and dirty.  This dirty bit requirement is used
>>> to effect the copying.  In copy_one_pte(), clear the dirty bit from a
>>> SHSTK PTE to cause a page fault upon the next SHSTK access.  At that
>>> time, fix the PTE and copy/re-use the page.
>> Is using VM_SHSTK and special-casing all of this really better than
>> using a special mapping or other pseudo-file-backed VMA and putting
>> all the magic in the vm_operations?
> A special mapping is cleaner.  However, we also need to exclude normal [RO +
> dirty] pages from shadow stack.

I don't understand what you are saying.

Are you saying that we need this VM_SHSTK flag in order to exclude
RO+HW-Dirty pages from being created in non-shadow-stack VMAs?

^ permalink raw reply

* Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
From: Dave Hansen @ 2019-08-14 16:58 UTC (permalink / raw)
  To: Yu-cheng Yu, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Andy Lutomirski, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <c7731c682b55ec882ad3d4ea11ad7a823dcaae8f.camel@intel.com>

On 8/14/19 9:42 AM, Yu-cheng Yu wrote:
> On Tue, 2019-08-13 at 16:02 -0700, Dave Hansen wrote:
> [...]
>> Please also reconcile the supervisor XSAVE portion of your patches with
>> the ones that Fenghua has been sending around.  I've given quite a bit
>> of feedback to improve those.  Please consolidate and agree on a common
>> set of patches with him.
> XSAVES supervisor is now a six-patch set.  Maybe we can make it a separate
> series?  I will consolidate and send it out.

A separate series would be great.

Please also make sure it's in a (temporary) git tree somewhere so that
it's easy to base other sets on top of it.

^ permalink raw reply

* Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
From: Yu-cheng Yu @ 2019-08-14 17:00 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <bf8a6390-97a6-1ab6-90ef-6399437ed38c@intel.com>

On Wed, 2019-08-14 at 09:48 -0700, Dave Hansen wrote:
> On 8/14/19 9:27 AM, Yu-cheng Yu wrote:
> > On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote:
> > > On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > > When a task does fork(), its shadow stack (SHSTK) must be duplicated
> > > > for the child.  This patch implements a flow similar to copy-on-write
> > > > of an anonymous page, but for SHSTK.
> > > > 
> > > > A SHSTK PTE must be RO and dirty.  This dirty bit requirement is used
> > > > to effect the copying.  In copy_one_pte(), clear the dirty bit from a
> > > > SHSTK PTE to cause a page fault upon the next SHSTK access.  At that
> > > > time, fix the PTE and copy/re-use the page.
> > > 
> > > Is using VM_SHSTK and special-casing all of this really better than
> > > using a special mapping or other pseudo-file-backed VMA and putting
> > > all the magic in the vm_operations?
> > 
> > A special mapping is cleaner.  However, we also need to exclude normal [RO +
> > dirty] pages from shadow stack.
> 
> I don't understand what you are saying.
> 
> Are you saying that we need this VM_SHSTK flag in order to exclude
> RO+HW-Dirty pages from being created in non-shadow-stack VMAs?

We use VM_SHSTK for page fault handling (the special-casing).  If we have a
special mapping, all these become cleaner (but more code).  However, we still
need most of the PTE macros (e.g. ptep_set_wrprotect, PAGE_DIRTY_SW, etc.).

Yu-cheng

^ permalink raw reply

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Matthew Garrett @ 2019-08-14 17:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
	Josh Boyer, David Howells, Kees Cook, Dave Young, linux-acpi
In-Reply-To: <20190814072602.GA27836@zn.tnic>

On Wed, Aug 14, 2019 at 12:25 AM Borislav Petkov <bp@alien8.de> wrote:
> #if defined(CONFIG_RANDOMIZE_BASE) && defined(CONFIG_MEMORY_HOTREMOVE)
>
> false and thus not available to early code anymore.

We explicitly don't want to pay attention to the acpi_rsdp kernel
parameter in early boot except for the case of finding the SRAT table,
and we only need that if CONFIG_RANDOMIZE_BASE and
CONFIG_MEMORY_HOTREMOVE are set. However, we *do* want to tell the
actual kernel where the RSDP is if we found it via some other means,
so we can't just clear the boot parameters value. The kernel proper
will parse the command line again and will then (if lockdown isn't
enabled) override the actual value we passed up in boot params. So I
think this is ok?

(Sorry for not Cc:ing x86, clear oversight on my part)

^ permalink raw reply

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Borislav Petkov @ 2019-08-14 17:47 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
	Josh Boyer, David Howells, Kees Cook, Dave Young, linux-acpi
In-Reply-To: <CACdnJuuOhuw71GDwQOrPQxUexpvpZNJocr6m0dYzJw+MMaVKWQ@mail.gmail.com>

On Wed, Aug 14, 2019 at 10:14:54AM -0700, Matthew Garrett wrote:
> We explicitly don't want to pay attention to the acpi_rsdp kernel
> parameter in early boot except for the case of finding the SRAT table,
> and we only need that if CONFIG_RANDOMIZE_BASE and
> CONFIG_MEMORY_HOTREMOVE are set. However, we *do* want to tell the
> actual kernel where the RSDP is if we found it via some other means,
> so we can't just clear the boot parameters value.

Ok.

> The kernel proper will parse the command line again and will then (if
> lockdown isn't enabled) override the actual value we passed up in boot
> params.

Yeah, ok, I see what you're doing there. AFAICT, you do that in

setup_arch->acpi_boot_table_init-> ... -> acpi_os_get_root_pointer()

I hope nothing needs it earlier because then we'll have to restructure
again...

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Colascione
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814005737.4qg6wh4a53vmso2v@ast-mbp>

On Tue, Aug 13, 2019 at 5:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:

> hmm. No. Kernel developers should not make any assumptions.
> They should guide their design by real use cases instead. That includes studing
> what people do now and hacks they use to workaround lack of interfaces.
> Effecitvely bpf is root only. There are no unpriv users.
> This root applications go out of their way to reduce privileges
> while they still want to use bpf. That is the need that /dev/bpf is solving.
>
> >
> > > Containers are not providing the level of security that is enough
> > > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > > Containers are used to make production systems safer.
> > > Some people call it more 'secure', but it's clearly not secure for
> > > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > > It's been a constant source of pain. The constant blinding, randomization,
> > > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > > are simply not worth it. It's a lot of complex kernel code without users.
> >
> > Seccomp really will want eBPF some day, and it should work without
> > privilege.  Maybe it should be a restricted subset of eBPF, and
> > Spectre will always be an issue until dramatically better hardware
> > shows up, but I think people will want the ability for regular
> > programs to load eBPF seccomp programs.
>
> I'm absolutely against using eBPF in seccomp.
> Precisely due to discussions like the current one.

I still think I don't really agree with your overall premise.

If eBPF is genuinely not usable by programs that are not fully trusted
by the admin, then no kernel changes at all are needed.  Programs that
want to reduce their own privileges can easily fork() a privileged
subprocess or run a little helper to which they delegate BPF
operations.  This is far more flexible than anything that will ever be
in the kernel because it allows the helper to verify that the rest of
the program is doing exactly what it's supposed to and restrict eBPF
operations to exactly the subset that is needed.  So a container
manager or network manager that drops some provilege could have a
little bpf-helper that manages its BPF XDP, firewalling, etc
configuration.  The two processes would talk over a socketpair.

The interesting cases you're talking about really *do* involved
unprivileged or less privileged eBPF, though.  Let's see:

systemd --user: systemd --user *is not privileged at all*.  There's no
issue of reducing privilege, since systemd --user doesn't have any
privilege to begin with.  But systemd supports some eBPF features, and
presumably it would like to support them in the systemd --user case.
This is unprivileged eBPF.

Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
it doesn't involve the bpf() syscall.  There are some seccomp
proposals in the works that will want some stuff from eBPF.  In
particular, the ability to call seccomp-specific bpf functions from a
seccomp program could be very nice. Similarly, the ability to use the
enhanced instruction set and maybe even *read* maps would be nice.  I
do think that seccomp will continue to want its programs to be
stateless.

So it's a bit of a chicken-and-egg situation.  There aren't major
unprivileged eBPF users because the kernel support isn't there.

>
> >
> > > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > > The applications that will use it are going to be just as trusted as systemd.
> >
> > I still don't understand your systemd example.  systemd --users is not
> > trusted systemwide in any respect.  The main PID 1 systemd is root.
> > No matter how you dice it, granting a user systemd instance extra bpf
> > access is tantamount to granting the user extra bpf access in general.
>
> People use systemd --user while their kernel have 'undef CONFIG_USER_NS'.

I don't know what you're getting at.  I'm typing this email in a
browser running under a systemd --user instance, and there are no user
namespaces involved.

$ ps -u luto |grep systemd
 1944 ?        00:00:02 systemd
$ stat /proc/1944
...
Access: (0555/dr-xr-xr-x)  Uid: ( 1000/    luto)   Gid: ( 1000/    luto)
Context: unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
$ gdb -p 1944
[snipped tons of output, but gdb works fine like this]

systemd --user is not privileged.  Giving it /dev/bpf as imagined by
the current set of patches would be a gaping security hole.

>
> I think there should be no unprivileged bpf at all,
> because over all these years we've seen zero use cases.
> Hence all new features are root only.

You're the maintainer.  If you feel this way, then I think you should
just drop the /dev/bpf idea entirely and have userspace manage all of
this by itself.  It will remain extremely awkward for containers and
especially nested containers to use eBPF.

--Andy

^ permalink raw reply

* Re: [PATCH V38 15/29] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
From: Matthew Garrett @ 2019-08-14 18:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
	Josh Boyer, David Howells, Kees Cook, Dave Young, linux-acpi
In-Reply-To: <20190814174732.GD1841@zn.tnic>

On Wed, Aug 14, 2019 at 10:46 AM Borislav Petkov <bp@alien8.de> wrote:
> Yeah, ok, I see what you're doing there. AFAICT, you do that in
>
> setup_arch->acpi_boot_table_init-> ... -> acpi_os_get_root_pointer()

Right.

> I hope nothing needs it earlier because then we'll have to restructure
> again...

Passing the RSDP via command line is a pretty grotesque hack - we
should just set up boot params in kexec_file, which would leave this
as a problem for legacy kexec and nothing else.

^ permalink raw reply

* Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
From: Michal Hocko @ 2019-08-14 18:36 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: khlebnikov, linux-kernel, Minchan Kim, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Mike Rapoport, namhyung, paulmck,
	Robin Murphy, Roman Gushchin
In-Reply-To: <20190814163203.GB59398@google.com>

On Wed 14-08-19 12:32:03, Joel Fernandes wrote:
> On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote:
> > On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> > > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > > > Idle page tracking currently does not work well in the following
> > > > > scenario:
> > > > >  1. mark page-A idle which was present at that time.
> > > > >  2. run workload
> > > > >  3. page-A is not touched by workload
> > > > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > > > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > > > >     so idle bit not set in output) - but it's incorrect.
> > > > > 
> > > > > To fix this, we store the idle information into a new idle bit of the
> > > > > swap PTE during swapping of anonymous pages.
> > > > >
> > > > > Also in the future, madvise extensions will allow a system process
> > > > > manager (like Android's ActivityManager) to swap pages out of a process
> > > > > that it knows will be cold. To an external process like a heap profiler
> > > > > that is doing idle tracking on another process, this procedure will
> > > > > interfere with the idle page tracking similar to the above steps.
> > > > 
> > > > This could be solved by checking the !present/swapped out pages
> > > > right? Whoever decided to put the page out to the swap just made it
> > > > idle effectively.  So the monitor can make some educated guess for
> > > > tracking. If that is fundamentally not possible then please describe
> > > > why.
> > > 
> > > But the monitoring process (profiler) does not have control over the 'whoever
> > > made it effectively idle' process.
> > 
> > Why does that matter? Whether it is a global/memcg reclaim or somebody
> > calling MADV_PAGEOUT or whatever it is a decision to make the page not
> > hot. Sure you could argue that a missing idle bit on swap entries might
> > mean that the swap out decision was pre-mature/sub-optimal/wrong but is
> > this the aim of the interface?
> > 
> > > As you said it will be a guess, it will not be accurate.
> > 
> > Yes and the point I am trying to make is that having some space and not
> > giving a guarantee sounds like a safer option for this interface because
> 
> I do see your point of view, but jJust because a future (and possibly not
> going to happen) usecase which you mentioned as pte reclaim, makes you feel
> that userspace may be subject to inaccuracies anyway, doesn't mean we should
> make everything inaccurate..  We already know idle page tracking is not
> completely accurate. But that doesn't mean we miss out on the opportunity to
> make the "non pte-reclaim" usecase inaccurate as well. 

Just keep in mind that you will add more burden to future features
because they would have to somehow overcome this user visible behavior
and we will get to the usual question - Is this going to break
something that relies on the idle bit being stable?

> IMO, we should do our best for today, and not hypothesize. How likely is pte
> reclaim and is there a thread to describe that direction?

Not that I am aware of now but with large NVDIMM mapped files I can see
that this will get more and more interesting.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrUkqUprujww26VxHwkdXQ3DWJH8nnL2VBYpK2EU0oX_YA@mail.gmail.com>

On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> 
> If eBPF is genuinely not usable by programs that are not fully trusted
> by the admin, then no kernel changes at all are needed.  Programs that
> want to reduce their own privileges can easily fork() a privileged
> subprocess or run a little helper to which they delegate BPF
> operations.  This is far more flexible than anything that will ever be
> in the kernel because it allows the helper to verify that the rest of
> the program is doing exactly what it's supposed to and restrict eBPF
> operations to exactly the subset that is needed.  So a container
> manager or network manager that drops some provilege could have a
> little bpf-helper that manages its BPF XDP, firewalling, etc
> configuration.  The two processes would talk over a socketpair.

there were three projects that tried to delegate bpf operations.
All of them failed.
bpf operational workflow is much more complex than you're imagining.
fork() also doesn't work for all cases.
I gave this example before: consider multiple systemd-like deamons
that need to do bpf operations that want to pass this 'bpf capability'
to other deamons written by other teams. Some of them will start
non-root, but still need to do bpf. They will be rpm installed
and live upgraded while running.
We considered to make systemd such centralized bpf delegation
authority too. It didn't work. bpf in kernel grows quickly.
libbpf part grows independently. llvm keeps evolving.
All of them are being changed while system overall has to stay
operational. Centralized approach breaks apart.

> The interesting cases you're talking about really *do* involved
> unprivileged or less privileged eBPF, though.  Let's see:
> 
> systemd --user: systemd --user *is not privileged at all*.  There's no
> issue of reducing privilege, since systemd --user doesn't have any
> privilege to begin with.  But systemd supports some eBPF features, and
> presumably it would like to support them in the systemd --user case.
> This is unprivileged eBPF.

Let's disambiguate the terminology.
This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
I think that was a mistake.
Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
This is not unprivileged.
'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.

There is a huge difference between the two.
I'm against extending 'unprivileged bpf' even a bit more than what it is
today for many reasons mentioned earlier.
The /dev/bpf is about 'less privileged'.
Less privileged than root. We need to split part of full root capability
into bpf capability. So that most of the root can be dropped.
This is very similar to what cap_net_admin does.
cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
cap_net_admin is very much privileged. Just 'less privileged' than root.
Same thing for cap_bpf.

May be we should do both cap_bpf and /dev/bpf to make it clear that
this is the same thing. Two interfaces to achieve the same result.

> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
> it doesn't involve the bpf() syscall.  There are some seccomp
> proposals in the works that will want some stuff from eBPF.  In

I'm afraid these proposals won't go anywhere.

> So it's a bit of a chicken-and-egg situation.  There aren't major
> unprivileged eBPF users because the kernel support isn't there.

As I said before there are zero known use cases of 'unprivileged bpf'.

If I understand you correctly you're refusing to accept that
'less privileged bpf' is a valid use case while pushing for extending
scope of 'unprivileged'.
Extending the scope is an orthogonal discussion. Currently I'm
opposed to that. Whereas 'less privileged' is what people require.

> It will remain extremely awkward for containers and
> especially nested containers to use eBPF.

I'm afraid we have to agree to disagree here and move on.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 22:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814220545.co5pucyo5jk3weiv@ast-mbp.dhcp.thefacebook.com>



> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>> 
>> If eBPF is genuinely not usable by programs that are not fully trusted
>> by the admin, then no kernel changes at all are needed.  Programs that
>> want to reduce their own privileges can easily fork() a privileged
>> subprocess or run a little helper to which they delegate BPF
>> operations.  This is far more flexible than anything that will ever be
>> in the kernel because it allows the helper to verify that the rest of
>> the program is doing exactly what it's supposed to and restrict eBPF
>> operations to exactly the subset that is needed.  So a container
>> manager or network manager that drops some provilege could have a
>> little bpf-helper that manages its BPF XDP, firewalling, etc
>> configuration.  The two processes would talk over a socketpair.
> 
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
> 
>> The interesting cases you're talking about really *do* involved
>> unprivileged or less privileged eBPF, though.  Let's see:
>> 
>> systemd --user: systemd --user *is not privileged at all*.  There's no
>> issue of reducing privilege, since systemd --user doesn't have any
>> privilege to begin with.  But systemd supports some eBPF features, and
>> presumably it would like to support them in the systemd --user case.
>> This is unprivileged eBPF.
> 
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> 
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.

The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

> 
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.

What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

> 
>> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
>> it doesn't involve the bpf() syscall.  There are some seccomp
>> proposals in the works that will want some stuff from eBPF.  In
> 
> I'm afraid these proposals won't go anywhere.

Can you explain why?

> 
>> So it's a bit of a chicken-and-egg situation.  There aren't major
>> unprivileged eBPF users because the kernel support isn't there.
> 
> As I said before there are zero known use cases of 'unprivileged bpf'.
> 
> If I understand you correctly you're refusing to accept that
> 'less privileged bpf' is a valid use case while pushing for extending
> scope of 'unprivileged'.

No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

^ permalink raw reply

* Re: [PATCH v8 10/20] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-08-14 22:35 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	linux-ext4, Paul Crowley
In-Reply-To: <20190813000644.GH28705@mit.edu>

On Mon, Aug 12, 2019 at 08:06:44PM -0400, Theodore Y. Ts'o wrote:
> > +		/* Some inodes still reference this key; try to evict them. */
> > +		if (try_to_lock_encrypted_files(sb, mk) != 0)
> > +			status_flags |=
> > +				FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY;
> > +	}
> 
> try_to_lock_encrypted_files() can return other errors besides -EBUSY;
> in particular sync_filesystem() can return other errors, such as -EIO
> or -EFSCORUPTED.  In that case, I think we're better off returning the
> relevant status code back to the user.  We will have already wiped the
> master key, but this situation will only happen in exceptional
> conditions (e.g., user has ejected the sdcard, etc.), so it's not
> worth it to try to undo the master key wipe to try to restore things
> to the pre-ioctl execution state.
> 
> So I think we should capture the return code from
> try_to_lock_encrypted_files, and if it is EBUSY, we can set FILES_BUSY
> flag and return success.  Otherwise, we should return the error.
> 
> If you agree, please fix that up and then feel free to add:
> 
> Reviewed-by: Theodore Ts'o <tytso@mit.edu>
> 
> 						- Ted

Yes, that makes sense.  I've made the following change to this patch:

diff --git a/fs/crypto/keyring.c b/fs/crypto/keyring.c
index 9901593051424b..c3423f0edc7014 100644
--- a/fs/crypto/keyring.c
+++ b/fs/crypto/keyring.c
@@ -479,6 +479,7 @@ int fscrypt_ioctl_remove_key(struct file *filp, void __user *_uarg)
 	struct key *key;
 	struct fscrypt_master_key *mk;
 	u32 status_flags = 0;
+	int err;
 	bool dead;
 
 	if (copy_from_user(&arg, uarg, sizeof(arg)))
@@ -514,11 +515,15 @@ int fscrypt_ioctl_remove_key(struct file *filp, void __user *_uarg)
 		 * key object is free to be removed from the keyring.
 		 */
 		key_invalidate(key);
+		err = 0;
 	} else {
 		/* Some inodes still reference this key; try to evict them. */
-		if (try_to_lock_encrypted_files(sb, mk) != 0)
+		err = try_to_lock_encrypted_files(sb, mk);
+		if (err == -EBUSY) {
 			status_flags |=
 				FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY;
+			err = 0;
+		}
 	}
 	/*
 	 * We return 0 if we successfully did something: wiped the secret, or
@@ -527,7 +532,9 @@ int fscrypt_ioctl_remove_key(struct file *filp, void __user *_uarg)
 	 * including all files locked.
 	 */
 	key_put(key);
-	return put_user(status_flags, &uarg->removal_status_flags);
+	if (err == 0)
+		err = put_user(status_flags, &uarg->removal_status_flags);
+	return err;
 }
 EXPORT_SYMBOL_GPL(fscrypt_ioctl_remove_key);
 

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply related

* Re: [PATCH v8 00/20] fscrypt: key management improvements
From: Eric Biggers @ 2019-08-14 22:37 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: linux-ext4, Theodore Ts'o, linux-api, linux-f2fs-devel,
	keyrings, linux-mtd, linux-crypto, linux-fsdevel, Jaegeuk Kim,
	Satya Tangirala, Paul Crowley
In-Reply-To: <20190805162521.90882-1-ebiggers@kernel.org>

On Mon, Aug 05, 2019 at 09:25:01AM -0700, Eric Biggers wrote:
> Hello,
> 
> [Note: I'd like to apply this for v5.4.  Additional review is greatly
>  appreciated, especially of the API before it's set in stone.  Thanks!]
> 
> This patchset makes major improvements to how keys are added, removed,
> and derived in fscrypt, aka ext4/f2fs/ubifs encryption.  It does this by
> adding new ioctls that add and remove encryption keys directly to/from
> the filesystem, and by adding a new encryption policy version ("v2")
> where the user-provided keys are only used as input to HKDF-SHA512 and
> are identified by their cryptographic hash.
> 
> All new APIs and all cryptosystem changes are documented in
> Documentation/filesystems/fscrypt.rst.  Userspace can use the new key
> management ioctls with existing encrypted directories, but migrating to
> v2 encryption policies is needed for the full benefits.
> 

I've applied this patchset to the fscrypt tree for 5.4.

- Eric

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <AD211133-EA60-4B91-AB1B-201713F50AB2@amacapital.net>

On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >> 
> >> If eBPF is genuinely not usable by programs that are not fully trusted
> >> by the admin, then no kernel changes at all are needed.  Programs that
> >> want to reduce their own privileges can easily fork() a privileged
> >> subprocess or run a little helper to which they delegate BPF
> >> operations.  This is far more flexible than anything that will ever be
> >> in the kernel because it allows the helper to verify that the rest of
> >> the program is doing exactly what it's supposed to and restrict eBPF
> >> operations to exactly the subset that is needed.  So a container
> >> manager or network manager that drops some provilege could have a
> >> little bpf-helper that manages its BPF XDP, firewalling, etc
> >> configuration.  The two processes would talk over a socketpair.
> > 
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> > 
> >> The interesting cases you're talking about really *do* involved
> >> unprivileged or less privileged eBPF, though.  Let's see:
> >> 
> >> systemd --user: systemd --user *is not privileged at all*.  There's no
> >> issue of reducing privilege, since systemd --user doesn't have any
> >> privilege to begin with.  But systemd supports some eBPF features, and
> >> presumably it would like to support them in the systemd --user case.
> >> This is unprivileged eBPF.
> > 
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > 
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> 
> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

Initially I agreed that it's probably too broad, but then realized
that they're perfect as-is. There is no need to partition further.

> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> 
> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

Indeed, ambient capabilities should work for all cases.

> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

There are not that many bits left. I prefer to consume single CAP_BPF bit.
All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
This is no-brainer.

The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
should be extended to CAP_BPF or not.
imo devmap and xskmap can stay CAP_NET_ADMIN,
but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
Initially cgroup-bpf hooks were limited to networking.
It's no longer the case. Requiring NET_ADMIN there make little sense now.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 23:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814233335.37t4zfsiswrpd4d6@ast-mbp.dhcp.thefacebook.com>



> On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
>> 
>> 
>>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 
>>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>>>> 
>>>> If eBPF is genuinely not usable by programs that are not fully trusted
>>>> by the admin, then no kernel changes at all are needed.  Programs that
>>>> want to reduce their own privileges can easily fork() a privileged
>>>> subprocess or run a little helper to which they delegate BPF
>>>> operations.  This is far more flexible than anything that will ever be
>>>> in the kernel because it allows the helper to verify that the rest of
>>>> the program is doing exactly what it's supposed to and restrict eBPF
>>>> operations to exactly the subset that is needed.  So a container
>>>> manager or network manager that drops some provilege could have a
>>>> little bpf-helper that manages its BPF XDP, firewalling, etc
>>>> configuration.  The two processes would talk over a socketpair.
>>> 
>>> there were three projects that tried to delegate bpf operations.
>>> All of them failed.
>>> bpf operational workflow is much more complex than you're imagining.
>>> fork() also doesn't work for all cases.
>>> I gave this example before: consider multiple systemd-like deamons
>>> that need to do bpf operations that want to pass this 'bpf capability'
>>> to other deamons written by other teams. Some of them will start
>>> non-root, but still need to do bpf. They will be rpm installed
>>> and live upgraded while running.
>>> We considered to make systemd such centralized bpf delegation
>>> authority too. It didn't work. bpf in kernel grows quickly.
>>> libbpf part grows independently. llvm keeps evolving.
>>> All of them are being changed while system overall has to stay
>>> operational. Centralized approach breaks apart.
>>> 
>>>> The interesting cases you're talking about really *do* involved
>>>> unprivileged or less privileged eBPF, though.  Let's see:
>>>> 
>>>> systemd --user: systemd --user *is not privileged at all*.  There's no
>>>> issue of reducing privilege, since systemd --user doesn't have any
>>>> privilege to begin with.  But systemd supports some eBPF features, and
>>>> presumably it would like to support them in the systemd --user case.
>>>> This is unprivileged eBPF.
>>> 
>>> Let's disambiguate the terminology.
>>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
>>> I think that was a mistake.
>>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
>>> This is not unprivileged.
>>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>>> 
>>> There is a huge difference between the two.
>>> I'm against extending 'unprivileged bpf' even a bit more than what it is
>>> today for many reasons mentioned earlier.
>>> The /dev/bpf is about 'less privileged'.
>>> Less privileged than root. We need to split part of full root capability
>>> into bpf capability. So that most of the root can be dropped.
>>> This is very similar to what cap_net_admin does.
>>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
>>> cap_net_admin is very much privileged. Just 'less privileged' than root.
>>> Same thing for cap_bpf.
>> 
>> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> 
> Initially I agreed that it's probably too broad, but then realized
> that they're perfect as-is. There is no need to partition further.
> 
>>> May be we should do both cap_bpf and /dev/bpf to make it clear that
>>> this is the same thing. Two interfaces to achieve the same result.
>> 
>> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> 
> Indeed, ambient capabilities should work for all cases.
> 
>> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> 
> There are not that many bits left. I prefer to consume single CAP_BPF bit.
> All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> This is no-brainer.
> 
> The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> should be extended to CAP_BPF or not.
> imo devmap and xskmap can stay CAP_NET_ADMIN,
> but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> Initially cgroup-bpf hooks were limited to networking.
> It's no longer the case. Requiring NET_ADMIN there make little sense now.
> 

Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar.  The rest could plausibly be hardened in the future, although the by_id stuff may be tricky too.

Do new programs really need the by_id calls?  It could make sense to leave those unchanged and to have new programs use persistent maps instead.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <317422C3-ACE3-42A7-A287-7B8FEE12E33A@amacapital.net>

On Wed, Aug 14, 2019 at 04:59:18PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> >> 
> >> 
> >>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> 
> >>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >>>> 
> >>>> If eBPF is genuinely not usable by programs that are not fully trusted
> >>>> by the admin, then no kernel changes at all are needed.  Programs that
> >>>> want to reduce their own privileges can easily fork() a privileged
> >>>> subprocess or run a little helper to which they delegate BPF
> >>>> operations.  This is far more flexible than anything that will ever be
> >>>> in the kernel because it allows the helper to verify that the rest of
> >>>> the program is doing exactly what it's supposed to and restrict eBPF
> >>>> operations to exactly the subset that is needed.  So a container
> >>>> manager or network manager that drops some provilege could have a
> >>>> little bpf-helper that manages its BPF XDP, firewalling, etc
> >>>> configuration.  The two processes would talk over a socketpair.
> >>> 
> >>> there were three projects that tried to delegate bpf operations.
> >>> All of them failed.
> >>> bpf operational workflow is much more complex than you're imagining.
> >>> fork() also doesn't work for all cases.
> >>> I gave this example before: consider multiple systemd-like deamons
> >>> that need to do bpf operations that want to pass this 'bpf capability'
> >>> to other deamons written by other teams. Some of them will start
> >>> non-root, but still need to do bpf. They will be rpm installed
> >>> and live upgraded while running.
> >>> We considered to make systemd such centralized bpf delegation
> >>> authority too. It didn't work. bpf in kernel grows quickly.
> >>> libbpf part grows independently. llvm keeps evolving.
> >>> All of them are being changed while system overall has to stay
> >>> operational. Centralized approach breaks apart.
> >>> 
> >>>> The interesting cases you're talking about really *do* involved
> >>>> unprivileged or less privileged eBPF, though.  Let's see:
> >>>> 
> >>>> systemd --user: systemd --user *is not privileged at all*.  There's no
> >>>> issue of reducing privilege, since systemd --user doesn't have any
> >>>> privilege to begin with.  But systemd supports some eBPF features, and
> >>>> presumably it would like to support them in the systemd --user case.
> >>>> This is unprivileged eBPF.
> >>> 
> >>> Let's disambiguate the terminology.
> >>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> >>> I think that was a mistake.
> >>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> >>> This is not unprivileged.
> >>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >>> 
> >>> There is a huge difference between the two.
> >>> I'm against extending 'unprivileged bpf' even a bit more than what it is
> >>> today for many reasons mentioned earlier.
> >>> The /dev/bpf is about 'less privileged'.
> >>> Less privileged than root. We need to split part of full root capability
> >>> into bpf capability. So that most of the root can be dropped.
> >>> This is very similar to what cap_net_admin does.
> >>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> >>> cap_net_admin is very much privileged. Just 'less privileged' than root.
> >>> Same thing for cap_bpf.
> >> 
> >> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> > 
> > Initially I agreed that it's probably too broad, but then realized
> > that they're perfect as-is. There is no need to partition further.
> > 
> >>> May be we should do both cap_bpf and /dev/bpf to make it clear that
> >>> this is the same thing. Two interfaces to achieve the same result.
> >> 
> >> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> > 
> > Indeed, ambient capabilities should work for all cases.
> > 
> >> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> > 
> > There are not that many bits left. I prefer to consume single CAP_BPF bit.
> > All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> > This is no-brainer.
> > 
> > The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> > should be extended to CAP_BPF or not.
> > imo devmap and xskmap can stay CAP_NET_ADMIN,
> > but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> > Initially cgroup-bpf hooks were limited to networking.
> > It's no longer the case. Requiring NET_ADMIN there make little sense now.
> > 
> 
> Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar. 

'never to less trusted process' ? why do you think so?
I don't see a problem adding /dev/bpf/foo in the future and make things
more granular. There is no such use case today. Hence I don't want to
spend time and design something without clear use case in mind.

> Do new programs really need the by_id calls? 

yes. Lorenz gave an example earlier. map-in-map returns map_id.
To operate on that map by_id is needed.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox