Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  9:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <87h83s62mi.fsf@mid.deneb.enyo.de>

On Mon, Oct 28, 2019 at 09:23:17PM +0100, Florian Weimer wrote:
> * Mike Rapoport:
> 
> > On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer
> > <fw@deneb.enyo.de> wrote:
> >>* Mike Rapoport:
> >>
> >>> The patch below aims to allow applications to create mappins that
> >>have
> >>> pages visible only to the owning process. Such mappings could be used
> >>to
> >>> store secrets so that these secrets are not visible neither to other
> >>> processes nor to the kernel.
> >>
> >>How is this expected to interact with CRIU?
> >
> > CRIU dumps the memory contents using a parasite code from inside the
> > dumpee address space, so it would work the same way as for the other
> > mappings. Of course, at the restore time the exclusive mapping should
> > be recreated with the appropriate flags.
> 
> Hmm, so it would use a bounce buffer to perform the extraction?

At first I thought that CRIU would extract the memory contents from these
mappings just as it does now using vmsplice(). But it seems that such
mappings won't play well with pipes, so CRIU will need a bounce buffer
indeed.
 
> >>> I've only tested the basic functionality, the changes should be
> >>verified
> >>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
> >>
> >>What are the expected semantics for VM migration?  Should it fail?
> >
> > I don't quite follow. If qemu would use such mappings it would be able
> > to transfer them during live migration.
> 
> I was wondering if the special state is supposed to bubble up to the
> host eventually.

Well, that was not intended.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Peter Zijlstra @ 2019-10-29  8:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API, linux-mm,
	the arch/x86 maintainers, Mike Rapoport
In-Reply-To: <20191029064318.s4n4gidlfjun3d47@box>

On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> But some CPUs don't like to have two TLB entries for the same memory with
> different sizes at the same time. See for instance AMD erratum 383.
> 
> Getting it right would require making the range not present, flush TLB and
> only then install huge page. That's what we do for userspace.
> 
> It will not fly for the direct mapping. There is no reasonable way to
> exclude other CPU from accessing the range while it's not present (call
> stop_machine()? :P). Moreover, the range may contain the code that doing
> the collapse or data required for it...
> 
> BTW, looks like current __split_large_page() in pageattr.c is susceptible
> to the errata. Maybe we can get away with the easy way...

As you write above, there is just no way we can have a (temporary) hole
in the direct map.

We are careful about that other errata, and make sure both translations
are identical wrt everything else.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  8:55 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <alpine.DEB.2.21.1910290706360.3769@www.lameter.com>

On Tue, Oct 29, 2019 at 07:08:42AM +0000, Christopher Lameter wrote:
> On Mon, 28 Oct 2019, Kirill A. Shutemov wrote:
> 
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> >
> > It might be okay if all these pages cluster together, but I don't think we
> > have a way to achieve it easily.
> 
> Set aside a special physical memory range for this and migrate the
> page to that physical memory range when MAP_EXCLUSIVE is specified?

I've talked with Thomas yesterday and he suggested something similar:

When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge
page for it and then use this page as a pool of 4K pages for subsequent
requests. Once this huge page is full we allocate a new one and append it
to the pool. When all the 4K pages that comprise the huge page are freed
the huge page is collapsed.

And then on top of this we can look into compaction of the direct map.

Of course, this would work if the easy way of collapsing direct map pages
Kirill mentioned on other mail will work.

> Maybe some processors also have hardware ranges that offer additional
> protection for stuff like that?
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Christopher Lameter @ 2019-10-29  7:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <20191028131623.zwuwguhm4v4s5imh@box>

On Mon, 28 Oct 2019, Kirill A. Shutemov wrote:

> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.
>
> It might be okay if all these pages cluster together, but I don't think we
> have a way to achieve it easily.

Set aside a special physical memory range for this and migrate the
page to that physical memory range when MAP_EXCLUSIVE is specified?

Maybe some processors also have hardware ranges that offer additional
protection for stuff like that?

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Kirill A. Shutemov @ 2019-10-29  6:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API, linux-mm,
	the arch/x86 maintainers, Mike Rapoport
In-Reply-To: <CAA9_cmd7f2y2AAT6646S=tco3yfyLgCAC4Qp=1iTQaJqrQcOwQ@mail.gmail.com>

On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote:
> On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > >
> > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > > the owning process and can be used by applications to store secret
> > > > > information that will not be visible not only to other processes but to the
> > > > > kernel as well.
> > > > >
> > > > > The pages in these mappings are removed from the kernel direct map and
> > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > > the pages are mapped back into the direct map.
> > > >
> > > > I probably blind, but I don't see where you manipulate direct map...
> > >
> > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > > set_direct_map_invalid_noflush() that makes the page not present.
> >
> > Ah. okay.
> >
> > I think active use of this feature will lead to performance degradation of
> > the system with time.
> >
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> >
> > It might be okay if all these pages cluster together, but I don't think we
> > have a way to achieve it easily.
> 
> Still, it would be worth exploring what that would look like if not
> for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
> pages from the direct map. In the case of pmem, where those pages are
> able to be repaired, it would be nice to also repair the mapping
> granularity of the direct map.

The solution has to consist of two parts: finding a range to collapse and
actually collapsing the range into a huge page.

Finding the collapsible range will likely require background scanning of
the direct mapping as we do for THP with khugepaged. It should not too
hard, but likely require long and tedious tuning to be effective, but not
too disturbing for the system.

Alternatively, after any changes to the direct mapping, we can initiate
checking if the range is collapsible. Up to 1G around the changed 4k.
It might be more taxing than scanning if direct mapping changes often.

Collapsing itself appears to be simple: re-check if the range is
collapsible under the lock, replace the page table with the huge page and
flush the TLB.

But some CPUs don't like to have two TLB entries for the same memory with
different sizes at the same time. See for instance AMD erratum 383.

Getting it right would require making the range not present, flush TLB and
only then install huge page. That's what we do for userspace.

It will not fly for the direct mapping. There is no reasonable way to
exclude other CPU from accessing the range while it's not present (call
stop_machine()? :P). Moreover, the range may contain the code that doing
the collapse or data required for it...

BTW, looks like current __split_large_page() in pageattr.c is susceptible
to the errata. Maybe we can get away with the easy way...

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Dan Williams @ 2019-10-29  5:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, Linux Kernel Mailing List, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API, linux-mm,
	the arch/x86 maintainers, Mike Rapoport
In-Reply-To: <20191028131623.zwuwguhm4v4s5imh@box>

On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > >
> > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > the owning process and can be used by applications to store secret
> > > > information that will not be visible not only to other processes but to the
> > > > kernel as well.
> > > >
> > > > The pages in these mappings are removed from the kernel direct map and
> > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > the pages are mapped back into the direct map.
> > >
> > > I probably blind, but I don't see where you manipulate direct map...
> >
> > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > set_direct_map_invalid_noflush() that makes the page not present.
>
> Ah. okay.
>
> I think active use of this feature will lead to performance degradation of
> the system with time.
>
> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.
>
> It might be okay if all these pages cluster together, but I don't think we
> have a way to achieve it easily.

Still, it would be worth exploring what that would look like if not
for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
pages from the direct map. In the case of pmem, where those pages are
able to be repaired, it would be nice to also repair the mapping
granularity of the direct map.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Peter Zijlstra @ 2019-10-28 21:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kirill@shutemov.name, adobriyan@gmail.com,
	linux-kernel@vger.kernel.org, rppt@kernel.org,
	rostedt@goodmis.org, jejb@linux.ibm.com, tglx@linutronix.de,
	linux-mm@kvack.org, dave.hansen@linux.intel.com,
	linux-api@vger.kernel.org, x86@kernel.org,
	akpm@linux-foundation.org, hpa@zytor.com, mingo@redhat.com,
	luto@kernel.org, rppt@linux.ibm.com, bp@alien8.de, arnd
In-Reply-To: <0a35765f7412937c1775daa05177b20113760aee.camel@intel.com>

On Mon, Oct 28, 2019 at 07:59:25PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:
> > 
> > > I think active use of this feature will lead to performance degradation of
> > > the system with time.
> > > 
> > > Setting a single 4k page non-present in the direct mapping will require
> > > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > > way road. We don't have any mechanism to map the memory with huge page
> > > again after the application has freed the page.
> > 
> > Right, we recently had a 'bug' where ftrace triggered something like
> > this and facebook ran into it as a performance regression. So yes, this
> > is a real concern.
> 
> Don't e/cBPF filters also break the direct map down to 4k pages when calling
> set_memory_ro() on the filter for 64 bit x86 and arm?
> 
> I've been wondering if the page allocator should make some effort to find a
> broken down page for anything that can be known will have direct map permissions
> changed (or if it already groups them somehow). But also, why any potential
> slowdown of 4k pages on the direct map hasn't been noticed for apps that do a
> lot of insertions and removals of BPF filters, if this is indeed the case.

That should be limited to the module range. Random data maps could
shatter the world.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Andy Lutomirski @ 2019-10-28 20:44 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-1-git-send-email-rppt@kernel.org>


> On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> 
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.
> 
> I've only tested the basic functionality, the changes should be verified
> against THP/migration/compaction. Yet, I'd appreciate early feedback.

I’ve contemplated the concept a fair amount, and I think you should consider a change to the API. In particular, rather than having it be a MAP_ flag, make it a chardev.  You can, at least at first, allow only MAP_SHARED, and admins can decide who gets to use it.  It might also play better with the VM overall, and you won’t need a VM_ flag for it — you can just wire up .fault to do the right thing.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Florian Weimer @ 2019-10-28 20:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <385EB6D4-A1B0-4617-B256-181AA1C3BDE3@kernel.org>

* Mike Rapoport:

> On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer
> <fw@deneb.enyo.de> wrote:
>>* Mike Rapoport:
>>
>>> The patch below aims to allow applications to create mappins that
>>have
>>> pages visible only to the owning process. Such mappings could be used
>>to
>>> store secrets so that these secrets are not visible neither to other
>>> processes nor to the kernel.
>>
>>How is this expected to interact with CRIU?
>
> CRIU dumps the memory contents using a parasite code from inside the
> dumpee address space, so it would work the same way as for the other
> mappings. Of course, at the restore time the exclusive mapping should
> be recreated with the appropriate flags.

Hmm, so it would use a bounce buffer to perform the extraction?

>>> I've only tested the basic functionality, the changes should be
>>verified
>>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
>>
>>What are the expected semantics for VM migration?  Should it fail?
>
> I don't quite follow. If qemu would use such mappings it would be able
> to transfer them during live migration.

I was wondering if the special state is supposed to bubble up to the
host eventually.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Edgecombe, Rick P @ 2019-10-28 19:59 UTC (permalink / raw)
  To: kirill@shutemov.name, peterz@infradead.org
  Cc: adobriyan@gmail.com, linux-kernel@vger.kernel.org,
	rppt@kernel.org, rostedt@goodmis.org, jejb@linux.ibm.com,
	tglx@linutronix.de, linux-mm@kvack.org,
	dave.hansen@linux.intel.com, linux-api@vger.kernel.org,
	x86@kernel.org, akpm@linux-foundation.org, hpa@zytor.com,
	mingo@redhat.com, luto@kernel.org, rppt@linux.ibm.com,
	bp@alien8.de, arnd@arndb.de
In-Reply-To: <20191028135521.GB4097@hirez.programming.kicks-ass.net>

On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:
> 
> > I think active use of this feature will lead to performance degradation of
> > the system with time.
> > 
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> 
> Right, we recently had a 'bug' where ftrace triggered something like
> this and facebook ran into it as a performance regression. So yes, this
> is a real concern.

Don't e/cBPF filters also break the direct map down to 4k pages when calling
set_memory_ro() on the filter for 64 bit x86 and arm?

I've been wondering if the page allocator should make some effort to find a
broken down page for anything that can be known will have direct map permissions
changed (or if it already groups them somehow). But also, why any potential
slowdown of 4k pages on the direct map hasn't been noticed for apps that do a
lot of insertions and removals of BPF filters, if this is indeed the case.



^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Jann Horn @ 2019-10-28 19:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Florian Weimer, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <20191028172143.4vnnjpdljfnexaq5@wittgenstein>

On Mon, Oct 28, 2019 at 6:21 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Mon, Oct 28, 2019 at 04:12:09PM +0100, Jann Horn wrote:
> > On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > I've made a first shot at adding documentation for clone3(). You can
> > > see the diff here:
> > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
[...]
> > You might want to note somewhere that its flags can't be
> > seccomp-filtered because they're stored in memory, making it
> > inappropriate to use in heavily sandboxed processes.
>
> Hm, I don't think that belongs on the clone manpage. Granted that
> process creation is an important syscall but so are a bunch of others
> that aren't filterable because of pointer arguments.
> We can probably mention on the seccomp manpage that seccomp can't filter
> on pointer arguments and then provide a list of examples. If you setup a
> seccomp filter and don't know that you can't filter syscalls with
> pointer args that seems pretty bad to begin with.

Fair enough.

[...]
> One thing I never liked about clone() was that userspace had to know
> about stack direction. And there is a lot of ugly code in userspace that
> has nasty clone() wrappers like:
[...]
> where stack + stack_size is addition on a void pointer which usually
> clang and gcc are not very happy about.
> I wanted to bring this up on the mailing list soon: If possible, I don't
> want userspace to need to know about stack direction and just have stack
> point to the beginning and then have the kernel do the + stack_size
> after the copy_clone_args_from_user() if the arch needs it. For example,
> by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
> either does the + stack_size or not. Right now, clone3() is supported on
> parisc and afaict, the stack grows upwards for it. I'm not sure if there
> are obvious reasons why that won't work or it would be a bad idea...

That would mean adding a new clone flag that redefines how those
parameters work and describing the current behavior in the manpage as
the behavior without the flag (which doesn't exist on 5.3), right?

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Matthew Wilcox @ 2019-10-28 18:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <d6ac08fe-23f3-c2d5-24c4-88e68f3fd4d0@intel.com>

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.
>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.
>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.
>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.
>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.

Another random set of thoughts:

 - Should devices be permitted to DMA to/from MAP_SECRET pages?
 - How about GUP?  Can I ptrace my way into another process's secret pages?
 - What if I splice() the page into a pipe?

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Andy Lutomirski @ 2019-10-28 18:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, Linux-MM, X86 ML, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On Sun, Oct 27, 2019 at 3:17 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
>
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
>
> The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.
>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/mm/fault.c                    | 14 ++++++++++
>  fs/proc/task_mmu.c                     |  1 +
>  include/linux/mm.h                     |  9 +++++++
>  include/linux/page-flags.h             |  7 +++++
>  include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
>  include/trace/events/mmflags.h         |  9 ++++++-
>  include/uapi/asm-generic/mman-common.h |  1 +
>  kernel/fork.c                          |  3 ++-
>  mm/Kconfig                             |  3 +++
>  mm/gup.c                               |  8 ++++++
>  mm/memory.c                            |  3 +++
>  mm/mmap.c                              | 16 +++++++++++
>  mm/page_alloc.c                        |  5 ++++
>  13 files changed, 126 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/page_excl.h
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ceacd1..8f73a75 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -17,6 +17,7 @@
>  #include <linux/context_tracking.h>    /* exception_enter(), ...       */
>  #include <linux/uaccess.h>             /* faulthandler_disabled()      */
>  #include <linux/efi.h>                 /* efi_recover_from_page_fault()*/
> +#include <linux/page_excl.h>           /* page_is_user_exclusive()     */
>  #include <linux/mm_types.h>
>
>  #include <asm/cpufeature.h>            /* boot_cpu_has, ...            */
> @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
>         return address >= TASK_SIZE_MAX;
>  }
>
> +static bool fault_in_user_exclusive_page(unsigned long address)
> +{
> +       struct page *page = virt_to_page(address);
> +
> +       return page_is_user_exclusive(page);
> +}
> +
>  /*
>   * Called for all faults where 'address' is part of the kernel address
>   * space.  Might get called for faults that originate from *code* that
> @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>         if (spurious_kernel_fault(hw_error_code, address))
>                 return;
>
> +       /* FIXME: warn and handle gracefully */
> +       if (unlikely(fault_in_user_exclusive_page(address))) {
> +               pr_err("page fault in user exclusive page at %lx", address);
> +               force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
> +       }

Sending a signal here is not a reasonable thing to do in response to
an unexpected kernel fault.  You need to OOPS.  Printing a nice
message would be nice.

--Andy

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Sean Christopherson @ 2019-10-28 17:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <d6ac08fe-23f3-c2d5-24c4-88e68f3fd4d0@intel.com>

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> On 10/27/19 3:17 AM, Mike Rapoport wrote:
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> This looks fun.  It's certainly simple.
> 
> But, the description is not really calling out the pros and cons very
> well.  I'm also not sure that folks will use an interface like this that
> requires up-front, special code to do an allocation instead of something
> like madvise().  That's why protection keys ended up the way it did: if
> you do this as a mmap() replacement, you need to modify all *allocators*
> to be enabled for this.  If you do it with mprotect()-style, you can
> apply it to existing allocations.
> 
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.
>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.
>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.
>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.

If we go that route, maybe MAP_USER_SECRET so that there's wiggle room in
the event that there are different secret keepers that require different
implementations in the kernel?   E.g. MAP_GUEST_SECRET for a KVM guest to
take the userspace VMM (Qemu) out of the TCB, i.e. the mapping would be
accessible by the kernel (or just KVM?) and the KVM guest, but not
userspace.

>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.
> 

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Christian Brauner @ 2019-10-28 17:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Florian Weimer, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com>

On Mon, Oct 28, 2019 at 04:12:09PM +0100, Jann Horn wrote:
> On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > I've made a first shot at adding documentation for clone3(). You can
> > see the diff here:
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
> [...]
> >    clone3()
> >        The  clone3() system call provides a superset of the functionality
> >        of the older clone() interface.  It also provides a number of  API
> >        improvements,  including: space for additional flags bits; cleaner
> >        separation in the use of various arguments;  and  the  ability  to
> >        specify the size of the child's stack area.
> 
> You might want to note somewhere that its flags can't be
> seccomp-filtered because they're stored in memory, making it
> inappropriate to use in heavily sandboxed processes.

Hm, I don't think that belongs on the clone manpage. Granted that
process creation is an important syscall but so are a bunch of others
that aren't filterable because of pointer arguments.
We can probably mention on the seccomp manpage that seccomp can't filter
on pointer arguments and then provide a list of examples. If you setup a
seccomp filter and don't know that you can't filter syscalls with
pointer args that seems pretty bad to begin with.

> 
> >            struct clone_args {
> >                u64 flags;        /* Flags bit mask */
> >                u64 pidfd;        /* Where to store PID file descriptor
> >                                     (int *) */
> >                u64 child_tid;    /* Where to store child TID,
> >                                     in child's memory (int *) */
> >                u64 parent_tid;   /* Where to store child TID,
> >                                     in parent's memory (int *) */
> >                u64 exit_signal;  /* Signal to deliver to parent on
> >                                     child termination */
> >                u64 stack;        /* Pointer to lowest byte of stack */
> >                u64 stack_size;   /* Size of stack */
> >                u64 tls;          /* Location of new TLS */
> >            };
> >
> >        The size argument that is supplied to clone3() should be  initial‐
> >        ized  to  the  size of this structure.  (The existence of the size
> >        argument permits future extensions to the clone_args structure.)
> >
> >        The stack for the child process is  specified  via  cl_args.stack,
> >        which   points   to  the  lowest  byte  of  the  stack  area,  and
> 
> Here and in the comment in the struct above, you say that .stack
> "points to the lowest byte of the stack area", but isn't that
> architecture-dependent? For most architectures, I think it should
> instead be "is the initial stack pointer", with the exception of IA64
> (and maybe others, I'm not sure). For example, on X86, when launching
> a thread with an initially empty stack, it points directly *after* the
> end of the stack area.

re arch and stack_size: You mentioned ia64 below (I snipped this part.)
but it's not the only one. With legacy clone it's _passed_ for any
architecture that has CONFIG_CLONE_BACKWARDS3. That includes at least
microblaze and ia64 I think. But only ia64 makes _actual use_ of this in
copy_thread() by doing user_stack_base + user_stack_size - 16. I think ia64
only needs stack_size because of the split page-table layout where two
stacks grow in different directions; so the stack doesn't grow
dynamically. Afair, stack_size is mainly used when PF_KTHREAD is true
but that can't be set from userspace anyway, so _shrug_.

One thing I never liked about clone() was that userspace had to know
about stack direction. And there is a lot of ugly code in userspace that
has nasty clone() wrappers like:

pid_t wrap_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
{
	pid_t ret;
	void *stack;

	stack = malloc(__STACK_SIZE);
	if (!stack) {
		SYSERROR("Failed to allocate clone stack");
		return -ENOMEM;
	}

#ifdef __ia64__
	ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#else
	ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#endif
	return ret;
}

where stack + stack_size is addition on a void pointer which usually
clang and gcc are not very happy about.
I wanted to bring this up on the mailing list soon: If possible, I don't
want userspace to need to know about stack direction and just have stack
point to the beginning and then have the kernel do the + stack_size
after the copy_clone_args_from_user() if the arch needs it. For example,
by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
either does the + stack_size or not. Right now, clone3() is supported on
parisc and afaict, the stack grows upwards for it. I'm not sure if there
are obvious reasons why that won't work or it would be a bad idea...

Christian

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Dave Hansen @ 2019-10-28 17:12 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On 10/27/19 3:17 AM, Mike Rapoport wrote:
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

This looks fun.  It's certainly simple.

But, the description is not really calling out the pros and cons very
well.  I'm also not sure that folks will use an interface like this that
requires up-front, special code to do an allocation instead of something
like madvise().  That's why protection keys ended up the way it did: if
you do this as a mmap() replacement, you need to modify all *allocators*
to be enabled for this.  If you do it with mprotect()-style, you can
apply it to existing allocations.

Some other random thoughts:

 * The page flag is probably not a good idea.  It would be probably
   better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
   into the slow path.
 * This really stops being "normal" memory.  You can't do futexes on it,
   cant splice it.  Probably need a more fleshed-out list of
   incompatible features.
 * As Kirill noted, each 4k page ends up with a potential 1GB "blast
   radius" of demoted pages in the direct map.  Not cool.  This is
   probably a non-starter as it stands.
 * The global TLB flushes are going to eat you alive.  They probably
   border on a DoS on larger systems.
 * Do we really want this user interface to dictate the kernel
   implementation?  In other words, do we really want MAP_EXCLUSIVE,
   or do we want MAP_SECRET?  One tells the kernel what do *do*, the
   other tells the kernel what the memory *IS*.
 * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
   Persistent Memory, where the kernel direct map is a liability in some
   way.  We probably need some kind of overall, architected solution
   rather than five or ten things all poking at the direct map.

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Jann Horn @ 2019-10-28 15:12 UTC (permalink / raw)
  To: Michael Kerrisk-manpages
  Cc: Christian Brauner, lkml, linux-man, Kees Cook, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAKgNAkjo2WHq+zESU1iuCHJJ0x-fTNrakS9-d1+BjzUuV2uf2Q@mail.gmail.com>

On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> I've made a first shot at adding documentation for clone3(). You can
> see the diff here:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
[...]
>    clone3()
>        The  clone3() system call provides a superset of the functionality
>        of the older clone() interface.  It also provides a number of  API
>        improvements,  including: space for additional flags bits; cleaner
>        separation in the use of various arguments;  and  the  ability  to
>        specify the size of the child's stack area.

You might want to note somewhere that its flags can't be
seccomp-filtered because they're stored in memory, making it
inappropriate to use in heavily sandboxed processes.

>            struct clone_args {
>                u64 flags;        /* Flags bit mask */
>                u64 pidfd;        /* Where to store PID file descriptor
>                                     (int *) */
>                u64 child_tid;    /* Where to store child TID,
>                                     in child's memory (int *) */
>                u64 parent_tid;   /* Where to store child TID,
>                                     in parent's memory (int *) */
>                u64 exit_signal;  /* Signal to deliver to parent on
>                                     child termination */
>                u64 stack;        /* Pointer to lowest byte of stack */
>                u64 stack_size;   /* Size of stack */
>                u64 tls;          /* Location of new TLS */
>            };
>
>        The size argument that is supplied to clone3() should be  initial‐
>        ized  to  the  size of this structure.  (The existence of the size
>        argument permits future extensions to the clone_args structure.)
>
>        The stack for the child process is  specified  via  cl_args.stack,
>        which   points   to  the  lowest  byte  of  the  stack  area,  and

Here and in the comment in the struct above, you say that .stack
"points to the lowest byte of the stack area", but isn't that
architecture-dependent? For most architectures, I think it should
instead be "is the initial stack pointer", with the exception of IA64
(and maybe others, I'm not sure). For example, on X86, when launching
a thread with an initially empty stack, it points directly *after* the
end of the stack area.

>        cl_args.stack_size, which specifies  the  size  of  the  stack  in
>        bytes.   In the case where the CLONE_VM flag (see below) is speci‐

stack_size is ignored on most architectures.

>        fied, a stack must be explicitly allocated and specified.   Other‐
>        wise,  these  two  fields  can  be  specified as NULL and 0, which
>        causes the child to use the same stack area as the parent (in  the
>        child's own virtual address space).
[...]
>    Equivalence between clone() and clone3() arguments
>        Unlike  the  older  clone()  interface, where arguments are passed
>        individually, in the newer clone3() interface  the  arguments  are
>        packaged  into  the clone_args structure shown above.  This struc‐
>        ture allows for a superset  of  the  information  passed  via  the
>        clone() arguments.
>
>        The following table shows the equivalence between the arguments of
>        clone() and the fields in  the  clone_args  argument  supplied  to
>        clone3():
>
>               clone()         clone(3)        Notes
>                               cl_args field
>               flags & ~0xff   flags
>               parent_tid      pidfd           See CLONE_PIDFD
>               child_tid       child_tid       See CLONE_CHILD_SETTID
>               parent_tid      parent_tid      See CLONE_PARENT_SETTID
>               flags & 0xff    exit_signal
>               stack           stack
>
>               ---             stack_size

(except that on ia64, stack_size also exists in clone2(), and if
you're not on ia64, stack_size doesn't do anything, at least on X86,
so showing them side by side like this doesn't really make sense)

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: David Hildenbrand @ 2019-10-28 14:55 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On 27.10.19 11:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
> 
> The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   arch/x86/mm/fault.c                    | 14 ++++++++++
>   fs/proc/task_mmu.c                     |  1 +
>   include/linux/mm.h                     |  9 +++++++
>   include/linux/page-flags.h             |  7 +++++
>   include/linux/page_excl.h              | 49 ++++++++++++++++++++++++++++++++++
>   include/trace/events/mmflags.h         |  9 ++++++-
>   include/uapi/asm-generic/mman-common.h |  1 +
>   kernel/fork.c                          |  3 ++-
>   mm/Kconfig                             |  3 +++
>   mm/gup.c                               |  8 ++++++
>   mm/memory.c                            |  3 +++
>   mm/mmap.c                              | 16 +++++++++++
>   mm/page_alloc.c                        |  5 ++++
>   13 files changed, 126 insertions(+), 2 deletions(-)
>   create mode 100644 include/linux/page_excl.h
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ceacd1..8f73a75 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -17,6 +17,7 @@
>   #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>   #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
>   #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
> +#include <linux/page_excl.h>		/* page_is_user_exclusive()	*/
>   #include <linux/mm_types.h>
>   
>   #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
> @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address)
>   	return address >= TASK_SIZE_MAX;
>   }
>   
> +static bool fault_in_user_exclusive_page(unsigned long address)
> +{
> +	struct page *page = virt_to_page(address);
> +
> +	return page_is_user_exclusive(page);
> +}
> +
>   /*
>    * Called for all faults where 'address' is part of the kernel address
>    * space.  Might get called for faults that originate from *code* that
> @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>   	if (spurious_kernel_fault(hw_error_code, address))
>   		return;
>   
> +	/* FIXME: warn and handle gracefully */
> +	if (unlikely(fault_in_user_exclusive_page(address))) {
> +		pr_err("page fault in user exclusive page at %lx", address);
> +		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address);
> +	}
> +
>   	/* kprobes don't want to hook the spurious faults: */
>   	if (kprobe_page_fault(regs, X86_TRAP_PF))
>   		return;
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 9442631..99e14d1 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>   #ifdef CONFIG_X86_INTEL_MPX
>   		[ilog2(VM_MPX)]		= "mp",
>   #endif
> +		[ilog2(VM_EXCLUSIVE)]	= "xl",
>   		[ilog2(VM_LOCKED)]	= "lo",
>   		[ilog2(VM_IO)]		= "io",
>   		[ilog2(VM_SEQ_READ)]	= "sr",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc29227..9c43375 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>   #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>   #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>   #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>   #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>   #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>   #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>   
>   #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
>   # define VM_MPX		VM_NONE
>   #endif
>   
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> +# define VM_EXCLUSIVE	VM_HIGH_ARCH_5
> +#else
> +# define VM_EXCLUSIVE	VM_NONE
> +#endif
> +
>   #ifndef VM_GROWSUP
>   # define VM_GROWSUP	VM_NONE
>   #endif
> @@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   #define FOLL_ANON	0x8000	/* don't do file mappings */
>   #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>   #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> +#define FOLL_EXCLUSIVE	0x40000	/* mapping is exclusive to owning mm */
>   
>   /*
>    * NOTE on FOLL_LONGTERM:
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f91cb88..32d0aee 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -131,6 +131,9 @@ enum pageflags {
>   	PG_young,
>   	PG_idle,
>   #endif
> +#if defined(CONFIG_EXCLUSIVE_USER_PAGES)
> +	PG_user_exclusive,
> +#endif

Last time I tried to introduce a new page flag I learned that this is 
very much frowned upon. Best you can usually do is reuse another flag - 
if valid in that context.

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Peter Zijlstra @ 2019-10-28 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <20191028131623.zwuwguhm4v4s5imh@box>

On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote:

> I think active use of this feature will lead to performance degradation of
> the system with time.
> 
> Setting a single 4k page non-present in the direct mapping will require
> splitting 2M or 1G page we usually map direct mapping with. And it's one
> way road. We don't have any mechanism to map the memory with huge page
> again after the application has freed the page.

Right, we recently had a 'bug' where ftrace triggered something like
this and facebook ran into it as a performance regression. So yes, this
is a real concern.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Kirill A. Shutemov @ 2019-10-28 13:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <20191028130018.GA7192@rapoport-lnx>

On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > the owning process and can be used by applications to store secret
> > > information that will not be visible not only to other processes but to the
> > > kernel as well.
> > > 
> > > The pages in these mappings are removed from the kernel direct map and
> > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > the pages are mapped back into the direct map.
> > 
> > I probably blind, but I don't see where you manipulate direct map...
> 
> __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> set_direct_map_invalid_noflush() that makes the page not present.

Ah. okay.

I think active use of this feature will lead to performance degradation of
the system with time.

Setting a single 4k page non-present in the direct mapping will require
splitting 2M or 1G page we usually map direct mapping with. And it's one
way road. We don't have any mechanism to map the memory with huge page
again after the application has freed the page.

It might be okay if all these pages cluster together, but I don't think we
have a way to achieve it easily.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Florian Weimer @ 2019-10-28 13:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin
In-Reply-To: <d7e76bee-80c3-d787-b854-91e631ab29cd@yandex-team.ru>

* Konstantin Khlebnikov:

> On 28/10/2019 14.46, Florian Weimer wrote:
>> * Konstantin Khlebnikov:
>> 
>>> This implements fcntl() for getting amount of resident memory in cache.
>>> Kernel already maintains counter for each inode, this patch just exposes
>>> it into userspace. Returned size is in kilobytes like values in procfs.
>> 
>> I think this needs a 32-bit compat implementation which clamps the
>> returned value to INT_MAX.
>> 
>
> 32-bit machine couldn't hold more than 2TB cache in one file.
> Even radix tree wouldn't fit into low memory area.

I meant a 32-bit process running on a 64-bit kernel.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-28 13:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <20191028123124.ogkk5ogjlamvwc2s@box>

On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > the owning process and can be used by applications to store secret
> > information that will not be visible not only to other processes but to the
> > kernel as well.
> > 
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> I probably blind, but I don't see where you manipulate direct map...

__get_user_pages() calls __set_page_user_exclusive() which in turn calls
set_direct_map_invalid_noflush() that makes the page not present.
 
> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Konstantin Khlebnikov @ 2019-10-28 12:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin
In-Reply-To: <87k18p6qjk.fsf@mid.deneb.enyo.de>

On 28/10/2019 14.46, Florian Weimer wrote:
> * Konstantin Khlebnikov:
> 
>> This implements fcntl() for getting amount of resident memory in cache.
>> Kernel already maintains counter for each inode, this patch just exposes
>> it into userspace. Returned size is in kilobytes like values in procfs.
> 
> I think this needs a 32-bit compat implementation which clamps the
> returned value to INT_MAX.
> 

32-bit machine couldn't hold more than 2TB cache in one file.
Even radix tree wouldn't fit into low memory area.

^ permalink raw reply

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
From: Konstantin Khlebnikov @ 2019-10-28 12:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Linux-MM, Linux Kernel Mailing List, Linux API,
	Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Roman Gushchin
In-Reply-To: <CAHk-=wiCDPd1ivoU5BJBMSt5cmKnX0XFWiinfegyknfoipif0g@mail.gmail.com>

On 28/10/2019 15.27, Linus Torvalds wrote:
> On Mon, Oct 28, 2019 at 11:28 AM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> This implements fcntl() for getting amount of resident memory in cache.
>> Kernel already maintains counter for each inode, this patch just exposes
>> it into userspace. Returned size is in kilobytes like values in procfs.
> 
> This doesn't actually explain why anybody would want it, and what the
> usage scenario is.
> 

This really helps to plot memory usage distribution. Right now file cache
have only total counters. Collecting statistics via mincore as implemented
in page-types tool isn't efficient and very racy.

Usage scenario is the same as finding top memory usage among processes.
But among files which are not always mapped anywhere.

For example if somebody writes\reads logs too intensive this file cache
could bloat and push more important data out out memory.

Also little bit of introspection wouldn't hurt.
Using this I've found unneeded pages beyond i_size.

>               Linus
> 

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Kirill A. Shutemov @ 2019-10-28 12:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

I probably blind, but I don't see where you manipulate direct map...

-- 
 Kirill A. Shutemov

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox