linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Can someone explain what free_pgd_range(), etc actually do?
@ 2017-11-03 12:11 Andy Lutomirski
  2017-11-03 15:06 ` Dave Hansen
  0 siblings, 1 reply; 2+ messages in thread
From: Andy Lutomirski @ 2017-11-03 12:11 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Hugh Dickins, linux-mm@kvack.org; +Cc: X86 ML

I want to reserve a tiny bit of the address space just below 1<<47 on
x86_64 for kernel purposes but without stealing away management of the
page tables.  It seems like the way to do that is to set
USER_PGTABLES_CEILING to 0 and then make some adjustment to
exit_mmap() to free the tables on exit.

The problem is that free_pgd_range(), free_pgtables, etc are quite
opaque to me, and I'm having a hard time understanding the pagetable
freeing code.  Some questions I haven't figured out:

 - What is the intended purpose of addr, end, floor, and ceiling?
What are the pagetable freeing functions actually *supposed* to do?

 - Are there any invariants that, for example, there is never a
pagetable that doesn't have any vmas at all under it?  I can
understand how all the code would be correct if this invariant were to
exist, but I don't see what would preserve it.  But maybe
free_pgd_range(), etc really do preserve it.

 - What keeps mm->mmap pointing to the lowest-addressed vma?  I see
lots of code that seems to assume that you can start at mm->mmap,
follow the vm_next links, and find all vmas, but I can't figure out
why this would work.

 - What happens if a process exits while mm->mmap is NULL?

 - Is there any piece of code that makes it obvious that all the
pagetables are gone by the time the exit_mmap() finishes?

Because I'm staring to wonder whether some weird combination of maps
and unmaps will just leak pagetables, and the code is rather
complicated, subtle, and completely lacking in documentation, and I've
learned to be quite suspicious of such things.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Can someone explain what free_pgd_range(), etc actually do?
  2017-11-03 12:11 Can someone explain what free_pgd_range(), etc actually do? Andy Lutomirski
@ 2017-11-03 15:06 ` Dave Hansen
  0 siblings, 0 replies; 2+ messages in thread
From: Dave Hansen @ 2017-11-03 15:06 UTC (permalink / raw)
  To: Andy Lutomirski, Kirill A. Shutemov, Hugh Dickins,
	linux-mm@kvack.org
  Cc: X86 ML

On 11/03/2017 05:11 AM, Andy Lutomirski wrote:
>  - What is the intended purpose of addr, end, floor, and ceiling?
> What are the pagetable freeing functions actually *supposed* to do?

I've always logically thought of it as: the VMA (and this addr/end) tell
us where we _must_ walk and free.  floor/ceiling tell us about
neighboring areas that are unused.  We do not have to walk the unused
areas, but we must free them if we clear out their last use.

Walking is presumably expensive.  We use the VMA information and plumb
it down through floor/ceiling to make sure that we're not having to look
at a full page of data at each level every time we free a VMA.

I think that might be what's tripping you up: floor/ceiling is just an
optimization.  It's not logically required for freeing page tables, but
it does speed things up.

>  - Are there any invariants that, for example, there is never a
> pagetable that doesn't have any vmas at all under it?  I can
> understand how all the code would be correct if this invariant were to
> exist, but I don't see what would preserve it.  But maybe
> free_pgd_range(), etc really do preserve it.

I think it's implemented more like: the last VMA using a page table will
free the page table when the VMA is torn down.  It does this by looking
at its neighbors (or lack thereof) at unmap_region() time and expanding
the range covered by floor/ceiling.

>  - What keeps mm->mmap pointing to the lowest-addressed vma?  I see
> lots of code that seems to assume that you can start at mm->mmap,
> follow the vm_next links, and find all vmas, but I can't figure out
> why this would work.

__vma_(un)link_list() is where the magic normally happens.  It
effectively uses the rbtree to determine where to put the VMA in the
list to maintain ordering.

>  - What happens if a process exits while mm->mmap is NULL?

You mean how do we free the page tables for it?  We had to do a bunch of
unmap_regions() before that to axe all the VMAs and the page tables
_should_ have zapped then.

Now, if someone goes and just sets mm->mmap, we're obviously screwed,
but we leaked a bunch of VMAs _anyway_, in addition to the page tables.

>  - Is there any piece of code that makes it obvious that all the
> pagetables are gone by the time the exit_mmap() finishes?

mm->nr_ptes and mm->nr_pmds (and soon nr_puds) should tell us if we
forgot to free one.  I think that's our main defense.

I have some vague recollection that we also looked for zero'd page table
pages somewhere at free time, but I'm not finding it.

> Because I'm staring to wonder whether some weird combination of maps
> and unmaps will just leak pagetables, and the code is rather
> complicated, subtle, and completely lacking in documentation, and I've
> learned to be quite suspicious of such things.
There have surely been bugs.  FWIW, there's some code in the MPX
selftests that tries to map and free a bunch of random addresses to trip
up the MPX code.  I ran it a *lot* and this code never got tripped up on
it that I can remember.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-11-03 15:06 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-03 12:11 Can someone explain what free_pgd_range(), etc actually do? Andy Lutomirski
2017-11-03 15:06 ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).