[LSF/MM/BPF TOPIC] MM: Mapcount Madness

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] MM: Mapcount Madness
@ 2024-01-29 12:05 David Hildenbrand
  2024-01-29 13:49 ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2024-01-29 12:05 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm@kvack.org, Michal Hocko, Dan Williams

As PTE-mapped large folios become more relevant (mTHP [1]) and there is 
the desire to shrink the metadata allocated for such large folios as 
well (memdesc [2]), how we track folio mappings gets more relevant. Over 
the years, we used folio mapping information to answer various 
questions: is this folio mapped by somebody else? do we have to COW on 
write fault? how do we adjust memory statistics? ...

Let's talk about ongoing work in the mapcount area, get a common 
understanding of what the users of the different mapcounts are and what 
the implications of removing some would be: which questions could we 
answer differently, which questions would we not be able to answer 
precisely anymore, and what would be the implications of such changes?

For example, can we tolerate some imprecise memory statistics? How 
expressive is the PSS when large folios are only partially mapped? Would 
we need a transition period and glue changes to a new CONFIG_ option? Do 
we really have to support THP and friends on 32bit?

[1] 
https://patchwork.kernel.org/project/linux-mm/cover/20231207161211.2374093-1-ryan.roberts@arm.com/#25628022
[2] https://kernelnewbies.org/MatthewWilcox/Memdescs

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] MM: Mapcount Madness
  2024-01-29 12:05 [LSF/MM/BPF TOPIC] MM: Mapcount Madness David Hildenbrand
@ 2024-01-29 13:49 ` Matthew Wilcox
  2024-01-29 14:09   ` David Hildenbrand
  2024-02-01 16:17   ` Jason Gunthorpe
  0 siblings, 2 replies; 4+ messages in thread
From: Matthew Wilcox @ 2024-01-29 13:49 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lsf-pc, linux-mm@kvack.org, Michal Hocko, Dan Williams

On Mon, Jan 29, 2024 at 01:05:04PM +0100, David Hildenbrand wrote:
> As PTE-mapped large folios become more relevant (mTHP [1]) and there is the
> desire to shrink the metadata allocated for such large folios as well
> (memdesc [2]), how we track folio mappings gets more relevant. Over the
> years, we used folio mapping information to answer various questions: is
> this folio mapped by somebody else? do we have to COW on write fault? how do
> we adjust memory statistics? ...
> 
> Let's talk about ongoing work in the mapcount area, get a common
> understanding of what the users of the different mapcounts are and what the
> implications of removing some would be: which questions could we answer
> differently, which questions would we not be able to answer precisely
> anymore, and what would be the implications of such changes?
> 
> For example, can we tolerate some imprecise memory statistics? How
> expressive is the PSS when large folios are only partially mapped? Would we
> need a transition period and glue changes to a new CONFIG_ option? Do we
> really have to support THP and friends on 32bit?

Excellent topics to cover.  I have some of my own questions ...

Are we in danger of overflowing page refcount too easily?  Pincount
isn't an issue here; we're talking about large folios, so pincount gets
its own field.  But with tracking one mapcount per PTE mapping of a
folio, we can easily increment a PMD-sized folio's refcount by 512
per VMA.  Now we only need 2^22 VMAs to hit the 2^31 limit before the
page->refcount protections go into effect and operations start failing.

How / do we need to track mapcount for pages mapped to userspace which
are neither file-backed, nor anonymous mappings?  eg drivers pass
vmalloc memory to vmf_insert_page() in their ->mmap handler.

What do VM_PFNMAP and VM_MIXEDMAP really imply?  The documentation here
is a little sparse.  And that's sad, because I think we expect device
driver writers to use them, and without clear documentation of what
they actually do, they're going to be misused.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] MM: Mapcount Madness
  2024-01-29 13:49 ` Matthew Wilcox
@ 2024-01-29 14:09   ` David Hildenbrand
  2024-02-01 16:17   ` Jason Gunthorpe
  1 sibling, 0 replies; 4+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm@kvack.org, Michal Hocko, Dan Williams

On 29.01.24 14:49, Matthew Wilcox wrote:
> On Mon, Jan 29, 2024 at 01:05:04PM +0100, David Hildenbrand wrote:
>> As PTE-mapped large folios become more relevant (mTHP [1]) and there is the
>> desire to shrink the metadata allocated for such large folios as well
>> (memdesc [2]), how we track folio mappings gets more relevant. Over the
>> years, we used folio mapping information to answer various questions: is
>> this folio mapped by somebody else? do we have to COW on write fault? how do
>> we adjust memory statistics? ...
>>
>> Let's talk about ongoing work in the mapcount area, get a common
>> understanding of what the users of the different mapcounts are and what the
>> implications of removing some would be: which questions could we answer
>> differently, which questions would we not be able to answer precisely
>> anymore, and what would be the implications of such changes?
>>
>> For example, can we tolerate some imprecise memory statistics? How
>> expressive is the PSS when large folios are only partially mapped? Would we
>> need a transition period and glue changes to a new CONFIG_ option? Do we
>> really have to support THP and friends on 32bit?
> 
> Excellent topics to cover.  I have some of my own questions ...
> 
> Are we in danger of overflowing page refcount too easily?  Pincount
> isn't an issue here; we're talking about large folios, so pincount gets
> its own field.  But with tracking one mapcount per PTE mapping of a
> folio, we can easily increment a PMD-sized folio's refcount by 512
> per VMA.  Now we only need 2^22 VMAs to hit the 2^31 limit before the
> page->refcount protections go into effect and operations start failing.

I think we'll definitely either want to detect such overflows early and 
fail fork/pagefaults/ etc, or if there are sane use cases (2^22 sounds 
excessive, but we might be getting larger folios ...), much rather have 
a 64bit refcount for large folios (or any folios for simplicity? TBD) in 
the future.

And I think, then, once again the question will be: how much time are we 
willing to invest to support THP and friends on 32bit, and is it really 
worth it.

> 
> How / do we need to track mapcount for pages mapped to userspace which
> are neither file-backed, nor anonymous mappings?  eg drivers pass
> vmalloc memory to vmf_insert_page() in their ->mmap handler.

As of today, vm_insert_page() and friends end up calling 
insert_page_into_pte_locked(), which does a

folio_get(folio);
inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
folio_add_file_rmap_pte(folio, page, vma);

That is, we get non-rmappable folios (no pagecache/shmem/anon) in rmap 
code. It's nonsensical, because the rmap does not apply to such pages 
(rmap walks won't work, there is no rmap). When I stumbled over that 
recently, I was guessing that the current handling only exists for 
simplicity on the munmap/zap path.

IMHO, we shouldn't call rmap code on that path (and similarly, when 
unmapping). If we want to adjust some mapcounts for some reason, we 
better do that explicitly.

And that is an excellent topic to discuss.

> 
> What do VM_PFNMAP and VM_MIXEDMAP really imply?  The documentation here
> is a little sparse.  And that's sad, because I think we expect device
> driver writers to use them, and without clear documentation of what
> they actually do, they're going to be misused.

Agreed, it's under-documented. In general VM_PFNMAP means "map whatever 
you want, as long as you make sure it cannot get freed+reused as long as 
it is still mapped". That is, if the memory was allocated, the driver 
has to hold a reference, but the core won't be messing with any 
refcount/mapcount/rmap/stats/ ... treating it like "struct page" 
wouldn't exist.

VM_MIXEDMAP is the complicated brother that uses "struct page" if it exists.

Another good topic, agreed.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] MM: Mapcount Madness
  2024-01-29 13:49 ` Matthew Wilcox
  2024-01-29 14:09   ` David Hildenbrand
@ 2024-02-01 16:17   ` Jason Gunthorpe
  1 sibling, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2024-02-01 16:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, lsf-pc, linux-mm@kvack.org, Michal Hocko,
	Dan Williams

On Mon, Jan 29, 2024 at 01:49:30PM +0000, Matthew Wilcox wrote:

> What do VM_PFNMAP and VM_MIXEDMAP really imply?  The documentation here
> is a little sparse.  And that's sad, because I think we expect device
> driver writers to use them, and without clear documentation of what
> they actually do, they're going to be misused.

In many common driver cases the vm_* core code does set them when
installing pfns into a VMA..

PFNMAP means every PTE in a VMA is pure physical address and there is
no struct page refcount connected. In many, but not all, cases there
is no struct page at all.

MIXEDMAP means some PTEs are struct-pageless and others are not, but
if it has a struct page the the struct page is used.

I've had the feeling this is primarily to support arches that lack the
"special" bit CONFIG_ARCH_HAS_PTE_SPECIAL (badly named, but it means
the PTE does not have a struct page) and these flags help trigger some
guessing to fix it up.

The clearest documentation I've seen is in vm_normal_page().

In the majority of cases drivers want PFNMAP. MIXEDMAP is the weird
thing.

IIRC drivers get forced into MIXMAP if they want to prepopulate a VMA
instead of using the fault path, which is unfortunate if drivers are
actually working 100% with struct page backed memory (eg why does
binder set mixedmap? Does it work with struct-page-less PFNs, where
would it even get them from?). 

I believe this is due to missing driver facing APIs not anything
fundamental. Although I seem to recall there was some race inside mmap
if you try to prepopulate during the vma mmap op.. With zap, I think.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-01 16:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-29 12:05 [LSF/MM/BPF TOPIC] MM: Mapcount Madness David Hildenbrand
2024-01-29 13:49 ` Matthew Wilcox
2024-01-29 14:09   ` David Hildenbrand
2024-02-01 16:17   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).