[LSF/MM/BPF TOPIC] reducing direct map fragmentation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] reducing direct map fragmentation
@ 2023-02-01 18:06 Mike Rapoport
  2023-02-19  8:07 ` Hyeonggon Yoo
  2023-04-21  9:05 ` [Lsf-pc] " Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Mike Rapoport @ 2023-02-01 18:06 UTC (permalink / raw)
  To: lsf-pc, linux-mm

Hi all,

There are use-cases that need to remove pages from the direct map or at least
map them at PTE level. These use-cases include vfree, module loading, ftrace,
kprobe, BPF, secretmem and generally any caller of set_memory/set_direct_map
APIs.

Remapping pages at PTE level causes split of the PUD and PMD sized mappings
in the direct map which leads to performance degradation.

To reduce the performance hit caused by the fragmentation of the direct
map, it makes sense to group and/or cache the base pages removed from the
direct map so that the most of base pages created during a split of a large
page will be consumed by users requiring PTE level mappings.

Last year the proposal to use a new migrate type for such cache received
strong pushback and the suggested alternative was to try to use slab
instead.

I've been thinking about it (yeah, it took me a while) and I believe slab
is not appropriate because use cases require at least page size allocations
and some would really benefit from higher order allocations, and in the
most cases the code that allocates memory excluded from the direct map
needs the struct page/folio. 

For example, caching allocations of text in 2M pages would benefit from
reduced iTLB pressure and doing kmalloc() from vmalloc() will be way more
intrusive than using some variant of __alloc_pages().

Secretmem and potentially PKS protected page tables also need struct
page/folio.

My current proposal is to have a cache of 2M pages close to the page
allocator and use a GFP flag to make allocation request use that cache. On
the free() path, the pages that are mapped at PTE level will be put into
that cache.

The cache is internally implemented as a buddy allocator so it can satisfy
high order allocations, and there will be a shrinker to release free pages
from that cache to the page allocator.

I hope to have a first prototype posted Really Soon.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-02-01 18:06 [LSF/MM/BPF TOPIC] reducing direct map fragmentation Mike Rapoport
@ 2023-02-19  8:07 ` Hyeonggon Yoo
  2023-02-19 18:09   ` Mike Rapoport
  2023-04-21  9:05 ` [Lsf-pc] " Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-02-19  8:07 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: lsf-pc, linux-mm, Aaron Lu, Kirill A. Shutemov

On Wed, Feb 01, 2023 at 08:06:37PM +0200, Mike Rapoport wrote:
> Hi all,

Hi Mike, I'm interested in this topic and hope to discuss this with you
at LSF/MM/BPF.
 
> There are use-cases that need to remove pages from the direct map or at least
> map them at PTE level. These use-cases include vfree, module loading, ftrace,
> kprobe, BPF, secretmem and generally any caller of set_memory/set_direct_map
> APIs.
> 
> Remapping pages at PTE level causes split of the PUD and PMD sized mappings
> in the direct map which leads to performance degradation.
>
> To reduce the performance hit caused by the fragmentation of the direct
> map, it makes sense to group and/or cache the base pages removed from the
> direct map so that the most of base pages created during a split of a large
> page will be consumed by users requiring PTE level mappings.

How much performance difference did you see in your test when direct
map was fragmented, or is there a way to check this difference? 

> Last year the proposal to use a new migrate type for such cache received
> strong pushback and the suggested alternative was to try to use slab
> instead.
> 
> I've been thinking about it (yeah, it took me a while) and I believe slab
> is not appropriate because use cases require at least page size allocations
> and some would really benefit from higher order allocations, and in the
> most cases the code that allocates memory excluded from the direct map
> needs the struct page/folio. 
>
> For example, caching allocations of text in 2M pages would benefit from
> reduced iTLB pressure and doing kmalloc() from vmalloc() will be way more
> intrusive than using some variant of __alloc_pages().
>
> Secretmem and potentially PKS protected page tables also need struct
> page/folio.
> 
> My current proposal is to have a cache of 2M pages close to the page
> allocator and use a GFP flag to make allocation request use that cache. On
> the free() path, the pages that are mapped at PTE level will be put into
> that cache.

I would like to discuss not only having cache layer of pages but also how
direct map could be merged correctly and efficiently.

I vaguely recall that Aaron Lu sent RFC series about this and Kirill A.
Shutemov's feedback was to batch merge operations. [1]

Also a CPA API called by the cache layer that could merge fragmented
mappings would work for merging 4K pages to 2M [2], but won't work
for merging 2M mappings to 1G mappings.

At that time I didn't follow more discussions (e.g. execmem_alloc())
Maybe I'm missing some points.

[1] https://lore.kernel.org/linux-mm/20220809100408.rm6ofiewtty6rvcl@box

[2] https://lore.kernel.org/linux-mm/YvfLxuflw2ctHFWF@kernel.org
 
> The cache is internally implemented as a buddy allocator so it can satisfy
> high order allocations, and there will be a shrinker to release free pages
> from that cache to the page allocator.
> 
> I hope to have a first prototype posted Really Soon.

Looking forward to that!
Wonder how it would be shaped.

> 
> -- 
> Sincerely yours,
> Mike.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-02-19  8:07 ` Hyeonggon Yoo
@ 2023-02-19 18:09   ` Mike Rapoport
  2023-02-20 14:43     ` Hyeonggon Yoo
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Rapoport @ 2023-02-19 18:09 UTC (permalink / raw)
  To: Hyeonggon Yoo; +Cc: lsf-pc, linux-mm, Aaron Lu, Kirill A. Shutemov

On Sun, Feb 19, 2023 at 08:07:59AM +0000, Hyeonggon Yoo wrote:
> On Wed, Feb 01, 2023 at 08:06:37PM +0200, Mike Rapoport wrote:
> > Hi all,
> 
> Hi Mike, I'm interested in this topic and hope to discuss this with you
> at LSF/MM/BPF.
>  
> > To reduce the performance hit caused by the fragmentation of the direct
> > map, it makes sense to group and/or cache the base pages removed from the
> > direct map so that the most of base pages created during a split of a large
> > page will be consumed by users requiring PTE level mappings.
> 
> How much performance difference did you see in your test when direct
> map was fragmented, or is there a way to check this difference? 

I did some benchmarks a while ago with the entire direct map forced to 2M
or 4k pages. The results I had are here:

https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing

Intel folks did more comprehensive testing and their results are here:

https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

> > My current proposal is to have a cache of 2M pages close to the page
> > allocator and use a GFP flag to make allocation request use that cache. On
> > the free() path, the pages that are mapped at PTE level will be put into
> > that cache.
> 
> I would like to discuss not only having cache layer of pages but also how
> direct map could be merged correctly and efficiently.
> 
> I vaguely recall that Aaron Lu sent RFC series about this and Kirill A.
> Shutemov's feedback was to batch merge operations. [1]
> 
> Also a CPA API called by the cache layer that could merge fragmented
> mappings would work for merging 4K pages to 2M [2], but won't work
> for merging 2M mappings to 1G mappings.

One possible way is to make CPA scan all PMDs in 1G page after merging a 2M
page. Not sure how efficient would it be though.
 
> At that time I didn't follow more discussions (e.g. execmem_alloc())
> Maybe I'm missing some points.
> 
> [1] https://lore.kernel.org/linux-mm/20220809100408.rm6ofiewtty6rvcl@box
> 
> [2] https://lore.kernel.org/linux-mm/YvfLxuflw2ctHFWF@kernel.org

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-02-19 18:09   ` Mike Rapoport
@ 2023-02-20 14:43     ` Hyeonggon Yoo
  2023-02-24 14:45       ` Mike Rapoport
  0 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-02-20 14:43 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: lsf-pc, linux-mm, Aaron Lu, Kirill A. Shutemov

On Sun, Feb 19, 2023 at 08:09:07PM +0200, Mike Rapoport wrote:
> On Sun, Feb 19, 2023 at 08:07:59AM +0000, Hyeonggon Yoo wrote:
> > On Wed, Feb 01, 2023 at 08:06:37PM +0200, Mike Rapoport wrote:
> > > Hi all,
> > 
> > Hi Mike, I'm interested in this topic and hope to discuss this with you
> > at LSF/MM/BPF.
> >  
> > > To reduce the performance hit caused by the fragmentation of the direct
> > > map, it makes sense to group and/or cache the base pages removed from the
> > > direct map so that the most of base pages created during a split of a large
> > > page will be consumed by users requiring PTE level mappings.
> > 
> > How much performance difference did you see in your test when direct
> > map was fragmented, or is there a way to check this difference? 
> 
> I did some benchmarks a while ago with the entire direct map forced to 2M
> or 4k pages. The results I had are here:
> 
> https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing
> 
> Intel folks did more comprehensive testing and their results are here:
> 
> https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Thanks!

Hmm it might not be best choice to unconditionally
merge 2M mappings to 1G a mapping. (maybe should be controlled via a
boot parameter or something)
 
> > > My current proposal is to have a cache of 2M pages close to the page
> > > allocator and use a GFP flag to make allocation request use that cache. On
> > > the free() path, the pages that are mapped at PTE level will be put into
> > > that cache.
> > 
> > I would like to discuss not only having cache layer of pages but also how
> > direct map could be merged correctly and efficiently.
> > 
> > I vaguely recall that Aaron Lu sent RFC series about this and Kirill A.
> > Shutemov's feedback was to batch merge operations. [1]
> > 
> > Also a CPA API called by the cache layer that could merge fragmented
> > mappings would work for merging 4K pages to 2M [2], but won't work
> > for merging 2M mappings to 1G mappings.
> 
> One possible way is to make CPA scan all PMDs in 1G page after merging a 2M
> page. Not sure how efficient would it be though.

That seems to be similar to what Kirill A. Shutemov has been tried.
He may have opinions about that?

[3] https://lore.kernel.org/lkml/20200416213229.19174-1-kirill.shutemov@linux.intel.com
  
> > At that time I didn't follow more discussions (e.g. execmem_alloc())
> > Maybe I'm missing some points.
> > 
> > [1] https://lore.kernel.org/linux-mm/20220809100408.rm6ofiewtty6rvcl@box
> > 
> > [2] https://lore.kernel.org/linux-mm/YvfLxuflw2ctHFWF@kernel.org
> 
> -- 
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-02-20 14:43     ` Hyeonggon Yoo
@ 2023-02-24 14:45       ` Mike Rapoport
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2023-02-24 14:45 UTC (permalink / raw)
  To: Hyeonggon Yoo; +Cc: lsf-pc, linux-mm, Aaron Lu, Kirill A. Shutemov

On Mon, Feb 20, 2023 at 02:43:03PM +0000, Hyeonggon Yoo wrote:
> On Sun, Feb 19, 2023 at 08:09:07PM +0200, Mike Rapoport wrote:
> > On Sun, Feb 19, 2023 at 08:07:59AM +0000, Hyeonggon Yoo wrote:
>  
> > > > My current proposal is to have a cache of 2M pages close to the page
> > > > allocator and use a GFP flag to make allocation request use that cache. On
> > > > the free() path, the pages that are mapped at PTE level will be put into
> > > > that cache.
> > > 
> > > I would like to discuss not only having cache layer of pages but also how
> > > direct map could be merged correctly and efficiently.
> > > 
> > > I vaguely recall that Aaron Lu sent RFC series about this and Kirill A.
> > > Shutemov's feedback was to batch merge operations. [1]
> > > 
> > > Also a CPA API called by the cache layer that could merge fragmented
> > > mappings would work for merging 4K pages to 2M [2], but won't work
> > > for merging 2M mappings to 1G mappings.
> > 
> > One possible way is to make CPA scan all PMDs in 1G page after merging a 2M
> > page. Not sure how efficient would it be though.
> 
> That seems to be similar to what Kirill A. Shutemov has been tried.
> He may have opinions about that?
> 
> [3] https://lore.kernel.org/lkml/20200416213229.19174-1-kirill.shutemov@linux.intel.com

Kirill's patch attempted to restore 1G page for each cpa_flush(), so it
scanned a lot of page tables without a guarantee that collapsing small
mappings to a large page is possible.

If we call a function that will collapse a 2M when we know for sure that
the collapse is possible, it will be more efficient.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-02-01 18:06 [LSF/MM/BPF TOPIC] reducing direct map fragmentation Mike Rapoport
  2023-02-19  8:07 ` Hyeonggon Yoo
@ 2023-04-21  9:05 ` Michal Hocko
  2023-04-21  9:47   ` Mike Rapoport
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2023-04-21  9:05 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: lsf-pc, linux-mm, Dan Williams

Hi,

On Wed 01-02-23 20:06:37, Mike Rapoport wrote:
[...]
> My current proposal is to have a cache of 2M pages close to the page
> allocator and use a GFP flag to make allocation request use that cache. On
> the free() path, the pages that are mapped at PTE level will be put into
> that cache.

Are there stil open questions which would benefit from a discussion at
LSFMM this year?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-04-21  9:05 ` [Lsf-pc] " Michal Hocko
@ 2023-04-21  9:47   ` Mike Rapoport
  2023-04-21 12:41     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Rapoport @ 2023-04-21  9:47 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm, Dan Williams

On Fri, Apr 21, 2023 at 11:05:20AM +0200, Michal Hocko wrote:
> Hi,
> 
> On Wed 01-02-23 20:06:37, Mike Rapoport wrote:
> [...]
> > My current proposal is to have a cache of 2M pages close to the page
> > allocator and use a GFP flag to make allocation request use that cache. On
> > the free() path, the pages that are mapped at PTE level will be put into
> > that cache.
> 
> Are there stil open questions which would benefit from a discussion at
> LSFMM this year?

Yes, I believe.

I was trying to get some numbers to see what would be the benefit of
__GFP_UNMAPPED and I couldn't find a benchmark that will produce results
with good signal-to-noise ratio.

So while it seems that there's a general agreement on how to implement
caching of 2M pages, there is still no evidence that it will be universally
useful. 

It would be interesting to discuss the reasons for inconclusive results,
and more importantly, what should be the general direction for dealing with
the direct map fragmentation.

As it seems now, packing code allocations into 2M pages would be an
improvement, while data allocations that fragment the direct map do not
impact much the overall system performance.

I'll bring the mmtest results I have to begin the discussion.

> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] reducing direct map fragmentation
  2023-04-21  9:47   ` Mike Rapoport
@ 2023-04-21 12:41     ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2023-04-21 12:41 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: lsf-pc, linux-mm, Dan Williams

On Fri 21-04-23 12:47:06, Mike Rapoport wrote:
> On Fri, Apr 21, 2023 at 11:05:20AM +0200, Michal Hocko wrote:
> > Hi,
> > 
> > On Wed 01-02-23 20:06:37, Mike Rapoport wrote:
> > [...]
> > > My current proposal is to have a cache of 2M pages close to the page
> > > allocator and use a GFP flag to make allocation request use that cache. On
> > > the free() path, the pages that are mapped at PTE level will be put into
> > > that cache.
> > 
> > Are there stil open questions which would benefit from a discussion at
> > LSFMM this year?
> 
> Yes, I believe.
> 
> I was trying to get some numbers to see what would be the benefit of
> __GFP_UNMAPPED and I couldn't find a benchmark that will produce results
> with good signal-to-noise ratio.
> 
> So while it seems that there's a general agreement on how to implement
> caching of 2M pages, there is still no evidence that it will be universally
> useful. 
> 
> It would be interesting to discuss the reasons for inconclusive results,
> and more importantly, what should be the general direction for dealing with
> the direct map fragmentation.
> 
> As it seems now, packing code allocations into 2M pages would be an
> improvement, while data allocations that fragment the direct map do not
> impact much the overall system performance.
> 
> I'll bring the mmtest results I have to begin the discussion.

Makes sense. Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-21 12:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-01 18:06 [LSF/MM/BPF TOPIC] reducing direct map fragmentation Mike Rapoport
2023-02-19  8:07 ` Hyeonggon Yoo
2023-02-19 18:09   ` Mike Rapoport
2023-02-20 14:43     ` Hyeonggon Yoo
2023-02-24 14:45       ` Mike Rapoport
2023-04-21  9:05 ` [Lsf-pc] " Michal Hocko
2023-04-21  9:47   ` Mike Rapoport
2023-04-21 12:41     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).