[LSF/MM/BPF TOPIC] Userspace managed memory tiering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Userspace managed memory tiering
@ 2021-06-18 17:50 Wei Xu
  2021-06-18 19:13 ` Zi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Wei Xu @ 2021-06-18 17:50 UTC (permalink / raw)
  To: lsf-pc, Linux MM
  Cc: Dan Williams, Dave Hansen, Tim Chen, David Rientjes, Greg Thelen,
	Paul Turner, Shakeel Butt

In this proposal, I'd like to discuss userspace-managed memory tiering
and the kernel support that it needs.

New memory technologies and interconnect standard make it possible to
have memory with different performance and cost on the same machine
(e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
We can expect heterogeneous memory systems that have performance
implications far beyond classical NUMA to become increasingly common
in the future.  One of important use cases of such tiered memory
systems is to improve the data center and cloud efficiency with
better performance/TCO.

Because different classes of applications (e.g. latency sensitive vs
latency tolerant, high priority vs low priority) have different
requirements, richer and more flexible memory tiering policies will
be needed to achieve the desired performance target on a tiered
memory system, which would be more effectively managed by a userspace
agent, not by the kernel.  Moreover, we (Google) are explicitly trying
to avoid adding a ton of heuristics to enlighten the kernel about the
policy that we want on multi-tenant machines when the userspace offers
more flexibility.

To manage memory tiering in userspace, we need the kernel support in
the three key areas:

- resource abstraction and control of tiered memory;
- API to monitor page accesses for making memory tiering decisions;
- API to migrate pages (demotion/promotion).

Userspace memory tiering can work on just NUMA memory nodes, provided
that memory resources from different tiers are abstracted into
separate NUMA nodes.  The userspace agent can create a tiering
topology among these nodes based on their distances.

An explicit memory tiering abstraction in the kernel is preferred,
though, because it can not only allow the kernel to react in cases
where it is challenging for userspace (e.g. reclaim-based demotion
when the system is under DRAM pressure due to usage surge), but also
enable tiering controls such as per-cgroup memory tier limits.
This requirement is mostly aligned with the existing proposals [1]
and [2].

The userspace agent manages all migratable user memory on the system
and this can be transparent from the point of view of applications.
To demote cold pages and promote hot pages, the userspace agent needs
page access information.  Because it is a system-wide tiering for user
memory, the access information for both mapped and unmapped user pages
is needed, and so are the physical page addresses.  A combination of
page table accessed-bit scanning and struct page scanning should be
needed.  Such page access monitoring should be efficient as well
because the scans can be frequent. To return the page-level access
information to the userspace, one proposal is to use tracepoint
events. The userspace agent can then use BPF programs to collect such
data and also apply customized filters when necessary.

The userspace agent can also make use of hardware PMU events, for
which the existing kernel support should be sufficient.

The third area is the API support for migrating pages. The existing
move_pages() syscall can be a candidate, though it is virtual-address
based and cannot migrate unmapped pages.  Is a physical-address based
variant (e.g. move_pfns()), an acceptable proposal?

[1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/
[2] https://lore.kernel.org/patchwork/cover/1408180/

Thanks,
Wei

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-18 17:50 [LSF/MM/BPF TOPIC] Userspace managed memory tiering Wei Xu
@ 2021-06-18 19:13 ` Zi Yan
  2021-06-18 19:23   ` Wei Xu
  2021-06-18 21:07 ` David Rientjes
  2021-06-21 18:58 ` Yang Shi
  2 siblings, 1 reply; 7+ messages in thread
From: Zi Yan @ 2021-06-18 19:13 UTC (permalink / raw)
  To: Wei Xu
  Cc: lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	David Rientjes, Greg Thelen, Paul Turner, Shakeel Butt

[-- Attachment #1: Type: text/plain, Size: 4310 bytes --]

On 18 Jun 2021, at 13:50, Wei Xu wrote:

> In this proposal, I'd like to discuss userspace-managed memory tiering
> and the kernel support that it needs.
>
> New memory technologies and interconnect standard make it possible to
> have memory with different performance and cost on the same machine
> (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
> We can expect heterogeneous memory systems that have performance
> implications far beyond classical NUMA to become increasingly common
> in the future.  One of important use cases of such tiered memory
> systems is to improve the data center and cloud efficiency with
> better performance/TCO.
>
> Because different classes of applications (e.g. latency sensitive vs
> latency tolerant, high priority vs low priority) have different
> requirements, richer and more flexible memory tiering policies will
> be needed to achieve the desired performance target on a tiered
> memory system, which would be more effectively managed by a userspace
> agent, not by the kernel.  Moreover, we (Google) are explicitly trying
> to avoid adding a ton of heuristics to enlighten the kernel about the
> policy that we want on multi-tenant machines when the userspace offers
> more flexibility.
>
> To manage memory tiering in userspace, we need the kernel support in
> the three key areas:
>
> - resource abstraction and control of tiered memory;
> - API to monitor page accesses for making memory tiering decisions;
> - API to migrate pages (demotion/promotion).
>
> Userspace memory tiering can work on just NUMA memory nodes, provided
> that memory resources from different tiers are abstracted into
> separate NUMA nodes.  The userspace agent can create a tiering
> topology among these nodes based on their distances.
>
> An explicit memory tiering abstraction in the kernel is preferred,
> though, because it can not only allow the kernel to react in cases
> where it is challenging for userspace (e.g. reclaim-based demotion
> when the system is under DRAM pressure due to usage surge), but also
> enable tiering controls such as per-cgroup memory tier limits.
> This requirement is mostly aligned with the existing proposals [1]
> and [2].
>
> The userspace agent manages all migratable user memory on the system
> and this can be transparent from the point of view of applications.
> To demote cold pages and promote hot pages, the userspace agent needs
> page access information.  Because it is a system-wide tiering for user
> memory, the access information for both mapped and unmapped user pages
> is needed, and so are the physical page addresses.  A combination of
> page table accessed-bit scanning and struct page scanning should be
> needed.  Such page access monitoring should be efficient as well
> because the scans can be frequent. To return the page-level access
> information to the userspace, one proposal is to use tracepoint
> events. The userspace agent can then use BPF programs to collect such
> data and also apply customized filters when necessary.
>
> The userspace agent can also make use of hardware PMU events, for
> which the existing kernel support should be sufficient.

I agree that userspace agents would be more flexible in terms of implementing
different page migration policies if the OS provides interfaces for that
like IRIX did before[1].

> The third area is the API support for migrating pages. The existing
> move_pages() syscall can be a candidate, though it is virtual-address
> based and cannot migrate unmapped pages.  Is a physical-address based
> variant (e.g. move_pfns()), an acceptable proposal?

PFN cannot be moved, right? I guess you mean moving the data from one
page to another based on the given PFN. What are the potential use
cases of moving unmapped pages? Moving unmapped page cache pages?

Besides all above, using DMA engine or other HW-provided data copy engine
for page migration instead of CPUs[2] and migrating pages in an async way
are something I am interested in, since it could save CPU resources
when page migration between nodes becomes more frequent.


[1] https://studies.ac.upc.edu/dso/papers/nikolopoulos00case.pdf
[2] https://lwn.net/Articles/784925/


—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-18 19:13 ` Zi Yan
@ 2021-06-18 19:23   ` Wei Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Wei Xu @ 2021-06-18 19:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	David Rientjes, Greg Thelen, Paul Turner, Shakeel Butt

On Fri, Jun 18, 2021 at 12:13 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 18 Jun 2021, at 13:50, Wei Xu wrote:
>
> > In this proposal, I'd like to discuss userspace-managed memory tiering
> > and the kernel support that it needs.
> >
> > New memory technologies and interconnect standard make it possible to
> > have memory with different performance and cost on the same machine
> > (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
> > We can expect heterogeneous memory systems that have performance
> > implications far beyond classical NUMA to become increasingly common
> > in the future.  One of important use cases of such tiered memory
> > systems is to improve the data center and cloud efficiency with
> > better performance/TCO.
> >
> > Because different classes of applications (e.g. latency sensitive vs
> > latency tolerant, high priority vs low priority) have different
> > requirements, richer and more flexible memory tiering policies will
> > be needed to achieve the desired performance target on a tiered
> > memory system, which would be more effectively managed by a userspace
> > agent, not by the kernel.  Moreover, we (Google) are explicitly trying
> > to avoid adding a ton of heuristics to enlighten the kernel about the
> > policy that we want on multi-tenant machines when the userspace offers
> > more flexibility.
> >
> > To manage memory tiering in userspace, we need the kernel support in
> > the three key areas:
> >
> > - resource abstraction and control of tiered memory;
> > - API to monitor page accesses for making memory tiering decisions;
> > - API to migrate pages (demotion/promotion).
> >
> > Userspace memory tiering can work on just NUMA memory nodes, provided
> > that memory resources from different tiers are abstracted into
> > separate NUMA nodes.  The userspace agent can create a tiering
> > topology among these nodes based on their distances.
> >
> > An explicit memory tiering abstraction in the kernel is preferred,
> > though, because it can not only allow the kernel to react in cases
> > where it is challenging for userspace (e.g. reclaim-based demotion
> > when the system is under DRAM pressure due to usage surge), but also
> > enable tiering controls such as per-cgroup memory tier limits.
> > This requirement is mostly aligned with the existing proposals [1]
> > and [2].
> >
> > The userspace agent manages all migratable user memory on the system
> > and this can be transparent from the point of view of applications.
> > To demote cold pages and promote hot pages, the userspace agent needs
> > page access information.  Because it is a system-wide tiering for user
> > memory, the access information for both mapped and unmapped user pages
> > is needed, and so are the physical page addresses.  A combination of
> > page table accessed-bit scanning and struct page scanning should be
> > needed.  Such page access monitoring should be efficient as well
> > because the scans can be frequent. To return the page-level access
> > information to the userspace, one proposal is to use tracepoint
> > events. The userspace agent can then use BPF programs to collect such
> > data and also apply customized filters when necessary.
> >
> > The userspace agent can also make use of hardware PMU events, for
> > which the existing kernel support should be sufficient.
>
> I agree that userspace agents would be more flexible in terms of implementing
> different page migration policies if the OS provides interfaces for that
> like IRIX did before[1].
>
> > The third area is the API support for migrating pages. The existing
> > move_pages() syscall can be a candidate, though it is virtual-address
> > based and cannot migrate unmapped pages.  Is a physical-address based
> > variant (e.g. move_pfns()), an acceptable proposal?
>
> PFN cannot be moved, right? I guess you mean moving the data from one
> page to another based on the given PFN. What are the potential use
> cases of moving unmapped pages? Moving unmapped page cache pages?

Right, move_pfns() is not the best name.  The idea is exactly to move data from
one page to another based on the given PFN.  Other than page cache pages,
another example is tmpfs pages that are not mmap-ed.

> Besides all above, using DMA engine or other HW-provided data copy engine
> for page migration instead of CPUs[2] and migrating pages in an async way
> are something I am interested in, since it could save CPU resources
> when page migration between nodes becomes more frequent.

This is a great point, which is also what we are interested in.  The
idea is that the
API to migrate pages can be optimized with such HW acceleration when
available.  Even with CPUs, we have found that non-temporal stores are useful
for demotions because it bypasses caches and works better for hardware such as
PMEM.

>
> [1] https://studies.ac.upc.edu/dso/papers/nikolopoulos00case.pdf
> [2] https://lwn.net/Articles/784925/
>
>
> —
> Best Regards,
> Yan, Zi

Wei


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-18 17:50 [LSF/MM/BPF TOPIC] Userspace managed memory tiering Wei Xu
  2021-06-18 19:13 ` Zi Yan
@ 2021-06-18 21:07 ` David Rientjes
  2021-06-19 23:43   ` Jason Gunthorpe
  2021-06-21 18:58 ` Yang Shi
  2 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2021-06-18 21:07 UTC (permalink / raw)
  To: Wei Xu
  Cc: lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	Greg Thelen, Paul Turner, Shakeel Butt

On Fri, 18 Jun 2021, Wei Xu wrote:

> In this proposal, I'd like to discuss userspace-managed memory tiering
> and the kernel support that it needs.
> 

Thanks Wei.  Yes, this would be very useful to discuss at LSFMMBPF.

It would also be very helpful to hear from other interested parties here 
on the mailing list ahead of time.  It would be great to know the 
motivations and priorities of others interested in memory tiering for the 
use cases that Wei enumerated so that we can do some early brainstorming.

Thanks!

> New memory technologies and interconnect standard make it possible to
> have memory with different performance and cost on the same machine
> (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
> We can expect heterogeneous memory systems that have performance
> implications far beyond classical NUMA to become increasingly common
> in the future.  One of important use cases of such tiered memory
> systems is to improve the data center and cloud efficiency with
> better performance/TCO.
> 
> Because different classes of applications (e.g. latency sensitive vs
> latency tolerant, high priority vs low priority) have different
> requirements, richer and more flexible memory tiering policies will
> be needed to achieve the desired performance target on a tiered
> memory system, which would be more effectively managed by a userspace
> agent, not by the kernel.  Moreover, we (Google) are explicitly trying
> to avoid adding a ton of heuristics to enlighten the kernel about the
> policy that we want on multi-tenant machines when the userspace offers
> more flexibility.
> 
> To manage memory tiering in userspace, we need the kernel support in
> the three key areas:
> 
> - resource abstraction and control of tiered memory;
> - API to monitor page accesses for making memory tiering decisions;
> - API to migrate pages (demotion/promotion).
> 
> Userspace memory tiering can work on just NUMA memory nodes, provided
> that memory resources from different tiers are abstracted into
> separate NUMA nodes.  The userspace agent can create a tiering
> topology among these nodes based on their distances.
> 
> An explicit memory tiering abstraction in the kernel is preferred,
> though, because it can not only allow the kernel to react in cases
> where it is challenging for userspace (e.g. reclaim-based demotion
> when the system is under DRAM pressure due to usage surge), but also
> enable tiering controls such as per-cgroup memory tier limits.
> This requirement is mostly aligned with the existing proposals [1]
> and [2].
> 
> The userspace agent manages all migratable user memory on the system
> and this can be transparent from the point of view of applications.
> To demote cold pages and promote hot pages, the userspace agent needs
> page access information.  Because it is a system-wide tiering for user
> memory, the access information for both mapped and unmapped user pages
> is needed, and so are the physical page addresses.  A combination of
> page table accessed-bit scanning and struct page scanning should be
> needed.  Such page access monitoring should be efficient as well
> because the scans can be frequent. To return the page-level access
> information to the userspace, one proposal is to use tracepoint
> events. The userspace agent can then use BPF programs to collect such
> data and also apply customized filters when necessary.
> 
> The userspace agent can also make use of hardware PMU events, for
> which the existing kernel support should be sufficient.
> 
> The third area is the API support for migrating pages. The existing
> move_pages() syscall can be a candidate, though it is virtual-address
> based and cannot migrate unmapped pages.  Is a physical-address based
> variant (e.g. move_pfns()), an acceptable proposal?
> 
> [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/
> [2] https://lore.kernel.org/patchwork/cover/1408180/
> 
> Thanks,
> Wei
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-18 21:07 ` David Rientjes
@ 2021-06-19 23:43   ` Jason Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2021-06-19 23:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Wei Xu, lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	Greg Thelen, Paul Turner, Shakeel Butt

On Fri, Jun 18, 2021 at 02:07:08PM -0700, David Rientjes wrote:
> On Fri, 18 Jun 2021, Wei Xu wrote:
> 
> > In this proposal, I'd like to discuss userspace-managed memory tiering
> > and the kernel support that it needs.
> > 
> 
> Thanks Wei.  Yes, this would be very useful to discuss at LSFMMBPF.
> 
> It would also be very helpful to hear from other interested parties here 
> on the mailing list ahead of time.  It would be great to know the 
> motivations and priorities of others interested in memory tiering for the 
> use cases that Wei enumerated so that we can do some early brainstorming.

This reminds me quite alot of the pitch that was given for the hmm
migration user space policy stuff aimed at GPUs, but perhaps
differently generalized?

Jason


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-18 17:50 [LSF/MM/BPF TOPIC] Userspace managed memory tiering Wei Xu
  2021-06-18 19:13 ` Zi Yan
  2021-06-18 21:07 ` David Rientjes
@ 2021-06-21 18:58 ` Yang Shi
  2021-06-22  3:00   ` Huang, Ying
  2 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2021-06-21 18:58 UTC (permalink / raw)
  To: Wei Xu
  Cc: lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	David Rientjes, Greg Thelen, Paul Turner, Shakeel Butt,
	ying.huang

On Fri, Jun 18, 2021 at 10:50 AM Wei Xu <weixugc@google.com> wrote:
>
> In this proposal, I'd like to discuss userspace-managed memory tiering
> and the kernel support that it needs.
>
> New memory technologies and interconnect standard make it possible to
> have memory with different performance and cost on the same machine
> (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
> We can expect heterogeneous memory systems that have performance
> implications far beyond classical NUMA to become increasingly common
> in the future.  One of important use cases of such tiered memory
> systems is to improve the data center and cloud efficiency with
> better performance/TCO.
>
> Because different classes of applications (e.g. latency sensitive vs
> latency tolerant, high priority vs low priority) have different
> requirements, richer and more flexible memory tiering policies will
> be needed to achieve the desired performance target on a tiered
> memory system, which would be more effectively managed by a userspace
> agent, not by the kernel.  Moreover, we (Google) are explicitly trying
> to avoid adding a ton of heuristics to enlighten the kernel about the
> policy that we want on multi-tenant machines when the userspace offers
> more flexibility.
>
> To manage memory tiering in userspace, we need the kernel support in
> the three key areas:
>
> - resource abstraction and control of tiered memory;
> - API to monitor page accesses for making memory tiering decisions;
> - API to migrate pages (demotion/promotion).
>
> Userspace memory tiering can work on just NUMA memory nodes, provided
> that memory resources from different tiers are abstracted into
> separate NUMA nodes.  The userspace agent can create a tiering
> topology among these nodes based on their distances.
>
> An explicit memory tiering abstraction in the kernel is preferred,
> though, because it can not only allow the kernel to react in cases
> where it is challenging for userspace (e.g. reclaim-based demotion
> when the system is under DRAM pressure due to usage surge), but also
> enable tiering controls such as per-cgroup memory tier limits.
> This requirement is mostly aligned with the existing proposals [1]
> and [2].
>
> The userspace agent manages all migratable user memory on the system
> and this can be transparent from the point of view of applications.
> To demote cold pages and promote hot pages, the userspace agent needs
> page access information.  Because it is a system-wide tiering for user
> memory, the access information for both mapped and unmapped user pages
> is needed, and so are the physical page addresses.  A combination of
> page table accessed-bit scanning and struct page scanning should be
> needed.  Such page access monitoring should be efficient as well
> because the scans can be frequent. To return the page-level access
> information to the userspace, one proposal is to use tracepoint
> events. The userspace agent can then use BPF programs to collect such
> data and also apply customized filters when necessary.

Just FYI. There has been a project for userspace daemon. Please refer
to https://github.com/fengguang/memory-optimizer

We (Alibaba, when I was there) did some preliminary tests and
benchmarks with it. The accuracy was pretty good, but the cost was
relatively high. I agree with you that efficiency is the key. BPF may
be a good approach to improve the cost.

I'm not sure what the current status of this project is. You may reach
Huang Ying to get more information.

>
> The userspace agent can also make use of hardware PMU events, for
> which the existing kernel support should be sufficient.
>
> The third area is the API support for migrating pages. The existing
> move_pages() syscall can be a candidate, though it is virtual-address
> based and cannot migrate unmapped pages.  Is a physical-address based
> variant (e.g. move_pfns()), an acceptable proposal?
>
> [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/
> [2] https://lore.kernel.org/patchwork/cover/1408180/
>
> Thanks,
> Wei
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering
  2021-06-21 18:58 ` Yang Shi
@ 2021-06-22  3:00   ` Huang, Ying
  0 siblings, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2021-06-22  3:00 UTC (permalink / raw)
  To: Yang Shi
  Cc: Wei Xu, lsf-pc, Linux MM, Dan Williams, Dave Hansen, Tim Chen,
	David Rientjes, Greg Thelen, Paul Turner, Shakeel Butt,
	wufengguang

Yang Shi <shy828301@gmail.com> writes:

> On Fri, Jun 18, 2021 at 10:50 AM Wei Xu <weixugc@google.com> wrote:
>>
>> In this proposal, I'd like to discuss userspace-managed memory tiering
>> and the kernel support that it needs.
>>
>> New memory technologies and interconnect standard make it possible to
>> have memory with different performance and cost on the same machine
>> (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
>> We can expect heterogeneous memory systems that have performance
>> implications far beyond classical NUMA to become increasingly common
>> in the future.  One of important use cases of such tiered memory
>> systems is to improve the data center and cloud efficiency with
>> better performance/TCO.
>>
>> Because different classes of applications (e.g. latency sensitive vs
>> latency tolerant, high priority vs low priority) have different
>> requirements, richer and more flexible memory tiering policies will
>> be needed to achieve the desired performance target on a tiered
>> memory system, which would be more effectively managed by a userspace
>> agent, not by the kernel.  Moreover, we (Google) are explicitly trying
>> to avoid adding a ton of heuristics to enlighten the kernel about the
>> policy that we want on multi-tenant machines when the userspace offers
>> more flexibility.

Because more knowledge about the applications may be available in the
user space, it's possible for the advanced user space solution to work
better than the basic kernel space solution for some workloads.  And
this doesn't make the in-kernel basic optimization solution useless :-)

>> To manage memory tiering in userspace, we need the kernel support in
>> the three key areas:
>>
>> - resource abstraction and control of tiered memory;
>> - API to monitor page accesses for making memory tiering decisions;
>> - API to migrate pages (demotion/promotion).
>>
>> Userspace memory tiering can work on just NUMA memory nodes, provided
>> that memory resources from different tiers are abstracted into
>> separate NUMA nodes.  The userspace agent can create a tiering
>> topology among these nodes based on their distances.
>>
>> An explicit memory tiering abstraction in the kernel is preferred,
>> though, because it can not only allow the kernel to react in cases
>> where it is challenging for userspace (e.g. reclaim-based demotion
>> when the system is under DRAM pressure due to usage surge), but also
>> enable tiering controls such as per-cgroup memory tier limits.
>> This requirement is mostly aligned with the existing proposals [1]
>> and [2].
>>
>> The userspace agent manages all migratable user memory on the system
>> and this can be transparent from the point of view of applications.
>> To demote cold pages and promote hot pages, the userspace agent needs
>> page access information.  Because it is a system-wide tiering for user
>> memory, the access information for both mapped and unmapped user pages
>> is needed, and so are the physical page addresses.  A combination of
>> page table accessed-bit scanning and struct page scanning should be
>> needed.  Such page access monitoring should be efficient as well
>> because the scans can be frequent. To return the page-level access
>> information to the userspace, one proposal is to use tracepoint
>> events. The userspace agent can then use BPF programs to collect such
>> data and also apply customized filters when necessary.
>
> Just FYI. There has been a project for userspace daemon. Please refer
> to https://github.com/fengguang/memory-optimizer
>
> We (Alibaba, when I was there) did some preliminary tests and
> benchmarks with it. The accuracy was pretty good, but the cost was
> relatively high. I agree with you that efficiency is the key. BPF may
> be a good approach to improve the cost.
>
> I'm not sure what the current status of this project is. You may reach
> Huang Ying to get more information.

We have stopped working on that project.  Because we are focusing on the
kernel space basic solution for now.  It's our pleasure if the code is
helpful in any way for anyone.

>>
>> The userspace agent can also make use of hardware PMU events, for
>> which the existing kernel support should be sufficient.

There's a PMU based implementation in the above github project too.

>> The third area is the API support for migrating pages. The existing
>> move_pages() syscall can be a candidate, though it is virtual-address
>> based and cannot migrate unmapped pages.

Dave had told me before that, for file cache pages, we can map the file
by ourselves and call move_pages() on the pages to migrate it between
NUMA nodes.

Best Regards,
Huang, Ying

>> Is a physical-address based
>> variant (e.g. move_pfns()), an acceptable proposal?
>>
>> [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/
>> [2] https://lore.kernel.org/patchwork/cover/1408180/
>>
>> Thanks,
>> Wei
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-06-22  3:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-18 17:50 [LSF/MM/BPF TOPIC] Userspace managed memory tiering Wei Xu
2021-06-18 19:13 ` Zi Yan
2021-06-18 19:23   ` Wei Xu
2021-06-18 21:07 ` David Rientjes
2021-06-19 23:43   ` Jason Gunthorpe
2021-06-21 18:58 ` Yang Shi
2021-06-22  3:00   ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).