Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Srinivasulu Thanneeru <sthanneeru@micron.com>,
	Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"john@jagalactic.com" <john@jagalactic.com>,
	Eishan Mirakhur <emirakhur@micron.com>,
	Vinicius Tavares Petrucci <vtavarespetr@micron.com>,
	Ravis OpenSrc <Ravis.OpenSrc@micron.com>,
	"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Wei Xu <weixugc@google.com>,
	Hao Xiang <hao.xiang@bytedance.com>,
	"Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com>
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
Date: Mon, 8 Jan 2024 12:04:34 -0500	[thread overview]
Message-ID: <ZZwrIoP9+ey7rp3C@memverge.com> (raw)
In-Reply-To: <87wmspbpma.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >
> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> > abstract_distance_offset: override by users to deal with firmware issue.
> >
> > say firmware can configure the cxl node into wrong tiers, similar to
> > that it may also configure all cxl nodes into single memtype, hence
> > all these nodes can fall into a single wrong tier.
> > In this case, per node adistance_offset would be good to have ?
> 
> I think that it's better to fix the error firmware if possible.  And
> these are only theoretical, not practical issues.  Do you have some
> practical issues?
> 
> I understand that users may want to move nodes between memory tiers for
> different policy choices.  For that, memory_type based adistance_offset
> should be good.
> 

There's actually an affirmative case to change memory tiering to allow
either movement of nodes between tiers, or at least base placement on
HMAT information. Preferably, membership would be changable to allow
hotplug/DCD to be managed (there's no guarantee that the memory passed
through will always be what HMAT says on initial boot).

https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/

This group wants to enable passing CXL memory through to KVM/QEMU
(i.e. host CXL expander memory passed through to the guest), and
allow the guest to apply memory tiering.

There are multiple issues with this, presently:

1. The QEMU CXL virtual device is not and probably never will be
   performant enough to be a commodity class virtualization.  The
   reason is that the virtual CXL device is built off the I/O
   virtualization stack, which treats memory accesses as I/O accesses.

   KVM also seems incompatible with the design of the CXL memory device
   in general, but this problem may or may not be a blocker.

   As a result, access to virtual CXL memory device leads to QEMU
   crawling to a halt - and this is unlikely to change.

   There is presently no good way forward to create a performant virtual
   CXL device in QEMU.  This means the memory tiering component in the
   kernel is functionally useless for virtual CXL memory, because...

2. When passing memory through as an explicit NUMA node, but not as
   part of a CXL memory device, the nodes are lumped together in the
   DRAM tier.

None of this has to do with firmware.

Memory-type is an awful way of denoting membership of a tier, but we
have HMAT information that can be passed through via QEMU:

-object memory-backend-ram,size=4G,id=ram-node0 \
-object memory-backend-ram,size=4G,id=ram-node1 \
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

Not only would it be nice if we could change tier membership based on
this data, it's realistically the only way to allow guests to accomplish
memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.

~Gregory

next prev parent reply	other threads:[~2024-01-08 17:04 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-13 17:53 [RFC PATCH v2 0/2] Node migration between memory tiers sthanneeru.opensrc
2023-12-13 17:53 ` [PATCH 1/2] base/node: Add sysfs for memtier_override sthanneeru.opensrc
2023-12-13 17:53 ` [PATCH 2/2] memory tier: Support node migration between tiers sthanneeru.opensrc
2023-12-15  5:02 ` [RFC PATCH v2 0/2] Node migration between memory tiers Huang, Ying
2023-12-15 17:42   ` Gregory Price
2023-12-18  5:55     ` Huang, Ying
2024-01-03  5:26       ` [EXT] " Srinivasulu Thanneeru
2024-01-03  6:07         ` Huang, Ying
2024-01-03  7:56           ` Srinivasulu Thanneeru
2024-01-03  8:29             ` Huang, Ying
2024-01-03  8:47               ` Srinivasulu Thanneeru
2024-01-04  6:05                 ` Huang, Ying
2024-01-08 17:04                   ` Gregory Price [this message]
2024-01-09  3:41                     ` Huang, Ying
2024-01-09 15:50                       ` Jonathan Cameron
2024-01-09 17:59                         ` Gregory Price
2024-01-10  0:28                           ` [External] " Hao Xiang
2024-01-10 14:18                             ` Jonathan Cameron
2024-01-10 19:29                               ` Hao Xiang
2024-01-12  7:00                                 ` Huang, Ying
2024-01-12  8:14                                   ` Hao Xiang
2024-01-15  1:24                                     ` Huang, Ying
2024-01-10  5:47                           ` Huang, Ying
2024-01-10 14:11                           ` Jonathan Cameron
2024-01-10  6:06                         ` Huang, Ying
2024-01-09 17:34                       ` Gregory Price
2023-12-18  8:56   ` Srinivasulu Thanneeru
2023-12-19  3:57     ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZwrIoP9+ey7rp3C@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Ravis.OpenSrc@micron.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=dan.j.williams@intel.com \
    --cc=emirakhur@micron.com \
    --cc=hannes@cmpxchg.org \
    --cc=hao.xiang@bytedance.com \
    --cc=horenchuang@bytedance.com \
    --cc=john@jagalactic.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=sthanneeru.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=tj@kernel.org \
    --cc=vtavarespetr@micron.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox