Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'

Linux CXL
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: "Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	"Hao Xiang" <hao.xiang@bytedance.com>,
	"Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
	"Ben Widawsky" <ben.widawsky@intel.com>,
	"Gregory Price" <gourry.memverge@gmail.com>,
	"Fan Ni" <fan.ni@samsung.com>, "Ira Weiny" <ira.weiny@intel.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Igor Mammedov" <imammedo@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	qemu-devel@nongnu.org, "Ho-Ren (Jack) Chuang" <horenc@vt.edu>,
	linux-cxl@vger.kernel.org
Subject: Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
Date: Wed, 3 Jan 2024 16:56:07 -0500	[thread overview]
Message-ID: <ZZXX95yvk/WTIBT/@memverge.com> (raw)
In-Reply-To: <20240101075315.43167-2-horenchuang@bytedance.com>

On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \

Stupid questions:

Why not just use `host-nodes` and pass in the numa node you want to
allocate from? Why should QEMU be made "CXL-aware" in the sense that
QEMU is responsible for figuring out what host node has CXL memory?

This feels like an "upper level software" operation (orchestration), rather
than something qemu should internally understand.

> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \

For a variety of performance reasons, this will not work the way you
want it to.  You are essentially telling QEMU to map the vmem0 into a
virtual cxl device, and now any memory accesses to that memory region
will end up going through the cxl-type3 device logic - which is an IO
path from the perspective of QEMU.

You may want to double check that your tests aren't using swap space in
the guest, because I found in my testing that linux would prefer to use
swap rather than attempt to use virtual cxl-type3 device memory because
of how god-awful slow it is (even if it is backed by DRAM).

Additionally, this configuration will not (should not) presently work
with VMX enabled.  Unless I missed some other update, QEMU w/ CXL memory
presently crashes when VMX is enabled for not-yet-debugged reasons.

Another possiblity: You mapped this memory-backend into another numa
node explicitly and never onlined the memory via cxlcli.  I've done
this, and it works, but it's a "hidden feature" that probably should
not exist / be supported.

If I understand the goal here, it's to pass CXL-hosted DRAM through to
the guest in a way that the system can manage it according to its
performance attributes.

You're better off just using the `host-nodes` field of host-memory
and passing bandwidth/latency attributes though via `-numa hmat-lb`

In that scenario, the guest software doesn't even need to know CXL
exists at all, it can just read the attributes of the numa node
that QEMU created for it.

In the future to deal with proper dynamic capacity, we may need to
consider a different backend object altogether that allows sparse
allocations, and a virtual cxl device which pre-allocates the CFMW
can at least be registered to manage it.  I'm not quite sure how that
looks just yet.

For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB
on a cpuless node.

qemu-system-x86_64 \
-nographic \
-accel kvm \
-machine type=q35,hmat=on \
-drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 8G,slots=4,maxmem=16G \
-smp cpus=4 \
-object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here
-object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \          <-- cpu node
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \       <-- cpuless node
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

[root@fedora ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 3965 MB
node 0 free: 3611 MB
node 1 cpus:
node 1 size: 3986 MB
node 1 free: 3960 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

[root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators
node0  read_bandwidth  read_latency  write_bandwidth  write_latency
[root@fedora initiators]# cat read_bandwidth
5
[root@fedora initiators]# cat read_latency
20

~Gregory

next prev parent reply	other threads:[~2024-01-03 21:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-01  7:53 [QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*' Ho-Ren (Jack) Chuang
2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
2024-01-02 10:29   ` Philippe Mathieu-Daudé
2024-01-02 13:03   ` David Hildenbrand
2024-01-06  0:45     ` [External] " Hao Xiang
2024-01-03 21:56   ` Gregory Price [this message]
2024-01-06  5:59     ` Hao Xiang
2024-01-08 17:15       ` Gregory Price
2024-01-08 22:47         ` Hao Xiang
2024-01-09  1:05           ` Hao Xiang
2024-01-09  1:13             ` Gregory Price
2024-01-09 19:33               ` Hao Xiang
2024-01-09 19:57                 ` Gregory Price
2024-01-09 21:27                   ` Hao Xiang
2024-01-09 22:13                     ` Gregory Price
2024-01-09 23:55                       ` Hao Xiang
2024-01-10 14:31                         ` Jonathan Cameron
2024-01-12 15:32   ` Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZXX95yvk/WTIBT/@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=armbru@redhat.com \
    --cc=ben.widawsky@intel.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=eblake@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=fan.ni@samsung.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hao.xiang@bytedance.com \
    --cc=horenc@vt.edu \
    --cc=horenchuang@bytedance.com \
    --cc=imammedo@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox