From: David Hildenbrand <david@redhat.com>
To: "david.dai" <david.dai@montage-tech.com>
Cc: peter.maydell@linaro.org, vsementsov@virtuozzo.com,
eajames@linux.ibm.com, qemu-devel@nongnu.org,
changguo.du@montage-tech.com,
Stefan Hajnoczi <stefanha@redhat.com>,
Igor Mammedov <imammedo@redhat.com>,
kuhn.chenqun@huawei.com
Subject: Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU
Date: Wed, 13 Oct 2021 10:33:39 +0200 [thread overview]
Message-ID: <0c244f16-ca16-9f70-dab8-f543accc063b@redhat.com> (raw)
In-Reply-To: <20211013081337.GA96268@tianmu-host-sw-01>
On 13.10.21 10:13, david.dai wrote:
> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>
>>
>>
>>>> virito-mem currently relies on having a single sparse memory region (anon
>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>>>> share memory with other processes, sharing with other VMs is not intended.
>>>> Instead of actually mmaping parts dynamically (which can be quite
>>>> expensive), virtio-mem relies on punching holes into the backend and
>>>> dynamically allocating memory/file blocks/... on access.
>>>>
>>>> So the easy way to make it work is:
>>>>
>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>>>> memory getting managed by the buddy on a separate NUMA node.
>>>>
>>>
>>> Linux kernel buddy system? how to guarantee other applications don't apply memory
>>> from it
>>
>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
>> such that even if some other allocation ended up there, that it could
>> get migrated somewhere else.
>>
>> For example, "daxctl reconfigure-device" tries doing that as default:
>>
>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
>>
>> However, I agree that we might actually want to tell the system to not
>> use this CPU-less node as fallback for other allocations, and that we
>> might not want to swap out such memory etc.
>>
>>
>> But, in the end all that virtio-mem needs to work in the hypervisor is
>>
>> a) A sparse memmap (anonymous RAM, memfd, file)
>> b) A way to populate memory within that sparse memmap (e.g., on fault,
>> using madvise(MADV_POPULATE_WRITE), fallocate())
>> c) A way to discard memory (madvise(MADV_DONTNEED),
>> fallocate(FALLOC_FL_PUNCH_HOLE))
>>
>> So instead of using anonymous memory+mbind, you can also mmap a sparse file
>> and rely on populate-on-demand. One alternative for your use case would be
>> to create a DAX filesystem on that CXL memory (IIRC that should work) and
>> simply providing virtio-mem with a sparse file located on that filesystem.
>>
>> Of course, you can also use some other mechanism as you might have in
>> your approach, as long as it supports a,b,c.
>>
>>>
>>>>
>>>> b) (optional) allocate huge pages on that separate NUMA node.
>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
>>>> *bidning* the memory backend to that special NUMA node.
>>>>
>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
>>> How to bind backend memory to NUMA node
>>>
>>
>> I think the syntax is "policy=bind,host-nodes=X"
>>
>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
>> "host-nodes=0x20" etc.
>>
>>>>
>>>> This will dynamically allocate memory from that special NUMA node, resulting
>>>> in the virtio-mem device completely being backed by that device memory,
>>>> being able to dynamically resize the memory allocation.
>>>>
>>>>
>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
>>>> isn't really what we want and won't work without major design changes. Also,
>>>> I'm not so sure it's a very clean design: exposing memory belonging to other
>>>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
>>>> if you managed to escalate to the QEMU process from inside the VM, you can
>>>> access unrelated VM memory quite happily. You want an abstraction
>>>> in-between, that makes sure each VM/QEMU process only sees private memory:
>>>> for example, the buddy via dax/kmem.
>>>>
>>> Hi David
>>> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
>>> How does current virtio-mem dynamically attach memory to guest, via page fault?
>>
>> Essentially you have a large sparse mmap. Withing that mmap, memory is
>> populated on demand. Instead if mmap/munmap you perform a single large
>> mmap and then dynamically populate memory/discard memory.
>>
>> Right now, memory is populated via page faults on access. This is
>> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
>> file blocks) and you might run out of backend memory.
>>
>> I'm working on a "prealloc" mode, which will preallocate/populate memory
>> necessary for exposing the next block of memory to the VM, and which
>> fails gracefully if preallocation/population fails in the case of such
>> limited resources.
>>
>> The patch resides on:
>> https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
>>
>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Mon Aug 2 19:51:36 2021 +0200
>>
>> virtio-mem: support "prealloc=on" option
>> Especially for hugetlb, but also for file-based memory backends, we'd
>> like to be able to prealloc memory, especially to make user errors less
>> severe: crashing the VM when there are not sufficient huge pages around.
>> A common option for hugetlb will be using "reserve=off,prealloc=off" for
>> the memory backend and "prealloc=on" for the virtio-mem device. This
>> way, no huge pages will be reserved for the process, but we can recover
>> if there are no actual huge pages when plugging memory.
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>>
>> --
>> Thanks,
>>
>> David / dhildenb
>>
>
> Hi David,
>
> After read virtio-mem code, I understand what you have expressed, please allow me to describe
> my understanding to virtio-mem, so that we have a aligned view.
>
> Virtio-mem:
> Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
> growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
> to the whole area, but no ram is attached because Linux have a policy to delay allocation.
Right, but it can also be any sparse file (memory-backend-memfd,
memory-backend-file).
> When virtio-mem driver apply to dynamically add memory to guest, it first request a region
> from the reserved memory area, then notify virtio-mem device to record the information
> (virtio-mem device doesn't make real memory allocation). After received response from
In the upcoming prealloc=on mode I referenced, the allocation will
happen before the guest is notified about success and starts using the
memory.
With vfio/mdev support, the allocation will happen nowadays already,
when vfio/mdev is notified about the populated memory ranges (see
RamDiscardManager). That's essentially what makes virtio-mem device
passthrough work.
> virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
> page allocator. Real ram allocation will happen via page fault when guest cpu access it.
> Memory shrink will be achieved by madvise()
Right, but you could write a custom virtio-mem driver that pools this
memory differently.
Memory shrinking in the hypervisor is either done using
madvise(DONMTNEED) or fallocate(FALLOC_FL_PUNCH_HOLE)
>
> Questions:
> 1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
> Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
> allocate a buffer to computing device to place the computing result in.
That works already with virtio-mem with vfio/mdev via the
RamDiscardManager infrastructure introduced recently. With
"prealloc=on", the delayed memory allocation can also be avoided without
vfio/mdev.
> 2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
> method to build va->pa mapping in MMU and IOMMU.
Theoretically, you can wire up pretty much any driver in QEMU like
vfio/mdev via the RamDiscardManager. From there, you can issue whatever
syscall you need to popualte memory when plugging new memory blocks. All
you need to support is a sparse mmap and a way to populate/discard
memory. Populate/discard could be wired up in QEMU virtio-mem code as
you need it.
> 3. some potential requirements also require our driver to manage memory, so that page size
> granularity can be controlled to fit small device iotlb cache.
> CXL has bias mode for HDM(host managed device memory), it needs physical address to make
> bias mode switch between host access and device access. These tell us driver manage memory
> is mandatory.
I think if you write your driver in a certain way and wire it up in QEMU
virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
shouldn't be an issue.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-10-13 8:34 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-26 2:16 [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU David Dai
2021-09-27 8:27 ` Stefan Hajnoczi
2021-09-27 9:07 ` David Hildenbrand
2021-09-27 12:28 ` david.dai
2021-09-29 9:30 ` David Hildenbrand
2021-09-30 9:40 ` david.dai
2021-09-30 10:33 ` David Hildenbrand
2021-10-09 9:42 ` david.dai
2021-10-11 7:43 ` David Hildenbrand
2021-10-13 8:13 ` david.dai
2021-10-13 8:33 ` David Hildenbrand [this message]
2021-10-15 9:10 ` david.dai
2021-10-15 9:27 ` David Hildenbrand
2021-10-15 9:57 ` david.dai
2021-09-27 12:17 ` david.dai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c244f16-ca16-9f70-dab8-f543accc063b@redhat.com \
--to=david@redhat.com \
--cc=changguo.du@montage-tech.com \
--cc=david.dai@montage-tech.com \
--cc=eajames@linux.ibm.com \
--cc=imammedo@redhat.com \
--cc=kuhn.chenqun@huawei.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).