Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: "david.dai" <david.dai@montage-tech.com>
Cc: peter.maydell@linaro.org, vsementsov@virtuozzo.com,
	eajames@linux.ibm.com, qemu-devel@nongnu.org,
	changguo.du@montage-tech.com,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Igor Mammedov <imammedo@redhat.com>,
	kuhn.chenqun@huawei.com
Subject: Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU
Date: Mon, 11 Oct 2021 09:43:53 +0200	[thread overview]
Message-ID: <ea36815e-0b79-b5b2-9735-367404c9b8f6@redhat.com> (raw)
In-Reply-To: <20211009094233.GA13867@tianmu-host-sw-01>

>> virito-mem currently relies on having a single sparse memory region (anon
>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>> share memory with other processes, sharing with other VMs is not intended.
>> Instead of actually mmaping parts dynamically (which can be quite
>> expensive), virtio-mem relies on punching holes into the backend and
>> dynamically allocating memory/file blocks/... on access.
>>
>> So the easy way to make it work is:
>>
>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>> memory getting managed by the buddy on a separate NUMA node.
>>
> 
> Linux kernel buddy system? how to guarantee other applications don't apply memory
> from it

Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
such that even if some other allocation ended up there, that it could
get migrated somewhere else.

For example, "daxctl reconfigure-device" tries doing that as default:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

However, I agree that we might actually want to tell the system to not
use this CPU-less node as fallback for other allocations, and that we
might not want to swap out such memory etc.


But, in the end all that virtio-mem needs to work in the hypervisor is

a) A sparse memmap (anonymous RAM, memfd, file)
b) A way to populate memory within that sparse memmap (e.g., on fault,
    using madvise(MADV_POPULATE_WRITE), fallocate())
c) A way to discard memory (madvise(MADV_DONTNEED),
    fallocate(FALLOC_FL_PUNCH_HOLE))

So instead of using anonymous memory+mbind, you can also mmap a sparse file
and rely on populate-on-demand. One alternative for your use case would be
to create a DAX  filesystem on that CXL memory (IIRC that should work) and
simply providing virtio-mem with a sparse file located on that filesystem.

Of course, you can also use some other mechanism as you might have in
your approach, as long as it supports a,b,c.

> 
>>
>> b) (optional) allocate huge pages on that separate NUMA node.
>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
>> *bidning* the memory backend to that special NUMA node.
>>
>   
> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
> How to bind backend memory to NUMA node
> 

I think the syntax is "policy=bind,host-nodes=X"

whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
"host-nodes=0x20" etc.

>>
>> This will dynamically allocate memory from that special NUMA node, resulting
>> in the virtio-mem device completely being backed by that device memory,
>> being able to dynamically resize the memory allocation.
>>
>>
>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
>> isn't really what we want and won't work without major design changes. Also,
>> I'm not so sure it's a very clean design: exposing memory belonging to other
>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
>> if you managed to escalate to the QEMU process from inside the VM, you can
>> access unrelated VM memory quite happily. You want an abstraction
>> in-between, that makes sure each VM/QEMU process only sees private memory:
>> for example, the buddy via dax/kmem.
>>
> Hi David
> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
> How does current virtio-mem dynamically attach memory to guest, via page fault?

Essentially you have a large sparse mmap. Withing that mmap, memory is
populated on demand. Instead if mmap/munmap you perform a single large
mmap and then dynamically populate memory/discard memory.

Right now, memory is populated via page faults on access. This is
sub-optimal when dealing with limited resources (i.e., hugetlbfs,
file blocks) and you might run out of backend memory.

I'm working on a "prealloc" mode, which will preallocate/populate memory
necessary for exposing the next block of memory to the VM, and which
fails gracefully if preallocation/population fails in the case of such
limited resources.

The patch resides on:
	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next

commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 2 19:51:36 2021 +0200

     virtio-mem: support "prealloc=on" option
     
     Especially for hugetlb, but also for file-based memory backends, we'd
     like to be able to prealloc memory, especially to make user errors less
     severe: crashing the VM when there are not sufficient huge pages around.
     
     A common option for hugetlb will be using "reserve=off,prealloc=off" for
     the memory backend and "prealloc=on" for the virtio-mem device. This
     way, no huge pages will be reserved for the process, but we can recover
     if there are no actual huge pages when plugging memory.
     
     Signed-off-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2021-10-11  7:45 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-26  2:16 [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU David Dai
2021-09-27  8:27 ` Stefan Hajnoczi
2021-09-27  9:07   ` David Hildenbrand
2021-09-27 12:28     ` david.dai
2021-09-29  9:30       ` David Hildenbrand
2021-09-30  9:40         ` david.dai
2021-09-30 10:33           ` David Hildenbrand
2021-10-09  9:42             ` david.dai
2021-10-11  7:43               ` David Hildenbrand [this message]
2021-10-13  8:13                 ` david.dai
2021-10-13  8:33                   ` David Hildenbrand
2021-10-15  9:10                     ` david.dai
2021-10-15  9:27                       ` David Hildenbrand
2021-10-15  9:57                         ` david.dai
2021-09-27 12:17   ` david.dai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea36815e-0b79-b5b2-9735-367404c9b8f6@redhat.com \
    --to=david@redhat.com \
    --cc=changguo.du@montage-tech.com \
    --cc=david.dai@montage-tech.com \
    --cc=eajames@linux.ibm.com \
    --cc=imammedo@redhat.com \
    --cc=kuhn.chenqun@huawei.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).