Re: [PATCH v1 1/2] vhost: Defer filtering memory sections until building the vhost memory structure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Igor Mammedov <imammedo@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Tiwei Bie <tiwei.bie@intel.com>
Subject: Re: [PATCH v1 1/2] vhost: Defer filtering memory sections until building the vhost memory structure
Date: Wed, 8 Mar 2023 13:30:20 +0100	[thread overview]
Message-ID: <20230308133020.28aabe98@imammedo.users.ipa.redhat.com> (raw)
In-Reply-To: <fad9136f-08d3-3fd9-71a1-502069c000cf@redhat.com>

On Tue, 7 Mar 2023 13:46:36 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 07.03.23 11:51, Igor Mammedov wrote:
> > On Thu, 16 Feb 2023 12:47:51 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> Having multiple devices, some filtering memslots and some not filtering
> >> memslots, messes up the "used_memslot" accounting. If we'd have a device
> >> the filters out less memory sections after a device that filters out more,
> >> we'd be in trouble,

it should say why/when it happens (in example you've provided
it's caused by mix of in kernel vhost and vhost-user devices)

> >> because our memslot checks stop working reliably.
> >> For example, hotplugging a device that filters out less memslots might end
> >> up passing the checks based on max vs. used memslots, but can run out of
> >> memslots when getting notified about all memory sections.  
> > 
> > an hypothetical example of such case would be appreciated
> > (I don't really get how above can happen, perhaps more detailed explanation
> > would help)  
> 
> Thanks for asking! AFAIKT, it's mostly about hot-adding first a vhost devices
> that filters (and messes up used_memslots), and then messing with memslots that
> get filtered out,
> 
> $ sudo rmmod vhost
> $ sudo modprobe vhost max_mem_regions=4
> 
> // startup guest with virtio-net device
> ...
> 
> // hotplug a NVDIMM, resulting in used_memslots=4
> echo "object_add memory-backend-ram,id=mem0,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
> echo "device_add nvdimm,id=nvdimm0,memdev=mem0" | sudo nc -U /var/tmp/mon_src
> 
> // hotplug vhost-user device that overwrites "used_memslots=3"
> echo "device_add vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,bus=root" | sudo nc -U /var/tmp/mon_src
> 
> // hotplug another NVDIMM
> echo "object_add memory-backend-ram,id=mem1,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
> echo "device_add pc-dimm,id=nvdimm1,memdev=mem1" | sudo nc -U /var/tmp/mon_src
> 
> // vvhost will fail to update the memslots
> vhost_set_mem_table failed: Argument list too long (7)
> 
> 
> So we tricked used_memslots to be smaller than it actually has to be, because
> we're ignoring the memslots filtered out by the vhost-user device.
> 
> 
> Now, this is all far from relevant in practice as of now I think, and
> usually would indicate user errors already (memory that's not shared with
> vhost-user?).

well vhost-user device_add should fail if it can't get hands on all RAM
(if it doesn't we have a bug somewhere else)

> 
> It might gets more relevant when virtio-mem dynamically adds/removes memslots and
> relies on precise tracking of used vs. free memslots.
> 
> 
> But maybe I should just ignore that case and live a happy life instead, it's
> certainly hard to even trigger right now :)
> >     
> >> Further, it will be helpful in memory device context in the near future
> >> to know that a RAM memory region section will consume a memslot, and be
> >> accounted for in the used vs. free memslots, such that we can implement
> >> reservation of memslots for memory devices properly. Whether a device
> >> filters this out and would theoretically still have a free memslot is
> >> then hidden internally, making overall vhost memslot accounting easier.
> >>

> >> Let's filter the memslots when creating the vhost memory array,
> >> accounting all RAM && !ROM memory regions as "used_memslots" even if
> >> vhost_user isn't interested in anonymous RAM regions, because it needs
> >> an fd.

that would regress existing setups where it was possible
to start with N DIMMs and after this patch the same VM could fail to
start if N was close to vhost's limit in otherwise perfectly working
configuration. So this approach doesn't seem right. 

Perhaps redoing vhost's used_memslots accounting would be
a better approach, right down to introducing reservations you'd
like to have eventually.

Something like:
  1: s/vhost_has_free_slot/vhost_memory_region_limit/
     and maybe the same for kvm_has_free_slot
  then rewrite memory_device_check_addable() moving all
  used_memslots accounting into memory_device core.

> >> When a device actually filters out regions (which should happen rarely
> >> in practice), we might detect a layout change although only filtered
> >> regions changed. We won't bother about optimizing that for now.
> >>
> >> Note: we cannot simply filter out the region and count them as
> >> "filtered" to add them to used, because filtered regions could get
> >> merged and result in a smaller effective number of memslots. Further,
> >> we won't touch the hmp/qmp virtio introspection output.  
> > What output exactly you are talking about?  
> 
> hw/virtio/virtio-qmp.c:qmp_x_query_virtio_status
> 
> Prints hdev->n_mem_sections and hdev->n_tmp_sections. I won't be
> touching that (debug) output.
> 
> > 
> > PS:
> > If we drop vhost_dev::memm the bulk of this patch would go away  
> 
> Yes, unfortunately we can't I think.
> 
> > 
> > side questions:
> > do we have MemorySection merging on qemu's kvm side?  
> 
> Yes, we properly merge in flatview_simplify(). It's all about handling holes
> in huge pages IIUC.
> 
> > also does KVM merge merge memslots?  
> 
> No, for good reasons not. Mapping more than we're instructed to map via a notifier
> sounds is kind-of hacky already. But I guess there is no easy way around it (e.g., if
> mapping that part of memory doesn't work, we'd have to bounce the reads/writes
> through QEMU instead).
>

next prev parent reply	other threads:[~2023-03-08 12:31 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-16 11:47 [PATCH v1 0/2] vhost: memslot handling improvements David Hildenbrand
2023-02-16 11:47 ` [PATCH v1 1/2] vhost: Defer filtering memory sections until building the vhost memory structure David Hildenbrand
2023-02-16 12:04   ` Michael S. Tsirkin
2023-02-16 12:10     ` David Hildenbrand
2023-02-16 12:13       ` David Hildenbrand
2023-02-16 12:22         ` Michael S. Tsirkin
2023-02-16 12:21       ` Michael S. Tsirkin
2023-02-16 12:39         ` David Hildenbrand
2023-03-07 10:51   ` Igor Mammedov
2023-03-07 12:46     ` David Hildenbrand
2023-03-08 12:30       ` Igor Mammedov [this message]
2023-03-08 15:21         ` David Hildenbrand
2023-02-16 11:47 ` [PATCH v1 2/2] vhost: Remove vhost_backend_can_merge() callback David Hildenbrand
2023-03-07 10:25   ` Igor Mammedov
2023-03-07 11:16     ` Igor Mammedov
2023-03-07 11:24       ` David Hildenbrand
2023-02-16 16:04 ` [PATCH v1 0/2] vhost: memslot handling improvements Stefan Hajnoczi
2023-02-17 13:48   ` David Hildenbrand
2023-02-17 14:20 ` Michael S. Tsirkin
2023-02-17 14:27   ` David Hildenbrand
2023-03-07 11:14   ` Igor Mammedov
2023-03-08 10:08     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230308133020.28aabe98@imammedo.users.ipa.redhat.com \
    --to=imammedo@redhat.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=tiwei.bie@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.