Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jonah Palmer <jonah.palmer@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>,
	Si-Wei Liu <si-wei.liu@oracle.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, leiyang@redhat.com,
	peterx@redhat.com, dtatulea@nvidia.com, jasowang@redhat.com,
	boris.ostrovsky@oracle.com
Subject: Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Date: Tue, 8 Oct 2024 11:40:29 -0400	[thread overview]
Message-ID: <5ebfd766-c8b5-4fb3-86ad-17a74212ef5e@oracle.com> (raw)
In-Reply-To: <CAJaqyWfYvD0nEYU9UgKzYgUo5JzuFu3PBKNEkDrM0BE0Ek5LfA@mail.gmail.com>



On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>>>> translations for guest memory regions.
>>>>>>
>>>>>> When the guest has overlapping memory regions, an HVA to IOVA translation
>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>>>>>> due to one HVA range being contained (overlapping) in another HVA range
>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>>>>>> translate and find the correct IOVA for guest memory regions.
>>>>>>
>>>>> Yes, this first patch is super close to what I meant, just one issue
>>>>> and a pair of nits here and there.
>>>>>
>>>>> I'd leave the second patch as an optimization on top, if the numbers
>>>>> prove that adding the code is worth it.
>>>>>
>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
>>>> previous series you also wanted implemented or if these would also be
>>>> optimizations on top.
>>>>
>>>> [Adding code to the vhost_iova_tree layer for handling multiple buffers
>>>> returned from translation for the memory area where each iovec covers]:
>>>> -----------------------------------------------------------------------
>>>> "Let's say that SVQ wants to translate the HVA range
>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>> with addr=(0x20000c1000 len=0x10000).
>>>>
>>>> The VirtIO device should be able to translate these two buffers in
>>>> isolation and chain them. Not optimal but it helps to keep QEMU source
>>>> clean, as the device already must support it."
>>>>
>>> This is 100% in the device and QEMU is already able to split the
>>> buffers that way, so we don't need any change in QEMU.
>> Noted that if working with the full HVA tree directly, the internal iova
>> tree linear iterator iova_tree_find_address_iterator() today doesn't
>> guarantee the iova range returned can cover the entire length of the
>> iov, so things could happen like that the aliased range with smaller
>> size (than the requested) happens to be hit first in the linear search
>> and be returned, the fragmentation of which can't be guarded against by
>> the VirtIO device or the DMA API mentioned above.
>>
>> The second patch in this series kind of mitigated this side effect by
>> sorting out the backing ram_block with the help of
>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
>> matching IOVA, but instead (somehow implicitly) avoids the fragmentation
>> side effect as mentioned above would never happen. Not saying I like the
>> way how it is implemented, but just wanted to point out the implication
>> if the second patch has to be removed - either add special handling code
>> to the iova-tree iterator sizing the range (same as how range selection
>> based upon permission will be done), or add special code in SVQ layer to
>> deal with fragmented IOVA ranges due to aliasing.
>>
> 
> This special code in SVQ is already there. And it will be needed even
> if we look for the buffers by GPA instead of by vaddr, the same way
> virtqueue_map_desc needs to handle it even if it works with GPA.
> Continuous GPA does not imply continuous vaddr.
> 

My apologies if I misunderstood something here regarding size & 
permission matching, but I attempted to implement both the size and 
permission check to iova_tree_find_address_iterator(), however, there's 
a sizing mismatch in the vhost_svq_translate_addr() code path that's 
causing vhost-net to fail to start.

qemu-system-x86_64: unable to start vhost net: 22: falling back on
userspace virtio

More specifically, in vhost_svq_add_split(), the first call to 
vhost_svq_vring_write_descs() returns false:

     ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
                                      0, false);
     if (unlikely(!ok)) {
         return false;
     }

Maybe I misunderstood the proposal, but in 
iova_tree_find_address_iterator I'm checking for an exact match for sizes:

     if (map->size != needle->size || map->perm != needle->perm) {
         return false;
     }

During the device setup phase, vhost_svq_add_split() -> 
vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() -> 
vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in 
iova_tree_find_address_iterator() map->size & needle->size mismatch. I 
inserted some printf's to give more information:

...
[    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
[    8.022403] scsi host0: ahci
[    8.023511] scsi host1: ahci
[    8.024446] scsi host2: ahci
[    8.025308
vhost_svq_translate_addr: iovec[i].iov_len = 0x8
iova_tree_find_address_iterator: mismatched sizes
map->size: 0xfff, needle->size: 0x8
map->perm: 1, needle->perm: 1
vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg: 
0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1, 
more_descs: true, write: false
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost 
net: 22: falling back on userspace virtio
] scsi host3: ahci
[   10.921733] scsi host4: ahci
[   10.922946] scsi host5: ahci
[   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
...

This is with similar Qemu args as Si-Wei's from way back when:

-m size=128G,slots=8,maxmem=256G

There are also some error catches with just the permission check but it 
appears the vhost-net device is still able to start up (when not 
matching sizes also).

next prev parent reply	other threads:[~2024-10-08 15:42 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-04 12:44 [RFC v2 0/2] Handling aliased guest memory maps in vhost-vDPA SVQs Jonah Palmer
2024-10-04 12:44 ` [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree Jonah Palmer
2024-10-04 15:17   ` Eugenio Perez Martin
2024-10-04 18:48     ` Jonah Palmer
2024-10-07 13:51       ` Eugenio Perez Martin
2024-10-07 15:38         ` Jonah Palmer
2024-10-07 16:52           ` Eugenio Perez Martin
2024-10-08  0:14         ` Si-Wei Liu
2024-10-08  6:51           ` Eugenio Perez Martin
2024-10-08 15:40             ` Jonah Palmer [this message]
2024-10-08 20:29               ` Si-Wei Liu
2024-10-09  9:29                 ` Eugenio Perez Martin
2024-10-10  7:00                   ` Si-Wei Liu
2024-10-10 10:14                     ` Eugenio Perez Martin
2024-10-04 12:44 ` [RFC v2 2/2] vhost-svq: Translate guest-backed memory with " Jonah Palmer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ebfd766-c8b5-4fb3-86ad-17a74212ef5e@oracle.com \
    --to=jonah.palmer@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eperezma@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).