From: Jonah Palmer <jonah.palmer@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>,
Si-Wei Liu <si-wei.liu@oracle.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, leiyang@redhat.com,
peterx@redhat.com, dtatulea@nvidia.com, jasowang@redhat.com,
boris.ostrovsky@oracle.com
Subject: Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Date: Tue, 8 Oct 2024 11:40:29 -0400 [thread overview]
Message-ID: <5ebfd766-c8b5-4fb3-86ad-17a74212ef5e@oracle.com> (raw)
In-Reply-To: <CAJaqyWfYvD0nEYU9UgKzYgUo5JzuFu3PBKNEkDrM0BE0Ek5LfA@mail.gmail.com>
On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>>>> translations for guest memory regions.
>>>>>>
>>>>>> When the guest has overlapping memory regions, an HVA to IOVA translation
>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>>>>>> due to one HVA range being contained (overlapping) in another HVA range
>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>>>>>> translate and find the correct IOVA for guest memory regions.
>>>>>>
>>>>> Yes, this first patch is super close to what I meant, just one issue
>>>>> and a pair of nits here and there.
>>>>>
>>>>> I'd leave the second patch as an optimization on top, if the numbers
>>>>> prove that adding the code is worth it.
>>>>>
>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
>>>> previous series you also wanted implemented or if these would also be
>>>> optimizations on top.
>>>>
>>>> [Adding code to the vhost_iova_tree layer for handling multiple buffers
>>>> returned from translation for the memory area where each iovec covers]:
>>>> -----------------------------------------------------------------------
>>>> "Let's say that SVQ wants to translate the HVA range
>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>> with addr=(0x20000c1000 len=0x10000).
>>>>
>>>> The VirtIO device should be able to translate these two buffers in
>>>> isolation and chain them. Not optimal but it helps to keep QEMU source
>>>> clean, as the device already must support it."
>>>>
>>> This is 100% in the device and QEMU is already able to split the
>>> buffers that way, so we don't need any change in QEMU.
>> Noted that if working with the full HVA tree directly, the internal iova
>> tree linear iterator iova_tree_find_address_iterator() today doesn't
>> guarantee the iova range returned can cover the entire length of the
>> iov, so things could happen like that the aliased range with smaller
>> size (than the requested) happens to be hit first in the linear search
>> and be returned, the fragmentation of which can't be guarded against by
>> the VirtIO device or the DMA API mentioned above.
>>
>> The second patch in this series kind of mitigated this side effect by
>> sorting out the backing ram_block with the help of
>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
>> matching IOVA, but instead (somehow implicitly) avoids the fragmentation
>> side effect as mentioned above would never happen. Not saying I like the
>> way how it is implemented, but just wanted to point out the implication
>> if the second patch has to be removed - either add special handling code
>> to the iova-tree iterator sizing the range (same as how range selection
>> based upon permission will be done), or add special code in SVQ layer to
>> deal with fragmented IOVA ranges due to aliasing.
>>
>
> This special code in SVQ is already there. And it will be needed even
> if we look for the buffers by GPA instead of by vaddr, the same way
> virtqueue_map_desc needs to handle it even if it works with GPA.
> Continuous GPA does not imply continuous vaddr.
>
My apologies if I misunderstood something here regarding size &
permission matching, but I attempted to implement both the size and
permission check to iova_tree_find_address_iterator(), however, there's
a sizing mismatch in the vhost_svq_translate_addr() code path that's
causing vhost-net to fail to start.
qemu-system-x86_64: unable to start vhost net: 22: falling back on
userspace virtio
More specifically, in vhost_svq_add_split(), the first call to
vhost_svq_vring_write_descs() returns false:
ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
0, false);
if (unlikely(!ok)) {
return false;
}
Maybe I misunderstood the proposal, but in
iova_tree_find_address_iterator I'm checking for an exact match for sizes:
if (map->size != needle->size || map->perm != needle->perm) {
return false;
}
During the device setup phase, vhost_svq_add_split() ->
vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() ->
vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in
iova_tree_find_address_iterator() map->size & needle->size mismatch. I
inserted some printf's to give more information:
...
[ 8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[ 8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
[ 8.022403] scsi host0: ahci
[ 8.023511] scsi host1: ahci
[ 8.024446] scsi host2: ahci
[ 8.025308
vhost_svq_translate_addr: iovec[i].iov_len = 0x8
iova_tree_find_address_iterator: mismatched sizes
map->size: 0xfff, needle->size: 0x8
map->perm: 1, needle->perm: 1
vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg:
0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1,
more_descs: true, write: false
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost
net: 22: falling back on userspace virtio
] scsi host3: ahci
[ 10.921733] scsi host4: ahci
[ 10.922946] scsi host5: ahci
[ 10.923486] virtio_net virtio1 enp0s2: renamed from eth0
...
This is with similar Qemu args as Si-Wei's from way back when:
-m size=128G,slots=8,maxmem=256G
There are also some error catches with just the permission check but it
appears the vhost-net device is still able to start up (when not
matching sizes also).
next prev parent reply other threads:[~2024-10-08 15:42 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-04 12:44 [RFC v2 0/2] Handling aliased guest memory maps in vhost-vDPA SVQs Jonah Palmer
2024-10-04 12:44 ` [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree Jonah Palmer
2024-10-04 15:17 ` Eugenio Perez Martin
2024-10-04 18:48 ` Jonah Palmer
2024-10-07 13:51 ` Eugenio Perez Martin
2024-10-07 15:38 ` Jonah Palmer
2024-10-07 16:52 ` Eugenio Perez Martin
2024-10-08 0:14 ` Si-Wei Liu
2024-10-08 6:51 ` Eugenio Perez Martin
2024-10-08 15:40 ` Jonah Palmer [this message]
2024-10-08 20:29 ` Si-Wei Liu
2024-10-09 9:29 ` Eugenio Perez Martin
2024-10-10 7:00 ` Si-Wei Liu
2024-10-10 10:14 ` Eugenio Perez Martin
2024-10-04 12:44 ` [RFC v2 2/2] vhost-svq: Translate guest-backed memory with " Jonah Palmer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5ebfd766-c8b5-4fb3-86ad-17a74212ef5e@oracle.com \
--to=jonah.palmer@oracle.com \
--cc=boris.ostrovsky@oracle.com \
--cc=dtatulea@nvidia.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=leiyang@redhat.com \
--cc=mst@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=si-wei.liu@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).