qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jonah Palmer <jonah.palmer@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>,
	qemu-devel@nongnu.org, Si-Wei Liu <si-wei.liu@oracle.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Lei Yang <leiyang@redhat.com>, Peter Xu <peterx@redhat.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
Date: Mon, 29 Jul 2024 13:50:08 -0400	[thread overview]
Message-ID: <a9dd189f-ab4a-4dad-acec-31ee733fb1d7@oracle.com> (raw)
In-Reply-To: <CAJaqyWfekhhG+5Zp4eddpp-2N=SjUL+7zs+YqS5xx6eWxa-PqQ@mail.gmail.com>



On 7/29/24 6:04 AM, Eugenio Perez Martin wrote:
> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 5/13/24 11:56 PM, Jason Wang wrote:
>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
>>> <eperezma@redhat.com> wrote:
>>>>
>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
>>>>> <eperezma@redhat.com> wrote:
>>>>>>
>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
>>>>>>> <eperezma@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
>>>>>>>>>>>>>>> <eperezma@redhat.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads
>>>>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped regions
>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking
>>>>>>>>>>>>>>>>>> them by HVA will return them twice.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps /
>>>>>>>>>>>>>>>> unmaps happens [1]:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HVA                    GPA                IOVA
>>>>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>>>>> Map
>>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
>>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
>>>>>>>>>>>>>>>> [0x80001000, 0x2000001000)
>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
>>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Unmap
>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
>>>>>>>>>>>>>>>> 0x20000) ???
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a
>>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does
>>>>>>>>>>>>>>>> not overlap, only HVA.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk,
>>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at
>>>>>>>>>>>>>>>> unmap time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
>>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be
>>>>>>>>>>>>>>> aliasing issues.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ
>>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree,
>>>>>>>>>>>>>> and they do not have GPA.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this
>>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store
>>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
>>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This seems to be tricky.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As discussed, it could be another iova tree.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap.
>>>>>>>>>>>>
>>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
>>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that
>>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at
>>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at
>>>>>>>>>>>> vhost_svq_translate_addr?
>>>>>>>>>>>
>>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of
>>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is not virtqueue_pop, that's out of the
>>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding
>>>>>>>>>> conditionals / complexity in all the callers of
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When
>>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is
>>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it
>>>>>>>>>>>> just worked in my tests so far.
>>>>>>>>>>>>
>>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the
>>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated
>>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so
>>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very
>>>>>>>>>>>> complicated with two trees.
>>>>>>>>>>>
>>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we
>>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is
>>>>>>>>>>> shared by both
>>>>>>>>>>>
>>>>>>>>>>> 1) Guest memory (GPA)
>>>>>>>>>>> 2) SVQ virtqueue and buffers
>>>>>>>>>>>
>>>>>>>>>>> And another gtree to track the GPA to IOVA.
>>>>>>>>>>>
>>>>>>>>>>> The SVQ code could use either
>>>>>>>>>>>
>>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers
>>>>>>>>>>>
>>>>>>>>>>> or
>>>>>>>>>>>
>>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers
>>>>>>>>>>>
>>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of
>>>>>>>>>> keeping it self-contained in VhostIOVATree.
>>>>>>>>>
>>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used.
>>>>>>>>>
>>>>>>>>> An example is the iova allocator in the kernel.
>>>>>>>>>
>>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough
>>>>>>>>> code, not sure it is useful here though (haven't had a deep look at
>>>>>>>>> that).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't know enough about them to have an opinion. I keep seeing the
>>>>>>>> drawback of needing to synchronize both allocation & adding in all the
>>>>>>>> places we want to modify the IOVATree. At this moment, these are the
>>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and
>>>>>>>> net CVQ buffers. But it may be more in the future.
>>>>>>>>
>>>>>>>> What are the advantages of keeping these separated that justifies
>>>>>>>> needing to synchronize in all these places, compared with keeping them
>>>>>>>> synchronized in VhostIOVATree?
>>>>>>>
>>>>>>> It doesn't need to be synchronized.
>>>>>>>
>>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
>>>>>>> which part of the range has been used.
>>>>>>>
>>>>>>
>>>>>> Not sure if I follow, that is what I mean with "synchronized".
>>>>>
>>>>> Oh right.
>>>>>
>>>>>>
>>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
>>>>>>> IOVA mappings separately.
>>>>>>>
>>>>>>
>>>>>> Sorry, I still cannot see the whole picture :).
>>>>>>
>>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so
>>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
>>>>>> because of the migration. How can we know where we can place SVQ
>>>>>> vrings without having them synchronized?
>>>>>
>>>>> Just allocating a new IOVA range for SVQ?
>>>>>
>>>>>>
>>>>>> At this moment we're using a tree. The tree nature of the current SVQ
>>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look
>>>>>> for holes.
>>>>>
>>>>> Yes, iova allocate could still be implemented via a tree.
>>>>>
>>>>>>
>>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
>>>>>> over all of them, order them, of the two trees, and then look for
>>>>>> holes there?
>>>>>
>>>>> Let me clarify, correct me if I was wrong:
>>>>>
>>>>> 1) IOVA allocator is still implemented via a tree, we just don't need
>>>>> to store how the IOVA is used
>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
>>>>> the datapath SVQ translation
>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
>>>>>
>>>>
>>>> Ok, so the part I was missing is that now we have 3 whole trees, with
>>>> somehow redundant information :).
>>>
>>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This
>>> might be simple as guests and SVQ may try to share a single IOVA
>>> address space.
>>>
>>
>> I'm working on implementing your three suggestions above but I'm a bit
>> confused with some of the wording and I was hoping you could clarify
>> some of it for me when you get the chance.
>>
>> ---
>> For your first suggestion (1) you mention decoupling the IOVA allocator
>> and "don't need to store how the IOVA is used."
>>
>> By this, do you mean to not save the IOVA->HVA mapping and instead only
>> save the allocated IOVA ranges? In other words, are you suggesting to
>> create a dedicated "IOVA->IOVA" tree like:
>>
>> struct VhostIOVATree {
>>       uint64_t iova_first;
>>       uint64_t iova_last;
>>       IOVATree *iova_map;
>> };
>>
>> Where the mapping might look something like (where translated_addr is
>> given some kind of 0 value):
>>
>> iova_region = (DMAMap) {
>>       .iova = iova_first,
>>       .translated_addr = 0,
>>       .size = region_size - 1,
>>       .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>> };
>>
>> Also, if this is what you mean by decoupling the IOVA allocator, what
>> happens to the IOVA->HVA tree? Are we no longer saving these mappings in
>> a tree?
>>
>> ---
>> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you
>> thinking something like this? Just adding on to VhostIOVATree here:
>>
>> struct VhostIOVATree {
>>       uint64_t iova_first;
>>       uint64_t iova_last;
>>       IOVATree *iova_map;
>>       IOVATree *gpa_map;
>> };
>>
>> Where the mapping might look something like:
>>
>> gpa_region = (DMAMap) {
>>       .iova = iova_first,
>>       .translated_addr = gpa_first,
>>       .size = region_size - 1,
>>       .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>> };
>>
>> Also, when you say "used in the datapath SVQ translation", we still need
>> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is
>> called, right?
>>
>> ---
>> Lastly, in your third suggestion (3) you mention implementing a
>> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for
>> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g.
>>
>> struct VhostIOVATree {
>>       uint64_t iova_first;
>>       uint64_t iova_last;
>>       IOVATree *iova_map;
>>       IOVATree *gpa_map;
>>       IOVATree *svq_map;
>> };
>>
>> ---
>>
>> Let me know if I'm understanding this correctly. If I am, this would
>> require a pretty good amount of refactoring on the IOVA allocation,
>> searching, destroying, etc. code to account for these new trees.
>>
> 
> Ok I think I understand your previous question better, sorry for the delay :).
> 
> If I'm not wrong, Jason did not enumerate these as alternatives but as
> part of the same solution. Jason please correct me if I'm wrong here.
> 
> His solution is composed of three trees:
> 1) One for the IOVA allocations, so we know where to allocate new ranges
> 2) One of the GPA -> SVQ IOVA translations.
> 3) Another one for SVQ vrings translations.
> 
> In my opinion to use GPA this is problematic as we force all of the
> callers to know if the address we want to translate comes from the
> guest or not. Current code does now know it, as
> vhost_svq_translate_addr is called to translate both buffer dataplane
> addresses and control virtqueue buffers, which are also shadowed.  To
> transmit that information to the caller code would require either
> duplicate functions, to add a boolean, etc, as it is in a different
> layer (net specific net/vhost-vdpa.c vs generic
> hw/virtio/vhost-shadow-virtqueue.c).
> 
> In my opinion is easier to keep just two trees (or similar structs):
> 1) One for the IOVA allocations, so we know where to allocate new
> ranges. We don't actually need to store the translations here.
> 2) One for HVA -> SVQ IOVA translations.
> 
> This way we can accommodate both SVQ vrings, CVQ buffers, and guest
> memory buffers, all on the second tree, and take care of the HVA
> duplications.
> 
> Thanks!

I assume that this dedicated IOVA tree will be created and added to in 
vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but 
what about the IOVA->HVA tree that gets created during 
vhost_vdpa_listener_region_add? Will this tree just be replaced with the 
dedicated IOVA tree?

Also, an HVA->SVQ IOVA tree means that the tree is balanced using the 
HVA value and not the IOVA value, right? In other words, a tree where 
it's more efficient to search using the HVA values vs. IOVA values?

> 
>> Thanks Jason!
>>
>>>>
>>>> In some sense this is simpler than trying to get all the information
>>>> from only two trees. On the bad side, all SVQ calls that allocate some
>>>> region need to remember to add to one of the two other threes. That is
>>>> what I mean by synchronized. But sure, we can go that way.
>>>
>>> Just a suggestion, if it turns out to complicate the issue, I'm fine
>>> to go the other way.
>>>
>>> Thanks
>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
> 


  reply	other threads:[~2024-07-29 17:51 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
2024-04-18 20:46   ` Si-Wei Liu
2024-04-19  8:29     ` Eugenio Perez Martin
2024-04-19 23:49       ` Si-Wei Liu
2024-04-22  8:49         ` Eugenio Perez Martin
2024-04-23 22:20           ` Si-Wei Liu
2024-04-24  7:33             ` Eugenio Perez Martin
2024-04-25 17:43               ` Si-Wei Liu
2024-04-29  8:14                 ` Eugenio Perez Martin
2024-04-29 11:19                   ` Jonah Palmer
2024-04-30 18:11                     ` Eugenio Perez Martin
2024-05-01 22:08                       ` Si-Wei Liu
2024-05-02  6:18                         ` Eugenio Perez Martin
2024-05-07  9:12                           ` Si-Wei Liu
2024-04-30  5:54                   ` Si-Wei Liu
2024-04-30 17:19                     ` Eugenio Perez Martin
2024-05-01 23:13                       ` Si-Wei Liu
2024-05-02  6:44                         ` Eugenio Perez Martin
2024-05-08  0:52                           ` Si-Wei Liu
2024-05-08 15:25                             ` Eugenio Perez Martin
2024-04-10 10:03 ` [RFC 2/2] vdpa: identify aliased maps in iova_tree Eugenio Pérez
2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
2024-04-12  7:56   ` Eugenio Perez Martin
2024-05-07  7:29     ` Jason Wang
2024-05-07 10:56       ` Eugenio Perez Martin
2024-05-08  2:29         ` Jason Wang
2024-05-08 17:15           ` Eugenio Perez Martin
2024-05-09  6:27             ` Jason Wang
2024-05-09  7:10               ` Eugenio Perez Martin
2024-05-10  4:28                 ` Jason Wang
2024-05-10  7:16                   ` Eugenio Perez Martin
2024-05-11  4:00                     ` Jason Wang
2024-05-13  6:27                       ` Eugenio Perez Martin
2024-05-13  8:28                         ` Jason Wang
2024-05-13  9:56                           ` Eugenio Perez Martin
2024-05-14  3:56                             ` Jason Wang
2024-07-24 16:59                               ` Jonah Palmer
2024-07-29 10:04                                 ` Eugenio Perez Martin
2024-07-29 17:50                                   ` Jonah Palmer [this message]
2024-07-29 18:20                                     ` Eugenio Perez Martin
2024-07-29 19:33                                       ` Jonah Palmer
2024-07-30  8:47                                   ` Jason Wang
2024-07-30 11:00                                     ` Eugenio Perez Martin
2024-07-30 12:31                                       ` Jonah Palmer
2024-07-31  9:56                                         ` Eugenio Perez Martin
2024-07-31 14:09                                           ` Jonah Palmer
2024-08-01  0:41                                             ` Si-Wei Liu
2024-08-01  8:22                                               ` Eugenio Perez Martin
2024-08-02  5:54                                                 ` Si-Wei Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9dd189f-ab4a-4dad-acec-31ee733fb1d7@oracle.com \
    --to=jonah.palmer@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eperezma@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).