qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Eduardo Habkost <ehabkost@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel@nongnu.org, cohuck@redhat.com, f4bug@amsat.org,
	yuval.shaia@oracle.com, borntraeger@de.ibm.com,
	pbonzini@redhat.com, imammedo@redhat.com
Subject: Re: [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backend-ram
Date: Thu, 1 Feb 2018 16:51:29 -0200	[thread overview]
Message-ID: <20180201185129.GI21702@localhost.localdomain> (raw)
In-Reply-To: <f87b5810-6d87-bd92-be70-dd620501a29d@redhat.com>

On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 20:21, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 15:53, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> >>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
> >>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >>>>> [...]
> >>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>>>>>
> >>>>>>> It's a side effect of the kernel/userspace API which always wants
> >>>>>>> a single HVA/len pair to map memory for the application.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Hi Eduardo and Michael,
> >>>>>>
> >>>>>>>>  Can
> >>>>>>>> this be fixed?
> >>>>>>>
> >>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>>>>>> be used to refer to the region, without creating the two mappings.
> >>>>>>>
> >>>>>>> Something like splitting the register mr into
> >>>>>>>
> >>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>>>>>
> >>>>>>> addmemory(mr, offset, hva, len) - pin memory
> >>>>>>>
> >>>>>>> register mr - pass it to HW
> >>>>>>>
> >>>>>>> As a nice side effect we won't burn so much virtual address space.
> >>>>>>>
> >>>>>>
> >>>>>> We would still need a contiguous virtual address space range (for post-send)
> >>>>>> which we don't have since guest contiguous virtual address space
> >>>>>> will always end up as non-contiguous host virtual address space.
> >>>>>>
> >>>>>> I am not sure the RDMA HW can handle a large VA with holes.
> >>>>>
> >>>>> I'm confused.  Why would the hardware see and care about virtual
> >>>>> addresses? 
> >>>>
> >>>> The post-send operations bypasses the kernel, and the process
> >>>> puts in the work request GVA addresses.
> >>>>
> >>>>> How exactly does the hardware translates VAs to
> >>>>> PAs? 
> >>>>
> >>>> The HW maintains a page-directory like structure different form MMU
> >>>> VA -> phys pages
> >>>>
> >>>>> What if the process page tables change?
> >>>>>
> >>>>
> >>>> Since the page tables the HW uses are their own, we just need the phys
> >>>> page to be pinned.
> >>>
> >>> So there's no hardware-imposed requirement that the hardware VAs
> >>> (mapped by the HW page directory) match the VAs in QEMU
> >>> address-space, right? 
> >>
> >> Actually there is. Today it works exactly as you described.
> > 
> > Are you sure there's such hardware-imposed requirement?
> > 
> 
> Yes.
> 
> > Why would the hardware require VAs to match the ones in the
> > userspace address-space, if it doesn't use the CPU MMU at all?
> > 
> 
> It works like that:
> 
> 1. We register a buffer from the process address space
>    giving its base address and length.
>    This call goes to kernel which in turn pins the phys pages
>    and registers them with the device *together* with the base
>    address (virtual address!)
> 2. The device builds its own page tables to be able to translate
>    the virtual addresses to actual phys pages.

How would the device be able to do that?  It would require the
device to look at the process page tables, wouldn't it?  Isn't
the HW IOVA->PA translation table built by the OS?


> 3. The process executes post-send requests directly to hw by-passing
>    the kernel giving process virtual addresses in work requests.
> 4. The device uses its own page tables to translate the virtual
>    addresses to phys pages and sending them.
> 
> Theoretically is possible to send any contiguous IOVA instead of
> process's one but is not how is working today.
> 
> Makes sense?

-- 
Eduardo

  reply	other threads:[~2018-02-01 18:51 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-17  9:54 [Qemu-devel] [PATCH V8 0/4] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2018-01-17  9:54 ` [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2018-01-31 20:40   ` Eduardo Habkost
2018-01-31 21:10     ` Michael S. Tsirkin
2018-01-31 23:34       ` Eduardo Habkost
2018-02-01  2:22         ` Michael S. Tsirkin
2018-02-01  5:36           ` Marcel Apfelbaum
2018-02-01 12:10             ` Eduardo Habkost
2018-02-01 12:29               ` Marcel Apfelbaum
2018-02-01 13:53                 ` Eduardo Habkost
2018-02-01 18:03                   ` Marcel Apfelbaum
2018-02-01 18:21                     ` Eduardo Habkost
2018-02-01 18:31                       ` Marcel Apfelbaum
2018-02-01 18:51                         ` Eduardo Habkost [this message]
2018-02-01 18:58                           ` Marcel Apfelbaum
2018-02-01 19:21                             ` Eduardo Habkost
2018-02-01 19:28                               ` Marcel Apfelbaum
2018-02-01 19:35                               ` Paolo Bonzini
2018-02-01 18:52                         ` Michael S. Tsirkin
2018-02-01 14:24                 ` Michael S. Tsirkin
2018-02-01 16:31                   ` Eduardo Habkost
2018-02-01 16:48                     ` Michael S. Tsirkin
2018-02-01 16:57                       ` Eduardo Habkost
2018-02-01 16:59                         ` Michael S. Tsirkin
2018-02-01 17:01                           ` Eduardo Habkost
2018-02-01 17:12                             ` Michael S. Tsirkin
2018-02-01 17:36                               ` Eduardo Habkost
2018-02-01 17:58                                 ` Marcel Apfelbaum
2018-02-01 18:18                                   ` Eduardo Habkost
2018-02-01 18:34                                     ` Marcel Apfelbaum
2018-02-01 18:01                                 ` Michael S. Tsirkin
2018-02-01 18:07                   ` Marcel Apfelbaum
2018-02-01 12:57             ` Michael S. Tsirkin
2018-02-01 18:11               ` Marcel Apfelbaum
2018-01-17  9:54 ` [Qemu-devel] [PATCH V8 2/4] docs: add pvrdma device documentation Marcel Apfelbaum
2018-01-17  9:54 ` [Qemu-devel] [PATCH V8 3/4] pvrdma: initial implementation Marcel Apfelbaum
2018-02-01 19:10   ` Michael S. Tsirkin
2018-02-01 19:46     ` Marcel Apfelbaum
2018-01-17  9:54 ` [Qemu-devel] [PATCH V8 4/4] MAINTAINERS: add entry for hw/rdma Marcel Apfelbaum
2018-01-17 10:50 ` [Qemu-devel] [PATCH V8 0/4] hw/pvrdma: PVRDMA device implementation no-reply
2018-01-17 11:22   ` Yuval Shaia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180201185129.GI21702@localhost.localdomain \
    --to=ehabkost@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=f4bug@amsat.org \
    --cc=imammedo@redhat.com \
    --cc=marcel@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yuval.shaia@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).