From: Neo Jia <cjia@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
"Song, Jike" <jike.song@intel.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"kraxel@redhat.com" <kraxel@redhat.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"Ruan, Shuai" <shuai.ruan@intel.com>,
"Lv, Zhiyuan" <zhiyuan.lv@intel.com>
Subject: Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
Date: Thu, 12 May 2016 13:12:58 -0700 [thread overview]
Message-ID: <20160512201258.GB24334@nvidia.com> (raw)
In-Reply-To: <20160512130552.08974076@t450s.home>
On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2016 08:00:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > >
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>> 1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > >>>> 2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > proper iommu domain if the system has configured so.
> > > > >
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > and w/o an IOMMU hardware? :)
> > >
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this? In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device. For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another. There may not even be a physical device backing
> > > the mediated device. Those are details left to the vgpu driver itself.
> >
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> >
> > >
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks. The unmap callback might provide the
> > > invalidation interface that we're so far missing. The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation. There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though. Someone would need to keep
> > > track of iova to dma_addr_t. The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver. However, we're
> > > going to need to take a serious look at whether an rb-tree is the right
> > > data structure for the job. It works well for the current type1
> > > functionality where we typically have tens of entries. I think the
> > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > thousands. If Intel intends to pin the entire guest, that's
> > > potentially tens of millions of tracked entries and I don't know that
> > > an rb-tree is the right tool for that job. Thanks,
> > >
> >
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> >
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> >
> > - Leverage same page accounting/pinning logic in type1 driver, which
> > should be enough for 'pin-all' usage;
> >
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
>
> This seems troublesome. Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace. Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
>
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> >
> > * vGPU core driver uses dma_map_page to map specified pfns:
> >
> > o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > o When IOMMU is disabled, returned iova is same as pfn;
>
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
>
> > * Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> >
> > * Because each vfio_iommu_map invocation is about a contiguous
> > region, we can expect same number of vgpu_dma entries as maintained
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
>
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different. Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map. In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary. We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.
Hi Alex,
Thanks for pointing this out, we will not track those in our next rev and
get_user_pages will be used from the vaddr as you suggested to handle the
single VM with both passthru + mediated device case.
Thanks,
Neo
> I don't see any way around tracking the
> iova. The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry. You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.
>
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete. A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here. It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately. This might be
> how the s390 channel-io model would prefer to work. That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver. Thanks,
>
> Alex
next prev parent reply other threads:[~2016-05-12 20:13 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-02 18:40 [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support Kirti Wankhede
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver Kirti Wankhede
2016-05-03 22:43 ` Alex Williamson
2016-05-04 2:45 ` Tian, Kevin
2016-05-04 16:57 ` Alex Williamson
2016-05-05 8:58 ` Tian, Kevin
2016-05-04 2:58 ` Tian, Kevin
2016-05-12 8:22 ` Tian, Kevin
2016-05-04 13:31 ` Kirti Wankhede
2016-05-05 9:06 ` Tian, Kevin
2016-05-05 10:44 ` Kirti Wankhede
2016-05-05 12:07 ` Tian, Kevin
2016-05-05 12:57 ` Kirti Wankhede
2016-05-11 6:37 ` Tian, Kevin
2016-05-06 12:14 ` Jike Song
2016-05-06 16:16 ` Kirti Wankhede
2016-05-09 12:12 ` Jike Song
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device Kirti Wankhede
2016-05-03 22:43 ` Alex Williamson
2016-05-04 3:23 ` Tian, Kevin
2016-05-04 17:06 ` Alex Williamson
2016-05-04 21:14 ` Neo Jia
2016-05-05 4:42 ` Kirti Wankhede
2016-05-05 9:24 ` Tian, Kevin
2016-05-05 20:27 ` Neo Jia
2016-05-11 6:45 ` Tian, Kevin
2016-05-11 20:10 ` Alex Williamson
2016-05-12 0:59 ` Tian, Kevin
2016-05-04 16:25 ` Kirti Wankhede
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu Kirti Wankhede
2016-05-03 10:40 ` Jike Song
2016-05-03 22:43 ` Alex Williamson
2016-05-04 3:39 ` Tian, Kevin
2016-05-05 6:55 ` Jike Song
2016-05-05 9:27 ` Tian, Kevin
2016-05-10 7:52 ` Jike Song
2016-05-10 16:02 ` Neo Jia
2016-05-11 9:15 ` Jike Song
2016-05-11 22:06 ` Alex Williamson
2016-05-12 4:11 ` Jike Song
2016-05-12 19:49 ` Neo Jia
2016-05-13 2:41 ` Tian, Kevin
2016-05-13 6:22 ` Jike Song
2016-05-13 6:43 ` Neo Jia
2016-05-13 7:30 ` Jike Song
2016-05-13 7:42 ` Neo Jia
2016-05-13 7:45 ` Tian, Kevin
2016-05-13 8:31 ` Neo Jia
2016-05-13 9:23 ` Jike Song
2016-05-13 15:50 ` Neo Jia
2016-05-16 6:57 ` Jike Song
2016-05-13 6:08 ` Jike Song
2016-05-13 6:41 ` Neo Jia
2016-05-13 7:13 ` Tian, Kevin
2016-05-13 7:38 ` Neo Jia
2016-05-13 8:02 ` Tian, Kevin
2016-05-13 8:41 ` Neo Jia
2016-05-12 8:00 ` Tian, Kevin
2016-05-12 19:05 ` Alex Williamson
2016-05-12 20:12 ` Neo Jia [this message]
2016-05-13 9:46 ` Jike Song
2016-05-13 15:48 ` Neo Jia
2016-05-16 2:27 ` Jike Song
2016-05-13 3:55 ` Tian, Kevin
2016-05-13 16:16 ` Alex Williamson
2016-05-13 7:10 ` Dong Jia
2016-05-13 7:24 ` Neo Jia
2016-05-13 8:39 ` Dong Jia
2016-05-13 9:05 ` Neo Jia
2016-05-19 7:28 ` Dong Jia
2016-05-20 3:21 ` Tian, Kevin
2016-06-06 6:59 ` Dong Jia
2016-06-07 2:47 ` Tian, Kevin
2016-06-07 7:04 ` Dong Jia
2016-05-05 7:51 ` Kirti Wankhede
2016-05-04 1:05 ` [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support Tian, Kevin
2016-05-04 6:17 ` Neo Jia
2016-05-04 17:07 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160512201258.GB24334@nvidia.com \
--to=cjia@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=jike.song@intel.com \
--cc=kevin.tian@intel.com \
--cc=kraxel@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=shuai.ruan@intel.com \
--cc=zhiyuan.lv@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).