qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: "Song, Jike" <jike.song@intel.com>, Neo Jia <cjia@nvidia.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"kraxel@redhat.com" <kraxel@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>
Subject: Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
Date: Fri, 13 May 2016 10:16:18 -0600	[thread overview]
Message-ID: <20160513101618.0176a00b@t450s.home> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F854092@SHSMSX101.ccr.corp.intel.com>

On Fri, 13 May 2016 03:55:09 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 13, 2016 3:06 AM
> >   
> > > >  
> > >
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > >
> > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > >
> > > - Leverage same page accounting/pinning logic in type1 driver, which
> > > should be enough for 'pin-all' usage;
> > >
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.  
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> >   
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > >
> > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > >
> > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > > 		o When IOMMU is disabled, returned iova is same as pfn;  
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> >   
> > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > >
> > > 	* Because each vfio_iommu_map invocation is about a contiguous
> > > region, we can expect same number of vgpu_dma entries as maintained
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning.  
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.  
> 
> There is one option. We may use alloc_iova to reserve continuous iova
> range for each vgpu_dma range and then use iommu_map/unmap to
> write iommu ptes later upon map request (then could be same #entries
> as vfio_dma compared to unbounded entries when using dma_map_page). 
> Of course this needs to be done in vGPU core driver, since vfio type1 only 
> sees a faked iommu domain.

I'm not sure this is really how iova domains work.  There's only one
iova domain per iommu domain using the dma-iommu API, and
iommu_map/unmap are part of a different API.  iova domain may be an
interesting solution though.
 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > mediated iommu driver.  Thanks,
> >   
> 
> Another simpler way might be allocate an array for each memory
> regions registered from user space. For a 512MB region, it means
> 512K*4=2MB array to track pfn or iova mapping corresponding to
> a gfn. It may consume more resource than rb tree when not many
> pages need to be pinned, but could be less when rb tree increases
> a lot. 

An array is only the most space efficient structure for a fully pinned
area where we have no contiguous iova.  If we're either mapping a
larger hugepage or we have a larger continuous iova space due to
scatter-gather mapping or we're sparsely pinning the region, an array
can waste a lot of space.  A 512MB is also a pretty anemic example, 2MB
is a reasonable over head, but 2MB per 512MB looks pretty bad when we
have a 512GB VM.  Thanks,

Alex

  reply	other threads:[~2016-05-13 16:16 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-02 18:40 [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support Kirti Wankhede
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver Kirti Wankhede
2016-05-03 22:43   ` Alex Williamson
2016-05-04  2:45     ` Tian, Kevin
2016-05-04 16:57       ` Alex Williamson
2016-05-05  8:58         ` Tian, Kevin
2016-05-04  2:58     ` Tian, Kevin
2016-05-12  8:22       ` Tian, Kevin
2016-05-04 13:31     ` Kirti Wankhede
2016-05-05  9:06       ` Tian, Kevin
2016-05-05 10:44         ` Kirti Wankhede
2016-05-05 12:07           ` Tian, Kevin
2016-05-05 12:57             ` Kirti Wankhede
2016-05-11  6:37               ` Tian, Kevin
2016-05-06 12:14         ` Jike Song
2016-05-06 16:16           ` Kirti Wankhede
2016-05-09 12:12             ` Jike Song
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device Kirti Wankhede
2016-05-03 22:43   ` Alex Williamson
2016-05-04  3:23     ` Tian, Kevin
2016-05-04 17:06       ` Alex Williamson
2016-05-04 21:14         ` Neo Jia
2016-05-05  4:42           ` Kirti Wankhede
2016-05-05  9:24         ` Tian, Kevin
2016-05-05 20:27           ` Neo Jia
2016-05-11  6:45         ` Tian, Kevin
2016-05-11 20:10           ` Alex Williamson
2016-05-12  0:59             ` Tian, Kevin
2016-05-04 16:25     ` Kirti Wankhede
2016-05-02 18:40 ` [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu Kirti Wankhede
2016-05-03 10:40   ` Jike Song
2016-05-03 22:43   ` Alex Williamson
2016-05-04  3:39     ` Tian, Kevin
2016-05-05  6:55     ` Jike Song
2016-05-05  9:27       ` Tian, Kevin
2016-05-10  7:52         ` Jike Song
2016-05-10 16:02           ` Neo Jia
2016-05-11  9:15             ` Jike Song
2016-05-11 22:06               ` Alex Williamson
2016-05-12  4:11                 ` Jike Song
2016-05-12 19:49                   ` Neo Jia
2016-05-13  2:41                     ` Tian, Kevin
2016-05-13  6:22                       ` Jike Song
2016-05-13  6:43                         ` Neo Jia
2016-05-13  7:30                           ` Jike Song
2016-05-13  7:42                             ` Neo Jia
2016-05-13  7:45                               ` Tian, Kevin
2016-05-13  8:31                                 ` Neo Jia
2016-05-13  9:23                                   ` Jike Song
2016-05-13 15:50                                     ` Neo Jia
2016-05-16  6:57                                       ` Jike Song
2016-05-13  6:08                     ` Jike Song
2016-05-13  6:41                       ` Neo Jia
2016-05-13  7:13                         ` Tian, Kevin
2016-05-13  7:38                           ` Neo Jia
2016-05-13  8:02                             ` Tian, Kevin
2016-05-13  8:41                               ` Neo Jia
2016-05-12  8:00                 ` Tian, Kevin
2016-05-12 19:05                   ` Alex Williamson
2016-05-12 20:12                     ` Neo Jia
2016-05-13  9:46                       ` Jike Song
2016-05-13 15:48                         ` Neo Jia
2016-05-16  2:27                           ` Jike Song
2016-05-13  3:55                     ` Tian, Kevin
2016-05-13 16:16                       ` Alex Williamson [this message]
2016-05-13  7:10                     ` Dong Jia
2016-05-13  7:24                       ` Neo Jia
2016-05-13  8:39                         ` Dong Jia
2016-05-13  9:05                           ` Neo Jia
2016-05-19  7:28                             ` Dong Jia
2016-05-20  3:21                               ` Tian, Kevin
2016-06-06  6:59                                 ` Dong Jia
2016-06-07  2:47                                   ` Tian, Kevin
2016-06-07  7:04                                     ` Dong Jia
2016-05-05  7:51     ` Kirti Wankhede
2016-05-04  1:05 ` [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support Tian, Kevin
2016-05-04  6:17   ` Neo Jia
2016-05-04 17:07     ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160513101618.0176a00b@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=jike.song@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuai.ruan@intel.com \
    --cc=zhiyuan.lv@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).