From: Peter Xu <peterx@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>,
tianyu.lan@intel.com, kevin.tian@intel.com, mst@redhat.com,
jan.kiszka@siemens.com, bd.aviv@gmail.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
Date: Wed, 25 Jan 2017 12:04:27 +0800 [thread overview]
Message-ID: <20170125040427.GB5151@pxdev.xzpeter.org> (raw)
In-Reply-To: <20170124092429.241a4eaf@t450s.home>
On Tue, Jan 24, 2017 at 09:24:29AM -0700, Alex Williamson wrote:
[...]
> > I see. Then this will be an strict requirement that we cannot do
> > coalescing during page walk, at least for mappings.
> >
> > I didn't notice this before, but luckily current series is following
> > the rule above - we are basically doing the mapping in the unit of
> > pages. Normally, we should always be mapping with 4K pages, only if
> > guest provides huge pages in the VT-d page table, would we notify map
> > with >4K, though of course it can be either 2M/1G but never other
> > values.
> >
> > The point is, guest should be aware of the existance of the above huge
> > pages, so it won't unmap (for example) a single 4k region within a 2M
> > huge page range. It'll either keep the huge page, or unmap the whole
> > huge page. In that sense, we are quite safe.
> >
> > (for my own curiousity and out of topic: could I ask why we can't do
> > that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)
>
> You understand why we can't do this in the hugepage case, right? A
> hugepage means that at least one entire level of the page table is
> missing and that in order to unmap a subsection of it, we actually need
> to replace it with a new page table level, which cannot be done
> atomically relative to the rest of the PTEs in that entry. Now what if
> we don't assume that hugepages are only the Intel defined 2MB & 1GB?
> AMD-Vi supports effectively arbitrary power of two page table entries.
> So what if we've passed a 2x 4K mapping where the physical pages were
> contiguous and vfio passed it as a direct 8K mapping to the IOMMU and
> the IOMMU has native support for 8K mappings. We're in a similar
> scenario as the 2MB page, different page table layout though.
Thanks for the explaination. The AMD example is clear.
>
> > > I would think (but please confirm), that when we're only tracking
> > > mappings generated by the guest OS that this works. If the guest OS
> > > maps with 4k pages, we get map notifies for each of those 4k pages. If
> > > they use 2MB pages, we get 2MB ranges and invalidations will come in
> > > the same granularity.
> >
> > I would agree (I haven't thought of a case that this might be a
> > problem).
> >
> > >
> > > An area of concern though is the replay mechanism in QEMU, I'll need to
> > > look for it in the code, but replaying an IOMMU domain into a new
> > > container *cannot* coalesce mappings or else it limits the granularity
> > > with which we can later accept unmaps. Take for instance a guest that
> > > has mapped a contiguous 2MB range with 4K pages. They can unmap any 4K
> > > page within that range. However if vfio gets a single 2MB mapping
> > > rather than 512 4K mappings, then the host IOMMU may use a hugepage
> > > mapping where our granularity is now 2MB. Thanks,
> >
> > Is this the answer of my above question (which is for my own
> > curiosity)? If so, that'll kind of explain.
> >
> > If it's just because vfio is smart enough on automatically using huge
> > pages when applicable (I believe it's for performance's sake), not
> > sure whether we can introduce a ioctl() to setup the iova_pgsizes
> > bitmap, as long as it is a subset of supported iova_pgsizes (from
> > VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
> > limitation, they can explicitly set the iova_pgsizes to only allow 4K
> > pages.
> >
> > But, of course, this series can live well without it at least for now.
>
> Yes, this is part of how vfio transparently makes use of hugepages in
> the IOMMU, we effectively disregard the supported page sizes bitmap
> (it's useless for anything other than determining the minimum page size
> anyway), and instead pass through the largest range of iovas which are
> physically contiguous. The IOMMU driver can then make use of hugepages
> where available. The VFIO_IOMMU_MAP_DMA ioctl does include a flags
> field where we could appropriate a bit to indicate map with minimum
> granularity, but that would not be as simple as triggering the
> disable_hugepages mapping path because the type1 driver would also need
> to flag the internal vfio_dma as being bisectable, if not simply
> converted to multiple vfio_dma structs internally. Thanks,
I see, thanks!
-- peterx
next prev parent reply other threads:[~2017-01-25 4:04 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
2017-01-23 18:20 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
2017-01-23 18:49 ` Alex Williamson
2017-01-24 3:28 ` Peter Xu
2017-01-24 4:30 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
2017-01-22 2:51 ` [Qemu-devel] [PATCH RFC v4.1 04/20] intel_iommu: add "caching-mode" option Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
2017-01-23 19:12 ` Alex Williamson
2017-01-24 7:48 ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
2017-01-22 7:56 ` Jason Wang
2017-01-22 8:51 ` Peter Xu
2017-01-22 9:36 ` Peter Xu
2017-01-23 1:50 ` Jason Wang
2017-01-23 1:48 ` Jason Wang
2017-01-23 2:54 ` Peter Xu
2017-01-23 3:12 ` Jason Wang
2017-01-23 3:35 ` Peter Xu
2017-01-23 19:34 ` Alex Williamson
2017-01-24 4:04 ` Peter Xu
2017-01-23 19:33 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
2017-01-23 10:36 ` Jason Wang
2017-01-24 4:52 ` Peter Xu
2017-01-25 3:09 ` Jason Wang
2017-01-25 3:46 ` Peter Xu
2017-01-25 6:37 ` Tian, Kevin
2017-01-25 6:44 ` Peter Xu
2017-01-25 7:45 ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
2017-01-22 8:08 ` Jason Wang
2017-01-22 9:04 ` Peter Xu
2017-01-23 1:55 ` Jason Wang
2017-01-23 3:34 ` Peter Xu
2017-01-23 10:23 ` Jason Wang
2017-01-23 19:40 ` Alex Williamson
2017-01-25 1:19 ` Jason Wang
2017-01-25 1:31 ` Alex Williamson
2017-01-25 7:41 ` Jason Wang
2017-01-24 4:42 ` Peter Xu
2017-01-23 18:03 ` Alex Williamson
2017-01-24 7:22 ` Peter Xu
2017-01-24 16:24 ` Alex Williamson
2017-01-25 4:04 ` Peter Xu [this message]
2017-01-23 2:01 ` Jason Wang
2017-01-23 2:17 ` Jason Wang
2017-01-23 3:40 ` Peter Xu
2017-01-23 10:27 ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
2017-01-22 8:13 ` Jason Wang
2017-01-22 9:09 ` Peter Xu
2017-01-23 1:57 ` Jason Wang
2017-01-23 7:30 ` Peter Xu
2017-01-23 10:29 ` Jason Wang
2017-01-23 10:40 ` Jason Wang
2017-01-24 7:31 ` Peter Xu
2017-01-25 3:11 ` Jason Wang
2017-01-25 4:15 ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc Peter Xu
2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-01-24 7:40 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170125040427.GB5151@pxdev.xzpeter.org \
--to=peterx@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=bd.aviv@gmail.com \
--cc=jan.kiszka@siemens.com \
--cc=jasowang@redhat.com \
--cc=kevin.tian@intel.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=tianyu.lan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.