From: Peter Xu <peterx@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>,
"Raj, Ashok" <ashok.raj@intel.com>,
Jacob Pan <jacob.jun.pan@linux.intel.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,
"Sun, Yi Y" <yi.y.sun@intel.com>,
"iommu@lists.linux-foundation.org"
<iommu@lists.linux-foundation.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Alex Williamson <alex.williamson@redhat.com>,
David Woodhouse <dwmw2@infradead.org>
Subject: Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
Date: Wed, 25 Sep 2019 16:52:04 +0800 [thread overview]
Message-ID: <20190925085204.GR28074@xz-x1> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D19D58F5F5@SHSMSX104.ccr.corp.intel.com>
On Wed, Sep 25, 2019 at 08:02:23AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 3:45 PM
> >
> > On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, September 25, 2019 2:57 PM
> > > >
> > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > > Hi Kevin,
> > > > >
> > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > > > '-----------'
> > > > > > > > > '-----------'
> > > > > > > > >
> > > > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > > > first goal? then what are other goals? I didn't spot such information.
> > > > > >
> > > > >
> > > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > > table
> > > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> > not
> > > > > accurate, but you could get the point.)
> > > > >
> > > > > 1) GIOVA mappings over 1st-level page table;
> > > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > > >
> > > > > This patch set aims to achieve 1).
> > > >
> > > > Would it make sense to use 1st level even for bare-metal to replace
> > > > the 2nd level?
> > > >
> > > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > > there for the huge pages, then if they can use 1st level as the
> > > > default device page table then it even does not need to map, because
> > > > it can simply bind the process root page table pointer to the 1st
> > > > level page root pointer of the device contexts that it uses.
> > > >
> > >
> > > Then you need bear with possible page faults from using CPU page
> > > table, while most devices don't support it today.
> >
> > Right, I was just thinking aloud. After all neither do we have IOMMU
> > hardware to support 1st level (or am I wrong?)... It's just that when
>
> You are right. Current VT-d supports only 2nd level.
>
> > the 1st level is ready it should sound doable because IIUC PRI should
> > be always with the 1st level support no matter on IOMMU side or the
> > device side?
>
> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
> just a protocol to trigger page fault, but the device doesn't care whether
> the page fault is on 1st or 2nd level in the IOMMU side. The only
> relevant part is that a PRI request can have PASID tagged or cleared.
> When it's tagged with PASID, the IOMMU will locate the translation
> table under the given PASID (either 1st or 2nd level is fine, according
> to PASID entry setting). When no PASID is included, the IOMMU locates
> the translation from default entry (e.g. PASID#0 or any PASID contained
> in RID2PASID in VT-d).
>
> Your knowledge happened to be correct in deprecated ECS mode. At
> that time, there is only one 2nd level per context entry which doesn't
> support page fault, and there is only one 1st level per PASID entry which
> supports page fault. Then PRI could be indirectly connected to 1st level,
> but this just changed with new scalable mode.
>
> Another note is that the PRI capability only indicates that a device is
> capable of handling page faults, but not that a device can tolerate
> page fault for any of its DMA access. If the latter is fasle, using CPU
> page table for DPDK usage is still risky (and specific to device behavior)
>
> >
> > I'm actually not sure about whether my understanding here is
> > correct... I thought the pasid binding previously was only for some
> > vendor kernel drivers but not a general thing to userspace. I feel
> > like that should be doable in the future once we've got some new
> > syscall interface ready to deliver 1st level page table (e.g., via
> > vfio?) then applications like DPDK seems to be able to use that too
> > even directly via bare metal.
> >
>
> using 1st level for userspace is different from supporting DMA page
> fault in userspace. The former is purely about which structure to
> keep the mapping. I think we may do the same thing for both bare
> metal and guest (using 2nd level only for GPA when nested is enabled
> on the IOMMU). But reusing CPU page table for userspace is more
> tricky. :-)
Yes I should have mixed up the 1st level page table and PRI a bit, and
after all my initial question should be irrelevant to this series as
well so it's already a bit out of topic (sorry for that).
And, thanks for explaining these. :)
--
Peter Xu
next prev parent reply other threads:[~2019-09-25 8:53 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-23 12:24 [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 1/4] iommu/vt-d: Move domain_flush_cache helper into header Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces Lu Baolu
2019-09-23 20:31 ` Raj, Ashok
2019-09-24 1:38 ` Lu Baolu
2019-09-25 4:30 ` Peter Xu
2019-09-25 4:38 ` Tian, Kevin
2019-09-25 5:24 ` Peter Xu
2019-09-25 6:52 ` Lu Baolu
2019-09-25 7:32 ` Tian, Kevin
2019-09-25 8:35 ` Peter Xu
2019-09-26 1:42 ` Lu Baolu
2019-09-25 5:21 ` Peter Xu
2019-09-26 2:35 ` Lu Baolu
2019-09-26 3:49 ` Peter Xu
2019-09-27 2:27 ` Lu Baolu
2019-09-27 5:34 ` Peter Xu
2019-09-28 8:23 ` Lu Baolu
2019-09-29 5:25 ` Peter Xu
2019-10-08 2:20 ` Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 3/4] iommu/vt-d: Map/unmap domain with mmmap/mmunmap Lu Baolu
2019-09-25 5:00 ` Tian, Kevin
2019-09-25 7:06 ` Lu Baolu
2019-09-23 12:24 ` [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table Lu Baolu
2019-09-25 6:50 ` Peter Xu
2019-09-25 7:35 ` Tian, Kevin
2019-09-23 19:27 ` [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest Jacob Pan
2019-09-23 20:25 ` Raj, Ashok
2019-09-24 4:40 ` Lu Baolu
2019-09-24 7:00 ` Tian, Kevin
2019-09-25 2:48 ` Lu Baolu
2019-09-25 6:56 ` Peter Xu
2019-09-25 7:21 ` Tian, Kevin
2019-09-25 7:45 ` Peter Xu
2019-09-25 8:02 ` Tian, Kevin
2019-09-25 8:52 ` Peter Xu [this message]
2019-09-26 1:37 ` Lu Baolu
2019-09-24 4:27 ` Lu Baolu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190925085204.GR28074@xz-x1 \
--to=peterx@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=ashok.raj@intel.com \
--cc=baolu.lu@linux.intel.com \
--cc=dwmw2@infradead.org \
--cc=iommu@lists.linux-foundation.org \
--cc=jacob.jun.pan@linux.intel.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sanjay.k.kumar@intel.com \
--cc=yi.y.sun@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox