From: Robin Murphy <robin.murphy@arm.com>
To: "Tian, Kevin" <kevin.tian@intel.com>, Nicolin Chen <nicolinc@nvidia.com>
Cc: "jgg@nvidia.com" <jgg@nvidia.com>,
"joro@8bytes.org" <joro@8bytes.org>,
"will@kernel.org" <will@kernel.org>,
"shuah@kernel.org" <shuah@kernel.org>,
"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-kselftest@vger.kernel.org"
<linux-kselftest@vger.kernel.org>
Subject: Re: [PATCH v2 2/3] iommu/dma: Support MSIs through nested domains
Date: Fri, 9 Aug 2024 18:43:47 +0100 [thread overview]
Message-ID: <f4c4a142-d0bb-44c5-8bb9-56136c8f7cf2@arm.com> (raw)
In-Reply-To: <BN9PR11MB5276D9387CB50D58E4A7585F8CBA2@BN9PR11MB5276.namprd11.prod.outlook.com>
On 2024-08-09 9:00 am, Tian, Kevin wrote:
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Friday, August 9, 2024 7:00 AM
>>
>> On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
>>> On 06/08/2024 9:25 am, Tian, Kevin wrote:
>>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Sent: Saturday, August 3, 2024 8:32 AM
>>>>>
>>>>> From: Robin Murphy <robin.murphy@arm.com>
>>>>>
>>>>> Currently, iommu-dma is the only place outside of IOMMUFD and
>> drivers
>>>>> which might need to be aware of the stage 2 domain encapsulated
>> within
>>>>> a nested domain. This would be in the legacy-VFIO-style case where
>> we're
>>>>
>>>> why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
>>>
>>> Because with proper nesting we ideally shouldn't need the host-managed
>>> MSI mess at all, which all stems from the old VFIO paradigm of
>>> completely abstracting interrupts from userspace. I'm still hoping
>>> IOMMUFD can grow its own interface for efficient MSI passthrough, where
>>> the VMM can simply map the physical MSI doorbell into whatever IPA (GPA)
>>> it wants it to appear at in the S2 domain, then whatever the guest does
>>> with S1 it can program the MSI address into the endpoint accordingly
>>> without us having to fiddle with it.
>>
>> Hmm, until now I wasn't so convinced myself that it could work as I
>> was worried about the data. But having a second thought, since the
>> host configures the MSI, it can still set the correct data. What we
>> only need is to change the MSI address from a RMRed IPA/gIOVA to a
>> real gIOVA of the vITS page.
>>
>> I did a quick hack to test that loop. MSI in the guest still works
>> fine without having the RMR node in its IORT. Sweet!
>>
>> To go further on this path, we will need the following changes:
>> - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
>> hypercall) should set gIOVA instead of fetching from msi_cookie.
>> That hypercall doesn't forward an address currently, since host
>> kernel pre-sets the msi_cookie. So, we need a way to forward the
>> gIOVA to kernel and pack it into the msi_msg structure. I haven't
>> read the VFIO PCI code thoroughly, yet wonder if we could just
>> let the guest program the gIOVA to the PCI register and fall it
>> through to the hardware, so host kernel handling that hypercall
>> can just read it back from the register?
>> - IOMMUFD should provide VMM a way to tell the gPA (or directly +
>> GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
>> have talked to Jason about this a while ago, and we have a few
>> thoughts how to implement it. But eventually, I think we still
>> can't avoid a middle man like msi_cookie to associate the gPA in
>> IOMMUFD to PA in irqchip?
>
> Probably a new IOMMU_DMA_MSI_COOKIE_USER type which uses
> GPA (passed in in ALLOC_HWPT for a nested_parent type) as IOVA
> in iommu_dma_get_msi_page()?
No, the whole point is to get away from cookies and having to keep track
of things in the kernel that can and should just be simple regular
user-owned S2 mappings.
>> One more concern is the MSI window size. VMM sets up a MSI region
>> that must fit the hardware window size. Most of ITS versions have
>> only one page size but one of them can have multiple pages? What
>> if vITS is one-page size while the underlying pITS has multiple?
>>
>> My understanding of the current kernel-defined 1MB size is also a
>> hard-coding window to potential fit all cases, since IOMMU code in
>> the code can just eyeball what's going on in the irqchip subsystem
>> and adjust accordingly if someday it needs to. But VMM can't?
The existing design is based around the kernel potentially having to
stuff multiple different mappings for different devices into the MSI
hole in a single domain, since VFIO userspace is allowed to do wacky
things like emulate INTx using an underlying physical MSI, so there may
not be any actual vITS region in the VM IPA space at all. I think that
was also why it ended up being a fake reserved region exposed by the
SMMU drivers rather than relying on userspace to say where to put it -
making things look superficially a bit more x86-like meant fewer changes
to userspace, which I think by now we can consider a tasty slice of
technical debt.
For a dedicated "MSI passthrough" model where, in parallel to IOMMU
nesting, the abstraction is thinner and userspace is in on the game of
knowingly emulating a GIC ITS backed by a GIC ITS, I'd imagine it could
be pretty straightforward, at least conceptually. Userspace has
something like an IOAS_MAP_MSI(device, IOVA) to indicate where it's
placing a vITS to which it wants that device's MSIs to be able to go,
the kernel resolves the PA from the IRQ layer and maps it, job done. If
userspace wants to associate two devices with the same vITS when they
have different physical ITSes, either we split the IOAS into two HWPTs
to hold the different mappings, or we punt it back to userspace to
resolve at the IOAS level.
Or I guess the really cheeky version is the IRQ layer exposes its own
thing for userspace to mmap the ITS, then it can call a literal IOAS_MAP
on that mapping... :D
Thanks,
Robin.
next prev parent reply other threads:[~2024-08-09 17:43 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-03 0:32 [PATCH v2 0/3] iommufd: Add selftest coverage for reserved IOVAs Nicolin Chen
2024-08-03 0:32 ` [PATCH v2 1/3] iommufd: Reorder include files Nicolin Chen
2024-08-15 17:51 ` Jason Gunthorpe
2024-08-15 18:12 ` Nicolin Chen
2024-08-03 0:32 ` [PATCH v2 2/3] iommu/dma: Support MSIs through nested domains Nicolin Chen
2024-08-06 8:25 ` Tian, Kevin
2024-08-06 17:24 ` Nicolin Chen
2024-08-08 12:38 ` Robin Murphy
2024-08-08 22:59 ` Nicolin Chen
2024-08-09 8:00 ` Tian, Kevin
2024-08-09 17:43 ` Robin Murphy [this message]
2024-08-09 20:09 ` Nicolin Chen
2024-08-09 23:01 ` Jason Gunthorpe
2024-08-09 7:34 ` Tian, Kevin
2024-08-09 18:41 ` Jason Gunthorpe
2024-08-09 19:18 ` Nicolin Chen
2024-08-09 22:49 ` Jason Gunthorpe
2024-08-09 23:38 ` Nicolin Chen
2024-08-03 0:32 ` [PATCH v2 3/3] iommufd/selftest: Add coverage for reserved IOVAs Nicolin Chen
2024-08-09 15:52 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f4c4a142-d0bb-44c5-8bb9-56136c8f7cf2@arm.com \
--to=robin.murphy@arm.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=nicolinc@nvidia.com \
--cc=shuah@kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox