kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RMRR device on non-Intel platform
@ 2023-04-20  6:52 Tian, Kevin
  2023-04-20 14:15 ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Tian, Kevin @ 2023-04-20  6:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm@vger.kernel.org, iommu@lists.linux.dev, robin.murphy@arm.com

Hi, Alex,

Happen to see that we may have inconsistent policy about RMRR devices cross
different vendors.

Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
AMD IVMD and ARM RMR.

RMRR identity mapping was considered unsafe (except for USB/GPU) for
device assignment:

/*
 * There are a couple cases where we need to restrict the functionality of
 * devices associated with RMRRs.  The first is when evaluating a device for
 * identity mapping because problems exist when devices are moved in and out
 * of domains and their respective RMRR information is lost.  This means that
 * a device with associated RMRRs will never be in a "passthrough" domain.
 * The second is use of the device through the IOMMU API.  This interface
 * expects to have full control of the IOVA space for the device.  We cannot
 * satisfy both the requirement that RMRR access is maintained and have an
 * unencumbered IOVA space.  We also have no ability to quiesce the device's
 * use of the RMRR space or even inform the IOMMU API user of the restriction.
 * We therefore prevent devices associated with an RMRR from participating in
 * the IOMMU API, which eliminates them from device assignment.
 *
 * In both cases, devices which have relaxable RMRRs are not concerned by this
 * restriction. See device_rmrr_is_relaxable comment.
 */
static bool device_is_rmrr_locked(struct device *dev)
{
	if (!device_has_rmrr(dev))
		return false;

	if (device_rmrr_is_relaxable(dev))
		return false;

	return true;
}

Then non-relaxable RMRR device is rejected when doing attach:

static int intel_iommu_attach_device(struct iommu_domain *domain,
                                     struct device *dev)
{
	struct device_domain_info *info = dev_iommu_priv_get(dev);
	int ret;

	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
	    device_is_rmrr_locked(dev)) {
		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
		return -EPERM;
	}
	...
}

But I didn't find the same check in AMD/ARM driver at a glance.

Did I overlook some arch difference which makes RMRR device safe in
those platforms or is it a gap to be fixed?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20  6:52 RMRR device on non-Intel platform Tian, Kevin
@ 2023-04-20 14:15 ` Alex Williamson
  2023-04-20 14:19   ` Robin Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2023-04-20 14:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm@vger.kernel.org, iommu@lists.linux.dev, robin.murphy@arm.com

On Thu, 20 Apr 2023 06:52:01 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> Hi, Alex,
> 
> Happen to see that we may have inconsistent policy about RMRR devices cross
> different vendors.
> 
> Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
> AMD IVMD and ARM RMR.

Any similar requirement imposed by system firmware that the operating
system must perpetually maintain a specific IOVA mapping for the device
should impose similar restrictions as we've implemented for VT-d
RMMR[1].  Thanks,

Alex

[1]https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf

> RMRR identity mapping was considered unsafe (except for USB/GPU) for
> device assignment:
> 
> /*
>  * There are a couple cases where we need to restrict the functionality of
>  * devices associated with RMRRs.  The first is when evaluating a device for
>  * identity mapping because problems exist when devices are moved in and out
>  * of domains and their respective RMRR information is lost.  This means that
>  * a device with associated RMRRs will never be in a "passthrough" domain.
>  * The second is use of the device through the IOMMU API.  This interface
>  * expects to have full control of the IOVA space for the device.  We cannot
>  * satisfy both the requirement that RMRR access is maintained and have an
>  * unencumbered IOVA space.  We also have no ability to quiesce the device's
>  * use of the RMRR space or even inform the IOMMU API user of the restriction.
>  * We therefore prevent devices associated with an RMRR from participating in
>  * the IOMMU API, which eliminates them from device assignment.
>  *
>  * In both cases, devices which have relaxable RMRRs are not concerned by this
>  * restriction. See device_rmrr_is_relaxable comment.
>  */
> static bool device_is_rmrr_locked(struct device *dev)
> {
> 	if (!device_has_rmrr(dev))
> 		return false;
> 
> 	if (device_rmrr_is_relaxable(dev))
> 		return false;
> 
> 	return true;
> }
> 
> Then non-relaxable RMRR device is rejected when doing attach:
> 
> static int intel_iommu_attach_device(struct iommu_domain *domain,
>                                      struct device *dev)
> {
> 	struct device_domain_info *info = dev_iommu_priv_get(dev);
> 	int ret;
> 
> 	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
> 	    device_is_rmrr_locked(dev)) {
> 		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
> 		return -EPERM;
> 	}
> 	...
> }
> 
> But I didn't find the same check in AMD/ARM driver at a glance.
> 
> Did I overlook some arch difference which makes RMRR device safe in
> those platforms or is it a gap to be fixed?
> 
> Thanks
> Kevin
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20 14:15 ` Alex Williamson
@ 2023-04-20 14:19   ` Robin Murphy
  2023-04-20 14:49     ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-20 14:19 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin; +Cc: kvm@vger.kernel.org, iommu@lists.linux.dev

On 2023-04-20 15:15, Alex Williamson wrote:
> On Thu, 20 Apr 2023 06:52:01 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
>> Hi, Alex,
>>
>> Happen to see that we may have inconsistent policy about RMRR devices cross
>> different vendors.
>>
>> Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
>> AMD IVMD and ARM RMR.
> 
> Any similar requirement imposed by system firmware that the operating
> system must perpetually maintain a specific IOVA mapping for the device
> should impose similar restrictions as we've implemented for VT-d
> RMMR[1].  Thanks,

Hmm, does that mean that vfio_iommu_resv_exclude() going to the trouble 
of punching out all the reserved region holes isn't really needed?

Robin.

> 
> Alex
> 
> [1]https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
> 
>> RMRR identity mapping was considered unsafe (except for USB/GPU) for
>> device assignment:
>>
>> /*
>>   * There are a couple cases where we need to restrict the functionality of
>>   * devices associated with RMRRs.  The first is when evaluating a device for
>>   * identity mapping because problems exist when devices are moved in and out
>>   * of domains and their respective RMRR information is lost.  This means that
>>   * a device with associated RMRRs will never be in a "passthrough" domain.
>>   * The second is use of the device through the IOMMU API.  This interface
>>   * expects to have full control of the IOVA space for the device.  We cannot
>>   * satisfy both the requirement that RMRR access is maintained and have an
>>   * unencumbered IOVA space.  We also have no ability to quiesce the device's
>>   * use of the RMRR space or even inform the IOMMU API user of the restriction.
>>   * We therefore prevent devices associated with an RMRR from participating in
>>   * the IOMMU API, which eliminates them from device assignment.
>>   *
>>   * In both cases, devices which have relaxable RMRRs are not concerned by this
>>   * restriction. See device_rmrr_is_relaxable comment.
>>   */
>> static bool device_is_rmrr_locked(struct device *dev)
>> {
>> 	if (!device_has_rmrr(dev))
>> 		return false;
>>
>> 	if (device_rmrr_is_relaxable(dev))
>> 		return false;
>>
>> 	return true;
>> }
>>
>> Then non-relaxable RMRR device is rejected when doing attach:
>>
>> static int intel_iommu_attach_device(struct iommu_domain *domain,
>>                                       struct device *dev)
>> {
>> 	struct device_domain_info *info = dev_iommu_priv_get(dev);
>> 	int ret;
>>
>> 	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
>> 	    device_is_rmrr_locked(dev)) {
>> 		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
>> 		return -EPERM;
>> 	}
>> 	...
>> }
>>
>> But I didn't find the same check in AMD/ARM driver at a glance.
>>
>> Did I overlook some arch difference which makes RMRR device safe in
>> those platforms or is it a gap to be fixed?
>>
>> Thanks
>> Kevin
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20 14:19   ` Robin Murphy
@ 2023-04-20 14:49     ` Alex Williamson
  2023-04-20 16:55       ` Robin Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2023-04-20 14:49 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Tian, Kevin, kvm@vger.kernel.org, iommu@lists.linux.dev

On Thu, 20 Apr 2023 15:19:55 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 2023-04-20 15:15, Alex Williamson wrote:
> > On Thu, 20 Apr 2023 06:52:01 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> >> Hi, Alex,
> >>
> >> Happen to see that we may have inconsistent policy about RMRR devices cross
> >> different vendors.
> >>
> >> Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
> >> AMD IVMD and ARM RMR.  
> > 
> > Any similar requirement imposed by system firmware that the operating
> > system must perpetually maintain a specific IOVA mapping for the device
> > should impose similar restrictions as we've implemented for VT-d
> > RMMR[1].  Thanks,  
> 
> Hmm, does that mean that vfio_iommu_resv_exclude() going to the trouble 
> of punching out all the reserved region holes isn't really needed?

While "Reserved Memory Region Reporting", might suggest that the ranges
are simply excluded, RMRR actually require that specific mappings are
maintained for ongoing, side-band activity, which is not compatible
with the ideas that userspace owns the IOVA address space for the
device or separation of host vs userspace control of the device.  Such
mappings suggest things like system health monitoring where the
influence of a user-owned device can easily extend to a system-wide
scope if the user it able to manipulate the device to deny that
interaction or report bad data.

If these ARM and AMD tables impose similar requirements, we should
really be restricting devices encumbered by such requirements from
userspace access as well.  Thanks,

Alex

> > [1]https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
> >   
> >> RMRR identity mapping was considered unsafe (except for USB/GPU) for
> >> device assignment:
> >>
> >> /*
> >>   * There are a couple cases where we need to restrict the functionality of
> >>   * devices associated with RMRRs.  The first is when evaluating a device for
> >>   * identity mapping because problems exist when devices are moved in and out
> >>   * of domains and their respective RMRR information is lost.  This means that
> >>   * a device with associated RMRRs will never be in a "passthrough" domain.
> >>   * The second is use of the device through the IOMMU API.  This interface
> >>   * expects to have full control of the IOVA space for the device.  We cannot
> >>   * satisfy both the requirement that RMRR access is maintained and have an
> >>   * unencumbered IOVA space.  We also have no ability to quiesce the device's
> >>   * use of the RMRR space or even inform the IOMMU API user of the restriction.
> >>   * We therefore prevent devices associated with an RMRR from participating in
> >>   * the IOMMU API, which eliminates them from device assignment.
> >>   *
> >>   * In both cases, devices which have relaxable RMRRs are not concerned by this
> >>   * restriction. See device_rmrr_is_relaxable comment.
> >>   */
> >> static bool device_is_rmrr_locked(struct device *dev)
> >> {
> >> 	if (!device_has_rmrr(dev))
> >> 		return false;
> >>
> >> 	if (device_rmrr_is_relaxable(dev))
> >> 		return false;
> >>
> >> 	return true;
> >> }
> >>
> >> Then non-relaxable RMRR device is rejected when doing attach:
> >>
> >> static int intel_iommu_attach_device(struct iommu_domain *domain,
> >>                                       struct device *dev)
> >> {
> >> 	struct device_domain_info *info = dev_iommu_priv_get(dev);
> >> 	int ret;
> >>
> >> 	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
> >> 	    device_is_rmrr_locked(dev)) {
> >> 		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
> >> 		return -EPERM;
> >> 	}
> >> 	...
> >> }
> >>
> >> But I didn't find the same check in AMD/ARM driver at a glance.
> >>
> >> Did I overlook some arch difference which makes RMRR device safe in
> >> those platforms or is it a gap to be fixed?
> >>
> >> Thanks
> >> Kevin
> >>  
> >   
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20 14:49     ` Alex Williamson
@ 2023-04-20 16:55       ` Robin Murphy
  2023-04-20 21:49         ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-20 16:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Tian, Kevin, kvm@vger.kernel.org, iommu@lists.linux.dev

On 20/04/2023 3:49 pm, Alex Williamson wrote:
> On Thu, 20 Apr 2023 15:19:55 +0100
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
>> On 2023-04-20 15:15, Alex Williamson wrote:
>>> On Thu, 20 Apr 2023 06:52:01 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>    
>>>> Hi, Alex,
>>>>
>>>> Happen to see that we may have inconsistent policy about RMRR devices cross
>>>> different vendors.
>>>>
>>>> Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
>>>> AMD IVMD and ARM RMR.
>>>
>>> Any similar requirement imposed by system firmware that the operating
>>> system must perpetually maintain a specific IOVA mapping for the device
>>> should impose similar restrictions as we've implemented for VT-d
>>> RMMR[1].  Thanks,
>>
>> Hmm, does that mean that vfio_iommu_resv_exclude() going to the trouble
>> of punching out all the reserved region holes isn't really needed?
> 
> While "Reserved Memory Region Reporting", might suggest that the ranges
> are simply excluded, RMRR actually require that specific mappings are
> maintained for ongoing, side-band activity, which is not compatible
> with the ideas that userspace owns the IOVA address space for the
> device or separation of host vs userspace control of the device.  Such
> mappings suggest things like system health monitoring where the
> influence of a user-owned device can easily extend to a system-wide
> scope if the user it able to manipulate the device to deny that
> interaction or report bad data.
> 
> If these ARM and AMD tables impose similar requirements, we should
> really be restricting devices encumbered by such requirements from
> userspace access as well.  Thanks,

Indeed the primary use-case behind Arm's RMRs was certain devices like 
big complex RAID controllers which have already been started by UEFI 
firmware at boot and have live in-memory data which needs to be preserved.

However, my point was more that if it's a VFIO policy that any device 
with an IOMMU_RESV_DIRECT reservation is not suitable for userspace 
assignment, then vfio_iommu_type1_attach_group() already has everything 
it would need to robustly enforce that policy itself. It seems silly to 
me for it to expect the IOMMU driver to fail the attach, then go ahead 
and dutifully punch out direct regions if it happened not to. A couple 
of obvious trivial tweaks and there could be no dependency on driver 
behaviour at all, other than correctly reporting resv_regions to begin with.

If we think this policy deserves to go beyond VFIO and userspace, and 
it's reasonable that such devices should never be allowed to attach to 
any other kind of kernel-owned unmanaged domain either, then we can 
still trivially enforce that in core IOMMU code. I really see no need 
for it to be in drivers at all.

Thanks,
Robin.

> 
> Alex
> 
>>> [1]https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
>>>    
>>>> RMRR identity mapping was considered unsafe (except for USB/GPU) for
>>>> device assignment:
>>>>
>>>> /*
>>>>    * There are a couple cases where we need to restrict the functionality of
>>>>    * devices associated with RMRRs.  The first is when evaluating a device for
>>>>    * identity mapping because problems exist when devices are moved in and out
>>>>    * of domains and their respective RMRR information is lost.  This means that
>>>>    * a device with associated RMRRs will never be in a "passthrough" domain.
>>>>    * The second is use of the device through the IOMMU API.  This interface
>>>>    * expects to have full control of the IOVA space for the device.  We cannot
>>>>    * satisfy both the requirement that RMRR access is maintained and have an
>>>>    * unencumbered IOVA space.  We also have no ability to quiesce the device's
>>>>    * use of the RMRR space or even inform the IOMMU API user of the restriction.
>>>>    * We therefore prevent devices associated with an RMRR from participating in
>>>>    * the IOMMU API, which eliminates them from device assignment.
>>>>    *
>>>>    * In both cases, devices which have relaxable RMRRs are not concerned by this
>>>>    * restriction. See device_rmrr_is_relaxable comment.
>>>>    */
>>>> static bool device_is_rmrr_locked(struct device *dev)
>>>> {
>>>> 	if (!device_has_rmrr(dev))
>>>> 		return false;
>>>>
>>>> 	if (device_rmrr_is_relaxable(dev))
>>>> 		return false;
>>>>
>>>> 	return true;
>>>> }
>>>>
>>>> Then non-relaxable RMRR device is rejected when doing attach:
>>>>
>>>> static int intel_iommu_attach_device(struct iommu_domain *domain,
>>>>                                        struct device *dev)
>>>> {
>>>> 	struct device_domain_info *info = dev_iommu_priv_get(dev);
>>>> 	int ret;
>>>>
>>>> 	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
>>>> 	    device_is_rmrr_locked(dev)) {
>>>> 		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
>>>> 		return -EPERM;
>>>> 	}
>>>> 	...
>>>> }
>>>>
>>>> But I didn't find the same check in AMD/ARM driver at a glance.
>>>>
>>>> Did I overlook some arch difference which makes RMRR device safe in
>>>> those platforms or is it a gap to be fixed?
>>>>
>>>> Thanks
>>>> Kevin
>>>>   
>>>    
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20 16:55       ` Robin Murphy
@ 2023-04-20 21:49         ` Alex Williamson
  2023-04-21  4:10           ` Tian, Kevin
  2023-04-21 12:04           ` Jason Gunthorpe
  0 siblings, 2 replies; 28+ messages in thread
From: Alex Williamson @ 2023-04-20 21:49 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Tian, Kevin, kvm@vger.kernel.org, iommu@lists.linux.dev,
	Jason Gunthorpe

On Thu, 20 Apr 2023 17:55:22 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 20/04/2023 3:49 pm, Alex Williamson wrote:
> > On Thu, 20 Apr 2023 15:19:55 +0100
> > Robin Murphy <robin.murphy@arm.com> wrote:
> >   
> >> On 2023-04-20 15:15, Alex Williamson wrote:  
> >>> On Thu, 20 Apr 2023 06:52:01 +0000
> >>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>      
> >>>> Hi, Alex,
> >>>>
> >>>> Happen to see that we may have inconsistent policy about RMRR devices cross
> >>>> different vendors.
> >>>>
> >>>> Previously only Intel supports RMRR. Now both AMD/ARM have similar thing,
> >>>> AMD IVMD and ARM RMR.  
> >>>
> >>> Any similar requirement imposed by system firmware that the operating
> >>> system must perpetually maintain a specific IOVA mapping for the device
> >>> should impose similar restrictions as we've implemented for VT-d
> >>> RMMR[1].  Thanks,  
> >>
> >> Hmm, does that mean that vfio_iommu_resv_exclude() going to the trouble
> >> of punching out all the reserved region holes isn't really needed?  
> > 
> > While "Reserved Memory Region Reporting", might suggest that the ranges
> > are simply excluded, RMRR actually require that specific mappings are
> > maintained for ongoing, side-band activity, which is not compatible
> > with the ideas that userspace owns the IOVA address space for the
> > device or separation of host vs userspace control of the device.  Such
> > mappings suggest things like system health monitoring where the
> > influence of a user-owned device can easily extend to a system-wide
> > scope if the user it able to manipulate the device to deny that
> > interaction or report bad data.
> > 
> > If these ARM and AMD tables impose similar requirements, we should
> > really be restricting devices encumbered by such requirements from
> > userspace access as well.  Thanks,  
> 
> Indeed the primary use-case behind Arm's RMRs was certain devices like 
> big complex RAID controllers which have already been started by UEFI 
> firmware at boot and have live in-memory data which needs to be preserved.
> 
> However, my point was more that if it's a VFIO policy that any device 
> with an IOMMU_RESV_DIRECT reservation is not suitable for userspace 
> assignment, then vfio_iommu_type1_attach_group() already has everything 
> it would need to robustly enforce that policy itself. It seems silly to 
> me for it to expect the IOMMU driver to fail the attach, then go ahead 
> and dutifully punch out direct regions if it happened not to. A couple 
> of obvious trivial tweaks and there could be no dependency on driver 
> behaviour at all, other than correctly reporting resv_regions to begin with.
> 
> If we think this policy deserves to go beyond VFIO and userspace, and 
> it's reasonable that such devices should never be allowed to attach to 
> any other kind of kernel-owned unmanaged domain either, then we can 
> still trivially enforce that in core IOMMU code. I really see no need 
> for it to be in drivers at all.

It seems like a reasonable choice to me that any mixing of unmanaged
domains with IOMMU_RESV_DIRECT could be restricted globally.  Do we
even have infrastructure for a driver to honor the necessary mapping
requirements?

It looks pretty easy to do as well, something like this (untested):

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10db680acaed..521f9a731ce9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2012,11 +2012,29 @@ static void __iommu_group_set_core_domain(struct iommu_group *group)
 static int __iommu_attach_device(struct iommu_domain *domain,
 				 struct device *dev)
 {
-	int ret;
+	int ret = 0;
 
 	if (unlikely(domain->ops->attach_dev == NULL))
 		return -ENODEV;
 
+	if (domain->type == IOMMU_DOMAIN_UNMANAGED) {
+		struct iommu_resv_region *region;
+		LIST_HEAD(resv_regions);
+
+		iommu_get_resv_regions(dev, &resv_regions);
+		list_for_each_entry(region, &resv_regions, list) {
+			if (region->type == IOMMU_RESV_DIRECT) {
+				ret = -EPERM;
+				break;
+			}
+		}
+		iommu_put_resv_regions(dev, &resv_regions);
+		if (ret) {
+			dev_warn(dev, "Device may not be used with an unmanaged IOMMU domain due to reserved direct mapping requirement.\n");
+			return ret;
+		}
+	}
+
 	ret = domain->ops->attach_dev(domain, dev);
 	if (ret)
 		return ret;
 
Restrictions in either type1 or iommufd would be pretty trivial as well,
but centralizing it in core IOMMU code would do a better job of covering
all use cases.

This effectively makes the VT-d code further down the same path
redundant, so no new restrictions there.

What sort of fall-out should we expect on ARM or AMD?  This was a pretty
painful restriction to add on Intel.  Thanks,

Alex

> >>> [1]https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
> >>>      
> >>>> RMRR identity mapping was considered unsafe (except for USB/GPU) for
> >>>> device assignment:
> >>>>
> >>>> /*
> >>>>    * There are a couple cases where we need to restrict the functionality of
> >>>>    * devices associated with RMRRs.  The first is when evaluating a device for
> >>>>    * identity mapping because problems exist when devices are moved in and out
> >>>>    * of domains and their respective RMRR information is lost.  This means that
> >>>>    * a device with associated RMRRs will never be in a "passthrough" domain.
> >>>>    * The second is use of the device through the IOMMU API.  This interface
> >>>>    * expects to have full control of the IOVA space for the device.  We cannot
> >>>>    * satisfy both the requirement that RMRR access is maintained and have an
> >>>>    * unencumbered IOVA space.  We also have no ability to quiesce the device's
> >>>>    * use of the RMRR space or even inform the IOMMU API user of the restriction.
> >>>>    * We therefore prevent devices associated with an RMRR from participating in
> >>>>    * the IOMMU API, which eliminates them from device assignment.
> >>>>    *
> >>>>    * In both cases, devices which have relaxable RMRRs are not concerned by this
> >>>>    * restriction. See device_rmrr_is_relaxable comment.
> >>>>    */
> >>>> static bool device_is_rmrr_locked(struct device *dev)
> >>>> {
> >>>> 	if (!device_has_rmrr(dev))
> >>>> 		return false;
> >>>>
> >>>> 	if (device_rmrr_is_relaxable(dev))
> >>>> 		return false;
> >>>>
> >>>> 	return true;
> >>>> }
> >>>>
> >>>> Then non-relaxable RMRR device is rejected when doing attach:
> >>>>
> >>>> static int intel_iommu_attach_device(struct iommu_domain *domain,
> >>>>                                        struct device *dev)
> >>>> {
> >>>> 	struct device_domain_info *info = dev_iommu_priv_get(dev);
> >>>> 	int ret;
> >>>>
> >>>> 	if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
> >>>> 	    device_is_rmrr_locked(dev)) {
> >>>> 		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
> >>>> 		return -EPERM;
> >>>> 	}
> >>>> 	...
> >>>> }
> >>>>
> >>>> But I didn't find the same check in AMD/ARM driver at a glance.
> >>>>
> >>>> Did I overlook some arch difference which makes RMRR device safe in
> >>>> those platforms or is it a gap to be fixed?
> >>>>
> >>>> Thanks
> >>>> Kevin
> >>>>     
> >>>      
> >>  
> >   
> 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: RMRR device on non-Intel platform
  2023-04-20 21:49         ` Alex Williamson
@ 2023-04-21  4:10           ` Tian, Kevin
  2023-04-21 11:33             ` Jason Gunthorpe
  2023-04-21 11:34             ` Robin Murphy
  2023-04-21 12:04           ` Jason Gunthorpe
  1 sibling, 2 replies; 28+ messages in thread
From: Tian, Kevin @ 2023-04-21  4:10 UTC (permalink / raw)
  To: Alex Williamson, Robin Murphy
  Cc: kvm@vger.kernel.org, iommu@lists.linux.dev, Jason Gunthorpe,
	Lu, Baolu

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 21, 2023 5:50 AM
> 
> On Thu, 20 Apr 2023 17:55:22 +0100
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
> > On 20/04/2023 3:49 pm, Alex Williamson wrote:
> > > On Thu, 20 Apr 2023 15:19:55 +0100
> > > Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > >> On 2023-04-20 15:15, Alex Williamson wrote:
> > >>> On Thu, 20 Apr 2023 06:52:01 +0000
> > >>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >>>
> > >>>> Hi, Alex,
> > >>>>
> > >>>> Happen to see that we may have inconsistent policy about RMRR
> devices cross
> > >>>> different vendors.
> > >>>>
> > >>>> Previously only Intel supports RMRR. Now both AMD/ARM have
> similar thing,
> > >>>> AMD IVMD and ARM RMR.
> > >>>
> > >>> Any similar requirement imposed by system firmware that the
> operating
> > >>> system must perpetually maintain a specific IOVA mapping for the
> device
> > >>> should impose similar restrictions as we've implemented for VT-d
> > >>> RMMR[1].  Thanks,
> > >>
> > >> Hmm, does that mean that vfio_iommu_resv_exclude() going to the
> trouble
> > >> of punching out all the reserved region holes isn't really needed?
> > >
> > > While "Reserved Memory Region Reporting", might suggest that the
> ranges
> > > are simply excluded, RMRR actually require that specific mappings are
> > > maintained for ongoing, side-band activity, which is not compatible
> > > with the ideas that userspace owns the IOVA address space for the
> > > device or separation of host vs userspace control of the device.  Such
> > > mappings suggest things like system health monitoring where the
> > > influence of a user-owned device can easily extend to a system-wide
> > > scope if the user it able to manipulate the device to deny that
> > > interaction or report bad data.
> > >
> > > If these ARM and AMD tables impose similar requirements, we should
> > > really be restricting devices encumbered by such requirements from
> > > userspace access as well.  Thanks,
> >
> > Indeed the primary use-case behind Arm's RMRs was certain devices like
> > big complex RAID controllers which have already been started by UEFI
> > firmware at boot and have live in-memory data which needs to be
> preserved.
> >
> > However, my point was more that if it's a VFIO policy that any device
> > with an IOMMU_RESV_DIRECT reservation is not suitable for userspace
> > assignment, then vfio_iommu_type1_attach_group() already has
> everything
> > it would need to robustly enforce that policy itself. It seems silly to
> > me for it to expect the IOMMU driver to fail the attach, then go ahead
> > and dutifully punch out direct regions if it happened not to. A couple
> > of obvious trivial tweaks and there could be no dependency on driver
> > behaviour at all, other than correctly reporting resv_regions to begin with.
> >
> > If we think this policy deserves to go beyond VFIO and userspace, and
> > it's reasonable that such devices should never be allowed to attach to
> > any other kind of kernel-owned unmanaged domain either, then we can
> > still trivially enforce that in core IOMMU code. I really see no need
> > for it to be in drivers at all.
> 
> It seems like a reasonable choice to me that any mixing of unmanaged
> domains with IOMMU_RESV_DIRECT could be restricted globally.  Do we
> even have infrastructure for a driver to honor the necessary mapping
> requirements?
> 
> It looks pretty easy to do as well, something like this (untested):
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 10db680acaed..521f9a731ce9 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2012,11 +2012,29 @@ static void
> __iommu_group_set_core_domain(struct iommu_group *group)
>  static int __iommu_attach_device(struct iommu_domain *domain,
>  				 struct device *dev)
>  {
> -	int ret;
> +	int ret = 0;
> 
>  	if (unlikely(domain->ops->attach_dev == NULL))
>  		return -ENODEV;
> 
> +	if (domain->type == IOMMU_DOMAIN_UNMANAGED) {
> +		struct iommu_resv_region *region;
> +		LIST_HEAD(resv_regions);
> +
> +		iommu_get_resv_regions(dev, &resv_regions);
> +		list_for_each_entry(region, &resv_regions, list) {
> +			if (region->type == IOMMU_RESV_DIRECT) {
> +				ret = -EPERM;
> +				break;
> +			}
> +		}
> +		iommu_put_resv_regions(dev, &resv_regions);
> +		if (ret) {
> +			dev_warn(dev, "Device may not be used with an
> unmanaged IOMMU domain due to reserved direct mapping
> requirement.\n");
> +			return ret;
> +		}
> +	}
> +
>  	ret = domain->ops->attach_dev(domain, dev);
>  	if (ret)
>  		return ret;
> 
> Restrictions in either type1 or iommufd would be pretty trivial as well,
> but centralizing it in core IOMMU code would do a better job of covering
> all use cases.
> 
> This effectively makes the VT-d code further down the same path
> redundant, so no new restrictions there.
> 
> What sort of fall-out should we expect on ARM or AMD?  This was a pretty
> painful restriction to add on Intel.  Thanks,
> 

What about device_rmrr_is_relaxable()? Leave it in specific driver or
consolidate to be generic?

intel-iommu sets RELAXABLE for USB and GPU assuming their RMRR region
is not used post boot.

Presumably same policy can be applied to AMD too.

ARM RMR reports an explicit flag (ACPI_IORT_RMR_REMAP_PERMITTED) to
mark out whether a RMR region is relaxable. I'm not sure whether USB/GPU
is already covered.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21  4:10           ` Tian, Kevin
@ 2023-04-21 11:33             ` Jason Gunthorpe
  2023-04-21 11:34             ` Robin Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 11:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Robin Murphy, kvm@vger.kernel.org,
	iommu@lists.linux.dev, Lu, Baolu

On Fri, Apr 21, 2023 at 04:10:41AM +0000, Tian, Kevin wrote:

> What about device_rmrr_is_relaxable()? Leave it in specific driver or
> consolidate to be generic?

Why can't this convert to IOMMU_RESV_DIRECT_RELAXABLE and let the
core code handle it?

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21  4:10           ` Tian, Kevin
  2023-04-21 11:33             ` Jason Gunthorpe
@ 2023-04-21 11:34             ` Robin Murphy
  2023-04-23  8:23               ` Tian, Kevin
  1 sibling, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-21 11:34 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: kvm@vger.kernel.org, iommu@lists.linux.dev, Jason Gunthorpe,
	Lu, Baolu

On 2023-04-21 05:10, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Friday, April 21, 2023 5:50 AM

[ 2-in-1 reply since I'm lazy... ]

>> On Thu, 20 Apr 2023 17:55:22 +0100
>> Robin Murphy <robin.murphy@arm.com> wrote:
>>
>>> On 20/04/2023 3:49 pm, Alex Williamson wrote:
>>>> On Thu, 20 Apr 2023 15:19:55 +0100
>>>> Robin Murphy <robin.murphy@arm.com> wrote:
>>>>
>>>>> On 2023-04-20 15:15, Alex Williamson wrote:
>>>>>> On Thu, 20 Apr 2023 06:52:01 +0000
>>>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>>>
>>>>>>> Hi, Alex,
>>>>>>>
>>>>>>> Happen to see that we may have inconsistent policy about RMRR
>> devices cross
>>>>>>> different vendors.
>>>>>>>
>>>>>>> Previously only Intel supports RMRR. Now both AMD/ARM have
>> similar thing,
>>>>>>> AMD IVMD and ARM RMR.
>>>>>>
>>>>>> Any similar requirement imposed by system firmware that the
>> operating
>>>>>> system must perpetually maintain a specific IOVA mapping for the
>> device
>>>>>> should impose similar restrictions as we've implemented for VT-d
>>>>>> RMMR[1].  Thanks,
>>>>>
>>>>> Hmm, does that mean that vfio_iommu_resv_exclude() going to the
>> trouble
>>>>> of punching out all the reserved region holes isn't really needed?
>>>>
>>>> While "Reserved Memory Region Reporting", might suggest that the
>> ranges
>>>> are simply excluded, RMRR actually require that specific mappings are
>>>> maintained for ongoing, side-band activity, which is not compatible
>>>> with the ideas that userspace owns the IOVA address space for the
>>>> device or separation of host vs userspace control of the device.  Such
>>>> mappings suggest things like system health monitoring where the
>>>> influence of a user-owned device can easily extend to a system-wide
>>>> scope if the user it able to manipulate the device to deny that
>>>> interaction or report bad data.
>>>>
>>>> If these ARM and AMD tables impose similar requirements, we should
>>>> really be restricting devices encumbered by such requirements from
>>>> userspace access as well.  Thanks,
>>>
>>> Indeed the primary use-case behind Arm's RMRs was certain devices like
>>> big complex RAID controllers which have already been started by UEFI
>>> firmware at boot and have live in-memory data which needs to be
>> preserved.
>>>
>>> However, my point was more that if it's a VFIO policy that any device
>>> with an IOMMU_RESV_DIRECT reservation is not suitable for userspace
>>> assignment, then vfio_iommu_type1_attach_group() already has
>> everything
>>> it would need to robustly enforce that policy itself. It seems silly to
>>> me for it to expect the IOMMU driver to fail the attach, then go ahead
>>> and dutifully punch out direct regions if it happened not to. A couple
>>> of obvious trivial tweaks and there could be no dependency on driver
>>> behaviour at all, other than correctly reporting resv_regions to begin with.
>>>
>>> If we think this policy deserves to go beyond VFIO and userspace, and
>>> it's reasonable that such devices should never be allowed to attach to
>>> any other kind of kernel-owned unmanaged domain either, then we can
>>> still trivially enforce that in core IOMMU code. I really see no need
>>> for it to be in drivers at all.
>>
>> It seems like a reasonable choice to me that any mixing of unmanaged
>> domains with IOMMU_RESV_DIRECT could be restricted globally.  Do we
>> even have infrastructure for a driver to honor the necessary mapping
>> requirements?

In principle a driver that knew what it was doing could call 
iommu_group_get_resv_regions() and map them into its own domain before 
attaching. Since nothing upstream *is* actually doing that, though, I'd 
have no real objection to restricting this at the API level as the 
safest approach, then coming back to consider exceptional cases if and 
when anybody does really have one.

>> It looks pretty easy to do as well, something like this (untested):
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 10db680acaed..521f9a731ce9 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2012,11 +2012,29 @@ static void
>> __iommu_group_set_core_domain(struct iommu_group *group)
>>   static int __iommu_attach_device(struct iommu_domain *domain,
>>   				 struct device *dev)
>>   {
>> -	int ret;
>> +	int ret = 0;
>>
>>   	if (unlikely(domain->ops->attach_dev == NULL))
>>   		return -ENODEV;
>>
>> +	if (domain->type == IOMMU_DOMAIN_UNMANAGED) {
>> +		struct iommu_resv_region *region;
>> +		LIST_HEAD(resv_regions);
>> +
>> +		iommu_get_resv_regions(dev, &resv_regions);
>> +		list_for_each_entry(region, &resv_regions, list) {
>> +			if (region->type == IOMMU_RESV_DIRECT) {
>> +				ret = -EPERM;
>> +				break;
>> +			}
>> +		}
>> +		iommu_put_resv_regions(dev, &resv_regions);
>> +		if (ret) {
>> +			dev_warn(dev, "Device may not be used with an
>> unmanaged IOMMU domain due to reserved direct mapping
>> requirement.\n");
>> +			return ret;
>> +		}
>> +	}
>> +
>>   	ret = domain->ops->attach_dev(domain, dev);
>>   	if (ret)
>>   		return ret;
>>
>> Restrictions in either type1 or iommufd would be pretty trivial as well,
>> but centralizing it in core IOMMU code would do a better job of covering
>> all use cases.

Indeed, I was thinking that either way, since we usually process the 
reserved regions at probe time anyway, it only needs a slight shuffle to 
cache this as a flag that both core code and/or users can easily refer 
to (so VFIO and IOMMUFD could still identify unsuitable devices long 
before having to go as far as the attach failing). Something like:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 265d34e9cbcb..53d3daef7a7b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -935,19 +935,20 @@ static int 
iommu_create_device_direct_mappings(struct iommu_group *group,
  {
  	struct iommu_domain *domain = group->default_domain;
  	struct iommu_resv_region *entry;
-	struct list_head mappings;
+	LIST_HEAD(mappings);
  	unsigned long pg_size;
  	int ret = 0;

+	iommu_get_resv_regions(dev, &mappings);
+	list_for_each_entry(entry, &mappings, list)
+		if (entry->type == IOMMU_RESV_DIRECT)
+			dev->iommu->has_resv_direct = true;
+
  	if (!domain || !iommu_is_dma_domain(domain))
-		return 0;
+		goto out;

  	BUG_ON(!domain->pgsize_bitmap);
-
  	pg_size = 1UL << __ffs(domain->pgsize_bitmap);
-	INIT_LIST_HEAD(&mappings);
-
-	iommu_get_resv_regions(dev, &mappings);

  	/* We need to consider overlapping regions for different devices */
  	list_for_each_entry(entry, &mappings, list) {

>>
>> This effectively makes the VT-d code further down the same path
>> redundant, so no new restrictions there.
>>
>> What sort of fall-out should we expect on ARM or AMD?  This was a pretty
>> painful restriction to add on Intel.  Thanks,

I can't speak for AMD, but I wouldn't imagine any noticeable fallout for 
Arm. RMRs are still relatively new, and the devices that I've heard 
about seem highly unlikely to ever want to be assigned to userspace 
directly (they may well provide virtual functions or sub-devices that 
absolutely *would* be used with VFIO, but the "host" device owning the 
RMRs would not be).

>>
> 
> What about device_rmrr_is_relaxable()? Leave it in specific driver or
> consolidate to be generic?

That's where the generic information already comes from, since it 
provides the region type in intel_iommu_get_resv_regions() - the 
existing resv_region abstraction *is* the consolidation. AFAICS it would 
be needlessly painful to start trying to directly process bits of DMAR 
or IRVS outside their respective drivers. It's only 
device_is_rmrr_locked() which would become redundant here.

> intel-iommu sets RELAXABLE for USB and GPU assuming their RMRR region
> is not used post boot.
> 
> Presumably same policy can be applied to AMD too.
> 
> ARM RMR reports an explicit flag (ACPI_IORT_RMR_REMAP_PERMITTED) to
> mark out whether a RMR region is relaxable. I'm not sure whether USB/GPU
> is already covered.

Arm systems have no legacy of needing to offer compatibility with a PC 
BIOS from the 1990s, so PS/2 or VGA emulation is a non-issue for us ;)

Thanks,
Robin.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-20 21:49         ` Alex Williamson
  2023-04-21  4:10           ` Tian, Kevin
@ 2023-04-21 12:04           ` Jason Gunthorpe
  2023-04-21 12:29             ` Robin Murphy
                               ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 12:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Robin Murphy, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Thu, Apr 20, 2023 at 03:49:33PM -0600, Alex Williamson wrote:
> > If we think this policy deserves to go beyond VFIO and userspace, and 
> > it's reasonable that such devices should never be allowed to attach to 
> > any other kind of kernel-owned unmanaged domain either, then we can 
> > still trivially enforce that in core IOMMU code. I really see no need 
> > for it to be in drivers at all.
> 
> It seems like a reasonable choice to me that any mixing of unmanaged
> domains with IOMMU_RESV_DIRECT could be restricted globally.  Do we
> even have infrastructure for a driver to honor the necessary mapping
> requirements?

What we discussed about the definition of IOMMU_RESV_DIRECT was that
an identity map needs to be present at all times. This is what is
documented at least:

	/* Memory regions which must be mapped 1:1 at all times */
	IOMMU_RESV_DIRECT,

Notably, this also means the device can never be attached to a
blocking domain. I would also think that drivers asking for this
should ideally implement the "atomic replace" we discussed already to
change between identity and unmanaged without disturbing the FW doing
DMA to these addresses..

I was looking at this when we talked about it earlier and we don't
follow that guideline today for vfio/iommufd.

Since taking ownership immediately switches to a blocking domain
restricting the use of blocking also restricts ownership thus vfio and
iommufd will be prevented.

Other places using unmanaged domains must follow the
iommu_get_resv_regions() and setup IOMMU_RESV_DIRECT - we should not
restrict them in the core code.

It also slightly changes my prior remarks to Robin about error domain
attach handling, since blocking domains are not available for these
devices the "error state" for such a device should be the identity
domain to preserve FW access.

Also, we have a functional gap, ARM would really like a
IOMMU_RESV_DIRECT_RELAXABLE_SAFE which would have iommufd/vfio install
the 1:1 map and allow the device to be used. This is necessary for the
GIC ITS page hack to support MSI since we should enable VFIO inside a
VM. It is always safe for hostile VFIO userspace to DMA to the ITS
page.

So, after my domain error handling series, the core fix is pretty
simple and universal. We should also remove all the redundant code in
drivers - drivers should report the regions each devices needs
properly and leave enforcement to the core code.. Lu/Kevin do you want
to take this?

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 19f8d28ff1323c..c15eb5e0ba761d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1059,6 +1059,9 @@ static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
 		    entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
 			continue;
 
+		if (entry->type == IOMMU_RESV_DIRECT)
+			dev->iommu->requires_direct = 1;
+
 		for (addr = start; addr <= end; addr += pg_size) {
 			phys_addr_t phys_addr;
 
@@ -2210,6 +2213,22 @@ static int __iommu_device_set_domain(struct iommu_group *group,
 {
 	int ret;
 
+	/*
+	 * If the driver has requested IOMMU_RESV_DIRECT then we cannot allow
+	 * the blocking domain to be attached as it does not contain the
+	 * required 1:1 mapping. This test effectively exclusive the device from
+	 * being used with iommu_group_claim_dma_owner() which will block vfio
+	 * and iommufd as well.
+	 */
+	if (dev->iommu->requires_direct &&
+	    (new_domain->type == IOMMU_DOMAIN_BLOCKED ||
+	     new_domain == group->blocking_domain)) {
+		dev_warn(
+			dev,
+			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.");
+		return -EINVAL;
+	}
+
 	if (dev->iommu->attach_deferred) {
 		if (new_domain == group->default_domain)
 			return 0;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3ad14437487638..7729a07923faa6 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -407,6 +407,7 @@ struct iommu_fault_param {
  * @priv:	 IOMMU Driver private data
  * @max_pasids:  number of PASIDs this device can consume
  * @attach_deferred: the dma domain attachment is deferred
+ * @requires_direct: The driver requested IOMMU_RESV_DIRECT
  *
  * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
  *	struct iommu_group	*iommu_group;
@@ -420,6 +421,7 @@ struct dev_iommu {
 	void				*priv;
 	u32				max_pasids;
 	u32				attach_deferred:1;
+	u32				requires_direct:1;
 };
 
 int iommu_device_register(struct iommu_device *iommu,

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 12:04           ` Jason Gunthorpe
@ 2023-04-21 12:29             ` Robin Murphy
  2023-04-21 12:45               ` Jason Gunthorpe
  2023-04-21 13:21             ` Baolu Lu
  2023-04-23  8:24             ` Tian, Kevin
  2 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-21 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, kvm@vger.kernel.org, iommu@lists.linux.dev

On 2023-04-21 13:04, Jason Gunthorpe wrote:
> On Thu, Apr 20, 2023 at 03:49:33PM -0600, Alex Williamson wrote:
>>> If we think this policy deserves to go beyond VFIO and userspace, and
>>> it's reasonable that such devices should never be allowed to attach to
>>> any other kind of kernel-owned unmanaged domain either, then we can
>>> still trivially enforce that in core IOMMU code. I really see no need
>>> for it to be in drivers at all.
>>
>> It seems like a reasonable choice to me that any mixing of unmanaged
>> domains with IOMMU_RESV_DIRECT could be restricted globally.  Do we
>> even have infrastructure for a driver to honor the necessary mapping
>> requirements?
> 
> What we discussed about the definition of IOMMU_RESV_DIRECT was that
> an identity map needs to be present at all times. This is what is
> documented at least:
> 
> 	/* Memory regions which must be mapped 1:1 at all times */
> 	IOMMU_RESV_DIRECT,
> 
> Notably, this also means the device can never be attached to a
> blocking domain. I would also think that drivers asking for this
> should ideally implement the "atomic replace" we discussed already to
> change between identity and unmanaged without disturbing the FW doing
> DMA to these addresses..
> 
> I was looking at this when we talked about it earlier and we don't
> follow that guideline today for vfio/iommufd.
> 
> Since taking ownership immediately switches to a blocking domain
> restricting the use of blocking also restricts ownership thus vfio and
> iommufd will be prevented.
> 
> Other places using unmanaged domains must follow the
> iommu_get_resv_regions() and setup IOMMU_RESV_DIRECT - we should not
> restrict them in the core code.
> 
> It also slightly changes my prior remarks to Robin about error domain
> attach handling, since blocking domains are not available for these
> devices the "error state" for such a device should be the identity
> domain to preserve FW access.
> 
> Also, we have a functional gap, ARM would really like a
> IOMMU_RESV_DIRECT_RELAXABLE_SAFE which would have iommufd/vfio install
> the 1:1 map and allow the device to be used. This is necessary for the
> GIC ITS page hack to support MSI since we should enable VFIO inside a
> VM. It is always safe for hostile VFIO userspace to DMA to the ITS
> page.

Can you clarify why something other than IOMMU_RESV_SW_MSI would be 
needed? MSI regions already represent "safe" direct mappings, either as 
an inherent property of the hardware, or with an actual mapping 
maintained by software. Also RELAXABLE is meant to imply that it is only 
needed until a driver takes over the device, which at face value doesn't 
make much sense for interrupts.

> So, after my domain error handling series, the core fix is pretty
> simple and universal. We should also remove all the redundant code in
> drivers - drivers should report the regions each devices needs
> properly and leave enforcement to the core code.. Lu/Kevin do you want
> to take this?
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 19f8d28ff1323c..c15eb5e0ba761d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1059,6 +1059,9 @@ static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
>   		    entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
>   			continue;
>   
> +		if (entry->type == IOMMU_RESV_DIRECT)
> +			dev->iommu->requires_direct = 1;

We'll still need to set this when the default domain type is identity 
too - see the diff I posted (the other parts below I merely implied).

Thanks,
Robin.

> +
>   		for (addr = start; addr <= end; addr += pg_size) {
>   			phys_addr_t phys_addr;
>   
> @@ -2210,6 +2213,22 @@ static int __iommu_device_set_domain(struct iommu_group *group,
>   {
>   	int ret;
>   
> +	/*
> +	 * If the driver has requested IOMMU_RESV_DIRECT then we cannot allow
> +	 * the blocking domain to be attached as it does not contain the
> +	 * required 1:1 mapping. This test effectively exclusive the device from
> +	 * being used with iommu_group_claim_dma_owner() which will block vfio
> +	 * and iommufd as well.
> +	 */
> +	if (dev->iommu->requires_direct &&
> +	    (new_domain->type == IOMMU_DOMAIN_BLOCKED ||
> +	     new_domain == group->blocking_domain)) {
> +		dev_warn(
> +			dev,
> +			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.");
> +		return -EINVAL;
> +	}
> +
>   	if (dev->iommu->attach_deferred) {
>   		if (new_domain == group->default_domain)
>   			return 0;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 3ad14437487638..7729a07923faa6 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -407,6 +407,7 @@ struct iommu_fault_param {
>    * @priv:	 IOMMU Driver private data
>    * @max_pasids:  number of PASIDs this device can consume
>    * @attach_deferred: the dma domain attachment is deferred
> + * @requires_direct: The driver requested IOMMU_RESV_DIRECT
>    *
>    * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
>    *	struct iommu_group	*iommu_group;
> @@ -420,6 +421,7 @@ struct dev_iommu {
>   	void				*priv;
>   	u32				max_pasids;
>   	u32				attach_deferred:1;
> +	u32				requires_direct:1;
>   };
>   
>   int iommu_device_register(struct iommu_device *iommu,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 12:29             ` Robin Murphy
@ 2023-04-21 12:45               ` Jason Gunthorpe
  2023-04-21 17:22                 ` Robin Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 12:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Fri, Apr 21, 2023 at 01:29:46PM +0100, Robin Murphy wrote:

> Can you clarify why something other than IOMMU_RESV_SW_MSI would be
> needed?

We need iommufd to setup a 1:1 map for the reserved space.

So, of the reserved spaces we have these:

	/* Memory regions which must be mapped 1:1 at all times */
	IOMMU_RESV_DIRECT,

           Block iommufd

	/*
	 * Memory regions which are advertised to be 1:1 but are
	 * commonly considered relaxable in some conditions,
	 * for instance in device assignment use case (USB, Graphics)
	 */
	IOMMU_RESV_DIRECT_RELAXABLE,

           iommufd ignores this one

	/* Arbitrary "never map this or give it to a device" address ranges */
	IOMMU_RESV_RESERVED,

	   iommufd prevents using this IOVA range

	/* Hardware MSI region (untranslated) */
	IOMMU_RESV_MSI,

	   iommufd treats this the same as IOMMU_RESV_RESERVED

	/* Software-managed MSI translation window */
	IOMMU_RESV_SW_MSI,

	   iommufd treats this the same as IOMMU_RESV_RESERVED, also
	   it passes the start to iommu_get_msi_cookie() which
	   eventually maps something, but not 1:1.

I don't think it is a compatible change for IOMMU_RESV_SW_MSI to also
mean 1:1 map?

On baremetal we have no idea what the platform put under that
hardcoded address?

On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
the VM pretends it doesn't have an ITS page?  (did I get that right?)

> MSI regions already represent "safe" direct mappings, either as an inherent
> property of the hardware, or with an actual mapping maintained by software.
> Also RELAXABLE is meant to imply that it is only needed until a driver takes
> over the device, which at face value doesn't make much sense for interrupts.

I used "relxable" to suggest it is safe for userspace.

> We'll still need to set this when the default domain type is identity too -
> see the diff I posted (the other parts below I merely implied).

Right, I missed that!

I suggest like this to avoid the double loop:

@@ -1037,9 +1037,6 @@ static int iommu_create_device_direct_mappings(struct iom>
        unsigned long pg_size;
        int ret = 0;
 
-       if (!iommu_is_dma_domain(domain))
-               return 0;
-
        BUG_ON(!domain->pgsize_bitmap);
 
        pg_size = 1UL << __ffs(domain->pgsize_bitmap);
@@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
                dma_addr_t start, end, addr;
                size_t map_size = 0;
 
-               start = ALIGN(entry->start, pg_size);
-               end   = ALIGN(entry->start + entry->length, pg_size);
-
                if (entry->type != IOMMU_RESV_DIRECT &&
                    entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
                        continue;
 
+               if (entry->type == IOMMU_RESV_DIRECT)
+                       dev->iommu->requires_direct = 1;
+
+               if (!iommu_is_dma_domain(domain))
+                       continue;
+
+               start = ALIGN(entry->start, pg_size);
+               end   = ALIGN(entry->start + entry->length, pg_size);
                for (addr = start; addr <= end; addr += pg_size) {
                        phys_addr_t phys_addr;

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 12:04           ` Jason Gunthorpe
  2023-04-21 12:29             ` Robin Murphy
@ 2023-04-21 13:21             ` Baolu Lu
  2023-04-21 13:33               ` Jason Gunthorpe
  2023-04-23  8:24             ` Tian, Kevin
  2 siblings, 1 reply; 28+ messages in thread
From: Baolu Lu @ 2023-04-21 13:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: baolu.lu, Robin Murphy, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On 2023/4/21 20:04, Jason Gunthorpe wrote:
> @@ -2210,6 +2213,22 @@ static int __iommu_device_set_domain(struct iommu_group *group,
>   {
>   	int ret;
>   
> +	/*
> +	 * If the driver has requested IOMMU_RESV_DIRECT then we cannot allow
> +	 * the blocking domain to be attached as it does not contain the
> +	 * required 1:1 mapping. This test effectively exclusive the device from
> +	 * being used with iommu_group_claim_dma_owner() which will block vfio
> +	 * and iommufd as well.
> +	 */
> +	if (dev->iommu->requires_direct &&
> +	    (new_domain->type == IOMMU_DOMAIN_BLOCKED ||
> +	     new_domain == group->blocking_domain)) {
> +		dev_warn(
> +			dev,
> +			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.");
> +		return -EINVAL;
> +	}
> +
>   	if (dev->iommu->attach_deferred) {
>   		if (new_domain == group->default_domain)
>   			return 0;

How about enforcing this in iommu_group_claim_dma_owner() and change the
iommu drivers to use "atomic replacement" instead of blocking
translation transition when switching to a new domain? Assuming that the
kernel drivers should always use the default domain, or handle the
IOMMU_RESV_DIRECT by themselves if they decide to use its own unmanaged
domain for kernel DMA.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 13:21             ` Baolu Lu
@ 2023-04-21 13:33               ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 13:33 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Alex Williamson, Robin Murphy, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Fri, Apr 21, 2023 at 09:21:12PM +0800, Baolu Lu wrote:
> On 2023/4/21 20:04, Jason Gunthorpe wrote:
> > @@ -2210,6 +2213,22 @@ static int __iommu_device_set_domain(struct iommu_group *group,
> >   {
> >   	int ret;
> > +	/*
> > +	 * If the driver has requested IOMMU_RESV_DIRECT then we cannot allow
> > +	 * the blocking domain to be attached as it does not contain the
> > +	 * required 1:1 mapping. This test effectively exclusive the device from
> > +	 * being used with iommu_group_claim_dma_owner() which will block vfio
> > +	 * and iommufd as well.
> > +	 */
> > +	if (dev->iommu->requires_direct &&
> > +	    (new_domain->type == IOMMU_DOMAIN_BLOCKED ||
> > +	     new_domain == group->blocking_domain)) {
> > +		dev_warn(
> > +			dev,
> > +			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.");
> > +		return -EINVAL;
> > +	}
> > +
> >   	if (dev->iommu->attach_deferred) {
> >   		if (new_domain == group->default_domain)
> >   			return 0;
> 
> How about enforcing this in iommu_group_claim_dma_owner() 

It is more general here, since this applies to any attempt to attach a
blocking domain, eg if we future miscode something else it will still
be protected. It is subtle enough, and we all missed this for a long
time already I prefer we be robust.

> and change the iommu drivers to use "atomic replacement" instead of
> blocking translation transition when switching to a new domain?

That seems unlikely to happen on a broad scale..

> Assuming that the kernel drivers should always use the default
> domain, or handle the IOMMU_RESV_DIRECT by themselves if they decide
> to use its own unmanaged domain for kernel DMA.

Long term we want to get to the point where all kernel drivers call
a claim_dma_owner before they mess with the domains. This is our
locking protocol to say that it is actually safe to do it.

If we reach a point where a kernel driver wants to make its own domain
and needs to work with FW that pushes IOMMU_RESV_DIRECT for its device
then we should add a claim_dma_owner variant for trusted users that
avoids using the blocking domain.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 12:45               ` Jason Gunthorpe
@ 2023-04-21 17:22                 ` Robin Murphy
  2023-04-21 17:58                   ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-21 17:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On 2023-04-21 13:45, Jason Gunthorpe wrote:
> On Fri, Apr 21, 2023 at 01:29:46PM +0100, Robin Murphy wrote:
> 
>> Can you clarify why something other than IOMMU_RESV_SW_MSI would be
>> needed?
> 
> We need iommufd to setup a 1:1 map for the reserved space.
> 
> So, of the reserved spaces we have these:
> 
> 	/* Memory regions which must be mapped 1:1 at all times */
> 	IOMMU_RESV_DIRECT,
> 
>             Block iommufd
> 
> 	/*
> 	 * Memory regions which are advertised to be 1:1 but are
> 	 * commonly considered relaxable in some conditions,
> 	 * for instance in device assignment use case (USB, Graphics)
> 	 */
> 	IOMMU_RESV_DIRECT_RELAXABLE,
> 
>             iommufd ignores this one
> 
> 	/* Arbitrary "never map this or give it to a device" address ranges */
> 	IOMMU_RESV_RESERVED,
> 
> 	   iommufd prevents using this IOVA range
> 
> 	/* Hardware MSI region (untranslated) */
> 	IOMMU_RESV_MSI,
> 
> 	   iommufd treats this the same as IOMMU_RESV_RESERVED
> 
> 	/* Software-managed MSI translation window */
> 	IOMMU_RESV_SW_MSI,
> 
> 	   iommufd treats this the same as IOMMU_RESV_RESERVED, also
> 	   it passes the start to iommu_get_msi_cookie() which
> 	   eventually maps something, but not 1:1.
> 
> I don't think it is a compatible change for IOMMU_RESV_SW_MSI to also
> mean 1:1 map?

Bleh, my bad, the nested MSI stuff is right on the limit of the amount 
of horribleness I can hold in my head at once even at the best of times, 
let alone when my head is still half-full of PMUs.

I think a slightly more considered and slightly less wrong version of 
that idea is to mark it as IOMMU_RESV_MSI, and special-case 
direct-mapping those on Arm (I believe it would technically be benign to 
do on x86 too, but might annoy people with its pointlessness). However...

> On baremetal we have no idea what the platform put under that
> hardcoded address?
> 
> On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
> the VM pretends it doesn't have an ITS page?  (did I get that right?)

No, that can't be right - PCIe devices have to support MSI or MSI-X, and 
many of them won't support INTx at all, so if the guest wants to use 
interrupts in general it must surely need to believe it has some kind of 
MSI controller, which for practical purposes in this context means an 
ITS. That was the next thing I started wondering after the above - if 
the aim is to direct-map the host's SW_MSI region to effectively pass 
through the S2 MSI cookie, but you have the same Linux SMMU driver in 
the guest, isn't that guest driver still going to add a conflicting 
SW_MSI region for the same IOVA and confuse things?

Ideally for nesting, the VMM would just tell us the IPA of where it's 
going to claim the given device's associated MSI doorbell is, we map 
that to the real underlying address at S2, then the guest can use its S1 
cookie as normal if it wants to, and the host doesn't have to rewrite 
addresses either way. The set_dev_data thing starts to look tempting for 
this too... Given that the nesting usage model inherently constrains the 
VMM's options for emulating the IOMMU, would it be unreasonable to make 
our lives a lot easier with some similar constraints around interrupts, 
and just not attempt to support the full gamut of "emulate any kind of 
IRQ with any other kind of IRQ" irqfd hilarity?

>> MSI regions already represent "safe" direct mappings, either as an inherent
>> property of the hardware, or with an actual mapping maintained by software.
>> Also RELAXABLE is meant to imply that it is only needed until a driver takes
>> over the device, which at face value doesn't make much sense for interrupts.
> 
> I used "relxable" to suggest it is safe for userspace.

I know, but the subtlety is the reason *why* it's safe for userspace. 
Namely that a VFIO driver has bound and reset (or at least taken control 
of) the device, and thus it is assumed to no longer be doing whatever 
the boot firmware left it doing, therefore the reserved region is 
assumed to no longer be relevant, and from then on the requirement to 
preserve it can be relaxed.

>> We'll still need to set this when the default domain type is identity too -
>> see the diff I posted (the other parts below I merely implied).
> 
> Right, I missed that!
> 
> I suggest like this to avoid the double loop:
> 
> @@ -1037,9 +1037,6 @@ static int iommu_create_device_direct_mappings(struct iom>
>          unsigned long pg_size;
>          int ret = 0;
>   
> -       if (!iommu_is_dma_domain(domain))
> -               return 0;
> -
>          BUG_ON(!domain->pgsize_bitmap);
>   
>          pg_size = 1UL << __ffs(domain->pgsize_bitmap);

But then you realise that you also need to juggle this around since 
identity domains aren't required to have a valid pgsize_bitmap either, 
give up on the idea and go straight to writing a dedicated loop as the 
clearer and tidier option because hey this is hardly a fast path anyway. 
At least, you do if you're me :)

Cheers,
Robin.

> @@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
>                  dma_addr_t start, end, addr;
>                  size_t map_size = 0;
>   
> -               start = ALIGN(entry->start, pg_size);
> -               end   = ALIGN(entry->start + entry->length, pg_size);
> -
>                  if (entry->type != IOMMU_RESV_DIRECT &&
>                      entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
>                          continue;
>   
> +               if (entry->type == IOMMU_RESV_DIRECT)
> +                       dev->iommu->requires_direct = 1;
> +
> +               if (!iommu_is_dma_domain(domain))
> +                       continue;
> +
> +               start = ALIGN(entry->start, pg_size);
> +               end   = ALIGN(entry->start + entry->length, pg_size);
>                  for (addr = start; addr <= end; addr += pg_size) {
>                          phys_addr_t phys_addr;
> 
> Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 17:22                 ` Robin Murphy
@ 2023-04-21 17:58                   ` Jason Gunthorpe
  2023-04-25 14:48                     ` Robin Murphy
                                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 17:58 UTC (permalink / raw)
  To: Robin Murphy, Nicolin Chen
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:

> I think a slightly more considered and slightly less wrong version of that
> idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
> on Arm (I believe it would technically be benign to do on x86 too, but might
> annoy people with its pointlessness). However...

I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
in ARM code..

> > On baremetal we have no idea what the platform put under that
> > hardcoded address?
> > 
> > On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
> > the VM pretends it doesn't have an ITS page?  (did I get that right?)
> 
> No, that can't be right - PCIe devices have to support MSI or MSI-X, and
> many of them won't support INTx at all, so if the guest wants to use
> interrupts in general it must surely need to believe it has some kind of MSI
> controller, 

Yes..

> which for practical purposes in this context means an ITS. 

I haven't delved into it super detail, but.. my impression was..

The ITS page only becomes relavent to the IOMMU layer if the actual
IRQ driver calls iommu_dma_prepare_msi()

And we have only these drivers that do so:

drivers/irqchip/irq-gic-v2m.c:  err = iommu_dma_prepare_msi(info->desc,
drivers/irqchip/irq-gic-v3-its.c:       err = iommu_dma_prepare_msi(info->desc, its->get_msi_base(its_dev));
drivers/irqchip/irq-gic-v3-mbi.c:       err = iommu_dma_prepare_msi(info->desc,
drivers/irqchip/irq-ls-scfg-msi.c:      err = iommu_dma_prepare_msi(info->desc, msi_data->msiir_addr);

While, I *thought* that the vGIC in ARM uses

drivers/irqchip/irq-gic-v4.c

Which doesn't obviously call iommu_dma_prepare_msi() ?

So while the SMMU driver will stick in a IOMMU_RESV_SW_MSI, and
iommufd will call iommu_get_msi_cookie(), there is no matching call
of iommu_dma_prepare_msi() - so it all effectively does nothing.

Instead, again from what I understood, is that the IOMMU layer is
expected to install the ITS page, not knowing it is an ITS page,
because the ACPI creates a IOMMU_RESV_DIRECT.

When the VM writes it totally-a-lie MSI address to the PCI MSI-X
registers the hypervisor traps it and subsitutes, what it valiantly
hopes, is the right address for the ITS in the VM's S1 IOMMU table
based on the ACPI where it nicely asked the guest to keep this
specific IOVA mapped.

I'm not sure how the data bit works on ARM..

> was the next thing I started wondering after the above - if the aim is to
> direct-map the host's SW_MSI region to effectively pass through the S2 MSI
> cookie, but you have the same Linux SMMU driver in the guest, isn't that
> guest driver still going to add a conflicting SW_MSI region for the same
> IOVA and confuse things?

Oh probably yes. At least from iommufd perspective, it can resolve
overlapping regions just fine though.

> Ideally for nesting, the VMM would just tell us the IPA of where it's going
> to claim the given device's associated MSI doorbell is, we map that to the
> real underlying address at S2, then the guest can use its S1 cookie as
> normal if it wants to, and the host doesn't have to rewrite addresses either
> way. 

Goodness yes, I'd love that.

> that the nesting usage model inherently constrains the VMM's options for
> emulating the IOMMU, would it be unreasonable to make our lives a lot easier
> with some similar constraints around interrupts, and just not attempt to
> support the full gamut of "emulate any kind of IRQ with any other kind of
> IRQ" irqfd hilarity?

Isn't that what GICv4 is?

Frankly, I think something whent wrong with the GICv4 design. A purely
virtualization focused GIC should not have continued to rely on the
hypervisor trapping of the MSI-X writes. The guest should have had a
real data value and a real physical ITS page.

I can understand why we got here, because fixing *all* of that would
be a big task and this is a small hack, but still... Yuk.

But that is a whole other journey. There is work afoot to standardize
some things would make MSI-X trapping impossible and more solidly
force this issue, so I'm just hoping to keep the current mess going
as-is right now..

> > > MSI regions already represent "safe" direct mappings, either as an inherent
> > > property of the hardware, or with an actual mapping maintained by software.
> > > Also RELAXABLE is meant to imply that it is only needed until a driver takes
> > > over the device, which at face value doesn't make much sense for interrupts.
> > 
> > I used "relxable" to suggest it is safe for userspace.
> 
> I know, but the subtlety is the reason *why* it's safe for userspace. Namely
> that a VFIO driver has bound and reset (or at least taken control of) the
> device, and thus it is assumed to no longer be doing whatever the boot
> firmware left it doing, therefore the reserved region is assumed to no
> longer be relevant, and from then on the requirement to preserve it can be
> relaxed.

IOMMU_RESV_MSI_DIRECT is probably the better name

> >          unsigned long pg_size;
> >          int ret = 0;
> > -       if (!iommu_is_dma_domain(domain))
> > -               return 0;
> > -
> >          BUG_ON(!domain->pgsize_bitmap);
> >          pg_size = 1UL << __ffs(domain->pgsize_bitmap);
> 
> But then you realise that you also need to juggle this around since identity
> domains aren't required to have a valid pgsize_bitmap either, give up on the
> idea and go straight to writing a dedicated loop as the clearer and tidier
> option because hey this is hardly a fast path anyway. At least, you do if
> you're me :)

domain->pgsize_bitmap is always valid memory, and __ffs() always
returns [0:31], so this caclculation will be fine but garbage.

> > @@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
> >                  dma_addr_t start, end, addr;
> >                  size_t map_size = 0;
> > -               start = ALIGN(entry->start, pg_size);
> > -               end   = ALIGN(entry->start + entry->length, pg_size);
> > -
> >                  if (entry->type != IOMMU_RESV_DIRECT &&
> >                      entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
> >                          continue;
> > +               if (entry->type == IOMMU_RESV_DIRECT)
> > +                       dev->iommu->requires_direct = 1;
> > +
> > +               if (!iommu_is_dma_domain(domain))
> > +                       continue;
> > +
> > +               start = ALIGN(entry->start, pg_size);
> > +               end   = ALIGN(entry->start + entry->length, pg_size);

Which is why I moved the only reader of pg_size after the check if it
is valid..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RMRR device on non-Intel platform
  2023-04-21 11:34             ` Robin Murphy
@ 2023-04-23  8:23               ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2023-04-23  8:23 UTC (permalink / raw)
  To: Robin Murphy, Alex Williamson
  Cc: kvm@vger.kernel.org, iommu@lists.linux.dev, Jason Gunthorpe,
	Lu, Baolu

> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Friday, April 21, 2023 7:35 PM
> 
> > intel-iommu sets RELAXABLE for USB and GPU assuming their RMRR region
> > is not used post boot.
> >
> > Presumably same policy can be applied to AMD too.
> >
> > ARM RMR reports an explicit flag (ACPI_IORT_RMR_REMAP_PERMITTED) to
> > mark out whether a RMR region is relaxable. I'm not sure whether
> USB/GPU
> > is already covered.
> 
> Arm systems have no legacy of needing to offer compatibility with a PC
> BIOS from the 1990s, so PS/2 or VGA emulation is a non-issue for us ;)
> 

Make sense. On the other hand if AMD wants the similar exempt on
USB/GPU they can do it as intel-iommu does.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RMRR device on non-Intel platform
  2023-04-21 12:04           ` Jason Gunthorpe
  2023-04-21 12:29             ` Robin Murphy
  2023-04-21 13:21             ` Baolu Lu
@ 2023-04-23  8:24             ` Tian, Kevin
  2023-04-24  2:50               ` Baolu Lu
  2 siblings, 1 reply; 28+ messages in thread
From: Tian, Kevin @ 2023-04-23  8:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Robin Murphy, kvm@vger.kernel.org, iommu@lists.linux.dev

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 21, 2023 8:05 PM
> 
> So, after my domain error handling series, the core fix is pretty
> simple and universal. We should also remove all the redundant code in
> drivers - drivers should report the regions each devices needs
> properly and leave enforcement to the core code.. Lu/Kevin do you want
> to take this?
> 

this direction looks good to me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-23  8:24             ` Tian, Kevin
@ 2023-04-24  2:50               ` Baolu Lu
  0 siblings, 0 replies; 28+ messages in thread
From: Baolu Lu @ 2023-04-24  2:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson
  Cc: baolu.lu, Robin Murphy, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On 4/23/23 4:24 PM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, April 21, 2023 8:05 PM
>>
>> So, after my domain error handling series, the core fix is pretty
>> simple and universal. We should also remove all the redundant code in
>> drivers - drivers should report the regions each devices needs
>> properly and leave enforcement to the core code.. Lu/Kevin do you want
>> to take this?
>>
> 
> this direction looks good to me.

Looks good to me too. I will take this work after Jason's series lands.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 17:58                   ` Jason Gunthorpe
@ 2023-04-25 14:48                     ` Robin Murphy
  2023-04-25 15:58                       ` Jason Gunthorpe
  2023-04-25 16:37                     ` Nicolin Chen
  2023-04-26 11:57                     ` Jason Gunthorpe
  2 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-25 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On 2023-04-21 18:58, Jason Gunthorpe wrote:
> On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:
> 
>> I think a slightly more considered and slightly less wrong version of that
>> idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
>> on Arm (I believe it would technically be benign to do on x86 too, but might
>> annoy people with its pointlessness). However...
> 
> I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
> in ARM code..

Maybe, but it's still actually broken either way, because how do you get 
that type into the VM? Firmware can't encode that a particular RMR 
represents the special magic hack for IOMMUFD, so now the SMMU driver 
needs to somehow be aware when it's running in a VM offering nested 
translation and do some more magic to inject the appropriate region, and 
it's all just... no.

>>> On baremetal we have no idea what the platform put under that
>>> hardcoded address?
>>>
>>> On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
>>> the VM pretends it doesn't have an ITS page?  (did I get that right?)
>>
>> No, that can't be right - PCIe devices have to support MSI or MSI-X, and
>> many of them won't support INTx at all, so if the guest wants to use
>> interrupts in general it must surely need to believe it has some kind of MSI
>> controller,
> 
> Yes..
> 
>> which for practical purposes in this context means an ITS.
> 
> I haven't delved into it super detail, but.. my impression was..
> 
> The ITS page only becomes relavent to the IOMMU layer if the actual
> IRQ driver calls iommu_dma_prepare_msi()
> 
> And we have only these drivers that do so:
> 
> drivers/irqchip/irq-gic-v2m.c:  err = iommu_dma_prepare_msi(info->desc,
> drivers/irqchip/irq-gic-v3-its.c:       err = iommu_dma_prepare_msi(info->desc, its->get_msi_base(its_dev));
> drivers/irqchip/irq-gic-v3-mbi.c:       err = iommu_dma_prepare_msi(info->desc,
> drivers/irqchip/irq-ls-scfg-msi.c:      err = iommu_dma_prepare_msi(info->desc, msi_data->msiir_addr);
> 
> While, I *thought* that the vGIC in ARM uses
> 
> drivers/irqchip/irq-gic-v4.c
> 
> Which doesn't obviously call iommu_dma_prepare_msi() ?
> 
> So while the SMMU driver will stick in a IOMMU_RESV_SW_MSI, and
> iommufd will call iommu_get_msi_cookie(), there is no matching call
> of iommu_dma_prepare_msi() - so it all effectively does nothing.
> 
> Instead, again from what I understood, is that the IOMMU layer is
> expected to install the ITS page, not knowing it is an ITS page,
> because the ACPI creates a IOMMU_RESV_DIRECT.
> 
> When the VM writes it totally-a-lie MSI address to the PCI MSI-X
> registers the hypervisor traps it and subsitutes, what it valiantly
> hopes, is the right address for the ITS in the VM's S1 IOMMU table
> based on the ACPI where it nicely asked the guest to keep this
> specific IOVA mapped.
> 
> I'm not sure how the data bit works on ARM..
> 
>> was the next thing I started wondering after the above - if the aim is to
>> direct-map the host's SW_MSI region to effectively pass through the S2 MSI
>> cookie, but you have the same Linux SMMU driver in the guest, isn't that
>> guest driver still going to add a conflicting SW_MSI region for the same
>> IOVA and confuse things?
> 
> Oh probably yes. At least from iommufd perspective, it can resolve
> overlapping regions just fine though.
> 
>> Ideally for nesting, the VMM would just tell us the IPA of where it's going
>> to claim the given device's associated MSI doorbell is, we map that to the
>> real underlying address at S2, then the guest can use its S1 cookie as
>> normal if it wants to, and the host doesn't have to rewrite addresses either
>> way.
> 
> Goodness yes, I'd love that.
> 
>> that the nesting usage model inherently constrains the VMM's options for
>> emulating the IOMMU, would it be unreasonable to make our lives a lot easier
>> with some similar constraints around interrupts, and just not attempt to
>> support the full gamut of "emulate any kind of IRQ with any other kind of
>> IRQ" irqfd hilarity?
> 
> Isn't that what GICv4 is?

That would fit *part* of the GICv4 usage model...

> Frankly, I think something whent wrong with the GICv4 design. A purely
> virtualization focused GIC should not have continued to rely on the
> hypervisor trapping of the MSI-X writes. The guest should have had a
> real data value and a real physical ITS page.

...I believe the remaining missing part is a UAPI for the VMM to ask the 
host kernel to configure a "physical" vLPI for a given device and 
EventID, at the point when its vITS emulation is handling the guest's 
configuration command. With that we would no longer have to rewrite the 
MSI payload either, so can avoid trapping the device's MSI-X capability 
at all, and the VM could actually have non-terrible interrupt performance.

> I can understand why we got here, because fixing *all* of that would
> be a big task and this is a small hack, but still... Yuk.
> 
> But that is a whole other journey. There is work afoot to standardize
> some things would make MSI-X trapping impossible and more solidly
> force this issue, so I'm just hoping to keep the current mess going
> as-is right now..

The thing is, though, this small hack is in fact just the tip of a large 
pile of small hacks across Linux and QEMU that probably add up to a 
similar amount of work overall as just implementing the interface that 
we'd ultimately want to have anyway.

>>>> MSI regions already represent "safe" direct mappings, either as an inherent
>>>> property of the hardware, or with an actual mapping maintained by software.
>>>> Also RELAXABLE is meant to imply that it is only needed until a driver takes
>>>> over the device, which at face value doesn't make much sense for interrupts.
>>>
>>> I used "relxable" to suggest it is safe for userspace.
>>
>> I know, but the subtlety is the reason *why* it's safe for userspace. Namely
>> that a VFIO driver has bound and reset (or at least taken control of) the
>> device, and thus it is assumed to no longer be doing whatever the boot
>> firmware left it doing, therefore the reserved region is assumed to no
>> longer be relevant, and from then on the requirement to preserve it can be
>> relaxed.
> 
> IOMMU_RESV_MSI_DIRECT is probably the better name
> 
>>>           unsigned long pg_size;
>>>           int ret = 0;
>>> -       if (!iommu_is_dma_domain(domain))
>>> -               return 0;
>>> -
>>>           BUG_ON(!domain->pgsize_bitmap);
>>>           pg_size = 1UL << __ffs(domain->pgsize_bitmap);
>>
>> But then you realise that you also need to juggle this around since identity
>> domains aren't required to have a valid pgsize_bitmap either, give up on the
>> idea and go straight to writing a dedicated loop as the clearer and tidier
>> option because hey this is hardly a fast path anyway. At least, you do if
>> you're me :)
> 
> domain->pgsize_bitmap is always valid memory, and __ffs() always
> returns [0:31], so this caclculation will be fine but garbage.
> 
>>> @@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
>>>                   dma_addr_t start, end, addr;
>>>                   size_t map_size = 0;
>>> -               start = ALIGN(entry->start, pg_size);
>>> -               end   = ALIGN(entry->start + entry->length, pg_size);
>>> -
>>>                   if (entry->type != IOMMU_RESV_DIRECT &&
>>>                       entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
>>>                           continue;
>>> +               if (entry->type == IOMMU_RESV_DIRECT)
>>> +                       dev->iommu->requires_direct = 1;
>>> +
>>> +               if (!iommu_is_dma_domain(domain))
>>> +                       continue;
>>> +
>>> +               start = ALIGN(entry->start, pg_size);
>>> +               end   = ALIGN(entry->start + entry->length, pg_size);
> 
> Which is why I moved the only reader of pg_size after the check if it
> is valid..

Except GCC says __builtin_ctzl(0) is undefined, so although I'd concur 
that the chances of nasal demons at the point of invoking __ffs() are 
realistically quite low, I don't fancy arguing that with the static 
checker brigade. So by the time we've appeased them with additional 
checks, initialisations, etc., we'd have basically the same overhead as 
running 0 iterations of another for loop (the overwhelmingly common case 
anyway), but in more lines of code, with a more convoluted flow. All of 
which leads me to conclude that "number of times we walk a usually-empty 
list in a one-off slow path" is not in fact the most worthwhile thing to 
optimise for ;)

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-25 14:48                     ` Robin Murphy
@ 2023-04-25 15:58                       ` Jason Gunthorpe
  2023-04-26  8:39                         ` Tian, Kevin
  2023-04-26 12:24                         ` Robin Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-25 15:58 UTC (permalink / raw)
  To: Robin Murphy, Eric Auger
  Cc: Nicolin Chen, Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Tue, Apr 25, 2023 at 03:48:11PM +0100, Robin Murphy wrote:
> On 2023-04-21 18:58, Jason Gunthorpe wrote:
> > On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:
> > 
> > > I think a slightly more considered and slightly less wrong version of that
> > > idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
> > > on Arm (I believe it would technically be benign to do on x86 too, but might
> > > annoy people with its pointlessness). However...
> > 
> > I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
> > in ARM code..
> 
> Maybe, but it's still actually broken either way, because how do you get
> that type into the VM? Firmware can't encode that a particular RMR
> represents the special magic hack for IOMMUFD, so now the SMMU driver needs
> to somehow be aware when it's running in a VM offering nested translation
> and do some more magic to inject the appropriate region, and it's all
> just... no.

Er, I figured ARM had sorted this out somehow :(

Eric, do you know anything about this? Where did you setup the 1:1 map
in the VM in your series?

So you are saying, the basic problem statement, is that the ACPI table
that describes the ITS direct mapping in the VM is supposed to be
interpreted by the SMMU driver as "memory must be iommu mapped 1:1 at
all times and is possibly dangerous enough to block userspace access
to the device, like Intel does" ?

This isn't end of the world bad, it just means that VFIO will not work
in ARM guests under this interrupt model. Sad, and something to fix,
but we can still cover alot of ground..

Maybe a GICv5 can correct it..

> > Frankly, I think something whent wrong with the GICv4 design. A purely
> > virtualization focused GIC should not have continued to rely on the
> > hypervisor trapping of the MSI-X writes. The guest should have had a
> > real data value and a real physical ITS page.
> 
> ...I believe the remaining missing part is a UAPI for the VMM to ask the
> host kernel to configure a "physical" vLPI for a given device and EventID,
> at the point when its vITS emulation is handling the guest's configuration
> command. With that we would no longer have to rewrite the MSI payload
> either, so can avoid trapping the device's MSI-X capability at all, and the
> VM could actually have non-terrible interrupt performance.

Yes.. More broadly I think we'd need to allow the vGIC code to
understand that it has complete control over a SID, and like we are
talking about for SMMU a vSID mapping as well.

This would have to replace the eventfd based hookup we have now.

I really want to avoid opening this can of worms because it is
basically iommufd all over again just irq focused :(

> Except GCC says __builtin_ctzl(0) is undefined, so although I'd concur that
> the chances of nasal demons at the point of invoking __ffs() are
> realistically quite low, I don't fancy arguing that with the static checker
> brigade. So by the time we've appeased them with additional checks,
> initialisations, etc., we'd have basically the same overhead as running 0
> iterations of another for loop (the overwhelmingly common case anyway), but
> in more lines of code, with a more convoluted flow. All of which leads me to
> conclude that "number of times we walk a usually-empty list in a one-off
> slow path" is not in fact the most worthwhile thing to optimise for ;)

Heh, well fair enough, we do have a UBSAN that might trip on this. Lu
can correct it

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 17:58                   ` Jason Gunthorpe
  2023-04-25 14:48                     ` Robin Murphy
@ 2023-04-25 16:37                     ` Nicolin Chen
  2023-04-26 11:57                     ` Jason Gunthorpe
  2 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2023-04-25 16:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Fri, Apr 21, 2023 at 02:58:01PM -0300, Jason Gunthorpe wrote:
 
> When the VM writes it totally-a-lie MSI address to the PCI MSI-X
> registers the hypervisor traps it and subsitutes, what it valiantly
> hopes, is the right address for the ITS in the VM's S1 IOMMU table
> based on the ACPI where it nicely asked the guest to keep this
> specific IOVA mapped.
> 
> I'm not sure how the data bit works on ARM..

Not sure if I follow everything here correctly, yet the data
seems to be the hwirq idx in its_alloc_device_irq()? So, not
only the MSI in the guest is fake, the data also?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RMRR device on non-Intel platform
  2023-04-25 15:58                       ` Jason Gunthorpe
@ 2023-04-26  8:39                         ` Tian, Kevin
  2023-04-26 12:24                         ` Robin Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2023-04-26  8:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy, Eric Auger
  Cc: Nicolin Chen, Alex Williamson, kvm@vger.kernel.org,
	iommu@lists.linux.dev

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 25, 2023 11:58 PM
> 
> On Tue, Apr 25, 2023 at 03:48:11PM +0100, Robin Murphy wrote:
> > On 2023-04-21 18:58, Jason Gunthorpe wrote:
> > > On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:
> > >
> > > > I think a slightly more considered and slightly less wrong version of that
> > > > idea is to mark it as IOMMU_RESV_MSI, and special-case direct-
> mapping those
> > > > on Arm (I believe it would technically be benign to do on x86 too, but
> might
> > > > annoy people with its pointlessness). However...
> > >
> > > I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special
> case
> > > in ARM code..
> >
> > Maybe, but it's still actually broken either way, because how do you get
> > that type into the VM? Firmware can't encode that a particular RMR
> > represents the special magic hack for IOMMUFD, so now the SMMU driver
> needs
> > to somehow be aware when it's running in a VM offering nested
> translation
> > and do some more magic to inject the appropriate region, and it's all
> > just... no.
> 
> Er, I figured ARM had sorted this out somehow :(
> 
> Eric, do you know anything about this? Where did you setup the 1:1 map
> in the VM in your series?
> 

Out of curiosity. Does this flag have different meaning on s1 vs. s2?

for s1 it means 1:1 mapping.

for s2 it has same meaning as IOMMU_RESV_SW_MSI stands?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-21 17:58                   ` Jason Gunthorpe
  2023-04-25 14:48                     ` Robin Murphy
  2023-04-25 16:37                     ` Nicolin Chen
@ 2023-04-26 11:57                     ` Jason Gunthorpe
  2023-04-26 13:53                       ` Robin Murphy
  2 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-26 11:57 UTC (permalink / raw)
  To: Robin Murphy, Nicolin Chen
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Fri, Apr 21, 2023 at 02:58:01PM -0300, Jason Gunthorpe wrote:

> > which for practical purposes in this context means an ITS. 
> 
> I haven't delved into it super detail, but.. my impression was..
> 
> The ITS page only becomes relavent to the IOMMU layer if the actual
> IRQ driver calls iommu_dma_prepare_msi()

Nicolin and I sat down and traced this through, this explanation is
almost right...

irq-gic-v4.c is some sub module of irq-gic-v3-its.c so it does end up
calling iommu_dma_prepare_msi() however..

qemu will setup the ACPI so that VM thinks the ITS page is at
0x08080000. I think it maps some dummy CPU memory to this address.

iommufd will map the real ITS page at MSI_IOVA_BASE = 0x8000000 (!!)
and only into the IOMMU

qemu will setup some RMRR thing to make 0x8000000 1:1 at the VM's
IOMMU

When DMA API is used iommu_dma_prepare_msi() is called which will
select a MSI page address that avoids the reserved region, so it is
some random value != 0x8000000 and maps the dummy CPU page to it.
The VM will then do a MSI-X programming cycle with the S1 IOVA of the
CPU page and the data. qemu traps this and throws away the address
from the VM. The kernel sets up the interrupt and assumes 0x8000000
is the right IOVA.

When VFIO is used iommufd in the VM will force the MSI window to
0x8000000 and instead of putting a 1:1 mapping we map the dummy CPU
page and then everything is broken. Adding the reserved check is an
improvement.

The only way to properly fix this is to have qemu stop throwing away
the address during the MSI-X programming. This needs to be programmed
into the device instead.

I have no idea how best to get there with the ARM GIC setup.. It feels
really hard.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-25 15:58                       ` Jason Gunthorpe
  2023-04-26  8:39                         ` Tian, Kevin
@ 2023-04-26 12:24                         ` Robin Murphy
  2023-04-26 12:58                           ` Jason Gunthorpe
  1 sibling, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-26 12:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric Auger
  Cc: Nicolin Chen, Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev, Marc Zyngier

On 2023-04-25 16:58, Jason Gunthorpe wrote:
> On Tue, Apr 25, 2023 at 03:48:11PM +0100, Robin Murphy wrote:
>> On 2023-04-21 18:58, Jason Gunthorpe wrote:
>>> On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:
>>>
>>>> I think a slightly more considered and slightly less wrong version of that
>>>> idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
>>>> on Arm (I believe it would technically be benign to do on x86 too, but might
>>>> annoy people with its pointlessness). However...
>>>
>>> I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
>>> in ARM code..
>>
>> Maybe, but it's still actually broken either way, because how do you get
>> that type into the VM? Firmware can't encode that a particular RMR
>> represents the special magic hack for IOMMUFD, so now the SMMU driver needs
>> to somehow be aware when it's running in a VM offering nested translation
>> and do some more magic to inject the appropriate region, and it's all
>> just... no.
> 
> Er, I figured ARM had sorted this out somehow :(
> 
> Eric, do you know anything about this? Where did you setup the 1:1 map
> in the VM in your series?
> 
> So you are saying, the basic problem statement, is that the ACPI table
> that describes the ITS direct mapping in the VM is supposed to be
> interpreted by the SMMU driver as "memory must be iommu mapped 1:1 at
> all times and is possibly dangerous enough to block userspace access
> to the device, like Intel does" ?

In the case of IORT RMRs, all firmware can describe is regions of memory 
which must be direct-mapped, along with some attributes for *how* they 
are to be mapped. It cannot (and frankly should not) convey any 
additional OS-specific special meaning.

> This isn't end of the world bad, it just means that VFIO will not work
> in ARM guests under this interrupt model. Sad, and something to fix,
> but we can still cover alot of ground..
> 
> Maybe a GICv5 can correct it..

Correct what? If Linux chooses to use the existing GICv3 architecture in 
an arse-backwards manner to opaquely emulate interrupts with other 
interrupts instead of actually virtualising them, I'd say that's hardly 
the architecture's fault. Given how much many of our partners do care 
about virtualisation performance, I'm pretty confident the GIC/SMMU 
architects aren't going to be spending time on future extensions for 
making horribly inefficient interrupt emulation any easier to implement.

The primary purpose of the Interrupt Translation Service is to 
*translate* interrupts. In combination with an SMMU, it allows a VM to 
program a device with a virtualised address and EventID such that when 
the device signals an interrupt by writing that EventID to that address, 
the SMMU translates the address to direct the write to the appropriate 
physical ITS, then the ITS translates the EventID into a physical LPI, 
which may then be handled directly or stuffed back into the VM as the 
appropriate virtual interrupt. The nature of these translations is such 
that the host has no need to mediate access to the device itself.

>>> Frankly, I think something whent wrong with the GICv4 design. A purely
>>> virtualization focused GIC should not have continued to rely on the
>>> hypervisor trapping of the MSI-X writes. The guest should have had a
>>> real data value and a real physical ITS page.

A real *virtual* ITS page (IPA/GPA, not PA) as above, because the Arm 
system architecture does not define a fixed address map thus that is 
free to be virtualised as well, but otherwise, yes, indeed it should, 
and it could, on both GICv3 and AMD/Intel IRQ remapping. Why isn't Linux 
letting VMMs do that?

Fair enough for VFIO, since that existed long before any architectural 
MSI support on Arm, and has crusty PCI/X legacy on x86 to deal with too, 
but IOMMUFD is a new thing for modern use-cases on modern hardware. In 
today's PCIe world why would we choose to focus on banging the 
legacy-shaped peg into the modern-interrupt-remapping-shaped hole with 
our foreheads, then complain that it hurts, rather than design the new 
thing to use the modern functionality as intended, and have something 
that can work really well for what users are actually going to want to do?

>> ...I believe the remaining missing part is a UAPI for the VMM to ask the
>> host kernel to configure a "physical" vLPI for a given device and EventID,
>> at the point when its vITS emulation is handling the guest's configuration
>> command. With that we would no longer have to rewrite the MSI payload
>> either, so can avoid trapping the device's MSI-X capability at all, and the
>> VM could actually have non-terrible interrupt performance.
> 
> Yes.. More broadly I think we'd need to allow the vGIC code to
> understand that it has complete control over a SID, and like we are
> talking about for SMMU a vSID mapping as well.

Marc, any thoughts on how much of this is actually missing from the MSI 
domain infrastructure today? I'm guessing the main thing is needing some 
sort of msi_domain_alloc_virq() API so that the caller can provide the 
predetermined message data to replace the normal compose/write flow - 
beyond that I think I can see a tentative idea of how it could then be 
plumbed through the generic msi_alloc_info and the ITS domain. The x86 
irq_remapping domains scare me, though, so I'll say right now I'm not 
going there :)

Thanks,
Robin.

> This would have to replace the eventfd based hookup we have now.
> 
> I really want to avoid opening this can of worms because it is
> basically iommufd all over again just irq focused :(
> 
>> Except GCC says __builtin_ctzl(0) is undefined, so although I'd concur that
>> the chances of nasal demons at the point of invoking __ffs() are
>> realistically quite low, I don't fancy arguing that with the static checker
>> brigade. So by the time we've appeased them with additional checks,
>> initialisations, etc., we'd have basically the same overhead as running 0
>> iterations of another for loop (the overwhelmingly common case anyway), but
>> in more lines of code, with a more convoluted flow. All of which leads me to
>> conclude that "number of times we walk a usually-empty list in a one-off
>> slow path" is not in fact the most worthwhile thing to optimise for ;)
> 
> Heh, well fair enough, we do have a UBSAN that might trip on this. Lu
> can correct it
> 
> Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-26 12:24                         ` Robin Murphy
@ 2023-04-26 12:58                           ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-26 12:58 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Eric Auger, Nicolin Chen, Alex Williamson, Tian, Kevin,
	kvm@vger.kernel.org, iommu@lists.linux.dev, Marc Zyngier

On Wed, Apr 26, 2023 at 01:24:21PM +0100, Robin Murphy wrote:

> A real *virtual* ITS page (IPA/GPA, not PA) as above, because the Arm system
> architecture does not define a fixed address map thus that is free to be
> virtualised as well, but otherwise, yes, indeed it should, and it could, on
> both GICv3 and AMD/Intel IRQ remapping. Why isn't Linux letting VMMs do
> that?

At the root, we don't have an interface out of VFIO for setting up
interrupts in such a sophisticated way.

> Fair enough for VFIO, since that existed long before any architectural MSI
> support on Arm, and has crusty PCI/X legacy on x86 to deal with too, but
> IOMMUFD is a new thing for modern use-cases on modern hardware. 

Ah but iommufd isn't touching interrupts :)

I'm scared we need a irqfd kind of idea to expose all this
acclerated IRQ hardware to the VMM as well.

> > > ...I believe the remaining missing part is a UAPI for the VMM to ask the
> > > host kernel to configure a "physical" vLPI for a given device and EventID,
> > > at the point when its vITS emulation is handling the guest's configuration
> > > command. With that we would no longer have to rewrite the MSI payload
> > > either, so can avoid trapping the device's MSI-X capability at all, and the
> > > VM could actually have non-terrible interrupt performance.
> > 
> > Yes.. More broadly I think we'd need to allow the vGIC code to
> > understand that it has complete control over a SID, and like we are
> > talking about for SMMU a vSID mapping as well.
> 
> Marc, any thoughts on how much of this is actually missing from the MSI
> domain infrastructure today? I'm guessing the main thing is needing some
> sort of msi_domain_alloc_virq() API so that the caller can provide the
> predetermined message data to replace the normal compose/write flow

We don't want the VMM to write the MSI-X data. This isn't going to get
us far enough. Talk to ARM's rep on the OCP SIOVr2 workgroup to get
some sense what is being proposed for next-generation interrupt
handling.

There will likely be no standard MSI-X table or equivalent and no
place to put a hypertrap.

If we are fixing things we need to fix it fully - the VM programs the
MSI-X registers directly. The VM's irq_chip does any hypertraps using
the GIC programming model, NOT PCI, to get the VMM to setup the IRQ
remapping HW.

This suits ARM better anyhow since the VM is in control of the IOVA
for the ITS page.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-26 11:57                     ` Jason Gunthorpe
@ 2023-04-26 13:53                       ` Robin Murphy
  2023-04-26 14:17                         ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2023-04-26 13:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On 2023-04-26 12:57, Jason Gunthorpe wrote:
> On Fri, Apr 21, 2023 at 02:58:01PM -0300, Jason Gunthorpe wrote:
> 
>>> which for practical purposes in this context means an ITS.
>>
>> I haven't delved into it super detail, but.. my impression was..
>>
>> The ITS page only becomes relavent to the IOMMU layer if the actual
>> IRQ driver calls iommu_dma_prepare_msi()
> 
> Nicolin and I sat down and traced this through, this explanation is
> almost right...
> 
> irq-gic-v4.c is some sub module of irq-gic-v3-its.c so it does end up
> calling iommu_dma_prepare_msi() however..

Ignore GICv4; that basically only makes a difference to what happens 
after the CPU receives an interrupt.

> qemu will setup the ACPI so that VM thinks the ITS page is at
> 0x08080000. I think it maps some dummy CPU memory to this address.
> 
> iommufd will map the real ITS page at MSI_IOVA_BASE = 0x8000000 (!!)
> and only into the IOMMU
> 
> qemu will setup some RMRR thing to make 0x8000000 1:1 at the VM's
> IOMMU
> 
> When DMA API is used iommu_dma_prepare_msi() is called which will
> select a MSI page address that avoids the reserved region, so it is
> some random value != 0x8000000 and maps the dummy CPU page to it.
> The VM will then do a MSI-X programming cycle with the S1 IOVA of the
> CPU page and the data. qemu traps this and throws away the address
> from the VM. The kernel sets up the interrupt and assumes 0x8000000
> is the right IOVA.
> 
> When VFIO is used iommufd in the VM will force the MSI window to
> 0x8000000 and instead of putting a 1:1 mapping we map the dummy CPU
> page and then everything is broken. Adding the reserved check is an
> improvement.
> 
> The only way to properly fix this is to have qemu stop throwing away
> the address during the MSI-X programming. This needs to be programmed
> into the device instead.
> 
> I have no idea how best to get there with the ARM GIC setup.. It feels
> really hard.

Give QEMU a way to tell IOMMUFD to associate that 0x08080000 address 
with a given device as an MSI target. IOMMUFD then ensures that the S2 
mapping exists from that IPA to the device's real ITS (I vaguely 
remember Eric had a patch to pre-populate an MSI cookie with specific 
pages, which may have been heading along those lines). In the worst case 
this might mean having to subdivide the per-SMMU copies of the S2 domain 
into per-ITS copies as well, so we'd probably want to detect and compare 
devices' ITS parents up-front.

QEMU will presumably also need a way to pass the VA down to IOMMUFD when 
it sees the guest programming the MSI (possibly it could pass the IPA at 
the same time so we don't need a distinct step to set up S2 beforehand?) 
- once the underlying physical MSI configuration comes back from the PCI 
layer, that VA just needs to be dropped in to replace the original 
msi_msg address.

TBH at that point it may be easier to just not have a cookie in the S2 
domain at all when nesting is enabled, and just let IOMMUFD make the ITS 
mappings directly for itself.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RMRR device on non-Intel platform
  2023-04-26 13:53                       ` Robin Murphy
@ 2023-04-26 14:17                         ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2023-04-26 14:17 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, Alex Williamson, Tian, Kevin, kvm@vger.kernel.org,
	iommu@lists.linux.dev

On Wed, Apr 26, 2023 at 02:53:53PM +0100, Robin Murphy wrote:

> Give QEMU a way to tell IOMMUFD to associate that 0x08080000 address with a
> given device as an MSI target. IOMMUFD then ensures that the S2 mapping
> exists from that IPA to the device's real ITS (I vaguely remember Eric had a
> patch to pre-populate an MSI cookie with specific pages, which may have been
> heading along those lines). 

This isn't the main problem - already for most cases iommufd makes it
so the ITS page is at 0x8000000. We can fix qemu to also use
0x8000000 in the ACPI - it already hardwires this for the RMRR part.

We can even make the kernel return the value so it isn't hardwired,
easy stuff..

> QEMU will presumably also need a way to pass the VA down to IOMMUFD when it
> sees the guest programming the MSI (possibly it could pass the IPA at the
> same time so we don't need a distinct step to set up S2 beforehand?) - once
> the underlying physical MSI configuration comes back from the PCI layer,
> that VA just needs to be dropped in to replace the original msi_msg address.

This is the main problem. What ioctl passes the VA, and how does it plumb
down into the irq_chip..

This is where everyone gets scared, I think. There is a thick mass of
IRQ plumbing and locking between those two points

And then it only solves MSI, not the bigger picture..

> TBH at that point it may be easier to just not have a cookie in the S2
> domain at all when nesting is enabled, and just let IOMMUFD make the ITS
> mappings directly for itself.

Yes, I'd like to get there so replace can work.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-04-26 14:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-20  6:52 RMRR device on non-Intel platform Tian, Kevin
2023-04-20 14:15 ` Alex Williamson
2023-04-20 14:19   ` Robin Murphy
2023-04-20 14:49     ` Alex Williamson
2023-04-20 16:55       ` Robin Murphy
2023-04-20 21:49         ` Alex Williamson
2023-04-21  4:10           ` Tian, Kevin
2023-04-21 11:33             ` Jason Gunthorpe
2023-04-21 11:34             ` Robin Murphy
2023-04-23  8:23               ` Tian, Kevin
2023-04-21 12:04           ` Jason Gunthorpe
2023-04-21 12:29             ` Robin Murphy
2023-04-21 12:45               ` Jason Gunthorpe
2023-04-21 17:22                 ` Robin Murphy
2023-04-21 17:58                   ` Jason Gunthorpe
2023-04-25 14:48                     ` Robin Murphy
2023-04-25 15:58                       ` Jason Gunthorpe
2023-04-26  8:39                         ` Tian, Kevin
2023-04-26 12:24                         ` Robin Murphy
2023-04-26 12:58                           ` Jason Gunthorpe
2023-04-25 16:37                     ` Nicolin Chen
2023-04-26 11:57                     ` Jason Gunthorpe
2023-04-26 13:53                       ` Robin Murphy
2023-04-26 14:17                         ` Jason Gunthorpe
2023-04-21 13:21             ` Baolu Lu
2023-04-21 13:33               ` Jason Gunthorpe
2023-04-23  8:24             ` Tian, Kevin
2023-04-24  2:50               ` Baolu Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).