[PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU
@ 2023-11-17  9:47 Roger Pau Monne
  2023-11-17  9:47 ` [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM Roger Pau Monne
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Roger Pau Monne @ 2023-11-17  9:47 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Wei Liu,
	Paul Durrant

Hello,

The follow series aims to improve the time consumed to setup the IOMMU
for the hardware domain.

Patch 1 and 2 are prereqs, while patch 3 is the actual change that
speeds up IOMMU setup.  See patch description for figures.

Thanks, Roger.

Roger Pau Monne (3):
  amd-vi: use the same IOMMU page table levels for PV and HVM
  x86/iommu: move xen_in_range() so it can be made static
  x86/iommu: use a rangeset for hwdom setup

 xen/arch/x86/hvm/io.c                       |  15 +-
 xen/arch/x86/include/asm/hvm/io.h           |   4 +-
 xen/arch/x86/include/asm/setup.h            |   2 -
 xen/arch/x86/setup.c                        |  49 ---
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  20 +-
 xen/drivers/passthrough/x86/iommu.c         | 321 ++++++++++++++------
 6 files changed, 247 insertions(+), 164 deletions(-)

-- 
2.42.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-17  9:47 [PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
@ 2023-11-17  9:47 ` Roger Pau Monne
  2023-11-17 11:55   ` Andrew Cooper
  2023-11-17  9:47 ` [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static Roger Pau Monne
  2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
  2 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monne @ 2023-11-17  9:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Using different page table levels for HVM or PV guest is not helpful, and is
not inline with the IOMMU implementation used by the other architecture vendor
(VT-d).

Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
page table levels.

Note using the max RAM address for PV was bogus anyway, as there's no guarantee
there can't be device MMIO or reserved regions past the maximum RAM region.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 6bc73dc21052..f9e749d74da2 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
 static int cf_check amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    int pgmode = amd_iommu_get_paging_mode(
+        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
+
+    if ( pgmode < 0 )
+        return pgmode;
 
     /*
-     * Choose the number of levels for the IOMMU page tables.
-     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
-     *   RAM) above the 512G boundary.
-     * - HVM could in principle use 3 or 4 depending on how much guest
-     *   physical address space we give it, but this isn't known yet so use 4
-     *   unilaterally.
-     * - Unity maps may require an even higher number.
+     * Choose the number of levels for the IOMMU page tables, taking into
+     * account unity maps.
      */
-    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
-            is_hvm_domain(d)
-            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-            : get_upper_mfn_bound() + 1),
-        amd_iommu_min_paging_mode);
+    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
 
     return 0;
 }
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-17  9:47 ` [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM Roger Pau Monne
@ 2023-11-17 11:55   ` Andrew Cooper
  2023-11-20  9:45     ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2023-11-17 11:55 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich

On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> Using different page table levels for HVM or PV guest is not helpful, and is
> not inline with the IOMMU implementation used by the other architecture vendor
> (VT-d).
>
> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
> page table levels.
>
> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
> there can't be device MMIO or reserved regions past the maximum RAM region.

Indeed - and the MMIO regions do matter for P2P DMA.

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Variable-height IOMMU pagetables are not worth the security
vulnerabilities they're made of.  I regret not fighting hard enough to
kill them entirely several years ago...

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...

> ---
>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 6bc73dc21052..f9e749d74da2 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>  static int cf_check amd_iommu_domain_init(struct domain *d)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    int pgmode = amd_iommu_get_paging_mode(
> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));

"paging mode" comes from the spec, but it's a very backwards way of
spelling height.

Can we at least start to improve the comprehensibility by renaming this
variable.

> +
> +    if ( pgmode < 0 )
> +        return pgmode;
>  
>      /*
> -     * Choose the number of levels for the IOMMU page tables.
> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> -     *   RAM) above the 512G boundary.
> -     * - HVM could in principle use 3 or 4 depending on how much guest
> -     *   physical address space we give it, but this isn't known yet so use 4
> -     *   unilaterally.
> -     * - Unity maps may require an even higher number.
> +     * Choose the number of levels for the IOMMU page tables, taking into
> +     * account unity maps.
>       */
> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> -            is_hvm_domain(d)
> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> -            : get_upper_mfn_bound() + 1),
> -        amd_iommu_min_paging_mode);
> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);

I think these min/max variables can be dropped now we're not doing
variable height IOMMU pagetables, which further simplifies this expression.

Dunno if it's something better folded into this patch, or done at some
point in the future.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-17 11:55   ` Andrew Cooper
@ 2023-11-20  9:45     ` Jan Beulich
  2023-11-20 10:27       ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-11-20  9:45 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monne; +Cc: xen-devel

On 17.11.2023 12:55, Andrew Cooper wrote:
> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>> Using different page table levels for HVM or PV guest is not helpful, and is
>> not inline with the IOMMU implementation used by the other architecture vendor
>> (VT-d).
>>
>> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
>> page table levels.
>>
>> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
>> there can't be device MMIO or reserved regions past the maximum RAM region.
> 
> Indeed - and the MMIO regions do matter for P2P DMA.

So what about any such living above the 48-bit boundary (i.e. not covered
by DEFAULT_DOMAIN_ADDRESS_WIDTH)?

>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Variable-height IOMMU pagetables are not worth the security
> vulnerabilities they're made of.  I regret not fighting hard enough to
> kill them entirely several years ago...
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...
> 
>> ---
>>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
>>  1 file changed, 8 insertions(+), 12 deletions(-)
>>
>> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> index 6bc73dc21052..f9e749d74da2 100644
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>>  static int cf_check amd_iommu_domain_init(struct domain *d)
>>  {
>>      struct domain_iommu *hd = dom_iommu(d);
>> +    int pgmode = amd_iommu_get_paging_mode(
>> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
> 
> "paging mode" comes from the spec, but it's a very backwards way of
> spelling height.
> 
> Can we at least start to improve the comprehensibility by renaming this
> variable.
> 
>> +
>> +    if ( pgmode < 0 )
>> +        return pgmode;
>>  
>>      /*
>> -     * Choose the number of levels for the IOMMU page tables.
>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>> -     *   RAM) above the 512G boundary.
>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>> -     *   physical address space we give it, but this isn't known yet so use 4
>> -     *   unilaterally.
>> -     * - Unity maps may require an even higher number.
>> +     * Choose the number of levels for the IOMMU page tables, taking into
>> +     * account unity maps.
>>       */
>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>> -            is_hvm_domain(d)
>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>> -            : get_upper_mfn_bound() + 1),
>> -        amd_iommu_min_paging_mode);
>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> 
> I think these min/max variables can be dropped now we're not doing
> variable height IOMMU pagetables, which further simplifies this expression.

Did you take unity maps into account? At least $subject and comment looks
to not be consistent in this regard: Either unity maps need considering
specially (and then we don't uniformly use the same depth), or they don't
need mentioning in the comment (anymore).

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20  9:45     ` Jan Beulich
@ 2023-11-20 10:27       ` Roger Pau Monné
  2023-11-20 10:37         ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-20 10:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> On 17.11.2023 12:55, Andrew Cooper wrote:
> > On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >> Using different page table levels for HVM or PV guest is not helpful, and is
> >> not inline with the IOMMU implementation used by the other architecture vendor
> >> (VT-d).
> >>
> >> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
> >> page table levels.
> >>
> >> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
> >> there can't be device MMIO or reserved regions past the maximum RAM region.
> > 
> > Indeed - and the MMIO regions do matter for P2P DMA.
> 
> So what about any such living above the 48-bit boundary (i.e. not covered
> by DEFAULT_DOMAIN_ADDRESS_WIDTH)?

That would only work for PV guests AFAICT, as HVM guests will already
refuse to create such mappings even before getting into the IOMMU
code: p2m_pt_set_entry() will return an error as the p2m code only
deals with 4 level page tables.

> 
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > Variable-height IOMMU pagetables are not worth the security
> > vulnerabilities they're made of.  I regret not fighting hard enough to
> > kill them entirely several years ago...
> > 
> > Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...
> > 
> >> ---
> >>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
> >>  1 file changed, 8 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> index 6bc73dc21052..f9e749d74da2 100644
> >> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
> >>  static int cf_check amd_iommu_domain_init(struct domain *d)
> >>  {
> >>      struct domain_iommu *hd = dom_iommu(d);
> >> +    int pgmode = amd_iommu_get_paging_mode(
> >> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
> > 
> > "paging mode" comes from the spec, but it's a very backwards way of
> > spelling height.
> > 
> > Can we at least start to improve the comprehensibility by renaming this
> > variable.
> > 
> >> +
> >> +    if ( pgmode < 0 )
> >> +        return pgmode;
> >>  
> >>      /*
> >> -     * Choose the number of levels for the IOMMU page tables.
> >> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >> -     *   RAM) above the 512G boundary.
> >> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >> -     *   physical address space we give it, but this isn't known yet so use 4
> >> -     *   unilaterally.
> >> -     * - Unity maps may require an even higher number.
> >> +     * Choose the number of levels for the IOMMU page tables, taking into
> >> +     * account unity maps.
> >>       */
> >> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >> -            is_hvm_domain(d)
> >> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >> -            : get_upper_mfn_bound() + 1),
> >> -        amd_iommu_min_paging_mode);
> >> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> > 
> > I think these min/max variables can be dropped now we're not doing
> > variable height IOMMU pagetables, which further simplifies this expression.
> 
> Did you take unity maps into account? At least $subject and comment looks
> to not be consistent in this regard: Either unity maps need considering
> specially (and then we don't uniformly use the same depth), or they don't
> need mentioning in the comment (anymore).

Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
will currently only work on PV at best, as HVM p2m code is limited to
4 level page tables, so even if the IOMMU page tables support a
greater address width the call to map such regions will trigger an
error in the p2m code way before attempting to create any IOMMU
mappings.

We could do:

hd->arch.amd.paging_mode =
    is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);

Putting IVMD/RMRR regions that require the usage of 5 level page
tables would be a very short sighted move by vendors IMO.

And will put us back in a situation where PV vs HVM can get different
IOMMU page table levels, which is undesirable.  It might be better to
just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
devices that have IVMD/RMRR regions above that limit.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20 10:27       ` Roger Pau Monné
@ 2023-11-20 10:37         ` Jan Beulich
  2023-11-20 10:50           ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-11-20 10:37 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 20.11.2023 11:27, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
>> On 17.11.2023 12:55, Andrew Cooper wrote:
>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>>>>      /*
>>>> -     * Choose the number of levels for the IOMMU page tables.
>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>>>> -     *   RAM) above the 512G boundary.
>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>>>> -     *   physical address space we give it, but this isn't known yet so use 4
>>>> -     *   unilaterally.
>>>> -     * - Unity maps may require an even higher number.
>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
>>>> +     * account unity maps.
>>>>       */
>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>>>> -            is_hvm_domain(d)
>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>>>> -            : get_upper_mfn_bound() + 1),
>>>> -        amd_iommu_min_paging_mode);
>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
>>>
>>> I think these min/max variables can be dropped now we're not doing
>>> variable height IOMMU pagetables, which further simplifies this expression.
>>
>> Did you take unity maps into account? At least $subject and comment looks
>> to not be consistent in this regard: Either unity maps need considering
>> specially (and then we don't uniformly use the same depth), or they don't
>> need mentioning in the comment (anymore).
> 
> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> will currently only work on PV at best, as HVM p2m code is limited to
> 4 level page tables, so even if the IOMMU page tables support a
> greater address width the call to map such regions will trigger an
> error in the p2m code way before attempting to create any IOMMU
> mappings.
> 
> We could do:
> 
> hd->arch.amd.paging_mode =
>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> 
> Putting IVMD/RMRR regions that require the usage of 5 level page
> tables would be a very short sighted move by vendors IMO.
> 
> And will put us back in a situation where PV vs HVM can get different
> IOMMU page table levels, which is undesirable.  It might be better to
> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> devices that have IVMD/RMRR regions above that limit.

That's a possible approach, yes. To be honest, I was actually hoping we'd
move in a different direction: Do away with the entirely arbitrary
DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.

Whether having PV and HVM have uniform depth is indeed desirable is also
not entirely obvious to me. Having looked over patch 3 now, it also
hasn't become clear to me why the change here is actually a (necessary)
prereq.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20 10:37         ` Jan Beulich
@ 2023-11-20 10:50           ` Roger Pau Monné
  2023-11-20 11:34             ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-20 10:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
> On 20.11.2023 11:27, Roger Pau Monné wrote:
> > On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> >> On 17.11.2023 12:55, Andrew Cooper wrote:
> >>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >>>>      /*
> >>>> -     * Choose the number of levels for the IOMMU page tables.
> >>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >>>> -     *   RAM) above the 512G boundary.
> >>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >>>> -     *   physical address space we give it, but this isn't known yet so use 4
> >>>> -     *   unilaterally.
> >>>> -     * - Unity maps may require an even higher number.
> >>>> +     * Choose the number of levels for the IOMMU page tables, taking into
> >>>> +     * account unity maps.
> >>>>       */
> >>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >>>> -            is_hvm_domain(d)
> >>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >>>> -            : get_upper_mfn_bound() + 1),
> >>>> -        amd_iommu_min_paging_mode);
> >>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> >>>
> >>> I think these min/max variables can be dropped now we're not doing
> >>> variable height IOMMU pagetables, which further simplifies this expression.
> >>
> >> Did you take unity maps into account? At least $subject and comment looks
> >> to not be consistent in this regard: Either unity maps need considering
> >> specially (and then we don't uniformly use the same depth), or they don't
> >> need mentioning in the comment (anymore).
> > 
> > Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> > will currently only work on PV at best, as HVM p2m code is limited to
> > 4 level page tables, so even if the IOMMU page tables support a
> > greater address width the call to map such regions will trigger an
> > error in the p2m code way before attempting to create any IOMMU
> > mappings.
> > 
> > We could do:
> > 
> > hd->arch.amd.paging_mode =
> >     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> > 
> > Putting IVMD/RMRR regions that require the usage of 5 level page
> > tables would be a very short sighted move by vendors IMO.
> > 
> > And will put us back in a situation where PV vs HVM can get different
> > IOMMU page table levels, which is undesirable.  It might be better to
> > just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> > devices that have IVMD/RMRR regions above that limit.
> 
> That's a possible approach, yes. To be honest, I was actually hoping we'd
> move in a different direction: Do away with the entirely arbitrary
> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.

Hm, yes, that might be a sensible approach, but right now I don't want
to block this series on such (likely big) piece of work.  I think we
should aim for HVM and PV to have the same IOMMU page table levels,
and that's currently limited by the p2m code only supporting 4 levels.

> Whether having PV and HVM have uniform depth is indeed desirable is also
> not entirely obvious to me. Having looked over patch 3 now, it also
> hasn't become clear to me why the change here is actually a (necessary)
> prereq.

Oh, it's a prereq because I've found AMD systems that have reserved
regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
would fail on those systems when patch 3/3 was applied, as then
reserved regions past the last RAM address are also mapped in
arch_iommu_hwdom_init().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20 10:50           ` Roger Pau Monné
@ 2023-11-20 11:34             ` Jan Beulich
  2023-11-20 12:01               ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-11-20 11:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 20.11.2023 11:50, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
>> On 20.11.2023 11:27, Roger Pau Monné wrote:
>>> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
>>>> On 17.11.2023 12:55, Andrew Cooper wrote:
>>>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>>>>>>      /*
>>>>>> -     * Choose the number of levels for the IOMMU page tables.
>>>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>>>>>> -     *   RAM) above the 512G boundary.
>>>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>>>>>> -     *   physical address space we give it, but this isn't known yet so use 4
>>>>>> -     *   unilaterally.
>>>>>> -     * - Unity maps may require an even higher number.
>>>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
>>>>>> +     * account unity maps.
>>>>>>       */
>>>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>>>>>> -            is_hvm_domain(d)
>>>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>>>>>> -            : get_upper_mfn_bound() + 1),
>>>>>> -        amd_iommu_min_paging_mode);
>>>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
>>>>>
>>>>> I think these min/max variables can be dropped now we're not doing
>>>>> variable height IOMMU pagetables, which further simplifies this expression.
>>>>
>>>> Did you take unity maps into account? At least $subject and comment looks
>>>> to not be consistent in this regard: Either unity maps need considering
>>>> specially (and then we don't uniformly use the same depth), or they don't
>>>> need mentioning in the comment (anymore).
>>>
>>> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
>>> will currently only work on PV at best, as HVM p2m code is limited to
>>> 4 level page tables, so even if the IOMMU page tables support a
>>> greater address width the call to map such regions will trigger an
>>> error in the p2m code way before attempting to create any IOMMU
>>> mappings.
>>>
>>> We could do:
>>>
>>> hd->arch.amd.paging_mode =
>>>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
>>>
>>> Putting IVMD/RMRR regions that require the usage of 5 level page
>>> tables would be a very short sighted move by vendors IMO.
>>>
>>> And will put us back in a situation where PV vs HVM can get different
>>> IOMMU page table levels, which is undesirable.  It might be better to
>>> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
>>> devices that have IVMD/RMRR regions above that limit.
>>
>> That's a possible approach, yes. To be honest, I was actually hoping we'd
>> move in a different direction: Do away with the entirely arbitrary
>> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.
> 
> Hm, yes, that might be a sensible approach, but right now I don't want
> to block this series on such (likely big) piece of work.  I think we
> should aim for HVM and PV to have the same IOMMU page table levels,
> and that's currently limited by the p2m code only supporting 4 levels.

No, I certainly don't mean to introduce a dependency there. Yet what
you do here goes actively against that possible movement in the other
direction: What "actual system properties" are differs between PV and
HVM (host properties vs guest properties), and hence there would
continue to be a (possible) difference in depth between the two.

>> Whether having PV and HVM have uniform depth is indeed desirable is also
>> not entirely obvious to me. Having looked over patch 3 now, it also
>> hasn't become clear to me why the change here is actually a (necessary)
>> prereq.
> 
> Oh, it's a prereq because I've found AMD systems that have reserved
> regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
> would fail on those systems when patch 3/3 was applied, as then
> reserved regions past the last RAM address are also mapped in
> arch_iommu_hwdom_init().

Hmm, interesting. I can't bring together "would fail" and "are also
mapped" though, unless the latter was meant to say "are attempted to
also be mapped", in which case I could at least see room for failure.
Yet still this would then feel like an issue with the last patch alone,
which the change here is merely avoiding (without this being a strict
prereq). Instead I'd expect us to use 4 levels whenever there are any
kind of regions (reserved or not) above 512G. Without disallowing use
of 3 levels on other (smaller) systems.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20 11:34             ` Jan Beulich
@ 2023-11-20 12:01               ` Roger Pau Monné
  2023-11-20 16:27                 ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-20 12:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, Nov 20, 2023 at 12:34:45PM +0100, Jan Beulich wrote:
> On 20.11.2023 11:50, Roger Pau Monné wrote:
> > On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
> >> On 20.11.2023 11:27, Roger Pau Monné wrote:
> >>> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> >>>> On 17.11.2023 12:55, Andrew Cooper wrote:
> >>>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >>>>>>      /*
> >>>>>> -     * Choose the number of levels for the IOMMU page tables.
> >>>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >>>>>> -     *   RAM) above the 512G boundary.
> >>>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >>>>>> -     *   physical address space we give it, but this isn't known yet so use 4
> >>>>>> -     *   unilaterally.
> >>>>>> -     * - Unity maps may require an even higher number.
> >>>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
> >>>>>> +     * account unity maps.
> >>>>>>       */
> >>>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >>>>>> -            is_hvm_domain(d)
> >>>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >>>>>> -            : get_upper_mfn_bound() + 1),
> >>>>>> -        amd_iommu_min_paging_mode);
> >>>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> >>>>>
> >>>>> I think these min/max variables can be dropped now we're not doing
> >>>>> variable height IOMMU pagetables, which further simplifies this expression.
> >>>>
> >>>> Did you take unity maps into account? At least $subject and comment looks
> >>>> to not be consistent in this regard: Either unity maps need considering
> >>>> specially (and then we don't uniformly use the same depth), or they don't
> >>>> need mentioning in the comment (anymore).
> >>>
> >>> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> >>> will currently only work on PV at best, as HVM p2m code is limited to
> >>> 4 level page tables, so even if the IOMMU page tables support a
> >>> greater address width the call to map such regions will trigger an
> >>> error in the p2m code way before attempting to create any IOMMU
> >>> mappings.
> >>>
> >>> We could do:
> >>>
> >>> hd->arch.amd.paging_mode =
> >>>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> >>>
> >>> Putting IVMD/RMRR regions that require the usage of 5 level page
> >>> tables would be a very short sighted move by vendors IMO.
> >>>
> >>> And will put us back in a situation where PV vs HVM can get different
> >>> IOMMU page table levels, which is undesirable.  It might be better to
> >>> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> >>> devices that have IVMD/RMRR regions above that limit.
> >>
> >> That's a possible approach, yes. To be honest, I was actually hoping we'd
> >> move in a different direction: Do away with the entirely arbitrary
> >> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.
> > 
> > Hm, yes, that might be a sensible approach, but right now I don't want
> > to block this series on such (likely big) piece of work.  I think we
> > should aim for HVM and PV to have the same IOMMU page table levels,
> > and that's currently limited by the p2m code only supporting 4 levels.
> 
> No, I certainly don't mean to introduce a dependency there. Yet what
> you do here goes actively against that possible movement in the other
> direction: What "actual system properties" are differs between PV and
> HVM (host properties vs guest properties), and hence there would
> continue to be a (possible) difference in depth between the two.

Might be.  Overall seems like more complexity for a little win.

The simplest option would be to unconditionally use the maximum page
table levels supported by both the CPU and the IOMMU.

> >> Whether having PV and HVM have uniform depth is indeed desirable is also
> >> not entirely obvious to me. Having looked over patch 3 now, it also
> >> hasn't become clear to me why the change here is actually a (necessary)
> >> prereq.
> > 
> > Oh, it's a prereq because I've found AMD systems that have reserved
> > regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
> > would fail on those systems when patch 3/3 was applied, as then
> > reserved regions past the last RAM address are also mapped in
> > arch_iommu_hwdom_init().
> 
> Hmm, interesting. I can't bring together "would fail" and "are also
> mapped" though, unless the latter was meant to say "are attempted to
> also be mapped", in which case I could at least see room for failure.

Yes, "are attempted to also be mapped", and that attempt fails.  I
would assume that "would fail" was already connected to "also mapped",
but maybe it's not clear enough.

> Yet still this would then feel like an issue with the last patch alone,
> which the change here is merely avoiding (without this being a strict
> prereq). Instead I'd expect us to use 4 levels whenever there are any
> kind of regions (reserved or not) above 512G. Without disallowing use
> of 3 levels on other (smaller) systems.

While reserved regions are the ones that made me realize about this
IOMMU page table difference, what about device MMIO regions?

There's no limitation that avoids MMIO regions from living past the
last RAM address, and possibly above the 512GB mark.

If anything for PV we should limit page table levels based on the
supported paddr bits reported by the CPU, but limiting it based on the
memory map seems plain bogus.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM
  2023-11-20 12:01               ` Roger Pau Monné
@ 2023-11-20 16:27                 ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2023-11-20 16:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 20.11.2023 13:01, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 12:34:45PM +0100, Jan Beulich wrote:
>> Yet still this would then feel like an issue with the last patch alone,
>> which the change here is merely avoiding (without this being a strict
>> prereq). Instead I'd expect us to use 4 levels whenever there are any
>> kind of regions (reserved or not) above 512G. Without disallowing use
>> of 3 levels on other (smaller) systems.
> 
> While reserved regions are the ones that made me realize about this
> IOMMU page table difference, what about device MMIO regions?
> 
> There's no limitation that avoids MMIO regions from living past the
> last RAM address, and possibly above the 512GB mark.
> 
> If anything for PV we should limit page table levels based on the
> supported paddr bits reported by the CPU, but limiting it based on the
> memory map seems plain bogus.

Right, matches what we were discussing (really it's the paddr_bits reported
to the domain, but I guess we have little reason to alter the host value
especially for Dom0).

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static
  2023-11-17  9:47 [PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
  2023-11-17  9:47 ` [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM Roger Pau Monne
@ 2023-11-17  9:47 ` Roger Pau Monne
  2023-11-17 12:03   ` Andrew Cooper
  2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
  2 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monne @ 2023-11-17  9:47 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Wei Liu,
	Paul Durrant

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/setup.h    |  2 --
 xen/arch/x86/setup.c                | 49 ----------------------------
 xen/drivers/passthrough/x86/iommu.c | 50 +++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 9a460e4db8f4..4a1600decf6a 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -36,8 +36,6 @@ unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 void *bootstrap_map(const module_t *mod);
 
-int xen_in_range(unsigned long mfn);
-
 extern uint8_t kbd_shift_flags;
 
 #ifdef NDEBUG
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a3d3f797bb1e..54daff3d4942 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -2080,55 +2080,6 @@ void arch_get_xen_caps(xen_capabilities_info_t *info)
     }
 }
 
-int __hwdom_init xen_in_range(unsigned long mfn)
-{
-    paddr_t start, end;
-    int i;
-
-    enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
-    static struct {
-        paddr_t s, e;
-    } xen_regions[nr_regions] __hwdom_initdata;
-
-    /* initialize first time */
-    if ( !xen_regions[0].s )
-    {
-        /* S3 resume code (and other real mode trampoline code) */
-        xen_regions[region_s3].s = bootsym_phys(trampoline_start);
-        xen_regions[region_s3].e = bootsym_phys(trampoline_end);
-
-        /*
-         * This needs to remain in sync with the uses of the same symbols in
-         * - __start_xen() (above)
-         * - is_xen_fixed_mfn()
-         * - tboot_shutdown()
-         */
-
-        /* hypervisor .text + .rodata */
-        xen_regions[region_ro].s = __pa(&_stext);
-        xen_regions[region_ro].e = __pa(&__2M_rodata_end);
-        /* hypervisor .data + .bss */
-        xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
-        xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
-        if ( efi_boot_mem_unused(&start, &end) )
-        {
-            ASSERT(__pa(start) >= xen_regions[region_rw].s);
-            ASSERT(__pa(end) <= xen_regions[region_rw].e);
-            xen_regions[region_rw].e = __pa(start);
-            xen_regions[region_bss].s = __pa(end);
-            xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
-        }
-    }
-
-    start = (paddr_t)mfn << PAGE_SHIFT;
-    end = start + PAGE_SIZE;
-    for ( i = 0; i < nr_regions; i++ )
-        if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
-            return 1;
-
-    return 0;
-}
-
 static int __hwdom_init cf_check io_bitmap_cb(
     unsigned long s, unsigned long e, void *ctx)
 {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 857dccb6a465..d70cee9fea77 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -13,6 +13,7 @@
  */
 
 #include <xen/cpu.h>
+#include <xen/efi.h>
 #include <xen/sched.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
@@ -300,6 +301,55 @@ void iommu_identity_map_teardown(struct domain *d)
     }
 }
 
+static int __hwdom_init xen_in_range(unsigned long mfn)
+{
+    paddr_t start, end;
+    int i;
+
+    enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
+    static struct {
+        paddr_t s, e;
+    } xen_regions[nr_regions] __hwdom_initdata;
+
+    /* initialize first time */
+    if ( !xen_regions[0].s )
+    {
+        /* S3 resume code (and other real mode trampoline code) */
+        xen_regions[region_s3].s = bootsym_phys(trampoline_start);
+        xen_regions[region_s3].e = bootsym_phys(trampoline_end);
+
+        /*
+         * This needs to remain in sync with the uses of the same symbols in
+         * - __start_xen()
+         * - is_xen_fixed_mfn()
+         * - tboot_shutdown()
+         */
+
+        /* hypervisor .text + .rodata */
+        xen_regions[region_ro].s = __pa(&_stext);
+        xen_regions[region_ro].e = __pa(&__2M_rodata_end);
+        /* hypervisor .data + .bss */
+        xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
+        xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
+        if ( efi_boot_mem_unused(&start, &end) )
+        {
+            ASSERT(__pa(start) >= xen_regions[region_rw].s);
+            ASSERT(__pa(end) <= xen_regions[region_rw].e);
+            xen_regions[region_rw].e = __pa(start);
+            xen_regions[region_bss].s = __pa(end);
+            xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
+        }
+    }
+
+    start = (paddr_t)mfn << PAGE_SHIFT;
+    end = start + PAGE_SIZE;
+    for ( i = 0; i < nr_regions; i++ )
+        if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
+            return 1;
+
+    return 0;
+}
+
 static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
                                                  unsigned long pfn,
                                                  unsigned long max_pfn)
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static
  2023-11-17  9:47 ` [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static Roger Pau Monne
@ 2023-11-17 12:03   ` Andrew Cooper
  2023-11-17 12:50     ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2023-11-17 12:03 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Wei Liu, Paul Durrant

On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

There may only be one caller (after dropping some bogus tboot logic
recently IIRC), but this isn't an IOMMU-specific helper.

See the comment in the middle which shows the other opencoded things
this needs to be kept up to date.  (And I'd hoped to make this common
because every architecture seems to have different bugs opencoding this
calculation.)

Switching to rangesets is fine, but the result still wants to be
something generic, rather than IOMMU-specific.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static
  2023-11-17 12:03   ` Andrew Cooper
@ 2023-11-17 12:50     ` Roger Pau Monné
  0 siblings, 0 replies; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-17 12:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jan Beulich, Wei Liu, Paul Durrant

On Fri, Nov 17, 2023 at 12:03:19PM +0000, Andrew Cooper wrote:
> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> > No functional change intended.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> There may only be one caller (after dropping some bogus tboot logic
> recently IIRC), but this isn't an IOMMU-specific helper.
> 
> See the comment in the middle which shows the other opencoded things
> this needs to be kept up to date.  (And I'd hoped to make this common
> because every architecture seems to have different bugs opencoding this
> calculation.)

But those symbols ultimately come from the linker script, and hence
didn't see them that tied to the setup code.  The setup code uses them,
just as the IOMMU will do now, and likely other parts of the code like
livepatch.

> Switching to rangesets is fine, but the result still wants to be
> something generic, rather than IOMMU-specific.

I'm fine with leaving the function in setup.c, this was mostly a
cleanup attempt.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup
  2023-11-17  9:47 [PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
  2023-11-17  9:47 ` [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM Roger Pau Monne
  2023-11-17  9:47 ` [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static Roger Pau Monne
@ 2023-11-17  9:47 ` Roger Pau Monne
  2023-11-17 12:04   ` Roger Pau Monné
                     ` (2 more replies)
  2 siblings, 3 replies; 18+ messages in thread
From: Roger Pau Monne @ 2023-11-17  9:47 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Paul Durrant, Jan Beulich, Andrew Cooper,
	Wei Liu

The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases.  It's also not accounting for any reserved regions
past the last RAM address.

Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.

On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:

x old
+ new
    N           Min           Max        Median           Avg        Stddev
x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+   5       1025012       1033036       1026188     1028276.2     3623.1194
Difference at 95.0% confidence
	-2.26214e+10 +/- 4.42931e+08
	-99.9955% +/- 9.05152e-05%
	(Student's t, pooled s = 3.03701e+08)

Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.

Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible.  Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().

No change intended in the resulting mappings that a hardware domain gets.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/hvm/io.c               |  15 +-
 xen/arch/x86/include/asm/hvm/io.h   |   4 +-
 xen/drivers/passthrough/x86/iommu.c | 355 +++++++++++++++++-----------
 3 files changed, 231 insertions(+), 143 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index d75af83ad01f..7c4b7317b13a 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -364,9 +364,20 @@ static const struct hvm_mmcfg *vpci_mmcfg_find(const struct domain *d,
     return NULL;
 }
 
-bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
+int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r)
 {
-    return vpci_mmcfg_find(d, addr);
+    const struct hvm_mmcfg *mmcfg;
+
+    list_for_each_entry ( mmcfg, &d->arch.hvm.mmcfg_regions, next )
+    {
+        int rc = rangeset_remove_range(r, PFN_DOWN(mmcfg->addr),
+                                       PFN_DOWN(mmcfg->addr + mmcfg->size - 1));
+
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
 }
 
 static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
index e5225e75ef26..c9d058fd5695 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -153,8 +153,8 @@ int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
 /* Destroy tracked MMCFG areas. */
 void destroy_vpci_mmcfg(struct domain *d);
 
-/* Check if an address is between a MMCFG region for a domain. */
-bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
+/* Remove MMCFG regions from a given rangeset. */
+int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
 
 #endif /* __ASM_X86_HVM_IO_H__ */
 
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index d70cee9fea77..be2c909f61d8 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -301,129 +301,133 @@ void iommu_identity_map_teardown(struct domain *d)
     }
 }
 
-static int __hwdom_init xen_in_range(unsigned long mfn)
+static int __hwdom_init remove_xen_ranges(struct rangeset *r)
 {
     paddr_t start, end;
-    int i;
-
-    enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
-    static struct {
-        paddr_t s, e;
-    } xen_regions[nr_regions] __hwdom_initdata;
+    int rc;
 
-    /* initialize first time */
-    if ( !xen_regions[0].s )
-    {
-        /* S3 resume code (and other real mode trampoline code) */
-        xen_regions[region_s3].s = bootsym_phys(trampoline_start);
-        xen_regions[region_s3].e = bootsym_phys(trampoline_end);
+    /* S3 resume code (and other real mode trampoline code) */
+    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
+                               PFN_DOWN(bootsym_phys(trampoline_end)));
+    if ( rc )
+        return rc;
 
-        /*
-         * This needs to remain in sync with the uses of the same symbols in
-         * - __start_xen()
-         * - is_xen_fixed_mfn()
-         * - tboot_shutdown()
-         */
+    /*
+     * This needs to remain in sync with the uses of the same symbols in
+     * - __start_xen()
+     * - is_xen_fixed_mfn()
+     * - tboot_shutdown()
+     */
+    /* hypervisor .text + .rodata */
+    rc = rangeset_remove_range(r, PFN_DOWN(__pa(&_stext)),
+                               PFN_DOWN(__pa(&__2M_rodata_end)));
+    if ( rc )
+        return rc;
 
-        /* hypervisor .text + .rodata */
-        xen_regions[region_ro].s = __pa(&_stext);
-        xen_regions[region_ro].e = __pa(&__2M_rodata_end);
-        /* hypervisor .data + .bss */
-        xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
-        xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
-        if ( efi_boot_mem_unused(&start, &end) )
-        {
-            ASSERT(__pa(start) >= xen_regions[region_rw].s);
-            ASSERT(__pa(end) <= xen_regions[region_rw].e);
-            xen_regions[region_rw].e = __pa(start);
-            xen_regions[region_bss].s = __pa(end);
-            xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
-        }
+    /* hypervisor .data + .bss */
+    if ( efi_boot_mem_unused(&start, &end) )
+    {
+        ASSERT(__pa(start) >= __pa(&__2M_rwdata_start));
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+                                   PFN_DOWN(__pa(start)));
+        if ( rc )
+            return rc;
+        ASSERT(__pa(end) <= __pa(&__2M_rwdata_end));
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(end)),
+                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
+        if ( rc )
+            return rc;
+    }
+    else
+    {
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
+        if ( rc )
+            return rc;
     }
-
-    start = (paddr_t)mfn << PAGE_SHIFT;
-    end = start + PAGE_SIZE;
-    for ( i = 0; i < nr_regions; i++ )
-        if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
-            return 1;
 
     return 0;
 }
 
-static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                                 unsigned long pfn,
-                                                 unsigned long max_pfn)
+static int __hwdom_init map_subtract(unsigned long s, unsigned long e,
+                                     void *data)
 {
-    mfn_t mfn = _mfn(pfn);
-    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
+    struct rangeset *map = data;
 
-    /*
-     * Set up 1:1 mapping for dom0. Default to include only conventional RAM
-     * areas and let RMRRs include needed reserved regions. When set, the
-     * inclusive mapping additionally maps in every pfn up to 4GB except those
-     * that fall in unusable ranges for PV Dom0.
-     */
-    if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return 0;
+    return rangeset_remove_range(map, s, e);
+}
 
-    switch ( type = page_get_ram_type(mfn) )
-    {
-    case RAM_TYPE_UNUSABLE:
-        return 0;
+struct map_data {
+    struct domain *d;
+    unsigned int flush_flags;
+    bool ro;
+};
 
-    case RAM_TYPE_CONVENTIONAL:
-        if ( iommu_hwdom_strict )
-            return 0;
-        break;
+static int __hwdom_init identity_map(unsigned long s, unsigned long e,
+                                     void *data)
+{
+    struct map_data *info = data;
+    struct domain *d = info->d;
+    long rc;
 
-    default:
-        if ( type & RAM_TYPE_RESERVED )
-        {
-            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                perms = 0;
-        }
-        else if ( is_hvm_domain(d) )
-            return 0;
-        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
-            perms = 0;
-    }
+    if ( iommu_verbose )
+        printk(XENLOG_INFO " [%010lx, %010lx] R%c\n",
+               s, e, info->ro ? 'O' : 'W');
 
-    /* Check that it doesn't overlap with the Interrupt Address Range. */
-    if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return 0;
-    /* ... or the IO-APIC */
-    if ( has_vioapic(d) )
+    if ( paging_mode_translate(d) )
     {
-        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
-            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-                return 0;
+        if ( info->ro )
+        {
+            ASSERT_UNREACHABLE();
+            return 0;
+        }
+        while ( (rc = map_mmio_regions(d, _gfn(s), e - s + 1, _mfn(s))) > 0 )
+        {
+            s += rc;
+            process_pending_softirqs();
+        }
     }
-    else if ( is_pv_domain(d) )
+    else
     {
-        /*
-         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
-         * ones there (also for e.g. HPET in certain cases), so it should also
-         * have such established for IOMMUs.
-         */
-        if ( iomem_access_permitted(d, pfn, pfn) &&
-             rangeset_contains_singleton(mmio_ro_ranges, pfn) )
-            perms = IOMMUF_readable;
+        const unsigned int perms = IOMMUF_readable | IOMMUF_preempt |
+                                   (info->ro ? 0 : IOMMUF_writable);
+
+        if ( info->ro && !iomem_access_permitted(d, s, e) )
+        {
+            /*
+             * Should be more fine grained in order to not map the forbidden
+             * frame instead of rejecting the region as a whole, but it's only
+             * for read-only MMIO regions, which are very limited.
+             */
+            printk(XENLOG_DEBUG
+                   "IOMMU read-only mapping of region [%lx, %lx] forbidden\n",
+                   s, e);
+            return 0;
+        }
+        while ( (rc = iommu_map(d, _dfn(s), _mfn(s), e - s + 1,
+                                perms, &info->flush_flags)) > 0 )
+        {
+            s += rc;
+            process_pending_softirqs();
+        }
     }
-    /*
-     * ... or the PCIe MCFG regions.
-     * TODO: runtime added MMCFG regions are not checked to make sure they
-     * don't overlap with already mapped regions, thus preventing trapping.
-     */
-    if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return 0;
+    ASSERT(rc <= 0);
+    if ( rc )
+        printk(XENLOG_WARNING
+               "IOMMU identity mapping of [%lx, %lx] failed: %ld\n",
+               s, e, rc);
 
-    return perms;
+    /* Ignore errors and attempt to map the remaining regions. */
+    return 0;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn, start, count;
-    unsigned int flush_flags = 0, start_perms = 0;
+    const unsigned long max_pfn = PFN_DOWN(GB(4)) - 1;
+    unsigned int i;
+    struct rangeset *map;
+    struct map_data map_data = { .d = d };
+    int rc;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -447,62 +451,135 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( iommu_hwdom_passthrough )
         return;
 
-    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
-    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
+    map = rangeset_new(NULL, NULL, 0);
+    if ( !map )
+        panic("IOMMU initialization failed unable to allocate rangeset\n");
+
+    if ( iommu_hwdom_inclusive )
+    {
+        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
+        rc = rangeset_add_range(map, 0, max_pfn);
+        if ( rc )
+            panic("IOMMU inclusive mappings can't be added: %d\n",
+                  rc);
+    }
 
-    for ( i = 0, start = 0, count = 0; i < top; )
+    for ( i = 0; i < e820.nr_map; i++ )
     {
-        unsigned long pfn = pdx_to_pfn(i);
-        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
+        struct e820entry entry = e820.map[i];
 
-        if ( !perms )
-            /* nothing */;
-        else if ( paging_mode_translate(d) )
+        switch ( entry.type )
         {
-            int rc;
+        case E820_UNUSABLE:
+            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
+                continue;
 
-            rc = p2m_add_identity_entry(d, pfn,
-                                        perms & IOMMUF_writable ? p2m_access_rw
-                                                                : p2m_access_r,
-                                        0);
+            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
+                                       PFN_DOWN(entry.addr + entry.size - 1));
             if ( rc )
-                printk(XENLOG_WARNING
-                       "%pd: identity mapping of %lx failed: %d\n",
-                       d, pfn, rc);
+                panic("IOMMU failed to remove unusable memory: %d\n",
+                      rc);
+            continue;
+
+        case E820_RESERVED:
+            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
+                continue;
+            break;
+
+        case E820_RAM:
+            if ( iommu_hwdom_strict )
+                continue;
+            break;
+
+        default:
+            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) >= max_pfn )
+                continue;
+            entry.size = pfn_to_paddr(max_pfn) - 1 - entry.addr;
+            break;
         }
-        else if ( pfn != start + count || perms != start_perms )
-        {
-            long rc;
 
-        commit:
-            while ( (rc = iommu_map(d, _dfn(start), _mfn(start), count,
-                                    start_perms | IOMMUF_preempt,
-                                    &flush_flags)) > 0 )
-            {
-                start += rc;
-                count -= rc;
-                process_pending_softirqs();
-            }
+        if ( iommu_hwdom_inclusive &&
+             PFN_DOWN(entry.addr + entry.size - 1) <= max_pfn )
+            /*
+             * Any range below 4GB is already in the rangeset if using inclusive
+             * mode.
+             */
+            continue;
+
+        rc = rangeset_add_range(map, PFN_DOWN(entry.addr),
+                                PFN_DOWN(entry.addr + entry.size - 1));
+        if ( rc )
+            panic("IOMMU failed to add identity range: %d\n", rc);
+    }
+
+    /* Remove any areas in-use by Xen. */
+    rc = remove_xen_ranges(map);
+    if ( rc )
+        panic("IOMMU failed to remove Xen ranges: %d\n", rc);
+
+    /* Remove any overlap with the Interrupt Address Range. */
+    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
+    if ( rc )
+        panic("IOMMU failed to remove Interrupt Address Range: %d\n",
+              rc);
+
+    /* If emulating IO-APIC(s) make sure the base address is unmapped. */
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+        {
+            rc = rangeset_remove_singleton(map,
+                PFN_DOWN(domain_vioapic(d, i)->base_address));
             if ( rc )
-                printk(XENLOG_WARNING
-                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %ld\n",
-                       d, start, start + count, rc);
-            start = pfn;
-            count = 1;
-            start_perms = perms;
+                panic("IOMMU failed to remove IO-APIC: %d\n",
+                      rc);
         }
-        else
-            ++count;
+    }
 
-        if ( !(++i & 0xfffff) )
-            process_pending_softirqs();
+    if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there (also for e.g. HPET in certain cases), so it should also
+         * have such established for IOMMUs.
+         */
+        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, map_subtract, map);
+        if ( rc )
+            panic("IOMMU failed to remove read-only regions: %d\n",
+                  rc);
+    }
+
+    if ( has_vpci(d) )
+    {
+        /*
+         * TODO: runtime added MMCFG regions are not checked to make sure they
+         * don't overlap with already mapped regions, thus preventing trapping.
+         */
+        rc = vpci_subtract_mmcfg(d, map);
+        if ( rc )
+            panic("IOMMU unable to remove MMCFG areas: %d\n", rc);
+    }
 
-        if ( i == top && count )
-            goto commit;
+    if ( iommu_verbose )
+        printk(XENLOG_INFO "d%u: identity mappings for IOMMU:\n",
+               d->domain_id);
+
+    rc = rangeset_report_ranges(map, 0, ~0UL, identity_map, &map_data);
+    if ( rc )
+        panic("IOMMU unable to create mappings: %d\n", rc);
+    if ( is_pv_domain(d) )
+    {
+        map_data.ro = true;
+        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, identity_map,
+                                    &map_data);
+        if ( rc )
+            panic("IOMMU unable to create read-only mappings: %d\n", rc);
     }
 
+    rangeset_destroy(map);
+
     /* Use if to avoid compiler warning */
-    if ( iommu_iotlb_flush_all(d, flush_flags) )
+    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
         return;
 }
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup
  2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
@ 2023-11-17 12:04   ` Roger Pau Monné
  2023-11-17 14:34   ` Andrew Cooper
  2023-11-20 10:30   ` Jan Beulich
  2 siblings, 0 replies; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-17 12:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Jan Beulich, Andrew Cooper, Wei Liu

On Fri, Nov 17, 2023 at 10:47:49AM +0100, Roger Pau Monne wrote:
> The current loop that iterates from 0 to the maximum RAM address in order to
> setup the IOMMU mappings is highly inefficient, and it will get worse as the
> amount of RAM increases.  It's also not accounting for any reserved regions
> past the last RAM address.
> 
> Instead of iterating over memory addresses, iterate over the memory map regions
> and use a rangeset in order to keep track of which ranges need to be identity
> mapped in the hardware domain physical address space.
> 
> On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
> arch_iommu_hwdom_init() in nanoseconds is:
> 
> x old
> + new
>     N           Min           Max        Median           Avg        Stddev
> x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
> +   5       1025012       1033036       1026188     1028276.2     3623.1194
> Difference at 95.0% confidence
> 	-2.26214e+10 +/- 4.42931e+08
> 	-99.9955% +/- 9.05152e-05%
> 	(Student's t, pooled s = 3.03701e+08)
> 
> Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.
> 
> Note there's a change for HVM domains (ie: PVH dom0) that get switched to
> create the p2m mappings using map_mmio_regions() instead of
> p2m_add_identity_entry(), so that ranges can be mapped with a single function
> call if possible.  Note that the interface of map_mmio_regions() doesn't
> allow creating read-only mappings, but so far there are no such mappings
> created for PVH dom0 in arch_iommu_hwdom_init().
> 
> No change intended in the resulting mappings that a hardware domain gets.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/hvm/io.c               |  15 +-
>  xen/arch/x86/include/asm/hvm/io.h   |   4 +-
>  xen/drivers/passthrough/x86/iommu.c | 355 +++++++++++++++++-----------
>  3 files changed, 231 insertions(+), 143 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index d75af83ad01f..7c4b7317b13a 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -364,9 +364,20 @@ static const struct hvm_mmcfg *vpci_mmcfg_find(const struct domain *d,
>      return NULL;
>  }
>  
> -bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
> +int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r)
>  {
> -    return vpci_mmcfg_find(d, addr);
> +    const struct hvm_mmcfg *mmcfg;
> +
> +    list_for_each_entry ( mmcfg, &d->arch.hvm.mmcfg_regions, next )
> +    {
> +        int rc = rangeset_remove_range(r, PFN_DOWN(mmcfg->addr),
> +                                       PFN_DOWN(mmcfg->addr + mmcfg->size - 1));
> +
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return 0;
>  }
>  
>  static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
> diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
> index e5225e75ef26..c9d058fd5695 100644
> --- a/xen/arch/x86/include/asm/hvm/io.h
> +++ b/xen/arch/x86/include/asm/hvm/io.h
> @@ -153,8 +153,8 @@ int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
>  /* Destroy tracked MMCFG areas. */
>  void destroy_vpci_mmcfg(struct domain *d);
>  
> -/* Check if an address is between a MMCFG region for a domain. */
> -bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
> +/* Remove MMCFG regions from a given rangeset. */
> +int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
>  
>  #endif /* __ASM_X86_HVM_IO_H__ */
>  
> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
> index d70cee9fea77..be2c909f61d8 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -301,129 +301,133 @@ void iommu_identity_map_teardown(struct domain *d)
>      }
>  }
>  
> -static int __hwdom_init xen_in_range(unsigned long mfn)
> +static int __hwdom_init remove_xen_ranges(struct rangeset *r)
>  {
>      paddr_t start, end;
> -    int i;
> -
> -    enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
> -    static struct {
> -        paddr_t s, e;
> -    } xen_regions[nr_regions] __hwdom_initdata;
> +    int rc;
>  
> -    /* initialize first time */
> -    if ( !xen_regions[0].s )
> -    {
> -        /* S3 resume code (and other real mode trampoline code) */
> -        xen_regions[region_s3].s = bootsym_phys(trampoline_start);
> -        xen_regions[region_s3].e = bootsym_phys(trampoline_end);
> +    /* S3 resume code (and other real mode trampoline code) */
> +    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
> +                               PFN_DOWN(bootsym_phys(trampoline_end)));
> +    if ( rc )
> +        return rc;
>  
> -        /*
> -         * This needs to remain in sync with the uses of the same symbols in
> -         * - __start_xen()
> -         * - is_xen_fixed_mfn()
> -         * - tboot_shutdown()
> -         */
> +    /*
> +     * This needs to remain in sync with the uses of the same symbols in
> +     * - __start_xen()
> +     * - is_xen_fixed_mfn()
> +     * - tboot_shutdown()
> +     */
> +    /* hypervisor .text + .rodata */
> +    rc = rangeset_remove_range(r, PFN_DOWN(__pa(&_stext)),
> +                               PFN_DOWN(__pa(&__2M_rodata_end)));
> +    if ( rc )
> +        return rc;
>  
> -        /* hypervisor .text + .rodata */
> -        xen_regions[region_ro].s = __pa(&_stext);
> -        xen_regions[region_ro].e = __pa(&__2M_rodata_end);
> -        /* hypervisor .data + .bss */
> -        xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
> -        xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
> -        if ( efi_boot_mem_unused(&start, &end) )
> -        {
> -            ASSERT(__pa(start) >= xen_regions[region_rw].s);
> -            ASSERT(__pa(end) <= xen_regions[region_rw].e);
> -            xen_regions[region_rw].e = __pa(start);
> -            xen_regions[region_bss].s = __pa(end);
> -            xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
> -        }
> +    /* hypervisor .data + .bss */
> +    if ( efi_boot_mem_unused(&start, &end) )
> +    {
> +        ASSERT(__pa(start) >= __pa(&__2M_rwdata_start));
> +        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
> +                                   PFN_DOWN(__pa(start)));
> +        if ( rc )
> +            return rc;
> +        ASSERT(__pa(end) <= __pa(&__2M_rwdata_end));
> +        rc = rangeset_remove_range(r, PFN_DOWN(__pa(end)),
> +                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
> +        if ( rc )
> +            return rc;
> +    }
> +    else
> +    {
> +        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
> +                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
> +        if ( rc )
> +            return rc;
>      }
> -
> -    start = (paddr_t)mfn << PAGE_SHIFT;
> -    end = start + PAGE_SIZE;
> -    for ( i = 0; i < nr_regions; i++ )
> -        if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
> -            return 1;
>  
>      return 0;
>  }
>  
> -static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> -                                                 unsigned long pfn,
> -                                                 unsigned long max_pfn)
> +static int __hwdom_init map_subtract(unsigned long s, unsigned long e,

Bah, this (and others below) are missing cf_check attribute.

Will fix in v2.

Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup
  2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
  2023-11-17 12:04   ` Roger Pau Monné
@ 2023-11-17 14:34   ` Andrew Cooper
  2023-11-20 10:30   ` Jan Beulich
  2 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2023-11-17 14:34 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Paul Durrant, Jan Beulich, Wei Liu

On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> The current loop that iterates from 0 to the maximum RAM address in order to
> setup the IOMMU mappings is highly inefficient, and it will get worse as the
> amount of RAM increases.  It's also not accounting for any reserved regions
> past the last RAM address.
>
> Instead of iterating over memory addresses, iterate over the memory map regions
> and use a rangeset in order to keep track of which ranges need to be identity
> mapped in the hardware domain physical address space.
>
> On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
> arch_iommu_hwdom_init() in nanoseconds is:
>
> x old
> + new
>     N           Min           Max        Median           Avg        Stddev
> x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
> +   5       1025012       1033036       1026188     1028276.2     3623.1194
> Difference at 95.0% confidence
> 	-2.26214e+10 +/- 4.42931e+08
> 	-99.9955% +/- 9.05152e-05%
> 	(Student's t, pooled s = 3.03701e+08)
>
> Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.
>
> Note there's a change for HVM domains (ie: PVH dom0) that get switched to
> create the p2m mappings using map_mmio_regions() instead of
> p2m_add_identity_entry(), so that ranges can be mapped with a single function
> call if possible.  Note that the interface of map_mmio_regions() doesn't
> allow creating read-only mappings, but so far there are no such mappings
> created for PVH dom0 in arch_iommu_hwdom_init().
>
> No change intended in the resulting mappings that a hardware domain gets.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Very nice numbers.  And yes - straight line performance like this (good
or bad) is all about the innermost loop.

Sadly, the patch diff is horrible to read.  Patch 2 remaining in common
code will improve this a little, but probably not very much.

If there are no better ideas, it's probably best to split into 3
patches, being:

1) Introduce new rangeset forms of existing operations
2) Rewrite arch_iommu_hwdom_init() to use rangesets
3) Delete old mfn forms

That at least means that the new and the old forms aren't expressed as a
delta against each-other.

~Andrew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup
  2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
  2023-11-17 12:04   ` Roger Pau Monné
  2023-11-17 14:34   ` Andrew Cooper
@ 2023-11-20 10:30   ` Jan Beulich
  2023-11-20 10:43     ` Roger Pau Monné
  2 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2023-11-20 10:30 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, xen-devel

On 17.11.2023 10:47, Roger Pau Monne wrote:
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -364,9 +364,20 @@ static const struct hvm_mmcfg *vpci_mmcfg_find(const struct domain *d,
>      return NULL;
>  }
>  
> -bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
> +int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r)

While there, also add __hwdom_init?

> @@ -447,62 +451,135 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( iommu_hwdom_passthrough )
>          return;
>  
> -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> +    map = rangeset_new(NULL, NULL, 0);
> +    if ( !map )
> +        panic("IOMMU initialization failed unable to allocate rangeset\n");

This reads a little odd, and could probably do with shortening anyway
(e.g. "IOMMU init: unable to allocate rangeset\n").

> +    if ( iommu_hwdom_inclusive )
> +    {
> +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
> +        rc = rangeset_add_range(map, 0, max_pfn);
> +        if ( rc )
> +            panic("IOMMU inclusive mappings can't be added: %d\n",
> +                  rc);
> +    }
>  
> -    for ( i = 0, start = 0, count = 0; i < top; )
> +    for ( i = 0; i < e820.nr_map; i++ )
>      {
> -        unsigned long pfn = pdx_to_pfn(i);
> -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> +        struct e820entry entry = e820.map[i];
>  
> -        if ( !perms )
> -            /* nothing */;
> -        else if ( paging_mode_translate(d) )
> +        switch ( entry.type )
>          {
> -            int rc;
> +        case E820_UNUSABLE:
> +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
> +                continue;

It's > here, but ...

> -            rc = p2m_add_identity_entry(d, pfn,
> -                                        perms & IOMMUF_writable ? p2m_access_rw
> -                                                                : p2m_access_r,
> -                                        0);
> +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
> +                                       PFN_DOWN(entry.addr + entry.size - 1));
>              if ( rc )
> -                printk(XENLOG_WARNING
> -                       "%pd: identity mapping of %lx failed: %d\n",
> -                       d, pfn, rc);
> +                panic("IOMMU failed to remove unusable memory: %d\n",
> +                      rc);
> +            continue;
> +
> +        case E820_RESERVED:
> +            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> +                continue;
> +            break;
> +
> +        case E820_RAM:
> +            if ( iommu_hwdom_strict )
> +                continue;
> +            break;
> +
> +        default:
> +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) >= max_pfn )
> +                continue;

... >= here?

> +            entry.size = pfn_to_paddr(max_pfn) - 1 - entry.addr;

Why the subtraction of 1 when you're calculating size? Don't you actually
need to add 1 to max_pfn before converting to paddr?

While overall things look plausible elsewhere, I'm hoping that - as asked
for by Andrew - it'll be possible to split this some, to make it a little
more reasonable to actually look at the changes done.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup
  2023-11-20 10:30   ` Jan Beulich
@ 2023-11-20 10:43     ` Roger Pau Monné
  0 siblings, 0 replies; 18+ messages in thread
From: Roger Pau Monné @ 2023-11-20 10:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, xen-devel

On Mon, Nov 20, 2023 at 11:30:16AM +0100, Jan Beulich wrote:
> On 17.11.2023 10:47, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -364,9 +364,20 @@ static const struct hvm_mmcfg *vpci_mmcfg_find(const struct domain *d,
> >      return NULL;
> >  }
> >  
> > -bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
> > +int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r)
> 
> While there, also add __hwdom_init?
> 
> > @@ -447,62 +451,135 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >      if ( iommu_hwdom_passthrough )
> >          return;
> >  
> > -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> > -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> > +    map = rangeset_new(NULL, NULL, 0);
> > +    if ( !map )
> > +        panic("IOMMU initialization failed unable to allocate rangeset\n");
> 
> This reads a little odd, and could probably do with shortening anyway
> (e.g. "IOMMU init: unable to allocate rangeset\n").
> 
> > +    if ( iommu_hwdom_inclusive )
> > +    {
> > +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
> > +        rc = rangeset_add_range(map, 0, max_pfn);
> > +        if ( rc )
> > +            panic("IOMMU inclusive mappings can't be added: %d\n",
> > +                  rc);
> > +    }
> >  
> > -    for ( i = 0, start = 0, count = 0; i < top; )
> > +    for ( i = 0; i < e820.nr_map; i++ )
> >      {
> > -        unsigned long pfn = pdx_to_pfn(i);
> > -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> > +        struct e820entry entry = e820.map[i];
> >  
> > -        if ( !perms )
> > -            /* nothing */;
> > -        else if ( paging_mode_translate(d) )
> > +        switch ( entry.type )
> >          {
> > -            int rc;
> > +        case E820_UNUSABLE:
> > +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
> > +                continue;
> 
> It's > here, but ...
> 
> > -            rc = p2m_add_identity_entry(d, pfn,
> > -                                        perms & IOMMUF_writable ? p2m_access_rw
> > -                                                                : p2m_access_r,
> > -                                        0);
> > +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
> > +                                       PFN_DOWN(entry.addr + entry.size - 1));
> >              if ( rc )
> > -                printk(XENLOG_WARNING
> > -                       "%pd: identity mapping of %lx failed: %d\n",
> > -                       d, pfn, rc);
> > +                panic("IOMMU failed to remove unusable memory: %d\n",
> > +                      rc);
> > +            continue;
> > +
> > +        case E820_RESERVED:
> > +            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> > +                continue;
> > +            break;
> > +
> > +        case E820_RAM:
> > +            if ( iommu_hwdom_strict )
> > +                continue;
> > +            break;
> > +
> > +        default:
> > +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) >= max_pfn )
> > +                continue;
> 
> ... >= here?

Oh, this was a leftover from a previous slightly different logic,
there's no need for such default case anymore, as the whole range
below 4GB is already added to the rangeset in the inclusive case, and
holes are poked for unusable regions.

> > +            entry.size = pfn_to_paddr(max_pfn) - 1 - entry.addr;
> 
> Why the subtraction of 1 when you're calculating size? Don't you actually
> need to add 1 to max_pfn before converting to paddr?

Previous code didn't had the - 1 in max_pfn definition, but again the
default case is no longer needed.

> 
> While overall things look plausible elsewhere, I'm hoping that - as asked
> for by Andrew - it'll be possible to split this some, to make it a little
> more reasonable to actually look at the changes done.

Will try to split it somehow.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-11-20 16:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-17  9:47 [PATCH 0/3] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
2023-11-17  9:47 ` [PATCH 1/3] amd-vi: use the same IOMMU page table levels for PV and HVM Roger Pau Monne
2023-11-17 11:55   ` Andrew Cooper
2023-11-20  9:45     ` Jan Beulich
2023-11-20 10:27       ` Roger Pau Monné
2023-11-20 10:37         ` Jan Beulich
2023-11-20 10:50           ` Roger Pau Monné
2023-11-20 11:34             ` Jan Beulich
2023-11-20 12:01               ` Roger Pau Monné
2023-11-20 16:27                 ` Jan Beulich
2023-11-17  9:47 ` [PATCH 2/3] x86/iommu: move xen_in_range() so it can be made static Roger Pau Monne
2023-11-17 12:03   ` Andrew Cooper
2023-11-17 12:50     ` Roger Pau Monné
2023-11-17  9:47 ` [PATCH 3/3] x86/iommu: use a rangeset for hwdom setup Roger Pau Monne
2023-11-17 12:04   ` Roger Pau Monné
2023-11-17 14:34   ` Andrew Cooper
2023-11-20 10:30   ` Jan Beulich
2023-11-20 10:43     ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.