All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
       [not found] <1440776507-30218-1-git-send-email-anshul.makkar@citrix.com>
@ 2015-08-28 16:24 ` Andrew Cooper
  2015-08-31  8:09 ` Jan Beulich
  1 sibling, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2015-08-28 16:24 UTC (permalink / raw)
  To: "Anshul Makkar anshul.makkar", xen-devel
  Cc: yang.z.zhang, kevin.tian, anshulma, jbeulich

On 28/08/15 16:41, "Anshul Makkar anshul.makkar"@citrix.com wrote:
> From: anshulma <anshul.makkar@citrix.com>
>
> Sandybridge or earlier processors don't have huge page support for
> IOTLB which leads to fallback on 4k pages and causes performance issues.
>
> Shared EPT will be disabled only if the user has not provided explicit
> choice on the command line.
>
> Signed-off-by: Anshul Makkar <anshul.makkar@citrix.com>

As a note concerning the performance issues, for some IO workloads, this
nets a 40% throughput improvement.

We did not observe any IO workloads which had a worse performance as a
result of disabling shared ept.

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
       [not found] <1440776507-30218-1-git-send-email-anshul.makkar@citrix.com>
  2015-08-28 16:24 ` [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors Andrew Cooper
@ 2015-08-31  8:09 ` Jan Beulich
  2015-09-01 14:18   ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-08-31  8:09 UTC (permalink / raw)
  To: Anshul ; +Cc: yang.z.zhang, andrew.cooper3, anshulma, kevin.tian, xen-devel

>>> On 28.08.15 at 17:41, <Anshul Makkar anshul.makkar@citrix.com> wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -896,7 +896,7 @@ debug hypervisor only).
>  
>  > `sharept`
>  
> -> Default: `true`
> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or earlier.

This neglects the AMD side.

> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -54,6 +54,7 @@ bool_t __read_mostly iommu_intremap = 1;
>  bool_t __read_mostly iommu_hap_pt_share = 1;
>  bool_t __read_mostly iommu_debug;
>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> +bool_t __read_mostly iommu_sharept_set = 0;

_If_ you really want to introduce a new variable, it ought to be
__initdata (since only __init functions reference it). My preference
though would be to make the existing variable a tristate, starting
out with a value of -1 (and having type s8).

> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -320,6 +320,19 @@ void __init platform_quirks_init(void)
>      /* Tylersburg interrupt remap quirk */
>      if ( iommu_intremap )
>          tylersburg_intremap_quirk();
> +
> +    /*
> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
> +     * by default.
> +     * SandyBridge has no huge page support for IOTLB which leads to fallback
> +     * on 4k pages and leads to performance degradation.
> +     *
> +     * Shared EPT ("sharept") will be disabled only if user has not
> +     * provided explicit choice on the command line.
> +     */
> +    if ( (boot_cpu_data.x86 == 6) &&
> +         (boot_cpu_data.x86_model <= 0x2a) && !iommu_sharept_set )
> +        iommu_hap_pt_share = 0;

Model 0x2d certainly is also Sandybridge. Models 0x2c, 0x2e, and
0x2f are even older architectures. And then there are various
Atoms at higher numbers which I'm not sure can be considered
"newer than Sandybridge" architecture wise. I.e. I don't think you
can get away with a simple, single relation here.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-08-31  8:09 ` Jan Beulich
@ 2015-09-01 14:18   ` Andrew Cooper
  2015-09-01 14:55     ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-09-01 14:18 UTC (permalink / raw)
  To: Jan Beulich, "Anshul Makkar anshul.makkar"
  Cc: yang.z.zhang, anshulma, kevin.tian, xen-devel

On 31/08/15 09:09, Jan Beulich wrote:
>>>> On 28.08.15 at 17:41, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -896,7 +896,7 @@ debug hypervisor only).
>>  
>>  > `sharept`
>>  
>> -> Default: `true`
>> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or earlier.
> This neglects the AMD side.

The AMD side has iommu_hap_pt_share unconditionally disabled, for
reasons pertaining to grant mapped frames.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-09-01 14:18   ` Andrew Cooper
@ 2015-09-01 14:55     ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2015-09-01 14:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: yang.z.zhang, anshulma, kevin.tian, xen-devel

>>> On 01.09.15 at 16:18, <andrew.cooper3@citrix.com> wrote:
> On 31/08/15 09:09, Jan Beulich wrote:
>>>>> On 28.08.15 at 17:41, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>>> --- a/docs/misc/xen-command-line.markdown
>>> +++ b/docs/misc/xen-command-line.markdown
>>> @@ -896,7 +896,7 @@ debug hypervisor only).
>>>  
>>>  > `sharept`
>>>  
>>> -> Default: `true`
>>> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or earlier.
>> This neglects the AMD side.
> 
> The AMD side has iommu_hap_pt_share unconditionally disabled, for
> reasons pertaining to grant mapped frames.

Yet that doesn't eliminate the desire to have the correct default
spelled out here.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
@ 2015-11-24 17:17 Anshul
  2015-11-24 17:41 ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Anshul @ 2015-11-24 17:17 UTC (permalink / raw)
  To: xen-devel
  Cc: yang.z.zhang, andrew.cooper3, kevin.tian, Anshul Makkar, jbeulich

From: Anshul Makkar <anshul.makkar@citrix.com>

Sandybridge or earlier processors don't have huge page support for
IOTLB which leads to fallback on 4k pages and causes performance issues.

Shared EPT will be disabled only if the user has not provided explicit
choice on the command line.

Signed-off-by: Anshul Makkar <anshul.makkar@citrix.com>
---
v2:
   * Removed the use of extra variable to control the shared EPT and made
     the existent variable as tristate.
   * Narrowed down the check for processors to Sandybridge and older including
     Atom processors.

 docs/misc/xen-command-line.markdown  |  2 +-
 xen/drivers/passthrough/iommu.c      |  2 +-
 xen/drivers/passthrough/vtd/quirks.c | 14 ++++++++++++++
 xen/include/xen/iommu.h              |  2 +-
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a2e427c..6b69ba2 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -896,7 +896,7 @@ debug hypervisor only).
 
 > `sharept`
 
-> Default: `true`
+> Default: `true` if newer than SandyBridge or `false` if Sandybridge or earlier.
 
 >> Control whether CPU and IOMMU page tables should be shared.
 
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index d5137733..9367987 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -51,7 +51,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
-bool_t __read_mostly iommu_hap_pt_share = 1;
+s8     __read_mostly iommu_hap_pt_share = -1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
 
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 1888843..7d63c8d 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
     /* Tylersburg interrupt remap quirk */
     if ( iommu_intremap )
         tylersburg_intremap_quirk();
+
+    /*
+     * Disable shared EPT ("sharept") on Sandybridge and older processors
+     * by default.
+     * SandyBridge has no huge page support for IOTLB which leads to fallback
+     * on 4k pages and leads to performance degradation.
+     *
+     * Shared EPT ("sharept") will be disabled only if user has not
+     * provided explicit choice on the command line thus iommu_hap_pt_share is
+     * at its initialized value of -1.
+     */
+    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
+          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
+        iommu_hap_pt_share = 0;
 }
 
 /*
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..d52d06f 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -31,7 +31,7 @@ extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough;
 extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
-extern bool_t iommu_hap_pt_share;
+extern s8 iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-24 17:17 Anshul
@ 2015-11-24 17:41 ` Jan Beulich
  2015-11-25 10:28   ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-24 17:41 UTC (permalink / raw)
  To: Anshul ; +Cc: yang.z.zhang, andrew.cooper3, Anshul Makkar, kevin.tian,
	xen-devel

>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>      /* Tylersburg interrupt remap quirk */
>      if ( iommu_intremap )
>          tylersburg_intremap_quirk();
> +
> +    /*
> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
> +     * by default.
> +     * SandyBridge has no huge page support for IOTLB which leads to fallback
> +     * on 4k pages and leads to performance degradation.
> +     *
> +     * Shared EPT ("sharept") will be disabled only if user has not
> +     * provided explicit choice on the command line thus iommu_hap_pt_share is
> +     * at its initialized value of -1.
> +     */
> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
> +        iommu_hap_pt_share = 0;

If we really want to do this, then I think we should key this on
EPT but not VT-d having 2M support, instead of on CPU models.

Also - with the above only marginally relevant - the line split and/or
indentation is wrong.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-24 17:41 ` Jan Beulich
@ 2015-11-25 10:28   ` Andrew Cooper
  2015-11-25 10:49     ` Jan Beulich
  2015-11-26 13:46     ` Jan Beulich
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2015-11-25 10:28 UTC (permalink / raw)
  To: Jan Beulich, "Anshul Makkar anshul.makkar"
  Cc: yang.z.zhang, Anshul Makkar, kevin.tian, xen-devel

On 24/11/15 17:41, Jan Beulich wrote:
>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>> --- a/xen/drivers/passthrough/vtd/quirks.c
>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>      /* Tylersburg interrupt remap quirk */
>>      if ( iommu_intremap )
>>          tylersburg_intremap_quirk();
>> +
>> +    /*
>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
>> +     * by default.
>> +     * SandyBridge has no huge page support for IOTLB which leads to fallback
>> +     * on 4k pages and leads to performance degradation.
>> +     *
>> +     * Shared EPT ("sharept") will be disabled only if user has not
>> +     * provided explicit choice on the command line thus iommu_hap_pt_share is
>> +     * at its initialized value of -1.
>> +     */
>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>> +        iommu_hap_pt_share = 0;
> If we really want to do this, then I think we should key this on
> EPT but not VT-d having 2M support, instead of on CPU models.

This check is already performed by vtd_ept_page_compatible()

The problem is that SandyBridge IOMMUs advertise 2M support and do
function with it, but cannot cache 2MB translations in the IOTLBs.

As a result, attempting to use 2M translations causes substantially
worse performance than 4K translations.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 10:28   ` Andrew Cooper
@ 2015-11-25 10:49     ` Jan Beulich
  2015-11-25 15:13       ` Andrew Cooper
  2015-11-26 13:46     ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-25 10:49 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar; +Cc: yang.z.zhang, kevin.tian, xen-devel

>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> On 24/11/15 17:41, Jan Beulich wrote:
>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>>> --- a/xen/drivers/passthrough/vtd/quirks.c
>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>>      /* Tylersburg interrupt remap quirk */
>>>      if ( iommu_intremap )
>>>          tylersburg_intremap_quirk();
>>> +
>>> +    /*
>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
>>> +     * by default.
>>> +     * SandyBridge has no huge page support for IOTLB which leads to fallback
>>> +     * on 4k pages and leads to performance degradation.
>>> +     *
>>> +     * Shared EPT ("sharept") will be disabled only if user has not
>>> +     * provided explicit choice on the command line thus iommu_hap_pt_share is
>>> +     * at its initialized value of -1.
>>> +     */
>>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>>> +        iommu_hap_pt_share = 0;
>> If we really want to do this, then I think we should key this on
>> EPT but not VT-d having 2M support, instead of on CPU models.
> 
> This check is already performed by vtd_ept_page_compatible()

Yeah, I realized there would be such a check on the way home.

> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
> 
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.

So commit message and comment should make this more explicit,
to avoid the impression "IOTLB" isn't just the relatively common
mis-naming of "IOMMU".

Plus I guess the sharing won't need suppressing if !opt_hap_2mb?

Further the model based check is relatively broad, and includes
Atoms (0x36 actually is one), which can't be considered "Sandybridge
or older" imo.

And finally I'm not fully convinced using CPU model info to deduce
chipset behavior is entirely correct (albeit perhaps in practice it'll
be fine except maybe when running Xen itself virtualized).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 10:49     ` Jan Beulich
@ 2015-11-25 15:13       ` Andrew Cooper
  2015-11-25 15:38         ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-11-25 15:13 UTC (permalink / raw)
  To: Jan Beulich, Anshul Makkar; +Cc: yang.z.zhang, kevin.tian, xen-devel

On 25/11/15 10:49, Jan Beulich wrote:
>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>> On 24/11/15 17:41, Jan Beulich wrote:
>>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>>>      /* Tylersburg interrupt remap quirk */
>>>>      if ( iommu_intremap )
>>>>          tylersburg_intremap_quirk();
>>>> +
>>>> +    /*
>>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
>>>> +     * by default.
>>>> +     * SandyBridge has no huge page support for IOTLB which leads to fallback
>>>> +     * on 4k pages and leads to performance degradation.
>>>> +     *
>>>> +     * Shared EPT ("sharept") will be disabled only if user has not
>>>> +     * provided explicit choice on the command line thus iommu_hap_pt_share is
>>>> +     * at its initialized value of -1.
>>>> +     */
>>>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>>>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>>>> +        iommu_hap_pt_share = 0;
>>> If we really want to do this, then I think we should key this on
>>> EPT but not VT-d having 2M support, instead of on CPU models.
>> This check is already performed by vtd_ept_page_compatible()
> Yeah, I realized there would be such a check on the way home.
>
>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>
>> As a result, attempting to use 2M translations causes substantially
>> worse performance than 4K translations.
> So commit message and comment should make this more explicit,
> to avoid the impression "IOTLB" isn't just the relatively common
> mis-naming of "IOMMU".
>
> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>
> Further the model based check is relatively broad, and includes
> Atoms (0x36 actually is one), which can't be considered "Sandybridge
> or older" imo.
>
> And finally I'm not fully convinced using CPU model info to deduce
> chipset behavior is entirely correct (albeit perhaps in practice it'll
> be fine except maybe when running Xen itself virtualized).

What else would you suggest? I can't think of any better identifying
information.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 15:13       ` Andrew Cooper
@ 2015-11-25 15:38         ` Jan Beulich
  2015-11-25 15:58           ` Malcolm Crossley
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-25 15:38 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar; +Cc: yang.z.zhang, kevin.tian, xen-devel

>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
> On 25/11/15 10:49, Jan Beulich wrote:
>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>> On 24/11/15 17:41, Jan Beulich wrote:
>>>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>>>>      /* Tylersburg interrupt remap quirk */
>>>>>      if ( iommu_intremap )
>>>>>          tylersburg_intremap_quirk();
>>>>> +
>>>>> +    /*
>>>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
>>>>> +     * by default.
>>>>> +     * SandyBridge has no huge page support for IOTLB which leads to 
> fallback
>>>>> +     * on 4k pages and leads to performance degradation.
>>>>> +     *
>>>>> +     * Shared EPT ("sharept") will be disabled only if user has not
>>>>> +     * provided explicit choice on the command line thus iommu_hap_pt_share 
> is
>>>>> +     * at its initialized value of -1.
>>>>> +     */
>>>>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>>>>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>>>>> +        iommu_hap_pt_share = 0;
>>>> If we really want to do this, then I think we should key this on
>>>> EPT but not VT-d having 2M support, instead of on CPU models.
>>> This check is already performed by vtd_ept_page_compatible()
>> Yeah, I realized there would be such a check on the way home.
>>
>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>
>>> As a result, attempting to use 2M translations causes substantially
>>> worse performance than 4K translations.
>> So commit message and comment should make this more explicit,
>> to avoid the impression "IOTLB" isn't just the relatively common
>> mis-naming of "IOMMU".
>>
>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>>
>> Further the model based check is relatively broad, and includes
>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
>> or older" imo.
>>
>> And finally I'm not fully convinced using CPU model info to deduce
>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>> be fine except maybe when running Xen itself virtualized).
> 
> What else would you suggest? I can't think of any better identifying
> information.

Chipset IDs / revisions?

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 15:38         ` Jan Beulich
@ 2015-11-25 15:58           ` Malcolm Crossley
  2015-11-26  7:17             ` Tian, Kevin
  2015-11-26  8:45             ` Jan Beulich
  0 siblings, 2 replies; 35+ messages in thread
From: Malcolm Crossley @ 2015-11-25 15:58 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Anshul Makkar
  Cc: yang.z.zhang, kevin.tian, xen-devel

On 25/11/15 15:38, Jan Beulich wrote:
>>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
>> On 25/11/15 10:49, Jan Beulich wrote:
>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>>> On 24/11/15 17:41, Jan Beulich wrote:
>>>>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
>>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
>>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>>>>>      /* Tylersburg interrupt remap quirk */
>>>>>>      if ( iommu_intremap )
>>>>>>          tylersburg_intremap_quirk();
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
>>>>>> +     * by default.
>>>>>> +     * SandyBridge has no huge page support for IOTLB which leads to 
>> fallback
>>>>>> +     * on 4k pages and leads to performance degradation.
>>>>>> +     *
>>>>>> +     * Shared EPT ("sharept") will be disabled only if user has not
>>>>>> +     * provided explicit choice on the command line thus iommu_hap_pt_share 
>> is
>>>>>> +     * at its initialized value of -1.
>>>>>> +     */
>>>>>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>>>>>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>>>>>> +        iommu_hap_pt_share = 0;
>>>>> If we really want to do this, then I think we should key this on
>>>>> EPT but not VT-d having 2M support, instead of on CPU models.
>>>> This check is already performed by vtd_ept_page_compatible()
>>> Yeah, I realized there would be such a check on the way home.
>>>
>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>
>>>> As a result, attempting to use 2M translations causes substantially
>>>> worse performance than 4K translations.
>>> So commit message and comment should make this more explicit,
>>> to avoid the impression "IOTLB" isn't just the relatively common
>>> mis-naming of "IOMMU".
>>>
>>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>>>
>>> Further the model based check is relatively broad, and includes
>>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
>>> or older" imo.
>>>
>>> And finally I'm not fully convinced using CPU model info to deduce
>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>> be fine except maybe when running Xen itself virtualized).
>>
>> What else would you suggest? I can't think of any better identifying
>> information.
> 
> Chipset IDs / revisions?

In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
Unfortunately there's no register to query the IOTLB configuration of the IOMMU
and so we're stuck identifying the via the processor model number itself.

Malcolm

> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 15:58           ` Malcolm Crossley
@ 2015-11-26  7:17             ` Tian, Kevin
  2015-12-01 16:45               ` Anshul Makkar
  2015-11-26  8:45             ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2015-11-26  7:17 UTC (permalink / raw)
  To: Malcolm Crossley, Jan Beulich, Andrew Cooper, Anshul Makkar
  Cc: Zhang, Yang Z, xen-devel@lists.xen.org

> From: Malcolm Crossley [mailto:malcolm.crossley@citrix.com]
> Sent: Wednesday, November 25, 2015 11:59 PM
> 
> On 25/11/15 15:38, Jan Beulich wrote:
> >>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
> >> On 25/11/15 10:49, Jan Beulich wrote:
> >>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>>> On 24/11/15 17:41, Jan Beulich wrote:
> >>>>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
> >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
> >>>>>>      /* Tylersburg interrupt remap quirk */
> >>>>>>      if ( iommu_intremap )
> >>>>>>          tylersburg_intremap_quirk();
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
> >>>>>> +     * by default.
> >>>>>> +     * SandyBridge has no huge page support for IOTLB which leads to
> >> fallback
> >>>>>> +     * on 4k pages and leads to performance degradation.
> >>>>>> +     *
> >>>>>> +     * Shared EPT ("sharept") will be disabled only if user has not
> >>>>>> +     * provided explicit choice on the command line thus iommu_hap_pt_share
> >> is
> >>>>>> +     * at its initialized value of -1.
> >>>>>> +     */
> >>>>>> +    if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F
> ||
> >>>>>> +          boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share ==
> -1) )
> >>>>>> +        iommu_hap_pt_share = 0;
> >>>>> If we really want to do this, then I think we should key this on
> >>>>> EPT but not VT-d having 2M support, instead of on CPU models.
> >>>> This check is already performed by vtd_ept_page_compatible()
> >>> Yeah, I realized there would be such a check on the way home.
> >>>
> >>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>
> >>>> As a result, attempting to use 2M translations causes substantially
> >>>> worse performance than 4K translations.
> >>> So commit message and comment should make this more explicit,
> >>> to avoid the impression "IOTLB" isn't just the relatively common
> >>> mis-naming of "IOMMU".
> >>>
> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
> >>>
> >>> Further the model based check is relatively broad, and includes
> >>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
> >>> or older" imo.
> >>>
> >>> And finally I'm not fully convinced using CPU model info to deduce
> >>> chipset behavior is entirely correct (albeit perhaps in practice it'll
> >>> be fine except maybe when running Xen itself virtualized).
> >>
> >> What else would you suggest? I can't think of any better identifying
> >> information.
> >
> > Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
> Unfortunately there's no register to query the IOTLB configuration of the IOMMU
> and so we're stuck identifying the via the processor model number itself.
> 
> Malcolm
> 

I'm OK to use processor model here, though ideally Jan is right. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 15:58           ` Malcolm Crossley
  2015-11-26  7:17             ` Tian, Kevin
@ 2015-11-26  8:45             ` Jan Beulich
  2015-11-26 10:27               ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-26  8:45 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar, Malcolm Crossley
  Cc: yang.z.zhang, kevin.tian, xen-devel

>>> On 25.11.15 at 16:58, <malcolm.crossley@citrix.com> wrote:
> On 25/11/15 15:38, Jan Beulich wrote:
>>>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
>>> On 25/11/15 10:49, Jan Beulich wrote:
>>>> And finally I'm not fully convinced using CPU model info to deduce
>>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>>> be fine except maybe when running Xen itself virtualized).
>>>
>>> What else would you suggest? I can't think of any better identifying
>>> information.
>> 
>> Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor 
> itself.

Which doesn't preclude it to be identified via PCI device ID - after all
there are dozens of processor integrated PCI devices. Looking at
one of my systems,

00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07)
80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07)

could be a candidate (we already key a quirk on this device in
pci_vtd_quirk()).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26  8:45             ` Jan Beulich
@ 2015-11-26 10:27               ` Andrew Cooper
  2015-11-26 10:39                 ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-11-26 10:27 UTC (permalink / raw)
  To: Jan Beulich, Anshul Makkar, Malcolm Crossley
  Cc: yang.z.zhang, kevin.tian, xen-devel

On 26/11/15 08:45, Jan Beulich wrote:
>>>> On 25.11.15 at 16:58, <malcolm.crossley@citrix.com> wrote:
>> On 25/11/15 15:38, Jan Beulich wrote:
>>>>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
>>>> On 25/11/15 10:49, Jan Beulich wrote:
>>>>> And finally I'm not fully convinced using CPU model info to deduce
>>>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>>>> be fine except maybe when running Xen itself virtualized).
>>>> What else would you suggest? I can't think of any better identifying
>>>> information.
>>> Chipset IDs / revisions?
>> In this case the IOMMU is integrated into the Sandybridge-EP processor 
>> itself.
> Which doesn't preclude it to be identified via PCI device ID - after all
> there are dozens of processor integrated PCI devices. Looking at
> one of my systems,
>
> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07)
> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>
> could be a candidate (we already key a quirk on this device in
> pci_vtd_quirk()).

These are fine for server variants, but not for desktop variants, both
of which we have seen in use.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 10:27               ` Andrew Cooper
@ 2015-11-26 10:39                 ` Jan Beulich
  2015-11-26 11:42                   ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-26 10:39 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar, Malcolm Crossley
  Cc: yang.z.zhang, kevin.tian, xen-devel

>>> On 26.11.15 at 11:27, <andrew.cooper3@citrix.com> wrote:
> On 26/11/15 08:45, Jan Beulich wrote:
>>>>> On 25.11.15 at 16:58, <malcolm.crossley@citrix.com> wrote:
>>> On 25/11/15 15:38, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
>>>>> On 25/11/15 10:49, Jan Beulich wrote:
>>>>>> And finally I'm not fully convinced using CPU model info to deduce
>>>>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>>>>> be fine except maybe when running Xen itself virtualized).
>>>>> What else would you suggest? I can't think of any better identifying
>>>>> information.
>>>> Chipset IDs / revisions?
>>> In this case the IOMMU is integrated into the Sandybridge-EP processor 
>>> itself.
>> Which doesn't preclude it to be identified via PCI device ID - after all
>> there are dozens of processor integrated PCI devices. Looking at
>> one of my systems,
>>
>> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>
>> could be a candidate (we already key a quirk on this device in
>> pci_vtd_quirk()).
> 
> These are fine for server variants, but not for desktop variants, both
> of which we have seen in use.

And I gave them only as an example that keying off of PCI IDs
would be possible. A complete list would of course need to be
compiled (but I think we could simply derive it from the list of IDs
we already deal with in quirks.c).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 10:39                 ` Jan Beulich
@ 2015-11-26 11:42                   ` Andrew Cooper
  2015-11-26 11:53                     ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-11-26 11:42 UTC (permalink / raw)
  To: Jan Beulich, Anshul Makkar, Malcolm Crossley
  Cc: yang.z.zhang, kevin.tian, xen-devel

On 26/11/15 10:39, Jan Beulich wrote:
>>>> On 26.11.15 at 11:27, <andrew.cooper3@citrix.com> wrote:
>> On 26/11/15 08:45, Jan Beulich wrote:
>>>>>> On 25.11.15 at 16:58, <malcolm.crossley@citrix.com> wrote:
>>>> On 25/11/15 15:38, Jan Beulich wrote:
>>>>>>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
>>>>>> On 25/11/15 10:49, Jan Beulich wrote:
>>>>>>> And finally I'm not fully convinced using CPU model info to deduce
>>>>>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>>>>>> be fine except maybe when running Xen itself virtualized).
>>>>>> What else would you suggest? I can't think of any better identifying
>>>>>> information.
>>>>> Chipset IDs / revisions?
>>>> In this case the IOMMU is integrated into the Sandybridge-EP processor 
>>>> itself.
>>> Which doesn't preclude it to be identified via PCI device ID - after all
>>> there are dozens of processor integrated PCI devices. Looking at
>>> one of my systems,
>>>
>>> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
>> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
>> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>> could be a candidate (we already key a quirk on this device in
>>> pci_vtd_quirk()).
>> These are fine for server variants, but not for desktop variants, both
>> of which we have seen in use.
> And I gave them only as an example that keying off of PCI IDs
> would be possible. A complete list would of course need to be
> compiled (but I think we could simply derive it from the list of IDs
> we already deal with in quirks.c).

That is not my point.  The Desktop variants do not expose their
internals as PCI devices.

Keying on the host bridge might be an option.


Also, on further consideration, the better fix (however we identify the
affected systems) would be to quirk the IOMMUs themselves into not
claiming 2M/1G superpage support.

Otherwise, when we do eventually get superpage IOMMU mapping support in
the API, the performance regression will creep back in.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 11:42                   ` Andrew Cooper
@ 2015-11-26 11:53                     ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2015-11-26 11:53 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar, Malcolm Crossley
  Cc: yang.z.zhang, kevin.tian, xen-devel

>>> On 26.11.15 at 12:42, <andrew.cooper3@citrix.com> wrote:
> Also, on further consideration, the better fix (however we identify the
> affected systems) would be to quirk the IOMMUs themselves into not
> claiming 2M/1G superpage support.
> 
> Otherwise, when we do eventually get superpage IOMMU mapping support in
> the API, the performance regression will creep back in.

Indeed, good point.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-25 10:28   ` Andrew Cooper
  2015-11-25 10:49     ` Jan Beulich
@ 2015-11-26 13:46     ` Jan Beulich
  2015-11-26 13:48       ` Malcolm Crossley
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-11-26 13:46 UTC (permalink / raw)
  To: Andrew Cooper, Malcolm Crossley
  Cc: yang.z.zhang, Anshul Makkar, kevin.tian, xen-devel

>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
> 
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.

Btw - how does this get explained? At a first glance, even if 2Mb
translations don't get entered into the TLB, it should still be one
less page table level to walk for the IOMMU, and should hence
nevertheless be a benefit. Yet you even say _substantially_
worse performance results.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 13:46     ` Jan Beulich
@ 2015-11-26 13:48       ` Malcolm Crossley
  2015-11-26 13:55         ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Malcolm Crossley @ 2015-11-26 13:48 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: yang.z.zhang, Anshul Makkar, kevin.tian, xen-devel

On 26/11/15 13:46, Jan Beulich wrote:
>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>
>> As a result, attempting to use 2M translations causes substantially
>> worse performance than 4K translations.
> 
> Btw - how does this get explained? At a first glance, even if 2Mb
> translations don't get entered into the TLB, it should still be one
> less page table level to walk for the IOMMU, and should hence
> nevertheless be a benefit. Yet you even say _substantially_
> worse performance results.

There is a IOTLB for the 4K translation so if you only use 4K
translations then you get to take advantage of the IOTLB.

If you use the 2Mb translation then a page table walk has to be
performed every time there's a DMA access to that region of the BFN
address space.

Malcolm

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 13:48       ` Malcolm Crossley
@ 2015-11-26 13:55         ` Andrew Cooper
  2015-11-30 21:22           ` Konrad Rzeszutek Wilk
                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Andrew Cooper @ 2015-11-26 13:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, yang.z.zhang, Malcolm Crossley, Anshul Makkar,
	xen-devel

On 26/11/15 13:48, Malcolm Crossley wrote:
> On 26/11/15 13:46, Jan Beulich wrote:
>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>
>>> As a result, attempting to use 2M translations causes substantially
>>> worse performance than 4K translations.
>> Btw - how does this get explained? At a first glance, even if 2Mb
>> translations don't get entered into the TLB, it should still be one
>> less page table level to walk for the IOMMU, and should hence
>> nevertheless be a benefit. Yet you even say _substantially_
>> worse performance results.
> There is a IOTLB for the 4K translation so if you only use 4K
> translations then you get to take advantage of the IOTLB.
>
> If you use the 2Mb translation then a page table walk has to be
> performed every time there's a DMA access to that region of the BFN
> address space.

Also remember that a high level dma access (from the point of view of a
driver) will be fragmented at the PCIe max packet size, which is
typically 256 bytes.

So by not caching the 2Mb translation, a dma access of 4k may undergo 16
pagetable walks, one for each PCIe packet.

We observed that using 2Mb mappings results in a 40% overhead, compared
to using 4k mappings, from the point of view of a sample network workload.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 13:55         ` Andrew Cooper
@ 2015-11-30 21:22           ` Konrad Rzeszutek Wilk
  2015-12-01 10:34             ` Andrew Cooper
  2015-12-03  1:19           ` Tian, Kevin
  2015-12-03  2:40           ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-11-30 21:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: kevin.tian, Anshul Makkar, xen-devel, Jan Beulich, yang.z.zhang,
	Malcolm Crossley

On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
> On 26/11/15 13:48, Malcolm Crossley wrote:
> > On 26/11/15 13:46, Jan Beulich wrote:
> >>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>
> >>> As a result, attempting to use 2M translations causes substantially
> >>> worse performance than 4K translations.
> >> Btw - how does this get explained? At a first glance, even if 2Mb
> >> translations don't get entered into the TLB, it should still be one
> >> less page table level to walk for the IOMMU, and should hence
> >> nevertheless be a benefit. Yet you even say _substantially_
> >> worse performance results.
> > There is a IOTLB for the 4K translation so if you only use 4K
> > translations then you get to take advantage of the IOTLB.
> >
> > If you use the 2Mb translation then a page table walk has to be
> > performed every time there's a DMA access to that region of the BFN
> > address space.
> 
> Also remember that a high level dma access (from the point of view of a
> driver) will be fragmented at the PCIe max packet size, which is
> typically 256 bytes.
> 
> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> pagetable walks, one for each PCIe packet.
> 
> We observed that using 2Mb mappings results in a 40% overhead, compared
> to using 4k mappings, from the point of view of a sample network workload.

How did you observe this? I am mighty curious what kind of performance tools
you used to find this  as I would love to figure out if some of the issues
we have seen are related to this?

> 
> ~Andrew
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-30 21:22           ` Konrad Rzeszutek Wilk
@ 2015-12-01 10:34             ` Andrew Cooper
  2015-12-01 10:44               ` Anshul Makkar
  2015-12-01 15:24               ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2015-12-01 10:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: kevin.tian, Anshul Makkar, xen-devel, Jan Beulich, yang.z.zhang,
	Malcolm Crossley

On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes substantially
>>>>> worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb
>>>> translations don't get entered into the TLB, it should still be one
>>>> less page table level to walk for the IOMMU, and should hence
>>>> nevertheless be a benefit. Yet you even say _substantially_
>>>> worse performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be
>>> performed every time there's a DMA access to that region of the BFN
>>> address space.
>> Also remember that a high level dma access (from the point of view of a
>> driver) will be fragmented at the PCIe max packet size, which is
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>> pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, compared
>> to using 4k mappings, from the point of view of a sample network workload.
> How did you observe this? I am mighty curious what kind of performance tools
> you used to find this  as I would love to figure out if some of the issues
> we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given
a workload which can normally saturate line rate on the card.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-01 10:34             ` Andrew Cooper
@ 2015-12-01 10:44               ` Anshul Makkar
  2015-12-01 15:24               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 35+ messages in thread
From: Anshul Makkar @ 2015-12-01 10:44 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: yang.z.zhang@intel.com, Kevin Tian, Malcolm Crossley, Jan Beulich,
	xen-devel@lists.xen.org

Snabbswitch (virtualized switch) also encountered similar problem : https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI

Thanks
Anshul Makkar

-----Original Message-----
From: Andrew Cooper [mailto:andrew.cooper3@citrix.com] 
Sent: 01 December 2015 10:34
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jan Beulich <JBeulich@suse.com>; Kevin Tian <kevin.tian@intel.com>; yang.z.zhang@intel.com; Malcolm Crossley <malcolm.crossley@citrix.com>; Anshul Makkar <anshul.makkar@citrix.com>; xen-devel@lists.xen.org
Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do 
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes 
>>>>> substantially worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb 
>>>> translations don't get entered into the TLB, it should still be one 
>>>> less page table level to walk for the IOMMU, and should hence 
>>>> nevertheless be a benefit. Yet you even say _substantially_ worse 
>>>> performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K 
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be 
>>> performed every time there's a DMA access to that region of the BFN 
>>> address space.
>> Also remember that a high level dma access (from the point of view of 
>> a
>> driver) will be fragmented at the PCIe max packet size, which is 
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 
>> 16 pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, 
>> compared to using 4k mappings, from the point of view of a sample network workload.
> How did you observe this? I am mighty curious what kind of performance 
> tools you used to find this  as I would love to figure out if some of 
> the issues we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given a workload which can normally saturate line rate on the card.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-01 10:34             ` Andrew Cooper
  2015-12-01 10:44               ` Anshul Makkar
@ 2015-12-01 15:24               ` Konrad Rzeszutek Wilk
  2015-12-01 16:19                 ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-12-01 15:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: kevin.tian, Anshul Makkar, xen-devel, Jan Beulich, yang.z.zhang,
	Malcolm Crossley

On Tue, Dec 01, 2015 at 10:34:17AM +0000, Andrew Cooper wrote:
> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
> >> On 26/11/15 13:48, Malcolm Crossley wrote:
> >>> On 26/11/15 13:46, Jan Beulich wrote:
> >>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>>
> >>>>> As a result, attempting to use 2M translations causes substantially
> >>>>> worse performance than 4K translations.
> >>>> Btw - how does this get explained? At a first glance, even if 2Mb
> >>>> translations don't get entered into the TLB, it should still be one
> >>>> less page table level to walk for the IOMMU, and should hence
> >>>> nevertheless be a benefit. Yet you even say _substantially_
> >>>> worse performance results.
> >>> There is a IOTLB for the 4K translation so if you only use 4K
> >>> translations then you get to take advantage of the IOTLB.
> >>>
> >>> If you use the 2Mb translation then a page table walk has to be
> >>> performed every time there's a DMA access to that region of the BFN
> >>> address space.
> >> Also remember that a high level dma access (from the point of view of a
> >> driver) will be fragmented at the PCIe max packet size, which is
> >> typically 256 bytes.
> >>
> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> >> pagetable walks, one for each PCIe packet.
> >>
> >> We observed that using 2Mb mappings results in a 40% overhead, compared
> >> to using 4k mappings, from the point of view of a sample network workload.
> > How did you observe this? I am mighty curious what kind of performance tools
> > you used to find this  as I would love to figure out if some of the issues
> > we have seen are related to this?
> 
> The 40% difference is just in terms of network throughput of a VF, given
> a workload which can normally saturate line rate on the card.

I understand that.

But I am curious on how you found out the page walks by the IOMMU were
so excessive? Were there any perf counters on the IOMMU that showed
a crazy amount of pagetable walks?

It just that if I had looked at this I would have first looked at interrupts, then
kernels, then hypervisor - and eventually (after lots of head banging) it would
have occurred to me to look at the IOMMU pagetables.


> 
> ~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-01 15:24               ` Konrad Rzeszutek Wilk
@ 2015-12-01 16:19                 ` Andrew Cooper
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2015-12-01 16:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: kevin.tian, Anshul Makkar, xen-devel, Jan Beulich, yang.z.zhang,
	Malcolm Crossley

On 01/12/15 15:24, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 01, 2015 at 10:34:17AM +0000, Andrew Cooper wrote:
>> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
>>>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>>>
>>>>>>> As a result, attempting to use 2M translations causes substantially
>>>>>>> worse performance than 4K translations.
>>>>>> Btw - how does this get explained? At a first glance, even if 2Mb
>>>>>> translations don't get entered into the TLB, it should still be one
>>>>>> less page table level to walk for the IOMMU, and should hence
>>>>>> nevertheless be a benefit. Yet you even say _substantially_
>>>>>> worse performance results.
>>>>> There is a IOTLB for the 4K translation so if you only use 4K
>>>>> translations then you get to take advantage of the IOTLB.
>>>>>
>>>>> If you use the 2Mb translation then a page table walk has to be
>>>>> performed every time there's a DMA access to that region of the BFN
>>>>> address space.
>>>> Also remember that a high level dma access (from the point of view of a
>>>> driver) will be fragmented at the PCIe max packet size, which is
>>>> typically 256 bytes.
>>>>
>>>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>>>> pagetable walks, one for each PCIe packet.
>>>>
>>>> We observed that using 2Mb mappings results in a 40% overhead, compared
>>>> to using 4k mappings, from the point of view of a sample network workload.
>>> How did you observe this? I am mighty curious what kind of performance tools
>>> you used to find this  as I would love to figure out if some of the issues
>>> we have seen are related to this?
>> The 40% difference is just in terms of network throughput of a VF, given
>> a workload which can normally saturate line rate on the card.
> I understand that.
>
> But I am curious on how you found out the page walks by the IOMMU were
> so excessive?

I didn't.  It is all speculation drawn from other information.

The manual states that there is not a superpage IOTLB.

This leaves two options
1) 2M mappings are entirely uncached
2) 2M mappings are shattered to 4K mappings and cached

The fact there is a 40% performance reduction suggests 1 rather than 2.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26  7:17             ` Tian, Kevin
@ 2015-12-01 16:45               ` Anshul Makkar
  2015-12-01 17:20                 ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Anshul Makkar @ 2015-12-01 16:45 UTC (permalink / raw)
  To: Kevin Tian, Malcolm Crossley, Jan Beulich, Andrew Cooper
  Cc: Zhang, Yang Z, xen-devel@lists.xen.org

Based on the discussion below, can I assume there is an agreement for using processor model for filtering or chipset ID will be the preferred candidate.

Thanks
Anshul Makkar

-----Original Message-----
From: Tian, Kevin [mailto:kevin.tian@intel.com] 
Sent: 26 November 2015 07:17
To: Malcolm Crossley <malcolm.crossley@citrix.com>; Jan Beulich <JBeulich@suse.com>; Andrew Cooper <Andrew.Cooper3@citrix.com>; Anshul Makkar <anshul.makkar@citrix.com>
Cc: Zhang, Yang Z <yang.z.zhang@intel.com>; xen-devel@lists.xen.org
Subject: RE: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

> From: Malcolm Crossley [mailto:malcolm.crossley@citrix.com]
> Sent: Wednesday, November 25, 2015 11:59 PM
> 
> On 25/11/15 15:38, Jan Beulich wrote:
> >>>> On 25.11.15 at 16:13, <andrew.cooper3@citrix.com> wrote:
> >> On 25/11/15 10:49, Jan Beulich wrote:
> >>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>>> On 24/11/15 17:41, Jan Beulich wrote:
> >>>>>>>> On 24.11.15 at 18:17, <Anshul Makkar anshul.makkar@citrix.com> wrote:
> >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
> >>>>>>      /* Tylersburg interrupt remap quirk */
> >>>>>>      if ( iommu_intremap )
> >>>>>>          tylersburg_intremap_quirk();
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * Disable shared EPT ("sharept") on Sandybridge and older processors
> >>>>>> +     * by default.
> >>>>>> +     * SandyBridge has no huge page support for IOTLB which 
> >>>>>> + leads to
> >> fallback
> >>>>>> +     * on 4k pages and leads to performance degradation.
> >>>>>> +     *
> >>>>>> +     * Shared EPT ("sharept") will be disabled only if user has not
> >>>>>> +     * provided explicit choice on the command line thus 
> >>>>>> + iommu_hap_pt_share
> >> is
> >>>>>> +     * at its initialized value of -1.
> >>>>>> +     */
> >>>>>> +    if ( (boot_cpu_data.x86 == 0x06 && 
> >>>>>> + (boot_cpu_data.x86_model <= 0x2F
> ||
> >>>>>> +          boot_cpu_data.x86_model == 0x36)) && 
> >>>>>> + (iommu_hap_pt_share ==
> -1) )
> >>>>>> +        iommu_hap_pt_share = 0;
> >>>>> If we really want to do this, then I think we should key this on 
> >>>>> EPT but not VT-d having 2M support, instead of on CPU models.
> >>>> This check is already performed by vtd_ept_page_compatible()
> >>> Yeah, I realized there would be such a check on the way home.
> >>>
> >>>> The problem is that SandyBridge IOMMUs advertise 2M support and 
> >>>> do function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>
> >>>> As a result, attempting to use 2M translations causes 
> >>>> substantially worse performance than 4K translations.
> >>> So commit message and comment should make this more explicit, to 
> >>> avoid the impression "IOTLB" isn't just the relatively common 
> >>> mis-naming of "IOMMU".
> >>>
> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
> >>>
> >>> Further the model based check is relatively broad, and includes 
> >>> Atoms (0x36 actually is one), which can't be considered 
> >>> "Sandybridge or older" imo.
> >>>
> >>> And finally I'm not fully convinced using CPU model info to deduce 
> >>> chipset behavior is entirely correct (albeit perhaps in practice 
> >>> it'll be fine except maybe when running Xen itself virtualized).
> >>
> >> What else would you suggest? I can't think of any better 
> >> identifying information.
> >
> > Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
> Unfortunately there's no register to query the IOTLB configuration of 
> the IOMMU and so we're stuck identifying the via the processor model number itself.
> 
> Malcolm
> 

I'm OK to use processor model here, though ideally Jan is right. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-01 16:45               ` Anshul Makkar
@ 2015-12-01 17:20                 ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2015-12-01 17:20 UTC (permalink / raw)
  To: Andrew Cooper, Anshul Makkar, Malcolm Crossley, Kevin Tian
  Cc: Yang Z Zhang, xen-devel@lists.xen.org

>>> On 01.12.15 at 17:45, <anshul.makkar@citrix.com> wrote:
> Based on the discussion below, can I assume there is an agreement for using 
> processor model for filtering or chipset ID will be the preferred candidate.

I think the subsequent suggestion by Andrew makes it even more
desirable to remain independent of CPU model here. As said before,
we should simply leverage the PCI IDs we already have quirks for.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 13:55         ` Andrew Cooper
  2015-11-30 21:22           ` Konrad Rzeszutek Wilk
@ 2015-12-03  1:19           ` Tian, Kevin
  2015-12-03 11:24             ` Andrew Cooper
  2015-12-03  2:40           ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2015-12-03  1:19 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: Zhang, Yang Z, Malcolm Crossley, Anshul Makkar,
	xen-devel@lists.xen.org

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, November 26, 2015 9:56 PM
> 
> On 26/11/15 13:48, Malcolm Crossley wrote:
> > On 26/11/15 13:46, Jan Beulich wrote:
> >>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>
> >>> As a result, attempting to use 2M translations causes substantially
> >>> worse performance than 4K translations.
> >> Btw - how does this get explained? At a first glance, even if 2Mb
> >> translations don't get entered into the TLB, it should still be one
> >> less page table level to walk for the IOMMU, and should hence
> >> nevertheless be a benefit. Yet you even say _substantially_
> >> worse performance results.
> > There is a IOTLB for the 4K translation so if you only use 4K
> > translations then you get to take advantage of the IOTLB.
> >
> > If you use the 2Mb translation then a page table walk has to be
> > performed every time there's a DMA access to that region of the BFN
> > address space.
> 
> Also remember that a high level dma access (from the point of view of a
> driver) will be fragmented at the PCIe max packet size, which is
> typically 256 bytes.
> 
> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> pagetable walks, one for each PCIe packet.
> 
> We observed that using 2Mb mappings results in a 40% overhead, compared
> to using 4k mappings, from the point of view of a sample network workload.
> 
> ~Andrew

One confusion here. The original patch just disables shared_ept, w/o
changing IOMMU to not use 2MB mapping. Is there something missing
or other tricks behind?

When you say using 4k mapping saves 40% overhead back, is it w/
ept shared or not?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-11-26 13:55         ` Andrew Cooper
  2015-11-30 21:22           ` Konrad Rzeszutek Wilk
  2015-12-03  1:19           ` Tian, Kevin
@ 2015-12-03  2:40           ` Tian, Kevin
  2015-12-03  8:18             ` Jan Beulich
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2015-12-03  2:40 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: Zhang, Yang Z, Malcolm Crossley, Anshul Makkar,
	xen-devel@lists.xen.org

> From: Tian, Kevin
> Sent: Thursday, December 03, 2015 9:20 AM
> 
> > From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> > Sent: Thursday, November 26, 2015 9:56 PM
> >
> > On 26/11/15 13:48, Malcolm Crossley wrote:
> > > On 26/11/15 13:46, Jan Beulich wrote:
> > >>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> > >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> > >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> > >>>
> > >>> As a result, attempting to use 2M translations causes substantially
> > >>> worse performance than 4K translations.
> > >> Btw - how does this get explained? At a first glance, even if 2Mb
> > >> translations don't get entered into the TLB, it should still be one
> > >> less page table level to walk for the IOMMU, and should hence
> > >> nevertheless be a benefit. Yet you even say _substantially_
> > >> worse performance results.
> > > There is a IOTLB for the 4K translation so if you only use 4K
> > > translations then you get to take advantage of the IOTLB.
> > >
> > > If you use the 2Mb translation then a page table walk has to be
> > > performed every time there's a DMA access to that region of the BFN
> > > address space.
> >
> > Also remember that a high level dma access (from the point of view of a
> > driver) will be fragmented at the PCIe max packet size, which is
> > typically 256 bytes.
> >
> > So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> > pagetable walks, one for each PCIe packet.
> >
> > We observed that using 2Mb mappings results in a 40% overhead, compared
> > to using 4k mappings, from the point of view of a sample network workload.
> >
> > ~Andrew
> 
> One confusion here. The original patch just disables shared_ept, w/o
> changing IOMMU to not use 2MB mapping. Is there something missing
> or other tricks behind?
> 
> When you say using 4k mapping saves 40% overhead back, is it w/
> ept shared or not?
> 

Just confirmed internally with HW team. On SNB 4KB cache is always
used regardless of 4KB/2MB/1GB mapping. There'd be another reason
for this 40% drop observation...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03  2:40           ` Tian, Kevin
@ 2015-12-03  8:18             ` Jan Beulich
  2015-12-03  8:50               ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2015-12-03  8:18 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Yang Z Zhang, Andrew Cooper, Anshul Makkar, Malcolm Crossley,
	xen-devel@lists.xen.org

>>> On 03.12.15 at 03:40, <kevin.tian@intel.com> wrote:
> Just confirmed internally with HW team. On SNB 4KB cache is always
> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
> for this 40% drop observation...

So when they stated that the 4k TLB gets always used, did they at
least provide some thoughts on what else might be causing this
severe a performance impact? Without them helping we're left
guessing...

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03  8:18             ` Jan Beulich
@ 2015-12-03  8:50               ` Tian, Kevin
  2015-12-03 11:19                 ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2015-12-03  8:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, Andrew Cooper, Anshul Makkar, Malcolm Crossley,
	xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, December 03, 2015 4:18 PM
> 
> >>> On 03.12.15 at 03:40, <kevin.tian@intel.com> wrote:
> > Just confirmed internally with HW team. On SNB 4KB cache is always
> > used regardless of 4KB/2MB/1GB mapping. There'd be another reason
> > for this 40% drop observation...
> 
> So when they stated that the 4k TLB gets always used, did they at
> least provide some thoughts on what else might be causing this
> severe a performance impact? Without them helping we're left
> guessing...
> 

Unfortunately no clear answer...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03  8:50               ` Tian, Kevin
@ 2015-12-03 11:19                 ` Andrew Cooper
  2015-12-04  2:35                   ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-12-03 11:19 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: Zhang, Yang Z, Anshul Makkar, Malcolm Crossley,
	xen-devel@lists.xen.org

On 03/12/15 08:50, Tian, Kevin wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, December 03, 2015 4:18 PM
>>
>>>>> On 03.12.15 at 03:40, <kevin.tian@intel.com> wrote:
>>> Just confirmed internally with HW team. On SNB 4KB cache is always
>>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
>>> for this 40% drop observation...
>> So when they stated that the 4k TLB gets always used, did they at
>> least provide some thoughts on what else might be causing this
>> severe a performance impact? Without them helping we're left
>> guessing...
>>
> Unfortunately no clear answer...

http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf

Page 42: "The IOTLB on the previous generation Intel Xeon Processor
E5-2690 does not natively support huge pages (it emulates them using 4K
pages)."

And Figure 51 on Page 43

The "emulates them using 4K pages" probably means that the IOTLB is
flushed and filled with 512 adjacent 4k mappings.

Citrix's measurements back up the findings in that paper, and also show
that performance is better when using plain 4k mappings as opposed to
emulated 2M mappings.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03  1:19           ` Tian, Kevin
@ 2015-12-03 11:24             ` Andrew Cooper
  2015-12-04  1:55               ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2015-12-03 11:24 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: Zhang, Yang Z, Malcolm Crossley, Anshul Makkar,
	xen-devel@lists.xen.org

On 03/12/15 01:19, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Thursday, November 26, 2015 9:56 PM
>>
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes substantially
>>>>> worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb
>>>> translations don't get entered into the TLB, it should still be one
>>>> less page table level to walk for the IOMMU, and should hence
>>>> nevertheless be a benefit. Yet you even say _substantially_
>>>> worse performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be
>>> performed every time there's a DMA access to that region of the BFN
>>> address space.
>> Also remember that a high level dma access (from the point of view of a
>> driver) will be fragmented at the PCIe max packet size, which is
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>> pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, compared
>> to using 4k mappings, from the point of view of a sample network workload.
>>
>> ~Andrew
> One confusion here. The original patch just disables shared_ept, w/o
> changing IOMMU to not use 2MB mapping. Is there something missing
> or other tricks behind?

Disabling shared_ept works because the Xen IOMMU interface doesn't
support superpages.

However, I subsequently changed my mind in this thread about the
approach taken, and quirking the IOMMUs not to report superpage
capabilities is the correct solution to the problem.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03 11:24             ` Andrew Cooper
@ 2015-12-04  1:55               ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2015-12-04  1:55 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: Zhang, Yang Z, Malcolm Crossley, Anshul Makkar,
	xen-devel@lists.xen.org

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, December 03, 2015 7:24 PM
> 
> On 03/12/15 01:19, Tian, Kevin wrote:
> >> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> Sent: Thursday, November 26, 2015 9:56 PM
> >>
> >> On 26/11/15 13:48, Malcolm Crossley wrote:
> >>> On 26/11/15 13:46, Jan Beulich wrote:
> >>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@citrix.com> wrote:
> >>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>>
> >>>>> As a result, attempting to use 2M translations causes substantially
> >>>>> worse performance than 4K translations.
> >>>> Btw - how does this get explained? At a first glance, even if 2Mb
> >>>> translations don't get entered into the TLB, it should still be one
> >>>> less page table level to walk for the IOMMU, and should hence
> >>>> nevertheless be a benefit. Yet you even say _substantially_
> >>>> worse performance results.
> >>> There is a IOTLB for the 4K translation so if you only use 4K
> >>> translations then you get to take advantage of the IOTLB.
> >>>
> >>> If you use the 2Mb translation then a page table walk has to be
> >>> performed every time there's a DMA access to that region of the BFN
> >>> address space.
> >> Also remember that a high level dma access (from the point of view of a
> >> driver) will be fragmented at the PCIe max packet size, which is
> >> typically 256 bytes.
> >>
> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> >> pagetable walks, one for each PCIe packet.
> >>
> >> We observed that using 2Mb mappings results in a 40% overhead, compared
> >> to using 4k mappings, from the point of view of a sample network workload.
> >>
> >> ~Andrew
> > One confusion here. The original patch just disables shared_ept, w/o
> > changing IOMMU to not use 2MB mapping. Is there something missing
> > or other tricks behind?
> 
> Disabling shared_ept works because the Xen IOMMU interface doesn't
> support superpages.
> 
> However, I subsequently changed my mind in this thread about the
> approach taken, and quirking the IOMMUs not to report superpage
> capabilities is the correct solution to the problem.
> 

Yes, it makes more sense.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
  2015-12-03 11:19                 ` Andrew Cooper
@ 2015-12-04  2:35                   ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2015-12-04  2:35 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: Zhang, Yang Z, Anshul Makkar, Malcolm Crossley,
	xen-devel@lists.xen.org

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, December 03, 2015 7:19 PM
> 
> On 03/12/15 08:50, Tian, Kevin wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, December 03, 2015 4:18 PM
> >>
> >>>>> On 03.12.15 at 03:40, <kevin.tian@intel.com> wrote:
> >>> Just confirmed internally with HW team. On SNB 4KB cache is always
> >>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
> >>> for this 40% drop observation...
> >> So when they stated that the 4k TLB gets always used, did they at
> >> least provide some thoughts on what else might be causing this
> >> severe a performance impact? Without them helping we're left
> >> guessing...
> >>
> > Unfortunately no clear answer...
> 
> http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf
> 
> Page 42: "The IOTLB on the previous generation Intel Xeon Processor
> E5-2690 does not natively support huge pages (it emulates them using 4K
> pages)."
> 
> And Figure 51 on Page 43
> 
> The "emulates them using 4K pages" probably means that the IOTLB is
> flushed and filled with 512 adjacent 4k mappings.
> 
> Citrix's measurements back up the findings in that paper, and also show
> that performance is better when using plain 4k mappings as opposed to
> emulated 2M mappings.
> 

Thanks for the information. I'll forward it to HW team.

If above interpretation is correct (which also matches my thought), then
for two options you listed earlier:

---
> This leaves two options
> 1) 2M mappings are entirely uncached
> 2) 2M mappings are shattered to 4K mappings and cached

> The fact there is a 40% performance reduction suggests 1 rather than 2.
---

looks 2) is suggested rather than 1). There are two further options:

2.1) 2M mappings are shattered to 512 adjacent 4k mappings which are all
cached;
2.2) Only the 4k mapping out of 2M mapping is cached for the page being 
accessed;

for 2.1), as IOTLB entries are limited, it may cause unnecessary IOTLB
entry flushes and thus incurs more page walking overhead to fill-in.

for 2.2), I can't think out a reason to cause performance drop.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-12-04  2:35 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1440776507-30218-1-git-send-email-anshul.makkar@citrix.com>
2015-08-28 16:24 ` [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors Andrew Cooper
2015-08-31  8:09 ` Jan Beulich
2015-09-01 14:18   ` Andrew Cooper
2015-09-01 14:55     ` Jan Beulich
2015-11-24 17:17 Anshul
2015-11-24 17:41 ` Jan Beulich
2015-11-25 10:28   ` Andrew Cooper
2015-11-25 10:49     ` Jan Beulich
2015-11-25 15:13       ` Andrew Cooper
2015-11-25 15:38         ` Jan Beulich
2015-11-25 15:58           ` Malcolm Crossley
2015-11-26  7:17             ` Tian, Kevin
2015-12-01 16:45               ` Anshul Makkar
2015-12-01 17:20                 ` Jan Beulich
2015-11-26  8:45             ` Jan Beulich
2015-11-26 10:27               ` Andrew Cooper
2015-11-26 10:39                 ` Jan Beulich
2015-11-26 11:42                   ` Andrew Cooper
2015-11-26 11:53                     ` Jan Beulich
2015-11-26 13:46     ` Jan Beulich
2015-11-26 13:48       ` Malcolm Crossley
2015-11-26 13:55         ` Andrew Cooper
2015-11-30 21:22           ` Konrad Rzeszutek Wilk
2015-12-01 10:34             ` Andrew Cooper
2015-12-01 10:44               ` Anshul Makkar
2015-12-01 15:24               ` Konrad Rzeszutek Wilk
2015-12-01 16:19                 ` Andrew Cooper
2015-12-03  1:19           ` Tian, Kevin
2015-12-03 11:24             ` Andrew Cooper
2015-12-04  1:55               ` Tian, Kevin
2015-12-03  2:40           ` Tian, Kevin
2015-12-03  8:18             ` Jan Beulich
2015-12-03  8:50               ` Tian, Kevin
2015-12-03 11:19                 ` Andrew Cooper
2015-12-04  2:35                   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.