All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
@ 2015-09-09  6:50 Tiejun Chen
  2015-09-09 14:20 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Tiejun Chen @ 2015-09-09  6:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Ian Jackson, Ian Campbell, Jan Beulich, Tim Deegan

We should lower loglevel to XENLOG_G_DEBUG while mapping or
unmapping memory via XEN_DOMCTL_memory_mapping since its
fair enough to check this info just while debugging.

CC: Ian Campbell <ian.campbell@citrix.com>
CC: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Keir Fraser <keir@xen.org>
CC: Tim Deegan <tim@xen.org>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/common/domctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 7f959f3..3bf39f1 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -1049,7 +1049,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 
         if ( add )
         {
-            printk(XENLOG_G_INFO
+            printk(XENLOG_G_DEBUG
                    "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
                    d->domain_id, gfn, mfn, nr_mfns);
 
@@ -1061,7 +1061,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         }
         else
         {
-            printk(XENLOG_G_INFO
+            printk(XENLOG_G_DEBUG
                    "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
                    d->domain_id, gfn, mfn, nr_mfns);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09  6:50 [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping Tiejun Chen
@ 2015-09-09 14:20 ` Konrad Rzeszutek Wilk
  2015-09-09 14:33   ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-09 14:20 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, xen-devel,
	Jan Beulich

On Wed, Sep 09, 2015 at 02:50:25PM +0800, Tiejun Chen wrote:
> We should lower loglevel to XENLOG_G_DEBUG while mapping or
> unmapping memory via XEN_DOMCTL_memory_mapping since its
> fair enough to check this info just while debugging.

The issue you folks are hitting where it takes eons to boot
with a large BAR _may_ be related to tasklets and triggering
the hypercall_preempt_check().

I think you may be using 'console_to_ring' or such and collecting
the hypervisor logs. That will schedule an tasklet whenever printk
is used - and the tasklet will cause hypercall_preempt_check() to
return true.

So what we get is that on the entrace (even before calling map_mmio_region)
for this function we have an outstanding softirq we must process - which
means that this function exits as soon as possible to service it.

Perhaps the solution is remove the first printk(s) and just have them
once the operation has completed? That may fix the outstanding tasklet
problem?

Could you try that (or perhaps you have already tried it?) please?

> 
> CC: Ian Campbell <ian.campbell@citrix.com>
> CC: Ian Jackson <ian.jackson@eu.citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Keir Fraser <keir@xen.org>
> CC: Tim Deegan <tim@xen.org>
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> ---
>  xen/common/domctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 7f959f3..3bf39f1 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -1049,7 +1049,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>  
>          if ( add )
>          {
> -            printk(XENLOG_G_INFO
> +            printk(XENLOG_G_DEBUG
>                     "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
>                     d->domain_id, gfn, mfn, nr_mfns);
>  
> @@ -1061,7 +1061,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>          }
>          else
>          {
> -            printk(XENLOG_G_INFO
> +            printk(XENLOG_G_DEBUG
>                     "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
>                     d->domain_id, gfn, mfn, nr_mfns);
>  
> -- 
> 1.9.1
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 14:20 ` Konrad Rzeszutek Wilk
@ 2015-09-09 14:33   ` Jan Beulich
  2015-09-09 14:50     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-09 14:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, xen-devel,
	Tiejun Chen

>>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
> Perhaps the solution is remove the first printk(s) and just have them
> once the operation has completed? That may fix the outstanding tasklet
> problem?

Considering that this is a tool stack based retry, how would the
hypervisor know when the _whole_ operation is done?

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 14:33   ` Jan Beulich
@ 2015-09-09 14:50     ` Konrad Rzeszutek Wilk
  2015-09-09 14:55       ` Jan Beulich
  2015-09-09 15:19       ` Malcolm Crossley
  0 siblings, 2 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-09 14:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, xen-devel,
	Tiejun Chen

On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
> >>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
> > Perhaps the solution is remove the first printk(s) and just have them
> > once the operation has completed? That may fix the outstanding tasklet
> > problem?
> 
> Considering that this is a tool stack based retry, how would the
> hypervisor know when the _whole_ operation is done?

I was merely thinking of moving the printk _after_ the map_mmio_regions
so there wouldn't be any outstanding preemption points in map_mmio_regions
(so it can at least do the 64 PFNs).

But going forward a couple of ideas:

 - The 64 limit was arbitrary. It could have been 42 or PFNs / num_online_cpus(),
   or actually finding out the size of the BAR and figuring the optimal
   case so that it will be done under 1ms. Or perhaps just provide an
   boot time parameter for those that really are struggling with this.

 - Perhaps add a new API to the P2M code so it can do the operations
   in batches. Our map_mmio_region iterates over every PFN which is
   surely not efficient. Some batching could help? Maybe? What other
   code could benefit from this? Would the boot-time creation of IOMMU
   page-tables also help with this (which takes 10 minutes on a 1TB box
   BTW)

 - This printk can be altogether removed in the hypervisor and moved
   in the toolstack. That is the libxl xc_domain_add_to_physmap
   could itself use the logging facility (xc_report?) the action.
   
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 14:50     ` Konrad Rzeszutek Wilk
@ 2015-09-09 14:55       ` Jan Beulich
  2015-09-09 15:05         ` Konrad Rzeszutek Wilk
  2015-09-09 15:19       ` Malcolm Crossley
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-09 14:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, xen-devel,
	Tiejun Chen

>>> On 09.09.15 at 16:50, <konrad.wilk@oracle.com> wrote:
> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
>> >>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
>> > Perhaps the solution is remove the first printk(s) and just have them
>> > once the operation has completed? That may fix the outstanding tasklet
>> > problem?
>> 
>> Considering that this is a tool stack based retry, how would the
>> hypervisor know when the _whole_ operation is done?
> 
> I was merely thinking of moving the printk _after_ the map_mmio_regions
> so there wouldn't be any outstanding preemption points in map_mmio_regions
> (so it can at least do the 64 PFNs).

But there are no preemption points. That's why you needed to use
tool stack based retries in the first place.

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 14:55       ` Jan Beulich
@ 2015-09-09 15:05         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-09 15:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, xen-devel,
	Tiejun Chen

On Wed, Sep 09, 2015 at 08:55:38AM -0600, Jan Beulich wrote:
> >>> On 09.09.15 at 16:50, <konrad.wilk@oracle.com> wrote:
> > On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
> >> >>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
> >> > Perhaps the solution is remove the first printk(s) and just have them
> >> > once the operation has completed? That may fix the outstanding tasklet
> >> > problem?
> >> 
> >> Considering that this is a tool stack based retry, how would the
> >> hypervisor know when the _whole_ operation is done?
> > 
> > I was merely thinking of moving the printk _after_ the map_mmio_regions
> > so there wouldn't be any outstanding preemption points in map_mmio_regions
> > (so it can at least do the 64 PFNs).
> 
> But there are no preemption points. That's why you needed to use
> tool stack based retries in the first place.

You are totally right. I recall seeing it but I realize it was an
RFC patch I wrote that would have made map_mmio_regions call
the hypercall_preempt_check()!

<sigh> Sorry about the noise!
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 14:50     ` Konrad Rzeszutek Wilk
  2015-09-09 14:55       ` Jan Beulich
@ 2015-09-09 15:19       ` Malcolm Crossley
  2015-09-09 15:44         ` Jan Beulich
  1 sibling, 1 reply; 20+ messages in thread
From: Malcolm Crossley @ 2015-09-09 15:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, xen-devel,
	Tiejun Chen

On 09/09/15 15:50, Konrad Rzeszutek Wilk wrote:
> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
>>>>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
>>> Perhaps the solution is remove the first printk(s) and just have them
>>> once the operation has completed? That may fix the outstanding tasklet
>>> problem?
>>
>> Considering that this is a tool stack based retry, how would the
>> hypervisor know when the _whole_ operation is done?
> 
> I was merely thinking of moving the printk _after_ the map_mmio_regions
> so there wouldn't be any outstanding preemption points in map_mmio_regions
> (so it can at least do the 64 PFNs).
> 
> But going forward a couple of ideas:
> 
>  - The 64 limit was arbitrary. It could have been 42 or PFNs / num_online_cpus(),
>    or actually finding out the size of the BAR and figuring the optimal
>    case so that it will be done under 1ms. Or perhaps just provide an
>    boot time parameter for those that really are struggling with this.

The issue of it taking a long time to map a large BAR is caused by the unconditional
memory_type_changed call at the bottom of XEN_DOMCTL_memory_mapping function.

The memory_type_changed call results in a full data cache flush on VM with a PCI device
assigned.

We have seen this overhead cause a 1GB BAR to take 20 seconds to map it's MMIO.

If the 64 limit was arbitrary then I would suggest increasing it to at least 1024 so that
at least 4M of BAR can be mapped in one go and it reduces the overhead by a factor of 16.

Currently the toolstack attempts lower and lower powers of 2 size's of the MMIO region to be mapped
until the hypercall succeeds.

Is it not clear to me why we have an unconditional call to memory_type_changed in the domctl.
Can somebody explain why it can't be made condition on errors?

Malcolm


> 
>  - Perhaps add a new API to the P2M code so it can do the operations
>    in batches. Our map_mmio_region iterates over every PFN which is
>    surely not efficient. Some batching could help? Maybe? What other
>    code could benefit from this? Would the boot-time creation of IOMMU
>    page-tables also help with this (which takes 10 minutes on a 1TB box
>    BTW)
> 
>  - This printk can be altogether removed in the hypervisor and moved
>    in the toolstack. That is the libxl xc_domain_add_to_physmap
>    could itself use the logging facility (xc_report?) the action.
>    
>>
>> Jan
>>
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 15:19       ` Malcolm Crossley
@ 2015-09-09 15:44         ` Jan Beulich
  2015-09-10  5:28           ` Chen, Tiejun
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-09 15:44 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Tiejun Chen

>>> On 09.09.15 at 17:19, <malcolm.crossley@citrix.com> wrote:
> On 09/09/15 15:50, Konrad Rzeszutek Wilk wrote:
>> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
>>>>>> On 09.09.15 at 16:20, <konrad.wilk@oracle.com> wrote:
>>>> Perhaps the solution is remove the first printk(s) and just have them
>>>> once the operation has completed? That may fix the outstanding tasklet
>>>> problem?
>>>
>>> Considering that this is a tool stack based retry, how would the
>>> hypervisor know when the _whole_ operation is done?
>> 
>> I was merely thinking of moving the printk _after_ the map_mmio_regions
>> so there wouldn't be any outstanding preemption points in map_mmio_regions
>> (so it can at least do the 64 PFNs).
>> 
>> But going forward a couple of ideas:
>> 
>>  - The 64 limit was arbitrary. It could have been 42 or PFNs / num_online_cpus(),
>>    or actually finding out the size of the BAR and figuring the optimal
>>    case so that it will be done under 1ms. Or perhaps just provide an
>>    boot time parameter for those that really are struggling with this.
> 
> The issue of it taking a long time to map a large BAR is caused by the 
> unconditional
> memory_type_changed call at the bottom of XEN_DOMCTL_memory_mapping 
> function.
> 
> The memory_type_changed call results in a full data cache flush on VM with a 
> PCI device
> assigned.
> 
> We have seen this overhead cause a 1GB BAR to take 20 seconds to map it's 
> MMIO.
> 
> If the 64 limit was arbitrary then I would suggest increasing it to at least 
> 1024 so that
> at least 4M of BAR can be mapped in one go and it reduces the overhead by a 
> factor of 16.

1024 may be a little much, but 256 is certainly a possibility, plus
Konrad's suggestion to allow this limit to be controlled via command
line option.

> Currently the toolstack attempts lower and lower powers of 2 size's of the 
> MMIO region to be mapped
> until the hypercall succeeds.
> 
> Is it not clear to me why we have an unconditional call to 
> memory_type_changed in the domctl.
> Can somebody explain why it can't be made condition on errors?

You mean only on success? That may be possible (albeit as you
can see from the commit introducing it I wasn't sure, and I'm still
not sure), but wouldn't buy you anything.

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-09 15:44         ` Jan Beulich
@ 2015-09-10  5:28           ` Chen, Tiejun
  2015-09-10  8:13             ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Chen, Tiejun @ 2015-09-10  5:28 UTC (permalink / raw)
  To: Jan Beulich, Malcolm Crossley
  Cc: Keir Fraser, Ian Campbell, IanJackson, Tim Deegan, xen-devel

>> If the 64 limit was arbitrary then I would suggest increasing it to at least
>> 1024 so that
>> at least 4M of BAR can be mapped in one go and it reduces the overhead by a
>> factor of 16.
>
> 1024 may be a little much, but 256 is certainly a possibility, plus
> Konrad's suggestion to allow this limit to be controlled via command
> line option.

Are you guys talking this way?

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 3946e4c..a9671bb 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -88,6 +88,10 @@ boolean_param("noapic", skip_ioapic_setup);
  s8 __read_mostly xen_cpuidle = -1;
  boolean_param("cpuidle", xen_cpuidle);

+/* once_mapping_mfns: memory mapping mfn bumbers once. */
+unsigned int xen_once_mapping_mfns;
+integer_param("once_mapping_mfns", xen_once_mapping_mfns);
+
  #ifndef NDEBUG
  unsigned long __initdata highmem_start;
  size_param("highmem-start", highmem_start);
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 3bf39f1..82c85e3 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -33,6 +33,8 @@
  #include <public/domctl.h>
  #include <xsm/xsm.h>

+extern unsigned int xen_once_mapping_mfns;
+
  static DEFINE_SPINLOCK(domctl_lock);
  DEFINE_SPINLOCK(vcpu_alloc_lock);

@@ -1035,7 +1037,7 @@ long 
do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)

          ret = -E2BIG;
          /* Must break hypercall up as this could take a while. */
-        if ( nr_mfns > 64 )
+        if ( nr_mfns > xen_once_mapping_mfns )
              break;

          ret = -EPERM;

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-10  5:28           ` Chen, Tiejun
@ 2015-09-10  8:13             ` Jan Beulich
  2015-09-10  8:55               ` Chen, Tiejun
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-10  8:13 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Malcolm Crossley

>>> On 10.09.15 at 07:28, <tiejun.chen@intel.com> wrote:
>> > If the 64 limit was arbitrary then I would suggest increasing it to at least
>>> 1024 so that
>>> at least 4M of BAR can be mapped in one go and it reduces the overhead by a
>>> factor of 16.
>>
>> 1024 may be a little much, but 256 is certainly a possibility, plus
>> Konrad's suggestion to allow this limit to be controlled via command
>> line option.
> 
> Are you guys talking this way?

Sort of (the patch has the intended effect, but for its size very
many rough edges).

Jan

> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 3946e4c..a9671bb 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -88,6 +88,10 @@ boolean_param("noapic", skip_ioapic_setup);
>   s8 __read_mostly xen_cpuidle = -1;
>   boolean_param("cpuidle", xen_cpuidle);
> 
> +/* once_mapping_mfns: memory mapping mfn bumbers once. */
> +unsigned int xen_once_mapping_mfns;
> +integer_param("once_mapping_mfns", xen_once_mapping_mfns);
> +
>   #ifndef NDEBUG
>   unsigned long __initdata highmem_start;
>   size_param("highmem-start", highmem_start);
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 3bf39f1..82c85e3 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -33,6 +33,8 @@
>   #include <public/domctl.h>
>   #include <xsm/xsm.h>
> 
> +extern unsigned int xen_once_mapping_mfns;
> +
>   static DEFINE_SPINLOCK(domctl_lock);
>   DEFINE_SPINLOCK(vcpu_alloc_lock);
> 
> @@ -1035,7 +1037,7 @@ long 
> do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
> 
>           ret = -E2BIG;
>           /* Must break hypercall up as this could take a while. */
> -        if ( nr_mfns > 64 )
> +        if ( nr_mfns > xen_once_mapping_mfns )
>               break;
> 
>           ret = -EPERM;
> 
> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-10  8:13             ` Jan Beulich
@ 2015-09-10  8:55               ` Chen, Tiejun
  2015-09-10  8:59                 ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Chen, Tiejun @ 2015-09-10  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Malcolm Crossley

> Sort of (the patch has the intended effect, but for its size very
> many rough edges).
>

I guess we need to amend the original parameter, once_mapping_mfns, like 
this,

/* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
unsigned int xen_once_mapping_mfns;
size_param("once_mapping_mfns", xen_once_mapping_mfns);

static void xen_once_mapping_mfns_setup(void)
{
     if ( once_mapping_mfns < 64 )
         xen_once_mapping_mfns = 64;
     else if ( once_mapping_mfns > 1024 )
         xen_once_mapping_mfns = 1024;
     else
         xen_once_mapping_mfns = once_mapping_mfns;
}

Or what is your expected rule?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-10  8:55               ` Chen, Tiejun
@ 2015-09-10  8:59                 ` Jan Beulich
  2015-09-10 17:55                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-10  8:59 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Malcolm Crossley

>>> On 10.09.15 at 10:55, <tiejun.chen@intel.com> wrote:
>>  Sort of (the patch has the intended effect, but for its size very
>> many rough edges).
>>
> 
> I guess we need to amend the original parameter, once_mapping_mfns, like 
> this,
> 
> /* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
> unsigned int xen_once_mapping_mfns;
> size_param("once_mapping_mfns", xen_once_mapping_mfns);
> 
> static void xen_once_mapping_mfns_setup(void)
> {
>      if ( once_mapping_mfns < 64 )
>          xen_once_mapping_mfns = 64;
>      else if ( once_mapping_mfns > 1024 )
>          xen_once_mapping_mfns = 1024;
>      else
>          xen_once_mapping_mfns = once_mapping_mfns;
> }

Right, that's one of the things that would need taking care of.
(Whether enforcing an upper limit is actually needed I'm not
sure - we generally allow the admin to shoot himself in the foot
if he wants to. And whether the lower limit should be 64 instead
of just ensuring the limit is not zero is another question.)

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-10  8:59                 ` Jan Beulich
@ 2015-09-10 17:55                   ` Konrad Rzeszutek Wilk
  2015-09-11  0:44                     ` Chen, Tiejun
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-10 17:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Tiejun Chen, Malcolm Crossley

On Thu, Sep 10, 2015 at 02:59:17AM -0600, Jan Beulich wrote:
> >>> On 10.09.15 at 10:55, <tiejun.chen@intel.com> wrote:
> >>  Sort of (the patch has the intended effect, but for its size very
> >> many rough edges).
> >>
> > 
> > I guess we need to amend the original parameter, once_mapping_mfns, like 
> > this,
> > 
> > /* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
> > unsigned int xen_once_mapping_mfns;
> > size_param("once_mapping_mfns", xen_once_mapping_mfns);
> > 
> > static void xen_once_mapping_mfns_setup(void)
> > {
> >      if ( once_mapping_mfns < 64 )
> >          xen_once_mapping_mfns = 64;
> >      else if ( once_mapping_mfns > 1024 )
> >          xen_once_mapping_mfns = 1024;
> >      else
> >          xen_once_mapping_mfns = once_mapping_mfns;
> > }
> 
> Right, that's one of the things that would need taking care of.
> (Whether enforcing an upper limit is actually needed I'm not
> sure - we generally allow the admin to shoot himself in the foot
> if he wants to. And whether the lower limit should be 64 instead
> of just ensuring the limit is not zero is another question.)

64 was semi-arbitrary  - it ended up giving good latency on
highly scalar machines (8 socket). Higher numbers ended up
affecting the latency.

But higher numbers on small socket machines were OK.
(As they do not have 8 IOMMU VT-d chipsets all potentially
flodding the QPI with serialized cache flushes).
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-10 17:55                   ` Konrad Rzeszutek Wilk
@ 2015-09-11  0:44                     ` Chen, Tiejun
  2015-09-11  0:59                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Chen, Tiejun @ 2015-09-11  0:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Jan Beulich
  Cc: Keir Fraser, Ian Campbell, IanJackson, Tim Deegan, xen-devel,
	Malcolm Crossley

>> Right, that's one of the things that would need taking care of.
>> (Whether enforcing an upper limit is actually needed I'm not
>> sure - we generally allow the admin to shoot himself in the foot
>> if he wants to. And whether the lower limit should be 64 instead
>> of just ensuring the limit is not zero is another question.)
>
> 64 was semi-arbitrary  - it ended up giving good latency on
> highly scalar machines (8 socket). Higher numbers ended up
> affecting the latency.
>
> But higher numbers on small socket machines were OK.
> (As they do not have 8 IOMMU VT-d chipsets all potentially
> flodding the QPI with serialized cache flushes).
>>

So we should make this range [8, <Its up to the user>] here, but 64 by 
default. Right?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11  0:44                     ` Chen, Tiejun
@ 2015-09-11  0:59                       ` Konrad Rzeszutek Wilk
  2015-09-11  9:17                         ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-11  0:59 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Jan Beulich, Malcolm Crossley

On Fri, Sep 11, 2015 at 08:44:50AM +0800, Chen, Tiejun wrote:
> >>Right, that's one of the things that would need taking care of.
> >>(Whether enforcing an upper limit is actually needed I'm not
> >>sure - we generally allow the admin to shoot himself in the foot
> >>if he wants to. And whether the lower limit should be 64 instead
> >>of just ensuring the limit is not zero is another question.)
> >
> >64 was semi-arbitrary  - it ended up giving good latency on
> >highly scalar machines (8 socket). Higher numbers ended up
> >affecting the latency.
> >
> >But higher numbers on small socket machines were OK.
> >(As they do not have 8 IOMMU VT-d chipsets all potentially
> >flodding the QPI with serialized cache flushes).
> >>
> 
> So we should make this range [8, <Its up to the user>] here, but 64 by
> default. Right?

Not sure I follow you. The 8 was based on 8 socket machines having 8
IOMMUs..

If you want a formula I would do:

#define MAX_SOCKETS 8

 max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS)))) * 64;

Where nr_iommus would have to be somehow implemented, ditto for pow.

This should give you:
 8	-> 64
 7	-> 128
 6	-> 256
 5	-> 512
 4	-> 1024
 3	-> 2048
 2	-> 4096
 1	-> 16384

> 
> Thanks
> Tiejun
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11  0:59                       ` Konrad Rzeszutek Wilk
@ 2015-09-11  9:17                         ` Jan Beulich
  2015-09-11 10:28                           ` Malcolm Crossley
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-11  9:17 UTC (permalink / raw)
  To: Tiejun Chen, Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, IanJackson, Tim Deegan, xen-devel,
	Malcolm Crossley

>>> On 11.09.15 at 02:59, <konrad.wilk@oracle.com> wrote:
> If you want a formula I would do:
> 
> #define MAX_SOCKETS 8
> 
>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS)))) * 64;
> 
> Where nr_iommus would have to be somehow implemented, ditto for pow.
> 
> This should give you:
>  8	-> 64
>  7	-> 128
>  6	-> 256
>  5	-> 512
>  4	-> 1024
>  3	-> 2048
>  2	-> 4096
>  1	-> 16384

16k seems excessive as a default. Also - why would this be related
to the number of sockets? I don't think there's a one-IOMMU-per-
socket rule; fixed-number-of-IOMMUs-per-node might come closer,
but there we'd have the problem of what "fixed number" is. Wouldn't
something as simple as 1024 / nr_iommus() do?

I also don't follow what cache flushes you talked about earlier: I
don't think the IOMMUs drive any global cache flushes, and I
would have thought the size limited IOTLB and (CPU side) cache
ones should be pretty limited in terms of bus load (unless the TLB
ones would get converted to global ones due to lacking IOMMU
capabilities). Is that not the case?

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11  9:17                         ` Jan Beulich
@ 2015-09-11 10:28                           ` Malcolm Crossley
  2015-09-11 11:11                             ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Malcolm Crossley @ 2015-09-11 10:28 UTC (permalink / raw)
  To: Jan Beulich, Tiejun Chen, Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Tim Deegan, IanJackson, Ian Campbell, xen-devel

On 11/09/15 10:17, Jan Beulich wrote:
>>>> On 11.09.15 at 02:59, <konrad.wilk@oracle.com> wrote:
>> If you want a formula I would do:
>>
>> #define MAX_SOCKETS 8
>>
>>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS)))) * 64;
>>
>> Where nr_iommus would have to be somehow implemented, ditto for pow.
>>
>> This should give you:
>>  8	-> 64
>>  7	-> 128
>>  6	-> 256
>>  5	-> 512
>>  4	-> 1024
>>  3	-> 2048
>>  2	-> 4096
>>  1	-> 16384
> 
> 16k seems excessive as a default. Also - why would this be related
> to the number of sockets? I don't think there's a one-IOMMU-per-
> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
> but there we'd have the problem of what "fixed number" is. Wouldn't
> something as simple as 1024 / nr_iommus() do?
> 
> I also don't follow what cache flushes you talked about earlier: I
> don't think the IOMMUs drive any global cache flushes, and I
> would have thought the size limited IOTLB and (CPU side) cache
> ones should be pretty limited in terms of bus load (unless the TLB
> ones would get converted to global ones due to lacking IOMMU
> capabilities). Is that not the case?

The data cache flushes are caused by the memory_type_changed() call at the
bottom of the XEN_DOMCTL_memory_mapping hypercall, not by the IOMMU code itself.

Malcolm

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11 10:28                           ` Malcolm Crossley
@ 2015-09-11 11:11                             ` Jan Beulich
  2015-09-11 12:05                               ` Malcolm Crossley
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-09-11 11:11 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Tiejun Chen

>>> On 11.09.15 at 12:28, <malcolm.crossley@citrix.com> wrote:
> On 11/09/15 10:17, Jan Beulich wrote:
>>>>> On 11.09.15 at 02:59, <konrad.wilk@oracle.com> wrote:
>>> If you want a formula I would do:
>>>
>>> #define MAX_SOCKETS 8
>>>
>>>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS)))) * 64;
>>>
>>> Where nr_iommus would have to be somehow implemented, ditto for pow.
>>>
>>> This should give you:
>>>  8	-> 64
>>>  7	-> 128
>>>  6	-> 256
>>>  5	-> 512
>>>  4	-> 1024
>>>  3	-> 2048
>>>  2	-> 4096
>>>  1	-> 16384
>> 
>> 16k seems excessive as a default. Also - why would this be related
>> to the number of sockets? I don't think there's a one-IOMMU-per-
>> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
>> but there we'd have the problem of what "fixed number" is. Wouldn't
>> something as simple as 1024 / nr_iommus() do?
>> 
>> I also don't follow what cache flushes you talked about earlier: I
>> don't think the IOMMUs drive any global cache flushes, and I
>> would have thought the size limited IOTLB and (CPU side) cache
>> ones should be pretty limited in terms of bus load (unless the TLB
>> ones would get converted to global ones due to lacking IOMMU
>> capabilities). Is that not the case?
> 
> The data cache flushes are caused by the memory_type_changed()
> call at the bottom of the XEN_DOMCTL_memory_mapping hypercall,
> not by the IOMMU code itself.

In which case - contrary to what Konrad said he measured - their
impact on overall throughput shouldn't scale with socket (or node)
count (unless the hardware implementation of the flushes is bad).

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11 11:11                             ` Jan Beulich
@ 2015-09-11 12:05                               ` Malcolm Crossley
  2015-09-11 14:11                                 ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Malcolm Crossley @ 2015-09-11 12:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Tiejun Chen

On 11/09/15 12:11, Jan Beulich wrote:
>>>> On 11.09.15 at 12:28, <malcolm.crossley@citrix.com> wrote:
>> On 11/09/15 10:17, Jan Beulich wrote:
>>>>>> On 11.09.15 at 02:59, <konrad.wilk@oracle.com> wrote:
>>>> If you want a formula I would do:
>>>>
>>>> #define MAX_SOCKETS 8
>>>>
>>>>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS)))) * 64;
>>>>
>>>> Where nr_iommus would have to be somehow implemented, ditto for pow.
>>>>
>>>> This should give you:
>>>>  8	-> 64
>>>>  7	-> 128
>>>>  6	-> 256
>>>>  5	-> 512
>>>>  4	-> 1024
>>>>  3	-> 2048
>>>>  2	-> 4096
>>>>  1	-> 16384
>>>
>>> 16k seems excessive as a default. Also - why would this be related
>>> to the number of sockets? I don't think there's a one-IOMMU-per-
>>> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
>>> but there we'd have the problem of what "fixed number" is. Wouldn't
>>> something as simple as 1024 / nr_iommus() do?
>>>
>>> I also don't follow what cache flushes you talked about earlier: I
>>> don't think the IOMMUs drive any global cache flushes, and I
>>> would have thought the size limited IOTLB and (CPU side) cache
>>> ones should be pretty limited in terms of bus load (unless the TLB
>>> ones would get converted to global ones due to lacking IOMMU
>>> capabilities). Is that not the case?
>>
>> The data cache flushes are caused by the memory_type_changed()
>> call at the bottom of the XEN_DOMCTL_memory_mapping hypercall,
>> not by the IOMMU code itself.
> 
> In which case - contrary to what Konrad said he measured - their
> impact on overall throughput shouldn't scale with socket (or node)
> count (unless the hardware implementation of the flushes is bad).
> 

The flush_all(FLUSH_CACHE) in mtrr.c will result in a flush_area_mask for all CPU's in the host.
It will more time to issue a IPI to all logical cores the more core's there are. I admit that
x2apic_cluster mode may speed this up but not all hosts will have that enabled.

The data flush will force all data out to memory controllers and it's possible that CPU's in
difference package have cached data all corresponding to a particular memory controller which will
become a bottleneck.

In worst case, with large delay between XEN_DOMCTL_memory_mapping hypercalls and on a 8 socket
system you may end up writing out 45MB (L3 cache) * 8 = 360MB to a single memory controller every 64
pages (256KiB) of domU p2m updated.

Malcolm


> Jan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping
  2015-09-11 12:05                               ` Malcolm Crossley
@ 2015-09-11 14:11                                 ` Jan Beulich
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2015-09-11 14:11 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, IanJackson, xen-devel,
	Tiejun Chen

>>> On 11.09.15 at 14:05, <malcolm.crossley@citrix.com> wrote:
> The flush_all(FLUSH_CACHE) in mtrr.c will result in a flush_area_mask for 
> all CPU's in the host.
> It will more time to issue a IPI to all logical cores the more core's there 
> are. I admit that
> x2apic_cluster mode may speed this up but not all hosts will have that 
> enabled.
> 
> The data flush will force all data out to memory controllers and it's 
> possible that CPU's in
> difference package have cached data all corresponding to a particular memory 
> controller which will
> become a bottleneck.
> 
> In worst case, with large delay between XEN_DOMCTL_memory_mapping hypercalls 
> and on a 8 socket
> system you may end up writing out 45MB (L3 cache) * 8 = 360MB to a single 
> memory controller every 64
> pages (256KiB) of domU p2m updated.

True.

Considering that BARs need to be properly aligned in both guest
and host address spaces, I wonder why we aren't using large
pages to map such huge BARs then. As it looks this would require
redefining the semantics of the domctl once again, but that's not
a big problem since - it's a domctl. I'll see if I can cook up something
(assuming that hosts used for passing through devices with such
huge BARs will have support for at least 2Mb pages in both EPT
[NPT always has] and IOMMU).

Jan

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-11 14:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-09  6:50 [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping Tiejun Chen
2015-09-09 14:20 ` Konrad Rzeszutek Wilk
2015-09-09 14:33   ` Jan Beulich
2015-09-09 14:50     ` Konrad Rzeszutek Wilk
2015-09-09 14:55       ` Jan Beulich
2015-09-09 15:05         ` Konrad Rzeszutek Wilk
2015-09-09 15:19       ` Malcolm Crossley
2015-09-09 15:44         ` Jan Beulich
2015-09-10  5:28           ` Chen, Tiejun
2015-09-10  8:13             ` Jan Beulich
2015-09-10  8:55               ` Chen, Tiejun
2015-09-10  8:59                 ` Jan Beulich
2015-09-10 17:55                   ` Konrad Rzeszutek Wilk
2015-09-11  0:44                     ` Chen, Tiejun
2015-09-11  0:59                       ` Konrad Rzeszutek Wilk
2015-09-11  9:17                         ` Jan Beulich
2015-09-11 10:28                           ` Malcolm Crossley
2015-09-11 11:11                             ` Jan Beulich
2015-09-11 12:05                               ` Malcolm Crossley
2015-09-11 14:11                                 ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.