RFC: large system support

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC: large system support - 128 CPUs
@ 2008-08-12 18:41 Bill Burns
  2008-08-13  8:21 ` Jan Beulich
  2008-08-13  8:21 ` Tim Deegan
  0 siblings, 2 replies; 12+ messages in thread
From: Bill Burns @ 2008-08-12 18:41 UTC (permalink / raw)
  To: xen-devel

There are a couple of issues with building the Hypervisor
with max_phys_cpus=128 for x86_64. (Note that this was
on a 3.1 base, but unstable appears to have
the same issue, at least with the first part).

First is a build assertion due to the size of
the page_info structure and the shadow_page_info
structures get out of sync due to the presence
of cpumask_t in the page info structure.

A possible fix is to tack on the following to
the end of shadow_page_info structure:

--- xen/arch/x86/mm/shadow/private.h.orig	2007-12-06 12:48:38.000000000 -0500
+++ xen/arch/x86/mm/shadow/private.h	2008-08-12 12:52:49.000000000 -0400
@@ -243,6 +243,12 @@ struct shadow_page_info
         /* For non-pinnable shadows, a higher entry that points at us */
         paddr_t up;
     };
+#if NR_CPUS > 64
+    /* Need to add some padding to match struct page_info size,
+    * if cpumask_t is larger than a long
+    */
+    u8 padding[sizeof(cpumask_t)-sizeof(long)];
+#endif
 };

 /* The structure above *must* be the same size as a struct page_info

The other issue is at runtime with a fault when
trying to bring up cpu 126. Seems the GDT space
reserved is not quite enough to hold the per
cpu entries. Crude fix (awaiting test results,
so not sure that this is sufficient.):

--- xen/include/asm-x86/desc.h.orig	2007-12-06 12:48:39.000000000 -0500
+++ xen/include/asm-x86/desc.h	2008-07-31 13:19:52.000000000 -0400
@@ -5,7 +5,11 @@
  * Xen reserves a memory page of GDT entries.
  * No guest GDT entries exist beyond the Xen reserved area.
  */
+#if MAX_PHYS_CPUS > 64
+#define NR_RESERVED_GDT_PAGES   2
+#else
 #define NR_RESERVED_GDT_PAGES   1
+#endif
 #define NR_RESERVED_GDT_BYTES   (NR_RESERVED_GDT_PAGES * PAGE_SIZE)
 #define NR_RESERVED_GDT_ENTRIES (NR_RESERVED_GDT_BYTES / 8)

Bill

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-12 18:41 RFC: large system support - 128 CPUs Bill Burns
@ 2008-08-13  8:21 ` Jan Beulich
  2008-08-13  8:22   ` Tim Deegan
  2008-08-13  8:21 ` Tim Deegan
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2008-08-13  8:21 UTC (permalink / raw)
  To: Bill Burns; +Cc: xen-devel

Both seem to be hacks to get to 128 CPUs, without consideration of how
to go beyond that, or perhaps even drop the fixed (compile-time) limit
altogether. Since we have to expect to be run on larger systems not too
far into the future, I think it rather needs to be explored how to address
these issues (and any potential others) in a fully scalable way.

Jan

>>> Bill Burns <bburns@redhat.com> 12.08.08 20:41 >>>

There are a couple of issues with building the Hypervisor
with max_phys_cpus=128 for x86_64. (Note that this was
on a 3.1 base, but unstable appears to have
the same issue, at least with the first part).

First is a build assertion due to the size of
the page_info structure and the shadow_page_info
structures get out of sync due to the presence
of cpumask_t in the page info structure.

A possible fix is to tack on the following to
the end of shadow_page_info structure:

--- xen/arch/x86/mm/shadow/private.h.orig	2007-12-06 12:48:38.000000000 -0500
+++ xen/arch/x86/mm/shadow/private.h	2008-08-12 12:52:49.000000000 -0400
@@ -243,6 +243,12 @@ struct shadow_page_info
         /* For non-pinnable shadows, a higher entry that points at us */
         paddr_t up;
     };
+#if NR_CPUS > 64
+    /* Need to add some padding to match struct page_info size,
+    * if cpumask_t is larger than a long
+    */
+    u8 padding[sizeof(cpumask_t)-sizeof(long)];
+#endif
 };

 /* The structure above *must* be the same size as a struct page_info

The other issue is at runtime with a fault when
trying to bring up cpu 126. Seems the GDT space
reserved is not quite enough to hold the per
cpu entries. Crude fix (awaiting test results,
so not sure that this is sufficient.):

--- xen/include/asm-x86/desc.h.orig	2007-12-06 12:48:39.000000000 -0500
+++ xen/include/asm-x86/desc.h	2008-07-31 13:19:52.000000000 -0400
@@ -5,7 +5,11 @@
  * Xen reserves a memory page of GDT entries.
  * No guest GDT entries exist beyond the Xen reserved area.
  */
+#if MAX_PHYS_CPUS > 64
+#define NR_RESERVED_GDT_PAGES   2
+#else
 #define NR_RESERVED_GDT_PAGES   1
+#endif
 #define NR_RESERVED_GDT_BYTES   (NR_RESERVED_GDT_PAGES * PAGE_SIZE)
 #define NR_RESERVED_GDT_ENTRIES (NR_RESERVED_GDT_BYTES / 8)

Bill

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com 
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-12 18:41 RFC: large system support - 128 CPUs Bill Burns
  2008-08-13  8:21 ` Jan Beulich
@ 2008-08-13  8:21 ` Tim Deegan
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2008-08-13  8:21 UTC (permalink / raw)
  To: Bill Burns; +Cc: xen-devel

At 14:41 -0400 on 12 Aug (1218552070), Bill Burns wrote:
> First is a build assertion due to the size of
> the page_info structure and the shadow_page_info
> structures get out of sync due to the presence
> of cpumask_t in the page info structure.
> 
> A possible fix is to tack on the following to
> the end of shadow_page_info structure:

Yep, that'll sort it out fine.  I don't think the #if is even needed
because a cpumask is always at least the size of a long.  Or you could
add a "cpumask_t _unused;" to the union with mbz in it since that's
where the sizes get out of sync.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:21 ` Jan Beulich
@ 2008-08-13  8:22   ` Tim Deegan
  2008-08-13  8:26     ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Deegan @ 2008-08-13  8:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Bill Burns

At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote:
> Both seem to be hacks to get to 128 CPUs, without consideration of how
> to go beyond that

I think the shadow_page_info one is a general fix for my implicit
assumption that sizeof(cpumask_t) == sizeof (long).

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:22   ` Tim Deegan
@ 2008-08-13  8:26     ` Keir Fraser
  2008-08-13  8:45       ` Jan Beulich
  2008-08-13 10:23       ` Bill Burns
  0 siblings, 2 replies; 12+ messages in thread
From: Keir Fraser @ 2008-08-13  8:26 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich; +Cc: xen-devel, Bill Burns

On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:

> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote:
>> Both seem to be hacks to get to 128 CPUs, without consideration of how
>> to go beyond that
> 
> I think the shadow_page_info one is a general fix for my implicit
> assumption that sizeof(cpumask_t) == sizeof (long).

Do some fields after the cpumask need to line up in both structures? Placing
a dummy cpumask in the shadow_page structure might make most sense.

For the other one I'll have to think a bit. The need for GDT entries per CPU
currently obviously means scaling much past a few hundred CPUs is going to
be difficult.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:26     ` Keir Fraser
@ 2008-08-13  8:45       ` Jan Beulich
  2008-08-13  8:47         ` Keir Fraser
  2008-08-13 10:23       ` Bill Burns
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2008-08-13  8:45 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel, Bill Burns

>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.08.08 10:26 >>>
>On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:
>
>> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote:
>>> Both seem to be hacks to get to 128 CPUs, without consideration of how
>>> to go beyond that
>> 
>> I think the shadow_page_info one is a general fix for my implicit
>> assumption that sizeof(cpumask_t) == sizeof (long).
>
>Do some fields after the cpumask need to line up in both structures? Placing
>a dummy cpumask in the shadow_page structure might make most sense.
>
>For the other one I'll have to think a bit. The need for GDT entries per CPU
>currently obviously means scaling much past a few hundred CPUs is going to
>be difficult.

But the cpumask-in-page_info is a scalability concern, too - systems with
many CPUs will tend to have a lot of memory, and the growing overhead
of the page_info array may become an issue then, too. Page clustering
may be an option to reduce/eliminate the growth, though I didn't spend
much thought on this or possible alternatives.

Jan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:45       ` Jan Beulich
@ 2008-08-13  8:47         ` Keir Fraser
  2008-08-13  8:52           ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-08-13  8:47 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan; +Cc: xen-devel, Bill Burns

On 13/8/08 09:45, "Jan Beulich" <jbeulich@novell.com> wrote:

> But the cpumask-in-page_info is a scalability concern, too - systems with
> many CPUs will tend to have a lot of memory, and the growing overhead
> of the page_info array may become an issue then, too. Page clustering
> may be an option to reduce/eliminate the growth, though I didn't spend
> much thought on this or possible alternatives.

An extra 8 bytes per page per 64 CPUs is hardly a concern I think. We're
talking an overhead of 32 bytes per megabyte per CPU. The concern over
growing page_info array with growing memory is fallacious -- the overhead is
a constant fraction of total memory, if #CPUs is held constant.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:47         ` Keir Fraser
@ 2008-08-13  8:52           ` Keir Fraser
  0 siblings, 0 replies; 12+ messages in thread
From: Keir Fraser @ 2008-08-13  8:52 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich, Tim Deegan; +Cc: xen-devel, Bill Burns

On 13/8/08 09:47, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

> On 13/8/08 09:45, "Jan Beulich" <jbeulich@novell.com> wrote:
> 
>> But the cpumask-in-page_info is a scalability concern, too - systems with
>> many CPUs will tend to have a lot of memory, and the growing overhead
>> of the page_info array may become an issue then, too. Page clustering
>> may be an option to reduce/eliminate the growth, though I didn't spend
>> much thought on this or possible alternatives.
> 
> An extra 8 bytes per page per 64 CPUs is hardly a concern I think. We're
> talking an overhead of 32 bytes per megabyte per CPU.

Put another way, at 512 CPUs the cpumasks would incur an overhead of <2% of
total memory. It's only really beyond that threshold that I'd be concerned.
The fact is it'll be a good while before 512 CPUs is concerning us, and
we'll have plenty of other scalability concerns, no doubt, by that point.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13  8:26     ` Keir Fraser
  2008-08-13  8:45       ` Jan Beulich
@ 2008-08-13 10:23       ` Bill Burns
  2008-08-13 10:25         ` Keir Fraser
  1 sibling, 1 reply; 12+ messages in thread
From: Bill Burns @ 2008-08-13 10:23 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

Keir Fraser wrote:
> On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:
> 
>> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote:
>>> Both seem to be hacks to get to 128 CPUs, without consideration of how
>>> to go beyond that
>> I think the shadow_page_info one is a general fix for my implicit
>> assumption that sizeof(cpumask_t) == sizeof (long).
> 
> Do some fields after the cpumask need to line up in both structures? Placing
> a dummy cpumask in the shadow_page structure might make most sense.

Yes, there is a check that a field of page_info and a
field of the shadow_page_info are at the same offset.
Both compile time checks are in private.h

> 
> For the other one I'll have to think a bit. The need for GDT entries per CPU
> currently obviously means scaling much past a few hundred CPUs is going to
> be difficult.

Yes, would like something better here. And as I said, we
don't know yet that just adding the additional page solves
anything.

 Bill


> 
>  -- Keir
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13 10:23       ` Bill Burns
@ 2008-08-13 10:25         ` Keir Fraser
  2008-08-13 10:53           ` Bill Burns
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-08-13 10:25 UTC (permalink / raw)
  To: Bill Burns; +Cc: xen-devel, Tim Deegan

On 13/8/08 11:23, "Bill Burns" <bburns@redhat.com> wrote:

>> For the other one I'll have to think a bit. The need for GDT entries per CPU
>> currently obviously means scaling much past a few hundred CPUs is going to
>> be difficult.
> 
> Yes, would like something better here. And as I said, we
> don't know yet that just adding the additional page solves
> anything.

How many CPUs do you currently need/want to support?

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13 10:25         ` Keir Fraser
@ 2008-08-13 10:53           ` Bill Burns
  2008-08-13 11:15             ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Bill Burns @ 2008-08-13 10:53 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

Keir Fraser wrote:
> On 13/8/08 11:23, "Bill Burns" <bburns@redhat.com> wrote:
> 
>>> For the other one I'll have to think a bit. The need for GDT entries per CPU
>>> currently obviously means scaling much past a few hundred CPUs is going to
>>> be difficult.
>> Yes, would like something better here. And as I said, we
>> don't know yet that just adding the additional page solves
>> anything.
> 
> How many CPUs do you currently need/want to support?
> 

Currently just looking to get 128 working.
But would be nice to have some proper sizing,
or even detection of running out. There is a
'last' GDT entry or some such #define, that is
never used (at least in the 3.1 code base).

 Bill


>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: large system support - 128 CPUs
  2008-08-13 10:53           ` Bill Burns
@ 2008-08-13 11:15             ` Keir Fraser
  0 siblings, 0 replies; 12+ messages in thread
From: Keir Fraser @ 2008-08-13 11:15 UTC (permalink / raw)
  To: Bill Burns; +Cc: xen-devel

On 13/8/08 11:53, "Bill Burns" <bburns@redhat.com> wrote:

>> How many CPUs do you currently need/want to support?
>> 
> 
> Currently just looking to get 128 working.
> But would be nice to have some proper sizing,
> or even detection of running out. There is a
> 'last' GDT entry or some such #define, that is
> never used (at least in the 3.1 code base).

I think your 'two pages' change will probably work. Then we just need a
run-time check when bringing a CPU online that there is space in the GDT for
its entries.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-08-13 11:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-12 18:41 RFC: large system support - 128 CPUs Bill Burns
2008-08-13  8:21 ` Jan Beulich
2008-08-13  8:22   ` Tim Deegan
2008-08-13  8:26     ` Keir Fraser
2008-08-13  8:45       ` Jan Beulich
2008-08-13  8:47         ` Keir Fraser
2008-08-13  8:52           ` Keir Fraser
2008-08-13 10:23       ` Bill Burns
2008-08-13 10:25         ` Keir Fraser
2008-08-13 10:53           ` Bill Burns
2008-08-13 11:15             ` Keir Fraser
2008-08-13  8:21 ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.