Xen x86 host memory limit issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* Xen x86 host memory limit issues
@ 2015-08-24 10:36 Andrew Cooper
  2015-08-24 11:47 ` Jan Beulich
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Cooper @ 2015-08-24 10:36 UTC (permalink / raw)
  To: Xen-devel List
  Cc: Elena Ufimtseva, Juergen Gross, Tim Deegan, George Dunlap,
	Jan Beulich

(Following up from a discussion at the Seattle Summit).

While the theoretical Xen x86 host memory limit is 16TB (or 123TB with
CONFIG_BIGMEM), Xen doesn't actually function correctly if host ram
exceeds the addressable range in the directmap region, which is at the
5TB boundary (or 3.5TB with CONFIG_BIGMEM).

The ultimate bug is that alloc_xenheap_pages() returns virtual addresses
which exceed HYPERVISOR_VIRT_END. 

Because of the way the idle pagetables and monitor pagetables extend the
directmap region, these pointers are safe to use.  However, in the
context of a 64bit PV guest, these virtual addresses belong to the guest
kernel.

In my repro case (6TB box, 8 numa nodes), it was particularly easy to
trigger the issue from a 64bit dom0 with `xenpm get-cpuidle-states all`
or `echo c > /proc/sysrq-trigger`, both of which went and accessed
per-cpu data allocated higher than HYPERVISOR_VIRT_END and unmapped in
the dom0 kernel pagetables.  (On broadwell hardware, I would expect SMAP
violations as the guest kernel pages are user pages).

For XenServer, I used the following gross hack to work around the problem

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index 3c64f19..715765a 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -15,7 +15,7 @@
  * opt_mem: Limit maximum address of physical RAM.
  *          Any RAM beyond this address limit is ignored.

*/                                                                                                                                                                                                          

-static unsigned long long __initdata opt_mem;
+static unsigned long long __initdata opt_mem = GB(5 * 1024);
 size_param("mem", opt_mem);

 /*

Which cases Xen to ignore any RAM above the 5TB boundary.  (We used a
similar trick with the 1TB limit for 32bit toolstack domains and migration).

The infrastructure around xenheap_max_mfn() is supposed cause all
xenheap page allocations to fall within the Xen direct mapped region,
but experimentally doesn't work correctly.

In all cases I have seen, the bad xenheap allocations have been from
calls which contain numa information in the memflags, which leads me to
suspect it is an interaction issue of numa hinting information and
xenheap_bits.  At a guess I suspect alloc_heap_pages() doesn't correctly
override the numa hint when both a numa hint and zone limit are
provided, but I have not investigated this yet.

Fixing that bug will be a useful step, as it will allow Xen to function
with host ram above the direct map limit, but is still not an optimal
solution as it prevents getting numa-local xenheap memory.

Longterm it would be optimal to segment the direct map region by numa
node so there is equal quantities of xenheap memory available from each
numa node.  This also has an added security benefit as it makes ret2dir
exploits harder, as the direct map target address is no longer a static
calculation from the point of view of the attacker.

~Andrew

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Xen x86 host memory limit issues
  2015-08-24 10:36 Xen x86 host memory limit issues Andrew Cooper
@ 2015-08-24 11:47 ` Jan Beulich
  2015-08-24 12:31   ` Andrew Cooper
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Beulich @ 2015-08-24 11:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Elena Ufimtseva, Tim Deegan, Juergen Gross,
	Xen-devel List

>>> On 24.08.15 at 12:36, <andrew.cooper3@citrix.com> wrote:
> The infrastructure around xenheap_max_mfn() is supposed cause all
> xenheap page allocations to fall within the Xen direct mapped region,
> but experimentally doesn't work correctly.
> 
> In all cases I have seen, the bad xenheap allocations have been from
> calls which contain numa information in the memflags, which leads me to
> suspect it is an interaction issue of numa hinting information and
> xenheap_bits.  At a guess I suspect alloc_heap_pages() doesn't correctly
> override the numa hint when both a numa hint and zone limit are
> provided, but I have not investigated this yet.

But you're in the ideal position to do so. As said previously on the same
topic, looking just at the code I can't see what's wrong, even when
taking into account the experimentally observed behavior.

> Fixing that bug will be a useful step, as it will allow Xen to function
> with host ram above the direct map limit, but is still not an optimal
> solution as it prevents getting numa-local xenheap memory.
> 
> Longterm it would be optimal to segment the direct map region by numa
> node so there is equal quantities of xenheap memory available from each
> numa node.

Yes, albeit I'm suspecting there to arise (at least theoretical) issues
on systems with many nodes - the per-node ranges directly mapped
may become unreasonably small (and we may risk exhausting node
0's memory due to not NUMA-tagged allocation requests).

Jan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Xen x86 host memory limit issues
  2015-08-24 11:47 ` Jan Beulich
@ 2015-08-24 12:31   ` Andrew Cooper
  0 siblings, 0 replies; 3+ messages in thread
From: Andrew Cooper @ 2015-08-24 12:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Elena Ufimtseva, Juergen Gross, George Dunlap, Tim Deegan,
	Xen-devel List

On 24/08/15 12:47, Jan Beulich wrote:
>>>> On 24.08.15 at 12:36, <andrew.cooper3@citrix.com> wrote:
>> The infrastructure around xenheap_max_mfn() is supposed cause all
>> xenheap page allocations to fall within the Xen direct mapped region,
>> but experimentally doesn't work correctly.
>>
>> In all cases I have seen, the bad xenheap allocations have been from
>> calls which contain numa information in the memflags, which leads me to
>> suspect it is an interaction issue of numa hinting information and
>> xenheap_bits.  At a guess I suspect alloc_heap_pages() doesn't correctly
>> override the numa hint when both a numa hint and zone limit are
>> provided, but I have not investigated this yet.
> But you're in the ideal position to do so. As said previously on the same
> topic, looking just at the code I can't see what's wrong, even when
> taking into account the experimentally observed behavior.

It is high on (but not top of) my todo list, as we currently have the
workaround in place.

>From discussions at the Summit, I know that Orcale, Suse and Citrix all
have machines large enough to reproduce the issue.  This information is
provided as the request of Elena and Konrad (who it turns out I forgot
to CC on the original message.  Sorry!)

>
>> Fixing that bug will be a useful step, as it will allow Xen to function
>> with host ram above the direct map limit, but is still not an optimal
>> solution as it prevents getting numa-local xenheap memory.
>>
>> Longterm it would be optimal to segment the direct map region by numa
>> node so there is equal quantities of xenheap memory available from each
>> numa node.
> Yes, albeit I'm suspecting there to arise (at least theoretical) issues
> on systems with many nodes - the per-node ranges directly mapped
> may become unreasonably small (and we may risk exhausting node
> 0's memory due to not NUMA-tagged allocation requests).

There are a number of allocation constraints.  Off the top of my head:

* DMA pools for dom0 (mitigated in certain circumstances by PVIOMMU)
* <128GB for 32bit PV domheap pages
* <4GB for some 32bit PV L3 pages

Some of this can be avoided by allocating directmap from the upper ram
in the numa nodes.  Exhaustion of node 0 can be mitigated by striping
allocations without a numa hint, or allocating from the node with most
free space remaining.

There should actually be very few allocations which can't have a numa
hint provided.  All allocations for anything hardware related should be
on local node, and everything else should be allocations on behalf a
domain, which itself has numa information.

As an orthogonal task, we should see whether it is possible to nab any
virtual address space back from 64bit PV guests, or whether it is
irreparably fixed at its current value.

~Andrew

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-08-24 12:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-24 10:36 Xen x86 host memory limit issues Andrew Cooper
2015-08-24 11:47 ` Jan Beulich
2015-08-24 12:31   ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.