* Re: PAE issue (32-on-64 work)
[not found] <E1GaYNL-0000rc-Ub@host-192-168-0-1-bcn-london>
@ 2006-10-19 16:19 ` Joe Bonasera
2006-10-19 16:24 ` Keir Fraser
0 siblings, 1 reply; 13+ messages in thread
From: Joe Bonasera @ 2006-10-19 16:19 UTC (permalink / raw)
To: xen-devel
> Date: Thu, 19 Oct 2006 13:56:51 +0100
> From: Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
> Subject: Re: [Xen-devel] PAE issue (32-on-64 work)
> To: Jan Beulich <jbeulich@novell.com>, <xen-devel@lists.xensource.com>
> Message-ID: <C15D34A3.2CB1%Keir.Fraser@cl.cam.ac.uk>
> Content-Type: text/plain; charset="US-ASCII"
>
> On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>
>>Just now I found that there is a resulting issue for the 32on64 work I'm
>>doing: Since none of the entries 4...511 of the PMD get initialized in Linux,
>>and since Xen nevertheless has to validate all 512 entries (in order to
>>avoid making available translations that could be used during speculative
>>execution), the validation has the potential to fail (and does in reality),
>>resulting in the guest dying. The only option I presently see is to special
>>case the compatibility guest in the l3 handling and (I really hate to do
>>that) clear out the 518 supposedly unused entries (or at least clear
>>their present bits), meaning that no guest may ever make clever
>>assumptions and try to store some other data in the unused portion of
>>the pgd page.
>
>
> Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or,
> as you say, zap the 508 unused entries. No guest uses them -- I'm pretty
> sure Linux is the only PAE-capable guest (others are non-pae or 64-bit).
> Storing other stuff in the page would be inconvenient anyway since it has to
> be read-only.
>
> -- Keir
>
I just now happen to be changing the Solaris 32 bit domains to support
PAE on XEN, purposely to be able to use the 32-on-64 capabilites as they are
available.
The code path in Solaris currently supports 2 possibilities for PAE top level
tables. The normal code we use on bare metal allocates only 1 page
that all cpu's share for the the top level page table. For
example, cpu0 uses the 1st four quads for its current process'
L3, cpu1 uses the next four, etc. On context switch or cr3 reload
we (re)copy in the 4 entries of the process for that CPU's section
of the page.
That code path is, as so much of the 32 bit PAE support, a special
case. So it was easily turned off and made to just use
an entire page for each unique top level L3 on Xen. I did that just for
my initial bring up on PAE Xen, but was hoping to go back to some
form of the optimized version next.
Joe
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: PAE issue (32-on-64 work)
2006-10-19 16:19 ` PAE issue (32-on-64 work) Joe Bonasera
@ 2006-10-19 16:24 ` Keir Fraser
2006-10-19 17:22 ` Joe Bonasera
0 siblings, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2006-10-19 16:24 UTC (permalink / raw)
To: Joe Bonasera, xen-devel
On 19/10/06 17:19, "Joe Bonasera" <joe.bonasera@sun.com> wrote:
> The code path in Solaris currently supports 2 possibilities for PAE top level
> tables. The normal code we use on bare metal allocates only 1 page
> that all cpu's share for the the top level page table. For
> example, cpu0 uses the 1st four quads for its current process'
> L3, cpu1 uses the next four, etc. On context switch or cr3 reload
> we (re)copy in the 4 entries of the process for that CPU's section
> of the page.
>
> That code path is, as so much of the 32 bit PAE support, a special
> case. So it was easily turned off and made to just use
> an entire page for each unique top level L3 on Xen. I did that just for
> my initial bring up on PAE Xen, but was hoping to go back to some
> form of the optimized version next.
You should just allocate a page-sized L3 per process and be done with it. A
page overhead per process is nothing to be concerned about (clearly the
overhead can be even bigger if, for example, you run 4-level tables on
x86_64). Recopying the L3 entries every TLB flush will *not* perform well on
current Xen.
-- Keir
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 16:24 ` Keir Fraser
@ 2006-10-19 17:22 ` Joe Bonasera
2006-10-19 18:48 ` Keir Fraser
0 siblings, 1 reply; 13+ messages in thread
From: Joe Bonasera @ 2006-10-19 17:22 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel
Keir Fraser wrote:
> On 19/10/06 17:19, "Joe Bonasera" <joe.bonasera@sun.com> wrote:
>
>
>>The code path in Solaris currently supports 2 possibilities for PAE top level
>>tables. The normal code we use on bare metal allocates only 1 page
>>that all cpu's share for the the top level page table. For
>>example, cpu0 uses the 1st four quads for its current process'
>>L3, cpu1 uses the next four, etc. On context switch or cr3 reload
>>we (re)copy in the 4 entries of the process for that CPU's section
>>of the page.
>>
>>That code path is, as so much of the 32 bit PAE support, a special
>>case. So it was easily turned off and made to just use
>>an entire page for each unique top level L3 on Xen. I did that just for
>>my initial bring up on PAE Xen, but was hoping to go back to some
>>form of the optimized version next.
>
>
> You should just allocate a page-sized L3 per process and be done with it. A
> page overhead per process is nothing to be concerned about (clearly the
> overhead can be even bigger if, for example, you run 4-level tables on
> x86_64). Recopying the L3 entries every TLB flush will *not* perform well on
> current Xen.
>
Well we actually don't do complete TLB flushes very often at all, essentially
only the first time a new L3 entry is created by a running process which
for most processes means never - as >1Gig processes are rare.
So it shouldn't matter if they hit one or two slowish flushes.
Are there any other performance implications to watch out for?
Joe
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 17:22 ` Joe Bonasera
@ 2006-10-19 18:48 ` Keir Fraser
0 siblings, 0 replies; 13+ messages in thread
From: Keir Fraser @ 2006-10-19 18:48 UTC (permalink / raw)
To: Joe Bonasera; +Cc: xen-devel
On 19/10/06 6:22 pm, "Joe Bonasera" <joe.bonasera@sun.com> wrote:
>> You should just allocate a page-sized L3 per process and be done with it. A
>> page overhead per process is nothing to be concerned about (clearly the
>> overhead can be even bigger if, for example, you run 4-level tables on
>> x86_64). Recopying the L3 entries every TLB flush will *not* perform well on
>> current Xen.
>>
>
> Well we actually don't do complete TLB flushes very often at all, essentially
> only the first time a new L3 entry is created by a running process which
> for most processes means never - as >1Gig processes are rare.
> So it shouldn't matter if they hit one or two slowish flushes.
>
> Are there any other performance implications to watch out for?
I don't think so. Just remember that PAE L3 entry updates are not fast. We
really expect it only to happen on process creation/destruction (or similar
frequency).
-- Keir
^ permalink raw reply [flat|nested] 13+ messages in thread
* PAE issue (32-on-64 work)
@ 2006-10-19 10:39 Jan Beulich
2006-10-19 11:03 ` Ian Pratt
2006-10-19 12:56 ` Keir Fraser
0 siblings, 2 replies; 13+ messages in thread
From: Jan Beulich @ 2006-10-19 10:39 UTC (permalink / raw)
To: xen-devel
As I had expressed before, I'm thinking that the current way of handling the
top level of PAE paging is inappropriate, even after the above-4G adjustments
that cured part of the problem. This is specifically because
- the handling here isn't consistent with how hardware behaves in the same
situation (though the Xen behavior is probably within range of the generic
architecture specification), in that the processor reads the 4 top level entries
when CR3 gets re-loaded (and hence doesn't try to access them later in any
way), while Xen treats them (including potential updates to them) like just
on any level in the hierarchy
- the guest still needs to allocate a full page, even though only the first 32
bytes of it are actually used
- the shadowing done in Xen could be avoided altogether by following
hardware behavior.
Just now I found that there is a resulting issue for the 32on64 work I'm
doing: Since none of the entries 4...511 of the PMD get initialized in Linux,
and since Xen nevertheless has to validate all 512 entries (in order to
avoid making available translations that could be used during speculative
execution), the validation has the potential to fail (and does in reality),
resulting in the guest dying. The only option I presently see is to special
case the compatibility guest in the l3 handling and (I really hate to do
that) clear out the 518 supposedly unused entries (or at least clear
their present bits), meaning that no guest may ever make clever
assumptions and try to store some other data in the unused portion of
the pgd page.
Thanks for sharing any other ideas on how to overcome this problem,
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: PAE issue (32-on-64 work)
2006-10-19 10:39 Jan Beulich
@ 2006-10-19 11:03 ` Ian Pratt
2006-10-19 11:18 ` Jan Beulich
2006-10-19 12:56 ` Keir Fraser
1 sibling, 1 reply; 13+ messages in thread
From: Ian Pratt @ 2006-10-19 11:03 UTC (permalink / raw)
To: Jan Beulich, xen-devel
> As I had expressed before, I'm thinking that the current way of
handling
> the
> top level of PAE paging is inappropriate, even after the above-4G
> adjustments
> that cured part of the problem. This is specifically because
> - the handling here isn't consistent with how hardware behaves in the
same
> situation (though the Xen behavior is probably within range of the
generic
> architecture specification), in that the processor reads the 4 top
level
> entries
> when CR3 gets re-loaded (and hence doesn't try to access them later in
any
> way), while Xen treats them (including potential updates to them) like
just
> on any level in the hierarchy
> - the guest still needs to allocate a full page, even though only the
first
> 32
> bytes of it are actually used
> - the shadowing done in Xen could be avoided altogether by following
> hardware behavior.
>
> Just now I found that there is a resulting issue for the 32on64 work
I'm
> doing: Since none of the entries 4...511 of the PMD get initialized in
> Linux,
> and since Xen nevertheless has to validate all 512 entries (in order
to
> avoid making available translations that could be used during
speculative
> execution), the validation has the potential to fail (and does in
reality),
> resulting in the guest dying. The only option I presently see is to
special
> case the compatibility guest in the l3 handling and (I really hate to
do
> that) clear out the 518 supposedly unused entries (or at least clear
> their present bits), meaning that no guest may ever make clever
> assumptions and try to store some other data in the unused portion of
> the pgd page.
Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3's
get copied on every cr3 load?
It's most analogous to what happens today.
We've thought of removing the page-size restriction on PAE L3's in the
past, but it's pretty low down the priority list as it typically doesn't
cost a great deal of memory.
Ian
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: PAE issue (32-on-64 work)
2006-10-19 11:03 ` Ian Pratt
@ 2006-10-19 11:18 ` Jan Beulich
2006-10-19 11:34 ` Ian Pratt
2006-10-19 12:58 ` Keir Fraser
0 siblings, 2 replies; 13+ messages in thread
From: Jan Beulich @ 2006-10-19 11:18 UTC (permalink / raw)
To: Ian Pratt; +Cc: xen-devel
>Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3's
>get copied on every cr3 load?
>It's most analogous to what happens today.
In the shadowing (PAE, 32bit) case (a code path that, as I said, I'd rather
see ripped out). In the general 64-bit case, this would add another
(needless) distinct code path. I think I still like better the idea of clearing
out the final 518 entries.
>We've thought of removing the page-size restriction on PAE L3's in the
>past, but it's pretty low down the priority list as it typically doesn't
>cost a great deal of memory.
Ah. I would have felt different.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: PAE issue (32-on-64 work)
2006-10-19 11:18 ` Jan Beulich
@ 2006-10-19 11:34 ` Ian Pratt
2006-10-19 12:58 ` Keir Fraser
1 sibling, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2006-10-19 11:34 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
> >Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE
L3's
> >get copied on every cr3 load?
> >It's most analogous to what happens today.
>
> In the shadowing (PAE, 32bit) case (a code path that, as I said, I'd
rather
> see ripped out).
Why? It's essential to allow PAE PGDs to live above 4GB, which is a PITA
otherwise.
> In the general 64-bit case, this would add another
> (needless) distinct code path. I think I still like better the idea of
> clearing out the final 518 entries.
>
> >We've thought of removing the page-size restriction on PAE L3's in
the
> >past, but it's pretty low down the priority list as it typically
doesn't
> >cost a great deal of memory.
>
> Ah. I would have felt different.
Most machines probably only have a hundred processes (we can exclude
kernel threads and threads in general), hence maybe a few hundred KB
wasted, tops.
If we did remove the size restriction, we'd still want to put them in
their own slab cache rather than the general 32b cache, as you don't
want them being shared with other non-PGD data. This is a PITA that
mandates how we handle shadowing of PAE PGDs in the HVM case where we
can't control what they're allocated alongside.
Ian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 11:18 ` Jan Beulich
2006-10-19 11:34 ` Ian Pratt
@ 2006-10-19 12:58 ` Keir Fraser
2006-10-19 14:24 ` Jan Beulich
1 sibling, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2006-10-19 12:58 UTC (permalink / raw)
To: Jan Beulich, Ian Pratt; +Cc: xen-devel
On 19/10/06 12:18, "Jan Beulich" <jbeulich@novell.com> wrote:
>> Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3's
>> get copied on every cr3 load?
>> It's most analogous to what happens today.
>
> In the shadowing (PAE, 32bit) case (a code path that, as I said, I'd rather
> see ripped out). In the general 64-bit case, this would add another
> (needless) distinct code path. I think I still like better the idea of
> clearing
> out the final 518 entries.
If we allowed non-pae-aligned L3s then you'd have no choice but to shadow
anyway, as that would be the only way to make the guest mappings appear at
the correct place in the 64-bit address space.
-- Keir
>> We've thought of removing the page-size restriction on PAE L3's in the
>> past, but it's pretty low down the priority list as it typically doesn't
>> cost a great deal of memory.
>
> Ah. I would have felt different.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 12:58 ` Keir Fraser
@ 2006-10-19 14:24 ` Jan Beulich
2006-10-19 14:26 ` Keir Fraser
0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2006-10-19 14:24 UTC (permalink / raw)
To: Keir Fraser; +Cc: Ian Pratt, xen-devel
>If we allowed non-pae-aligned L3s then you'd have no choice but to shadow
>anyway, as that would be the only way to make the guest mappings appear at
>the correct place in the 64-bit address space.
It's not really shadowing, since there is no need to monitor changes. I'd
therefore rather call it snapshotting.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 14:24 ` Jan Beulich
@ 2006-10-19 14:26 ` Keir Fraser
0 siblings, 0 replies; 13+ messages in thread
From: Keir Fraser @ 2006-10-19 14:26 UTC (permalink / raw)
To: Jan Beulich; +Cc: Ian Pratt, xen-devel
On 19/10/06 15:24, "Jan Beulich" <jbeulich@novell.com> wrote:
>> If we allowed non-pae-aligned L3s then you'd have no choice but to shadow
>> anyway, as that would be the only way to make the guest mappings appear at
>> the correct place in the 64-bit address space.
>
> It's not really shadowing, since there is no need to monitor changes. I'd
> therefore rather call it snapshotting.
But you do need to track updates, right? What if the guest zaps an L3 entry
in an in-use PGD table and than flushes TLBs?
-- Keir
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 10:39 Jan Beulich
2006-10-19 11:03 ` Ian Pratt
@ 2006-10-19 12:56 ` Keir Fraser
2006-10-19 13:18 ` Bruce Rogers
1 sibling, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2006-10-19 12:56 UTC (permalink / raw)
To: Jan Beulich, xen-devel
On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:
> Just now I found that there is a resulting issue for the 32on64 work I'm
> doing: Since none of the entries 4...511 of the PMD get initialized in Linux,
> and since Xen nevertheless has to validate all 512 entries (in order to
> avoid making available translations that could be used during speculative
> execution), the validation has the potential to fail (and does in reality),
> resulting in the guest dying. The only option I presently see is to special
> case the compatibility guest in the l3 handling and (I really hate to do
> that) clear out the 518 supposedly unused entries (or at least clear
> their present bits), meaning that no guest may ever make clever
> assumptions and try to store some other data in the unused portion of
> the pgd page.
Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or,
as you say, zap the 508 unused entries. No guest uses them -- I'm pretty
sure Linux is the only PAE-capable guest (others are non-pae or 64-bit).
Storing other stuff in the page would be inconvenient anyway since it has to
be read-only.
-- Keir
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: PAE issue (32-on-64 work)
2006-10-19 12:56 ` Keir Fraser
@ 2006-10-19 13:18 ` Bruce Rogers
0 siblings, 0 replies; 13+ messages in thread
From: Bruce Rogers @ 2006-10-19 13:18 UTC (permalink / raw)
To: Keir Fraser, xen-devel, Jan Beulich
NetWare is also a PAE guest, but doesn't put anything in the rest of the page, so
zapping would be fine for NetWare.
- Bruce
>>> On 10/19/2006 at 6:56 AM, in message <C15D34A3.2CB1%Keir.Fraser@cl.cam.ac.uk>,
Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> Just now I found that there is a resulting issue for the 32on64 work I'm
>> doing: Since none of the entries 4...511 of the PMD get initialized in
> Linux,
>> and since Xen nevertheless has to validate all 512 entries (in order to
>> avoid making available translations that could be used during speculative
>> execution), the validation has the potential to fail (and does in reality),
>> resulting in the guest dying. The only option I presently see is to special
>> case the compatibility guest in the l3 handling and (I really hate to do
>> that) clear out the 518 supposedly unused entries (or at least clear
>> their present bits), meaning that no guest may ever make clever
>> assumptions and try to store some other data in the unused portion of
>> the pgd page.
>
> Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or,
> as you say, zap the 508 unused entries. No guest uses them -- I'm pretty
> sure Linux is the only PAE-capable guest (others are non-pae or 64-bit).
> Storing other stuff in the page would be inconvenient anyway since it has to
> be read-only.
>
> -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-10-19 18:48 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <E1GaYNL-0000rc-Ub@host-192-168-0-1-bcn-london>
2006-10-19 16:19 ` PAE issue (32-on-64 work) Joe Bonasera
2006-10-19 16:24 ` Keir Fraser
2006-10-19 17:22 ` Joe Bonasera
2006-10-19 18:48 ` Keir Fraser
2006-10-19 10:39 Jan Beulich
2006-10-19 11:03 ` Ian Pratt
2006-10-19 11:18 ` Jan Beulich
2006-10-19 11:34 ` Ian Pratt
2006-10-19 12:58 ` Keir Fraser
2006-10-19 14:24 ` Jan Beulich
2006-10-19 14:26 ` Keir Fraser
2006-10-19 12:56 ` Keir Fraser
2006-10-19 13:18 ` Bruce Rogers
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.