* regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
@ 2012-02-17 15:50 Jan Beulich
2012-02-17 16:23 ` Tim Deegan
0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2012-02-17 15:50 UTC (permalink / raw)
To: Tim Deegan; +Cc: xen-devel
Tim,
this c/s or-s PFEC_reserved_bit into the error code passed to the guest
in two places, and I'm afraid it is responsible for problems we're seeing
when a Linux HVM PAE guest on a non-HAP platform runs into
invocations of pgtable_bad(). Linux check the reserved bit before
looking at the present bit, and expects the reserved bit to always be
clear when the present bit also is. The respective lines from the
guest's kernel log however are
Bad pagetable: 000c [#1] SMP
Bad pagetable: 000e [#2] SMP
Bad pagetable: 000e [#3] SMP
Bad pagetable: 000c [#4] SMP
Bad pagetable: 000c [#5] SMP
Bad pagetable: 000c [#6] SMP
Bad pagetable: 000a [#7] SMP
Bad pagetable: 000a [#8] SMP
Bad pagetable: 000c [#10] SMP
Bad pagetable: 000c [#11] SMP
(i.e. reserved flag always set, present flag always clear) with the
corresponding page table dumps being
*pdpt = 0000000035682001 *pde = 0000000013014067 *pte = 002dea0000000000
*pdpt = 0000000035d65001 *pde = 0000000028a8a067 *pte = 002dd16000000000
*pdpt = 0000000013177001 *pde = 0000000028bf9067 *pte = 002dd16000000000
*pdpt = 0000000035480001 *pde = 000000005e6b1067
*pdpt = 00000000133b6001 *pde = 000000005e6c1067
*pdpt = 00000000131b7001 *pde = 000000005de15067
*pdpt = 00000000354bd001 *pde = 00000000355ad067 *pte = 002dfca000000000
*pdpt = 000000002bd5c001 *pde = 000000002bf2c067 *pte = 002dfa6000000000
*pdpt = 000000002bd79001 *pde = 0000000000000000
*pdpt = 000000002bdc6001 *pde = 000000005e63c067
*pdpt = 00000000130ef001 *pde = 000000005e690067
(i.e. no invalid bits at all, but paged out entries with page indexes
into the swap file indicating a swap size of over 10G - with anything
up to 4G, the reserved bits would never be run into).
The problem appears to be that the or-ing in of PFEC_reserved_bit
happens without consideration of PFEC_present. If you can confirm
I'm not mis-interpreting things, fixing this should be relatively
strait forward (though the question would be whether it should be
guest_walk_tables() or its callers that should be fixed).
Jan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
2012-02-17 15:50 regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)? Jan Beulich
@ 2012-02-17 16:23 ` Tim Deegan
2012-02-17 16:58 ` Jan Beulich
0 siblings, 1 reply; 6+ messages in thread
From: Tim Deegan @ 2012-02-17 16:23 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
[-- Attachment #1: Type: text/plain, Size: 2645 bytes --]
Hi,
At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote:
> this c/s or-s PFEC_reserved_bit into the error code passed to the guest
> in two places, and I'm afraid it is responsible for problems we're seeing
> when a Linux HVM PAE guest on a non-HAP platform runs into
> invocations of pgtable_bad(). Linux check the reserved bit before
> looking at the present bit, and expects the reserved bit to always be
> clear when the present bit also is. The respective lines from the
> guest's kernel log however are
>
> Bad pagetable: 000c [#1] SMP
> Bad pagetable: 000e [#2] SMP
> Bad pagetable: 000e [#3] SMP
> Bad pagetable: 000c [#4] SMP
> Bad pagetable: 000c [#5] SMP
> Bad pagetable: 000c [#6] SMP
> Bad pagetable: 000a [#7] SMP
> Bad pagetable: 000a [#8] SMP
> Bad pagetable: 000c [#10] SMP
> Bad pagetable: 000c [#11] SMP
>
> (i.e. reserved flag always set, present flag always clear) with the
> corresponding page table dumps being
>
> *pdpt = 0000000035682001 *pde = 0000000013014067 *pte = 002dea0000000000
> *pdpt = 0000000035d65001 *pde = 0000000028a8a067 *pte = 002dd16000000000
> *pdpt = 0000000013177001 *pde = 0000000028bf9067 *pte = 002dd16000000000
> *pdpt = 0000000035480001 *pde = 000000005e6b1067
> *pdpt = 00000000133b6001 *pde = 000000005e6c1067
> *pdpt = 00000000131b7001 *pde = 000000005de15067
> *pdpt = 00000000354bd001 *pde = 00000000355ad067 *pte = 002dfca000000000
> *pdpt = 000000002bd5c001 *pde = 000000002bf2c067 *pte = 002dfa6000000000
> *pdpt = 000000002bd79001 *pde = 0000000000000000
> *pdpt = 000000002bdc6001 *pde = 000000005e63c067
> *pdpt = 00000000130ef001 *pde = 000000005e690067
>
> (i.e. no invalid bits at all, but paged out entries with page indexes
> into the swap file indicating a swap size of over 10G - with anything
> up to 4G, the reserved bits would never be run into).
Right - so the problem is that the walker is choking on the reserved
bits even though the present bit it clear in that particular PTE?
Yeah, that's a bug.
> The problem appears to be that the or-ing in of PFEC_reserved_bit
> happens without consideration of PFEC_present. If you can confirm
> I'm not mis-interpreting things, fixing this should be relatively
> strait forward (though the question would be whether it should be
> guest_walk_tables() or its callers that should be fixed).
We should fix it in guest_walk_tables, since AFAICS it's possible for
PFEC_reserved_bit to be set based on a bad higher-level entry even if a
lower-level one has _PAGE_PRESENT clear.
Something like the attached (compile-tested only) patch?
(Sigh; this PT walker was once relatively readable and efficient.)
Cheers,
Tim.
[-- Attachment #2: pfec --]
[-- Type: text/plain, Size: 2266 bytes --]
# HG changeset patch
# Parent 87218bd367befca7d3488ba1cf4feb2b10d5f14e
x86/mm: Don't check for invalid bits in non-present PTEs.
If _PAGE_PRESENT is clean in a pagetable entry, any pattern of bits
is valid in the rest of the entry. OSes that special-case
PFEC_invalid_bits (since it should never happen) will be confused
by our setting it in this way.
Signed-off-by: Tim Deegan <tim@xen.org>
diff -r 87218bd367be -r 05a3b346f1c3 xen/arch/x86/mm/guest_walk.c
--- a/xen/arch/x86/mm/guest_walk.c Fri Feb 17 12:24:38 2012 +0000
+++ b/xen/arch/x86/mm/guest_walk.c Fri Feb 17 16:14:19 2012 +0000
@@ -179,8 +179,11 @@ guest_walk_tables(struct vcpu *v, struct
l4p = (guest_l4e_t *) top_map;
gw->l4e = l4p[guest_l4_table_offset(va)];
gflags = guest_l4e_get_flags(gw->l4e) ^ iflags;
+ if ( !(gflags & _PAGE_PRESENT) ) {
+ rc |= _PAGE_PRESENT;
+ goto out;
+ }
rc |= ((gflags & mflags) ^ mflags);
- if ( rc & _PAGE_PRESENT ) goto out;
/* Map the l3 table */
l3p = map_domain_gfn(p2m,
@@ -193,9 +196,11 @@ guest_walk_tables(struct vcpu *v, struct
/* Get the l3e and check its flags*/
gw->l3e = l3p[guest_l3_table_offset(va)];
gflags = guest_l3e_get_flags(gw->l3e) ^ iflags;
+ if ( !(gflags & _PAGE_PRESENT) ) {
+ rc |= _PAGE_PRESENT;
+ goto out;
+ }
rc |= ((gflags & mflags) ^ mflags);
- if ( rc & _PAGE_PRESENT )
- goto out;
pse1G = (gflags & _PAGE_PSE) && guest_supports_1G_superpages(v);
@@ -261,9 +266,11 @@ guest_walk_tables(struct vcpu *v, struct
#endif /* All levels... */
gflags = guest_l2e_get_flags(gw->l2e) ^ iflags;
+ if ( !(gflags & _PAGE_PRESENT) ) {
+ rc |= _PAGE_PRESENT;
+ goto out;
+ }
rc |= ((gflags & mflags) ^ mflags);
- if ( rc & _PAGE_PRESENT )
- goto out;
pse2M = (gflags & _PAGE_PSE) && guest_supports_superpages(v);
@@ -321,6 +328,10 @@ guest_walk_tables(struct vcpu *v, struct
goto out;
gw->l1e = l1p[guest_l1_table_offset(va)];
gflags = guest_l1e_get_flags(gw->l1e) ^ iflags;
+ if ( !(gflags & _PAGE_PRESENT) ) {
+ rc |= _PAGE_PRESENT;
+ goto out;
+ }
rc |= ((gflags & mflags) ^ mflags);
}
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
2012-02-17 16:23 ` Tim Deegan
@ 2012-02-17 16:58 ` Jan Beulich
2012-02-23 10:33 ` Tim Deegan
0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2012-02-17 16:58 UTC (permalink / raw)
To: Tim Deegan; +Cc: xen-devel
>>> On 17.02.12 at 17:23, Tim Deegan <tim@xen.org> wrote:
> At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote:
>> The problem appears to be that the or-ing in of PFEC_reserved_bit
>> happens without consideration of PFEC_present. If you can confirm
>> I'm not mis-interpreting things, fixing this should be relatively
>> strait forward (though the question would be whether it should be
>> guest_walk_tables() or its callers that should be fixed).
>
> We should fix it in guest_walk_tables, since AFAICS it's possible for
> PFEC_reserved_bit to be set based on a bad higher-level entry even if a
> lower-level one has _PAGE_PRESENT clear.
>
> Something like the attached (compile-tested only) patch?
That was really fast, thanks!
Yes, that looks like it should do it. We'll want to give it a try early
next week.
> (Sigh; this PT walker was once relatively readable and efficient.)
Sorry for that ;-) (I wonder though whether this whole evaluation
of the flags couldn't be put in a macro, as it's - apart from the level
number - the same in all four places afaics.
Jan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
2012-02-17 16:58 ` Jan Beulich
@ 2012-02-23 10:33 ` Tim Deegan
2012-02-23 14:34 ` Jan Beulich
0 siblings, 1 reply; 6+ messages in thread
From: Tim Deegan @ 2012-02-23 10:33 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
At 16:58 +0000 on 17 Feb (1329497937), Jan Beulich wrote:
> >>> On 17.02.12 at 17:23, Tim Deegan <tim@xen.org> wrote:
> > At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote:
> >> The problem appears to be that the or-ing in of PFEC_reserved_bit
> >> happens without consideration of PFEC_present. If you can confirm
> >> I'm not mis-interpreting things, fixing this should be relatively
> >> strait forward (though the question would be whether it should be
> >> guest_walk_tables() or its callers that should be fixed).
> >
> > We should fix it in guest_walk_tables, since AFAICS it's possible for
> > PFEC_reserved_bit to be set based on a bad higher-level entry even if a
> > lower-level one has _PAGE_PRESENT clear.
> >
> > Something like the attached (compile-tested only) patch?
>
> That was really fast, thanks!
>
> Yes, that looks like it should do it. We'll want to give it a try early
> next week.
I've applied it, since it seemed not to break anything and I'll be away
from my test boxes for a while. Let me know of you're still seeing any
problems.
Cheers,
Tim.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
2012-02-23 10:33 ` Tim Deegan
@ 2012-02-23 14:34 ` Jan Beulich
2012-02-23 14:59 ` Tim Deegan
0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2012-02-23 14:34 UTC (permalink / raw)
To: tim; +Cc: xen-devel
>>> Tim Deegan <tim@xen.org> 02/23/12 11:34 AM >>>
>I've applied it, since it seemed not to break anything and I'll be away
>from my test boxes for a while. Let me know of you're still seeing any
>problems.
Thanks - I had hoped we would have reported testing results earlier, but
this unfortunately is going pretty slowly.
One thing though - shouldn't further walking be suppressed if at a
given level any violation is detected (i.e. if rc became non-zero)? For
sure the hardware aborts a walk at least if reserved bits are found
set, and I'd suspect it also doesn't bother continuing the walk if access
rights don't permit the result to be used.
Jan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
2012-02-23 14:34 ` Jan Beulich
@ 2012-02-23 14:59 ` Tim Deegan
0 siblings, 0 replies; 6+ messages in thread
From: Tim Deegan @ 2012-02-23 14:59 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
At 14:34 +0000 on 23 Feb (1330007656), Jan Beulich wrote:
> >>> Tim Deegan <tim@xen.org> 02/23/12 11:34 AM >>>
> >I've applied it, since it seemed not to break anything and I'll be away
> >from my test boxes for a while. Let me know of you're still seeing any
> >problems.
>
> Thanks - I had hoped we would have reported testing results earlier, but
> this unfortunately is going pretty slowly.
>
> One thing though - shouldn't further walking be suppressed if at a
> given level any violation is detected (i.e. if rc became non-zero)? For
> sure the hardware aborts a walk at least if reserved bits are found
> set, and I'd suspect it also doesn't bother continuing the walk if access
> rights don't permit the result to be used.
That sounds plausible, but I recall there being some subtleties about
that when we first looked at it and I haven't time to dig up what they
were right now. I may be misremembering but I don't want to tinker with
it without being sure. I'll look into it again when I get a minute.
Tim.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-02-23 14:59 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-17 15:50 regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)? Jan Beulich
2012-02-17 16:23 ` Tim Deegan
2012-02-17 16:58 ` Jan Beulich
2012-02-23 10:33 ` Tim Deegan
2012-02-23 14:34 ` Jan Beulich
2012-02-23 14:59 ` Tim Deegan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).