PoD code killing domain before it really gets started

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* PoD code killing domain before it really gets started
@ 2012-07-26 14:41 Jan Beulich
  2012-07-26 15:37 ` Jan Beulich
  2012-07-26 16:14 ` George Dunlap
  0 siblings, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2012-07-26 14:41 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

George,

in the hope that you might have some insight, or might be
remembering that something like this was reported before (and
ideally fixed), I'll try to describe a problem a customer of ours
reported. Unfortunately this is with Xen 4.0.x (plus numerous
backports), and it is not known whether the same issue exists
on 4.1.x or -unstable.

For a domain with maxmem=16000M and memory=3200M, what
gets logged is

(XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184
(XEN) domain_crash called from p2m.c:1150
(XEN) Domain 3 reported crashed by domain 0 on cpu#6:
(XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184
(XEN) domain_crash called from p2m.c:1150

Translated to hex, the numbers are 1e0 and 36000. The latter
one varies across the (rather infrequent) cases where this
happens (but was always a multiple of 1000 - see below), and
instant retries to create the affected domain did always succeed
so far (i.e. the failure is definitely not because of a lack of free
memory).

Given that the memory= target wasn't reached, yet, I would
conclude that this happens in the middle of (4.0.x file name used
here) tools/libxc/xc_hvm_build.c:setup_guest()'s main physmap
population code. However, the way I read the code there, I
would think that the sequence of population should be (using
hex GFNs) 0...9f, c0...7ff, 800-fff, 1000-17ff, etc. That,
however appears to be inconsistent with the logged numbers
above - tot_pages should always be at least 7e0 (low 2Mb less
the VGA hole), especially when pod_entries is divisible by 800
(the increment by which large page population happens).

As a result of this apparent inconsistency I can't really
conclude anything from the logged numbers.

The main question, irrespective of any numbers, of course is:
How would p2m_pod_demand_populate() be invoked at all
during this early phase of domain construction? Nothing
should be touching any of the memory... If this nevertheless
is possible (even if just for a single page), then perhaps the
tools ought to make sure the pages put into the low 2Mb get
actually zeroed, so the PoD code has a chance to find victim
pages.

Thanks for any thoughts or pointers, Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-07-26 14:41 PoD code killing domain before it really gets started Jan Beulich
@ 2012-07-26 15:37 ` Jan Beulich
  2012-07-26 16:14 ` George Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-07-26 15:37 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

>>> On 26.07.12 at 16:41, "Jan Beulich" <JBeulich@suse.com> wrote:
> The main question, irrespective of any numbers, of course is:
> How would p2m_pod_demand_populate() be invoked at all
> during this early phase of domain construction? Nothing
> should be touching any of the memory... If this nevertheless
> is possible (even if just for a single page), then perhaps the
> tools ought to make sure the pages put into the low 2Mb get
> actually zeroed, so the PoD code has a chance to find victim
> pages.

One more point of inconsistency: According to xend.log, the
device model got already launched at the point of the guest
death, yet the two operations (call trees ending in
pyxc_hvm_build() and ImageHandler.createDomain(), the log
message of which is present in the logs) are both rooted in
XendDomainInfo._initDomain(), and are hence sequential aiui
(i.e. the physmap population should already have finished).

Of course that's unless xend happily continues acting on a
crashed guest (which would explain why there are two
instances of the PoD related messages in the hypervisor log).

So I'm getting all the more confused the deeper I look into this.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-07-26 14:41 PoD code killing domain before it really gets started Jan Beulich
  2012-07-26 15:37 ` Jan Beulich
@ 2012-07-26 16:14 ` George Dunlap
  2012-07-27  6:45   ` Jan Beulich
  2012-08-06 13:57   ` Jan Beulich
  1 sibling, 2 replies; 22+ messages in thread
From: George Dunlap @ 2012-07-26 16:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 26/07/12 15:41, Jan Beulich wrote:
> George,
>
> in the hope that you might have some insight, or might be
> remembering that something like this was reported before (and
> ideally fixed), I'll try to describe a problem a customer of ours
> reported. Unfortunately this is with Xen 4.0.x (plus numerous
> backports), and it is not known whether the same issue exists
> on 4.1.x or -unstable.
>
> For a domain with maxmem=16000M and memory=3200M, what
> gets logged is
>
> (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184
> (XEN) domain_crash called from p2m.c:1150
> (XEN) Domain 3 reported crashed by domain 0 on cpu#6:
> (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184
> (XEN) domain_crash called from p2m.c:1150
>
> Translated to hex, the numbers are 1e0 and 36000. The latter
> one varies across the (rather infrequent) cases where this
> happens (but was always a multiple of 1000 - see below), and
> instant retries to create the affected domain did always succeed
> so far (i.e. the failure is definitely not because of a lack of free
> memory).
>
> Given that the memory= target wasn't reached, yet, I would
> conclude that this happens in the middle of (4.0.x file name used
> here) tools/libxc/xc_hvm_build.c:setup_guest()'s main physmap
> population code. However, the way I read the code there, I
> would think that the sequence of population should be (using
> hex GFNs) 0...9f, c0...7ff, 800-fff, 1000-17ff, etc. That,
> however appears to be inconsistent with the logged numbers
> above - tot_pages should always be at least 7e0 (low 2Mb less
> the VGA hole), especially when pod_entries is divisible by 800
> (the increment by which large page population happens).
>
> As a result of this apparent inconsistency I can't really
> conclude anything from the logged numbers.
>
> The main question, irrespective of any numbers, of course is:
> How would p2m_pod_demand_populate() be invoked at all
> during this early phase of domain construction? Nothing
> should be touching any of the memory... If this nevertheless
> is possible (even if just for a single page), then perhaps the
> tools ought to make sure the pages put into the low 2Mb get
> actually zeroed, so the PoD code has a chance to find victim
> pages.
Yes, this is a very strange circumstance: because p2m_demand_populate() 
shouldn't happen until at least one PoD entry has been created; and that 
shouldn't happen until after c0...7ff have been populated with 4k pages.

Although, it does look as though when populating 4k pages, the code 
doesn't actually look to see if the allocation succeeded or not... oh 
wait, no, it actually checks rc as a condition of the while() loop -- 
but that is then clobbered by the xc_domain_set_pod_target() call.  But 
surely if the 4k allocation failed, the set_target() call should fail as 
well?  And in any case, there shouldn't yet be any PoD entries to cause 
a demand-populate.

We probably should change "if(pod_mode)" to "if(rc == 0 && pod_mode)" or 
something like that, just to be sure.  I'll spin up a patch.

I think what I would try to do is to add a stack trace to the 
demand_populate() failure path, so you can see where the call came from; 
i.e., if it came from a guest access, or from someone in dom0 writing to 
some of the memory. I'd also add a printk to set_pod_target(), so you 
can see if it was actually called and what it was set to.

  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-07-26 16:14 ` George Dunlap
@ 2012-07-27  6:45   ` Jan Beulich
  2012-08-06 13:57   ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-07-27  6:45 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

>>> On 26.07.12 at 18:14, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> I think what I would try to do is to add a stack trace to the 
> demand_populate() failure path, so you can see where the call came from; 
> i.e., if it came from a guest access, or from someone in dom0 writing to 
> some of the memory. I'd also add a printk to set_pod_target(), so you 
> can see if it was actually called and what it was set to.

Yes, that sounds very similar to the plans I had in case you don't
recall any respective fixes or have any other immediate idea
about the problem. Albeit guest accesses can be excluded here
afaict, given that the guest didn't even get started yet. But
indeed, with nothing really fitting together, we really need to
make sure...

Meanwhile I also got confirmed that this can happen strait after
boot, which makes it all the more questionable - almost all pages
should hold zeros only, and hence the PoD code should be able
to find victim pages among the populated ones, yet it obviously
doesn't.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-07-26 16:14 ` George Dunlap
  2012-07-27  6:45   ` Jan Beulich
@ 2012-08-06 13:57   ` Jan Beulich
  2012-08-06 14:12     ` Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2012-08-06 13:57 UTC (permalink / raw)
  To: Ian Campbell, George Dunlap, Ian Jackson; +Cc: xen-devel

>>> On 26.07.12 at 18:14, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> Yes, this is a very strange circumstance: because p2m_demand_populate() 
> shouldn't happen until at least one PoD entry has been created; and that 
> shouldn't happen until after c0...7ff have been populated with 4k pages.

So meanwhile I was told that this very likely is caused by an access
originating in Dom0. Just a few minutes ago I also got hold of call
stacks (as was already seen with the original messages, it
produces two instances per bad access):

...
(XEN) gpmpod(1, 8000, 9) -> 0 [dom0]
(XEN) gpmpod(1, 8200, 9) -> 0 [dom0]

[coming from

printk("gpmpod(%d, %lx, %u) -> %d [dom%d]\n", d->domain_id, gfn, order, rc, current->domain->domain_id);

at the end of guest_physmap_mark_populate_on_demand()]

(XEN) p2m_pod_demand_populate: Dom1 out of PoD memory! (tot=1e0 ents=8200 dom0)

[altered message at the failure point in p2m_pod_demand_populate():

-    printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " pod_entries %" PRIi32 "\n",
-           __func__, d->tot_pages, p2md->pod.entry_count);
+    printk("%s: Dom%d out of PoD memory! (tot=%"PRIx32" ents=%"PRIx32" dom%d)\n",
+           __func__, d->domain_id, d->tot_pages, p2md->pod.entry_count, current->domain->domain_id);
+WARN_ON(1);

]

(XEN) Xen WARN at p2m.c:1155
(XEN) ----[ Xen-4.0.3_21548_04a-0.9.1  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0
...
(XEN) Xen call trace:
(XEN)    [<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0
(XEN)    [<ffff82c4801676b1>] get_page_and_type_from_pagenr+0x91/0x100
(XEN)    [<ffff82c4801f02d4>] ept_pod_check_and_populate+0x104/0x1a0
(XEN)    [<ffff82c4801f0482>] ept_get_entry+0x112/0x230
(XEN)    [<ffff82c48016be98>] do_mmu_update+0x16d8/0x1930
(XEN)    [<ffff82c4801f8c51>] do_iret+0xc1/0x1a0
(XEN)    [<ffff82c4801f4189>] syscall_enter+0xa9/0xae
(XEN)
(XEN) domain_crash called from p2m.c:1156
(XEN) Domain 1 reported crashed by domain 0 on cpu#2:
(XEN) p2m_pod_demand_populate: Dom1 out of PoD memory! (tot=1e0 ents=8200 dom0)
(XEN) Xen WARN at p2m.c:1155
(XEN) ----[ Xen-4.0.3_21548_04a-0.9.1  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0
...
(XEN) Xen call trace:
(XEN)    [<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0
(XEN)    [<ffff82c480108733>] send_guest_global_virq+0x93/0xe0
(XEN)    [<ffff82c4801cbfb2>] p2m_pod_demand_populate+0x862/0xab0
(XEN)    [<ffff82c4801f02d4>] ept_pod_check_and_populate+0x104/0x1a0
(XEN)    [<ffff82c4801f0482>] ept_get_entry+0x112/0x230
(XEN)    [<ffff82c48016890b>] mod_l1_entry+0x47b/0x650
(XEN)    [<ffff82c4801f0482>] ept_get_entry+0x112/0x230
(XEN)    [<ffff82c48016b21a>] do_mmu_update+0xa5a/0x1930
(XEN)    [<ffff82c4801f8c51>] do_iret+0xc1/0x1a0
(XEN)    [<ffff82c4801f4189>] syscall_enter+0xa9/0xae
(XEN)
(XEN) domain_crash called from p2m.c:1156

This clarifies at least why there are two events (and, despite
the code having changed quite a bit, appears to still be the
case for -unstable): case MMU_NORMAL_PT_UPDATE, sub-case
PGT_l1_page_table call (in -unstable terms) get_page_from_gfn()
but ignores the return value (which ought to be NULL here), and
only partially inspects the returned type. As the type matches
none of the ones looked for, it happily proceeds into
mod_l1_entry(), which then calls get_page_from_gfn() again.

> Although, it does look as though when populating 4k pages, the code 
> doesn't actually look to see if the allocation succeeded or not... oh 
> wait, no, it actually checks rc as a condition of the while() loop -- 
> but that is then clobbered by the xc_domain_set_pod_target() call.  But 
> surely if the 4k allocation failed, the set_target() call should fail as 
> well?  And in any case, there shouldn't yet be any PoD entries to cause 
> a demand-populate.
> 
> We probably should change "if(pod_mode)" to "if(rc == 0 && pod_mode)" or 
> something like that, just to be sure.  I'll spin up a patch.

I had also included this adjustment in the debugging patch, but
this clearly isn't related to the problem.

The domain indeed has 0x1e0 pages allocated, and a huge (still
growing number) of PoD entries. And apparently this fails so
rarely because it's pretty unlikely that there's not a single clear
page that the PoD code can select as victim, plus the Dom0
space code likely also only infrequently happens to kick in at
the wrong time.

So in the end it presumably boils down to decide whether such
an out-of-band Dom0 access is valid to be done (and I think it
is). If it is, then xc_hvm_build_x86.c:setup_guest() should
make sure any actually allocated pages (those coming from the
calls to xc_domain_populate_physmap_exact()) get cleared
when pod_mode is set.

Otoh, as pointed out in a yet unanswered mail (see
http://lists.xen.org/archives/html/xen-devel/2012-07/msg01331.html),
these allocations could/should - when pod_mode is set -
similarly be done with XENMEMF_populate_on_demand set.
In such a case, _any_ Dom0 access to guest memory prior
to the call to xc_domain_set_pod_target() would kill the
domain, as there is not even a single page to be looked at
as a possible victim. Consequently, I would think that the
guest shouldn't be killed unconditionally when a PoD
operation didn't succeed - in particular not when the
access was from a foreign (i.e. the management) domain.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-06 13:57   ` Jan Beulich
@ 2012-08-06 14:12     ` Jan Beulich
  2012-08-06 16:03       ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2012-08-06 14:12 UTC (permalink / raw)
  To: Ian Campbell, George Dunlap, Ian Jackson; +Cc: xen-devel

>>> On 06.08.12 at 15:57, "Jan Beulich" <JBeulich@suse.com> wrote:
> The domain indeed has 0x1e0 pages allocated, and a huge (still
> growing number) of PoD entries. And apparently this fails so
> rarely because it's pretty unlikely that there's not a single clear
> page that the PoD code can select as victim, plus the Dom0
> space code likely also only infrequently happens to kick in at
> the wrong time.

Just realized that of course it's also suspicious that there
shouldn't be any clear page among those 480 - Dom0 scrubs
its pages at balloons them out (but I think ballooning isn't even
in use there), Xen scrubs the free pages on boot, yet this
reportedly has happened also for the very first domain ever
created after boot. Or does the PoD code not touch the low
2Mb for some reason?

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-06 14:12     ` Jan Beulich
@ 2012-08-06 16:03       ` George Dunlap
  2012-08-07  7:34         ` Jan Beulich
  2012-08-07 12:17         ` Jan Beulich
  0 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2012-08-06 16:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Ian Campbell, xen-devel

On Mon, Aug 6, 2012 at 3:12 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 06.08.12 at 15:57, "Jan Beulich" <JBeulich@suse.com> wrote:
>> The domain indeed has 0x1e0 pages allocated, and a huge (still
>> growing number) of PoD entries. And apparently this fails so
>> rarely because it's pretty unlikely that there's not a single clear
>> page that the PoD code can select as victim, plus the Dom0
>> space code likely also only infrequently happens to kick in at
>> the wrong time.
>
> Just realized that of course it's also suspicious that there
> shouldn't be any clear page among those 480 - Dom0 scrubs
> its pages at balloons them out (but I think ballooning isn't even
> in use there), Xen scrubs the free pages on boot, yet this
> reportedly has happened also for the very first domain ever
> created after boot. Or does the PoD code not touch the low
> 2Mb for some reason?

Hmm -- the sweep code has some fairly complicated heuristics.  Ah -- I
bet this is it: The algorithm implicitly assumes that he first sweep
will happen after the first demand-fault.  It's designed to start at
the last demand-faulted gpfn (tracked by p2m->pod.max_guest) and go
downwards.  When it reaches 0, it stops its sweep (?!), and resets to
max_guest on the next entry.  But if max_guest is 0, this means it
will basically never sweep at all.

I guess there are two problems with that:
* As you've seen, apparently dom0 may access these pages before any
faults happen.
* If it happens that reclaim_single is below the only zeroed page, the
guest will crash even when there is reclaim-able memory available.

Two ways we could fix this:
1. Remove dom0 accesses (what on earth could be looking at a
not-yet-created VM?)
2. Allocate the PoD cache before populating the p2m table
3. Make it so that some accesses fail w/o crashing the guest?  I don't
see how that's really practical.
4. Change the sweep routine so that the lower 2MiB gets swept

#2 would require us to use all PoD entries when building the p2m
table, thus addressing the mail you mentioned from 25 July*.  Given
that you don't want #1, it seems like #2 is the best option.

No matter what we do, the sweep routine for 4.2 should be re-written
to search all of memory at least once (maybe with a timeout for
watchdogs), since it's only called in an actual emergency.

Let me take a look...

 -George

* Sorry for not responding to that one; I must have missed it in my
return-from-travelling e-mail sweep.  If you CC me next time I'll be
sure to get it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-06 16:03       ` George Dunlap
@ 2012-08-07  7:34         ` Jan Beulich
  2012-08-07 10:00           ` Tim Deegan
  2012-08-07 10:20           ` George Dunlap
  2012-08-07 12:17         ` Jan Beulich
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2012-08-07  7:34 UTC (permalink / raw)
  To: George Dunlap, Tim Deegan; +Cc: Ian Jackson, Ian Campbell, xen-devel

>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> I guess there are two problems with that:
> * As you've seen, apparently dom0 may access these pages before any
> faults happen.
> * If it happens that reclaim_single is below the only zeroed page, the
> guest will crash even when there is reclaim-able memory available.
> 
> Two ways we could fix this:
> 1. Remove dom0 accesses (what on earth could be looking at a
> not-yet-created VM?)

I'm told it's a monitoring daemon, and yes, they are intending to
adjust it to first query the GFN's type (and don't do the access
when it's not populated, yet). But wait, I didn't check the code
when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
also call get_page_from_gfn() with P2M_ALLOC, so would also
trigger the PoD code (in -unstable at least) - Tim, was that really
a correct adjustment in 25355:974ad81bb68b? It looks to be a
1:1 translation, but is that really necessary? If one wanted to
find out whether a page is PoD to avoid getting it populated,
how would that be done from outside the hypervisor? Would
we need XEN_DOMCTL_getpageframeinfo4 for this?

> 2. Allocate the PoD cache before populating the p2m table
> 3. Make it so that some accesses fail w/o crashing the guest?  I don't
> see how that's really practical.

What's wrong with telling control tools that a certain page is
unpopulated (from which they will be able to imply that's it's all
clear from the guest's pov)? Even outside of the current problem,
I would think that's more efficient than allocating the page. Of
course, the control tools need to be able to cope with that. And
it may also be necessary to distinguish between read and
read/write mappings being established (and for r/w ones the
option of populating at access time rather than at creation time
would need to be explored).

> 4. Change the sweep routine so that the lower 2MiB gets swept
> 
> #2 would require us to use all PoD entries when building the p2m
> table, thus addressing the mail you mentioned from 25 July*.  Given
> that you don't want #1, it seems like #2 is the best option.
> 
> No matter what we do, the sweep routine for 4.2 should be re-written
> to search all of memory at least once (maybe with a timeout for
> watchdogs), since it's only called in an actual emergency.
> 
> Let me take a look...

Thanks!

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07  7:34         ` Jan Beulich
@ 2012-08-07 10:00           ` Tim Deegan
  2012-08-07 10:32             ` George Dunlap
  2012-08-07 11:03             ` Jan Beulich
  2012-08-07 10:20           ` George Dunlap
  1 sibling, 2 replies; 22+ messages in thread
From: Tim Deegan @ 2012-08-07 10:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Ian Jackson, Ian Campbell, xen-devel

Hi,

At 08:34 +0100 on 07 Aug (1344328495), Jan Beulich wrote:
> >>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> > I guess there are two problems with that:
> > * As you've seen, apparently dom0 may access these pages before any
> > faults happen.
> > * If it happens that reclaim_single is below the only zeroed page, the
> > guest will crash even when there is reclaim-able memory available.
> > 
> > Two ways we could fix this:
> > 1. Remove dom0 accesses (what on earth could be looking at a
> > not-yet-created VM?)
> 
> I'm told it's a monitoring daemon, and yes, they are intending to
> adjust it to first query the GFN's type (and don't do the access
> when it's not populated, yet). But wait, I didn't check the code
> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
> also call get_page_from_gfn() with P2M_ALLOC, so would also
> trigger the PoD code (in -unstable at least) - Tim, was that really
> a correct adjustment in 25355:974ad81bb68b? It looks to be a
> 1:1 translation, but is that really necessary?

AFAICT 25355:974ad81bb68b doesn't change anything.  Back in 4.1-testing
the lookup was done with gmfn_to_mfn(), which boils down to a lookup
with p2m_alloc.

> If one wanted to find out whether a page is PoD to avoid getting it
> populated, how would that be done from outside the hypervisor? Would
> we need XEN_DOMCTL_getpageframeinfo4 for this?

We'd certainly need _some_ change to the hypercall interface, as there's
no XEN_DOMCTL_PFINFO_ rune for 'PoD', and presumably you'd want to know
the difference between PoD and not-present.

> > 2. Allocate the PoD cache before populating the p2m table

Any reason not to do this?

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07  7:34         ` Jan Beulich
  2012-08-07 10:00           ` Tim Deegan
@ 2012-08-07 10:20           ` George Dunlap
  2012-08-07 11:05             ` Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: George Dunlap @ 2012-08-07 10:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Tim Deegan, Ian Campbell, xen-devel

On Tue, Aug 7, 2012 at 8:34 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> I guess there are two problems with that:
>> * As you've seen, apparently dom0 may access these pages before any
>> faults happen.
>> * If it happens that reclaim_single is below the only zeroed page, the
>> guest will crash even when there is reclaim-able memory available.
>>
>> Two ways we could fix this:
>> 1. Remove dom0 accesses (what on earth could be looking at a
>> not-yet-created VM?)
>
> I'm told it's a monitoring daemon, and yes, they are intending to
> adjust it to first query the GFN's type (and don't do the access
> when it's not populated, yet). But wait, I didn't check the code
> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
> also call get_page_from_gfn() with P2M_ALLOC, so would also
> trigger the PoD code (in -unstable at least) - Tim, was that really
> a correct adjustment in 25355:974ad81bb68b? It looks to be a
> 1:1 translation, but is that really necessary? If one wanted to
> find out whether a page is PoD to avoid getting it populated,
> how would that be done from outside the hypervisor? Would
> we need XEN_DOMCTL_getpageframeinfo4 for this?
>
>> 2. Allocate the PoD cache before populating the p2m table
>> 3. Make it so that some accesses fail w/o crashing the guest?  I don't
>> see how that's really practical.
>
> What's wrong with telling control tools that a certain page is
> unpopulated (from which they will be able to imply that's it's all
> clear from the guest's pov)?

Because in the general case it's wrong.  The only time crashing the
guest is *not* the right thing to do is in the case we have at hand,
where PoD pages are accessed before the PoD memory is allocated.

Probably the quickest fix would be if there was a simple way for the
monitoring daemon to filter out domains that aren't completely built
yet -- maybe by looking at something in xenstore?

But the current state of things, does seem unnecessarily fragile; I
think if it can be done, allocating PoD memory before writing PoD
entries is probably a good thing to do anyway.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 10:00           ` Tim Deegan
@ 2012-08-07 10:32             ` George Dunlap
  2012-08-07 11:03             ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: George Dunlap @ 2012-08-07 10:32 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Ian Jackson, Ian Campbell, Jan Beulich, xen-devel

On Tue, Aug 7, 2012 at 11:00 AM, Tim Deegan <tim@xen.org> wrote:
>> > 2. Allocate the PoD cache before populating the p2m table
>
> Any reason not to do this?

The only reason I can think of is that it make it more fiddly if we
*didn't* want some entries to be PoD.  For example, if it is the case
that the first 2MiB shouldn't be PoD, we'd have to subtract 2MiB from
the target when doing the allocation.  Doing it afterwards means not
having to worry about it.

I can't think of any reason why the first 2MiB couldn't be PoD,
however; so I think this is probably the best route to take.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 10:00           ` Tim Deegan
  2012-08-07 10:32             ` George Dunlap
@ 2012-08-07 11:03             ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-08-07 11:03 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Ian Jackson, Ian Campbell, xen-devel

>>> On 07.08.12 at 12:00, Tim Deegan <tim@xen.org> wrote:
>> > 2. Allocate the PoD cache before populating the p2m table
> 
> Any reason not to do this?

Don't know. I was (maybe wrongly) implying that the order things
got done in was such for a reason. That's why I had added tools
folks to Cc.

In any case I have asked the reporter to test a respective change.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 10:20           ` George Dunlap
@ 2012-08-07 11:05             ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-08-07 11:05 UTC (permalink / raw)
  To: George Dunlap; +Cc: Tim Deegan, Ian Jackson, Ian Campbell, xen-devel

>>> On 07.08.12 at 12:20, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> Probably the quickest fix would be if there was a simple way for the
> monitoring daemon to filter out domains that aren't completely built
> yet -- maybe by looking at something in xenstore?

Given that the getpageframeinfo thing that I thought could be
used for this doesn't work, I'd be curious as to what (xenstore
or other) things to look at we could suggest to them. I personally
have no idea...

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-06 16:03       ` George Dunlap
  2012-08-07  7:34         ` Jan Beulich
@ 2012-08-07 12:17         ` Jan Beulich
  2012-08-07 13:13           ` George Dunlap
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2012-08-07 12:17 UTC (permalink / raw)
  To: George Dunlap; +Cc: Ian Jackson, Ian Campbell, xen-devel

>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> 2. Allocate the PoD cache before populating the p2m table

So this doesn't work, the call simply has no effect (and never
reaches p2m_pod_set_cache_target()). Apparently because
of

    /* P == B: Nothing to do. */
    if ( p2md->pod.entry_count == 0 )
        goto out;

in p2m_pod_set_mem_target(). Now I'm not sure about the
proper adjustment here: Entirely dropping the conditional is
certainly wrong. Would

    if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
        goto out;

be okay?

But then later in that function we also have

    /* B < T': Set the cache size equal to # of outstanding entries,
     * let the balloon driver fill in the rest. */
    if ( pod_target > p2md->pod.entry_count )
        pod_target = p2md->pod.entry_count;

which in the case at hand would set pod_target to 0, and the
whole operation would again not have any effect afaict. So
maybe this was the reason to do this operation _after_ the
normal address space population?

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 12:17         ` Jan Beulich
@ 2012-08-07 13:13           ` George Dunlap
  2012-08-07 13:29             ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2012-08-07 13:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Ian Campbell, xen-devel

On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> 2. Allocate the PoD cache before populating the p2m table
>
> So this doesn't work, the call simply has no effect (and never
> reaches p2m_pod_set_cache_target()). Apparently because
> of
>
>     /* P == B: Nothing to do. */
>     if ( p2md->pod.entry_count == 0 )
>         goto out;
>
> in p2m_pod_set_mem_target(). Now I'm not sure about the
> proper adjustment here: Entirely dropping the conditional is
> certainly wrong. Would
>
>     if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>         goto out;
>
> be okay?
>
> But then later in that function we also have
>
>     /* B < T': Set the cache size equal to # of outstanding entries,
>      * let the balloon driver fill in the rest. */
>     if ( pod_target > p2md->pod.entry_count )
>         pod_target = p2md->pod.entry_count;
>
> which in the case at hand would set pod_target to 0, and the
> whole operation would again not have any effect afaict. So
> maybe this was the reason to do this operation _after_ the
> normal address space population?

Snap -- forgot about that.

The main thing is for set_mem_target() to be simple for the toolstack
-- it's just supposed to say how much memory it wants the guest to
use, and Xen is supposed to figure out how much memory the PoD cache
needs.  The interface is that the toolstack is just supposed to call
set_mem_target() after each time it changes the balloon target.  The
idea was to be robust against the user setting arbitrary new targets
before the balloon driver had reached the old target.  So the problem
with allowing (pod_target > entry_count) is that that's the condition
that happens when you are ballooning up.

Maybe the best thing to do is to introduce a specific call to
initialize the PoD cache that would ignore entry_count?

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 13:13           ` George Dunlap
@ 2012-08-07 13:29             ` Jan Beulich
  2012-08-07 15:08               ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2012-08-07 13:29 UTC (permalink / raw)
  To: George Dunlap; +Cc: Ian Jackson, Ian Campbell, xen-devel

>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>>> 2. Allocate the PoD cache before populating the p2m table
>>
>> So this doesn't work, the call simply has no effect (and never
>> reaches p2m_pod_set_cache_target()). Apparently because
>> of
>>
>>     /* P == B: Nothing to do. */
>>     if ( p2md->pod.entry_count == 0 )
>>         goto out;
>>
>> in p2m_pod_set_mem_target(). Now I'm not sure about the
>> proper adjustment here: Entirely dropping the conditional is
>> certainly wrong. Would
>>
>>     if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>>         goto out;
>>
>> be okay?
>>
>> But then later in that function we also have
>>
>>     /* B < T': Set the cache size equal to # of outstanding entries,
>>      * let the balloon driver fill in the rest. */
>>     if ( pod_target > p2md->pod.entry_count )
>>         pod_target = p2md->pod.entry_count;
>>
>> which in the case at hand would set pod_target to 0, and the
>> whole operation would again not have any effect afaict. So
>> maybe this was the reason to do this operation _after_ the
>> normal address space population?
> 
> Snap -- forgot about that.
> 
> The main thing is for set_mem_target() to be simple for the toolstack
> -- it's just supposed to say how much memory it wants the guest to
> use, and Xen is supposed to figure out how much memory the PoD cache
> needs.  The interface is that the toolstack is just supposed to call
> set_mem_target() after each time it changes the balloon target.  The
> idea was to be robust against the user setting arbitrary new targets
> before the balloon driver had reached the old target.  So the problem
> with allowing (pod_target > entry_count) is that that's the condition
> that happens when you are ballooning up.
> 
> Maybe the best thing to do is to introduce a specific call to
> initialize the PoD cache that would ignore entry_count?

Hmm, would looks more like a hack to me.

How about doing the initial check as suggested earlier

    if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
        goto out;

and the latter check in a similar way

    if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 )
        pod_target = p2md->pod.entry_count;

(which would still take care of any ballooning activity)? Or are
there any other traps to fall into?

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
       [not found] <mailman.10292.1344326858.1399.xen-devel@lists.xen.org>
@ 2012-08-07 14:40 ` Andres Lagar-Cavilla
  2012-08-07 15:04   ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Andres Lagar-Cavilla @ 2012-08-07 14:40 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, Ian.Jackson, tim, JBeulich

>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com>
>>>> wrote:
>> I guess there are two problems with that:
>> * As you've seen, apparently dom0 may access these pages before any
>> faults happen.
>> * If it happens that reclaim_single is below the only zeroed page, the
>> guest will crash even when there is reclaim-able memory available.
>>
>> Two ways we could fix this:
>> 1. Remove dom0 accesses (what on earth could be looking at a
>> not-yet-created VM?)
>
> I'm told it's a monitoring daemon, and yes, they are intending to
> adjust it to first query the GFN's type (and don't do the access
> when it's not populated, yet). But wait, I didn't check the code
> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
> also call get_page_from_gfn() with P2M_ALLOC, so would also
> trigger the PoD code (in -unstable at least) - Tim, was that really
> a correct adjustment in 25355:974ad81bb68b? It looks to be a
> 1:1 translation, but is that really necessary? If one wanted to
> find out whether a page is PoD to avoid getting it populated,
> how would that be done from outside the hypervisor? Would
> we need XEN_DOMCTL_getpageframeinfo4 for this?
>
>> 2. Allocate the PoD cache before populating the p2m table
>> 3. Make it so that some accesses fail w/o crashing the guest?  I don't
>> see how that's really practical.
>
> What's wrong with telling control tools that a certain page is
> unpopulated (from which they will be able to imply that's it's all
> clear from the guest's pov)? Even outside of the current problem,
> I would think that's more efficient than allocating the page. Of
> course, the control tools need to be able to cope with that. And
> it may also be necessary to distinguish between read and
> read/write mappings being established (and for r/w ones the
> option of populating at access time rather than at creation time
> would need to be explored).

I wouldn't be opposed to some form of getpageframeinfo4. It's not just PoD
we are talking about here. Is the page paged out? Is the page shared?

Right now we have global per-domain queries (domaininfo). Or individual
gfn debug memctl's. A batched interface with richer information would be a
blessing for debugging or diagnosis purposes.

The first order of business is exposing the type. Do we really want to
expose the whole range of p2m_* types or just "really useful" ones like
is_shared, is_pod, is_paged, is_normal? An argument for the former is that
the mem event interface already pumps the p2m_* type up the stack.

The other useful bit of information I can think of is exposing the shared
ref count.

My two cents
Andres

>
>> 4. Change the sweep routine so that the lower 2MiB gets swept
>>
>> #2 would require us to use all PoD entries when building the p2m
>> table, thus addressing the mail you mentioned from 25 July*.  Given
>> that you don't want #1, it seems like #2 is the best option.
>>
>> No matter what we do, the sweep routine for 4.2 should be re-written
>> to search all of memory at least once (maybe with a timeout for
>> watchdogs), since it's only called in an actual emergency.
>>
>> Let me take a look...
>
> Thanks!
>
> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 14:40 ` Andres Lagar-Cavilla
@ 2012-08-07 15:04   ` George Dunlap
  2012-08-07 15:36     ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2012-08-07 15:04 UTC (permalink / raw)
  To: andres@lagarcavilla.org
  Cc: Ian Jackson, Tim (Xen.org), JBeulich@suse.com,
	xen-devel@lists.xen.org

On 07/08/12 15:40, Andres Lagar-Cavilla wrote:
>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com>
>>>>> wrote:
>>> I guess there are two problems with that:
>>> * As you've seen, apparently dom0 may access these pages before any
>>> faults happen.
>>> * If it happens that reclaim_single is below the only zeroed page, the
>>> guest will crash even when there is reclaim-able memory available.
>>>
>>> Two ways we could fix this:
>>> 1. Remove dom0 accesses (what on earth could be looking at a
>>> not-yet-created VM?)
>> I'm told it's a monitoring daemon, and yes, they are intending to
>> adjust it to first query the GFN's type (and don't do the access
>> when it's not populated, yet). But wait, I didn't check the code
>> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
>> also call get_page_from_gfn() with P2M_ALLOC, so would also
>> trigger the PoD code (in -unstable at least) - Tim, was that really
>> a correct adjustment in 25355:974ad81bb68b? It looks to be a
>> 1:1 translation, but is that really necessary? If one wanted to
>> find out whether a page is PoD to avoid getting it populated,
>> how would that be done from outside the hypervisor? Would
>> we need XEN_DOMCTL_getpageframeinfo4 for this?
>>
>>> 2. Allocate the PoD cache before populating the p2m table
>>> 3. Make it so that some accesses fail w/o crashing the guest?  I don't
>>> see how that's really practical.
>> What's wrong with telling control tools that a certain page is
>> unpopulated (from which they will be able to imply that's it's all
>> clear from the guest's pov)? Even outside of the current problem,
>> I would think that's more efficient than allocating the page. Of
>> course, the control tools need to be able to cope with that. And
>> it may also be necessary to distinguish between read and
>> read/write mappings being established (and for r/w ones the
>> option of populating at access time rather than at creation time
>> would need to be explored).
> I wouldn't be opposed to some form of getpageframeinfo4. It's not just PoD
> we are talking about here. Is the page paged out? Is the page shared?
>
> Right now we have global per-domain queries (domaininfo). Or individual
> gfn debug memctl's. A batched interface with richer information would be a
> blessing for debugging or diagnosis purposes.
>
> The first order of business is exposing the type. Do we really want to
> expose the whole range of p2m_* types or just "really useful" ones like
> is_shared, is_pod, is_paged, is_normal? An argument for the former is that
> the mem event interface already pumps the p2m_* type up the stack.
>
> The other useful bit of information I can think of is exposing the shared
> ref count.
I think just like the gfn_to_mfn() interface, we need a "I care about 
the details" interface and an "I don't care about the details" 
interface.  If a page isn't present, or needs to be un-shared, or is PoD 
and not currently available, then maybe dom0 callers trying to map that 
page should get something like -EAGAIN?  Just something that indicates, 
"This page isn't here at the moment, but may be here soon."  What do you 
think?

  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 13:29             ` Jan Beulich
@ 2012-08-07 15:08               ` George Dunlap
  2012-08-07 15:36                 ` Jan Beulich
  2012-08-09  8:37                 ` Jan Beulich
  0 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2012-08-07 15:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Ian Campbell, xen-devel

On 07/08/12 14:29, Jan Beulich wrote:
>>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>>>> 2. Allocate the PoD cache before populating the p2m table
>>> So this doesn't work, the call simply has no effect (and never
>>> reaches p2m_pod_set_cache_target()). Apparently because
>>> of
>>>
>>>      /* P == B: Nothing to do. */
>>>      if ( p2md->pod.entry_count == 0 )
>>>          goto out;
>>>
>>> in p2m_pod_set_mem_target(). Now I'm not sure about the
>>> proper adjustment here: Entirely dropping the conditional is
>>> certainly wrong. Would
>>>
>>>      if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>>>          goto out;
>>>
>>> be okay?
>>>
>>> But then later in that function we also have
>>>
>>>      /* B < T': Set the cache size equal to # of outstanding entries,
>>>       * let the balloon driver fill in the rest. */
>>>      if ( pod_target > p2md->pod.entry_count )
>>>          pod_target = p2md->pod.entry_count;
>>>
>>> which in the case at hand would set pod_target to 0, and the
>>> whole operation would again not have any effect afaict. So
>>> maybe this was the reason to do this operation _after_ the
>>> normal address space population?
>> Snap -- forgot about that.
>>
>> The main thing is for set_mem_target() to be simple for the toolstack
>> -- it's just supposed to say how much memory it wants the guest to
>> use, and Xen is supposed to figure out how much memory the PoD cache
>> needs.  The interface is that the toolstack is just supposed to call
>> set_mem_target() after each time it changes the balloon target.  The
>> idea was to be robust against the user setting arbitrary new targets
>> before the balloon driver had reached the old target.  So the problem
>> with allowing (pod_target > entry_count) is that that's the condition
>> that happens when you are ballooning up.
>>
>> Maybe the best thing to do is to introduce a specific call to
>> initialize the PoD cache that would ignore entry_count?
> Hmm, would looks more like a hack to me.
>
> How about doing the initial check as suggested earlier
>
>      if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>          goto out;
>
> and the latter check in a similar way
>
>      if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 )
>          pod_target = p2md->pod.entry_count;
>
> (which would still take care of any ballooning activity)? Or are
> there any other traps to fall into?
The "d->tot_pages > 0" seems more like a hack to me. :-) What's hackish 
about having an interface like this?
* allocate_pod_mem()
* for() { populate_pod_mem() }
* [Boot VM]
* set_pod_target()

Right now set_pod_mem() is used both for initial allocation and for 
adjustments.  But it seems like there's good reason to make a distinction.

  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 15:08               ` George Dunlap
@ 2012-08-07 15:36                 ` Jan Beulich
  2012-08-09  8:37                 ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-08-07 15:36 UTC (permalink / raw)
  To: George Dunlap; +Cc: Ian Jackson, Ian Campbell, xen-devel

>>> On 07.08.12 at 17:08, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 07/08/12 14:29, Jan Beulich wrote:
>>>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>>> On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>>>>> 2. Allocate the PoD cache before populating the p2m table
>>>> So this doesn't work, the call simply has no effect (and never
>>>> reaches p2m_pod_set_cache_target()). Apparently because
>>>> of
>>>>
>>>>      /* P == B: Nothing to do. */
>>>>      if ( p2md->pod.entry_count == 0 )
>>>>          goto out;
>>>>
>>>> in p2m_pod_set_mem_target(). Now I'm not sure about the
>>>> proper adjustment here: Entirely dropping the conditional is
>>>> certainly wrong. Would
>>>>
>>>>      if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>>>>          goto out;
>>>>
>>>> be okay?
>>>>
>>>> But then later in that function we also have
>>>>
>>>>      /* B < T': Set the cache size equal to # of outstanding entries,
>>>>       * let the balloon driver fill in the rest. */
>>>>      if ( pod_target > p2md->pod.entry_count )
>>>>          pod_target = p2md->pod.entry_count;
>>>>
>>>> which in the case at hand would set pod_target to 0, and the
>>>> whole operation would again not have any effect afaict. So
>>>> maybe this was the reason to do this operation _after_ the
>>>> normal address space population?
>>> Snap -- forgot about that.
>>>
>>> The main thing is for set_mem_target() to be simple for the toolstack
>>> -- it's just supposed to say how much memory it wants the guest to
>>> use, and Xen is supposed to figure out how much memory the PoD cache
>>> needs.  The interface is that the toolstack is just supposed to call
>>> set_mem_target() after each time it changes the balloon target.  The
>>> idea was to be robust against the user setting arbitrary new targets
>>> before the balloon driver had reached the old target.  So the problem
>>> with allowing (pod_target > entry_count) is that that's the condition
>>> that happens when you are ballooning up.
>>>
>>> Maybe the best thing to do is to introduce a specific call to
>>> initialize the PoD cache that would ignore entry_count?
>> Hmm, would looks more like a hack to me.
>>
>> How about doing the initial check as suggested earlier
>>
>>      if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>>          goto out;
>>
>> and the latter check in a similar way
>>
>>      if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 )
>>          pod_target = p2md->pod.entry_count;
>>
>> (which would still take care of any ballooning activity)? Or are
>> there any other traps to fall into?
> The "d->tot_pages > 0" seems more like a hack to me. :-) What's hackish 
> about having an interface like this?
> * allocate_pod_mem()
> * for() { populate_pod_mem() }
> * [Boot VM]
> * set_pod_target()

Mostly the fact that it's an almost identical interface to what
we already have. After all, the two !alloc checks would go at
exactly the place where I had put the d->tot_pages > 0.
Plus the new interface likely ought to check that d->tot_pages
is zero. In the end you're proposing a full new interface for
something that can be done with two small adjustments.

> Right now set_pod_mem() is used both for initial allocation and for 
> adjustments.  But it seems like there's good reason to make a distinction.

Yes. But that distinction can well be implicit.

In any case, as for testing this out my approach would be far
easier (even if it looks like a hack to you), is there anything
wrong with it (i.e. is my assumption where the !alloc checks
would have to go wrong)? If not, I'd prefer to have that
simpler thing tested, and then we can still settle on the more
involved change you would like to see.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 15:04   ` George Dunlap
@ 2012-08-07 15:36     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Lagar-Cavilla @ 2012-08-07 15:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Jackson, Tim (Xen.org), JBeulich@suse.com,
	xen-devel@lists.xen.org

> On 07/08/12 15:40, Andres Lagar-Cavilla wrote:
>>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com>
>>>>>> wrote:
>>>> I guess there are two problems with that:
>>>> * As you've seen, apparently dom0 may access these pages before any
>>>> faults happen.
>>>> * If it happens that reclaim_single is below the only zeroed page, the
>>>> guest will crash even when there is reclaim-able memory available.
>>>>
>>>> Two ways we could fix this:
>>>> 1. Remove dom0 accesses (what on earth could be looking at a
>>>> not-yet-created VM?)
>>> I'm told it's a monitoring daemon, and yes, they are intending to
>>> adjust it to first query the GFN's type (and don't do the access
>>> when it's not populated, yet). But wait, I didn't check the code
>>> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3)
>>> also call get_page_from_gfn() with P2M_ALLOC, so would also
>>> trigger the PoD code (in -unstable at least) - Tim, was that really
>>> a correct adjustment in 25355:974ad81bb68b? It looks to be a
>>> 1:1 translation, but is that really necessary? If one wanted to
>>> find out whether a page is PoD to avoid getting it populated,
>>> how would that be done from outside the hypervisor? Would
>>> we need XEN_DOMCTL_getpageframeinfo4 for this?
>>>
>>>> 2. Allocate the PoD cache before populating the p2m table
>>>> 3. Make it so that some accesses fail w/o crashing the guest?  I don't
>>>> see how that's really practical.
>>> What's wrong with telling control tools that a certain page is
>>> unpopulated (from which they will be able to imply that's it's all
>>> clear from the guest's pov)? Even outside of the current problem,
>>> I would think that's more efficient than allocating the page. Of
>>> course, the control tools need to be able to cope with that. And
>>> it may also be necessary to distinguish between read and
>>> read/write mappings being established (and for r/w ones the
>>> option of populating at access time rather than at creation time
>>> would need to be explored).
>> I wouldn't be opposed to some form of getpageframeinfo4. It's not just
>> PoD
>> we are talking about here. Is the page paged out? Is the page shared?
>>
>> Right now we have global per-domain queries (domaininfo). Or individual
>> gfn debug memctl's. A batched interface with richer information would be
>> a
>> blessing for debugging or diagnosis purposes.
>>
>> The first order of business is exposing the type. Do we really want to
>> expose the whole range of p2m_* types or just "really useful" ones like
>> is_shared, is_pod, is_paged, is_normal? An argument for the former is
>> that
>> the mem event interface already pumps the p2m_* type up the stack.
>>
>> The other useful bit of information I can think of is exposing the
>> shared
>> ref count.
> I think just like the gfn_to_mfn() interface, we need a "I care about
> the details" interface and an "I don't care about the details"
> interface.  If a page isn't present, or needs to be un-shared, or is PoD
> and not currently available, then maybe dom0 callers trying to map that
> page should get something like -EAGAIN?  Just something that indicates,
> "This page isn't here at the moment, but may be here soon."  What do you
> think?

Sure.

Right now you get -ENOENT if dom0 tries to mmap a foreign frame that is
paged out. Kernel-level backends get the same with grant mappings. As a
side-effect, the hypervisor has triggered the page in, so one of the next
retries should succeed.

My opinion is that you should expand the use of -ENOENT for this
newly-minted "delayed PoD" case. Iiuc, PoD would just succeed or crash the
guest prior to this conversation.

That way, no new code is needed in the upper-layers (neither libxc nor
kernel backends) to deal with delayed PoD allocations. The retry loops
paging needs are already in place and PoD leverages them.

Sharing can get ENOMEM when breaking shares. Much like paging, the
hypervisor will have kicked the appropriate helper, so a retry with an
expectation of success could be put in place. (Nevermind that there are no
in-tree helpers atm: what to do, balloon dom0 to release more mem?). I can
look into making it uniform with the above cases.

As for the detailed interface, getpageframeinfo4 sounds like a good idea.
Otherwise, you need a new privcmd mmap interface in the kernel ... snore
;)

Andres
>
>   -George
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PoD code killing domain before it really gets started
  2012-08-07 15:08               ` George Dunlap
  2012-08-07 15:36                 ` Jan Beulich
@ 2012-08-09  8:37                 ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2012-08-09  8:37 UTC (permalink / raw)
  To: George Dunlap; +Cc: Ian Jackson, Ian Campbell, xen-devel

>>> On 07.08.12 at 17:08, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 07/08/12 14:29, Jan Beulich wrote:
>> Hmm, would looks more like a hack to me.
>>
>> How about doing the initial check as suggested earlier
>>
>>      if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 )
>>          goto out;
>>
>> and the latter check in a similar way
>>
>>      if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 )
>>          pod_target = p2md->pod.entry_count;
>>
>> (which would still take care of any ballooning activity)? Or are
>> there any other traps to fall into?
> The "d->tot_pages > 0" seems more like a hack to me. :-) What's hackish 
> about having an interface like this?
> * allocate_pod_mem()
> * for() { populate_pod_mem() }
> * [Boot VM]
> * set_pod_target()
> 
> Right now set_pod_mem() is used both for initial allocation and for 
> adjustments.  But it seems like there's good reason to make a distinction.

So that one didn't take hypercall preemption into account.

While adjusting this to use "d->tot_pages > p2md->pod.count",
which in turn I converted to simply "populated > 0" (moving the
calculation of "populated" earlier), I noticed that there's quite a
bit of type inconsistency: I first stumbled across
p2m_pod_set_mem_target()'s "pod_target" being "unsigned"
instead of "unsigned long", then noticed that all the fields in
struct p2m_domain's pod sub-structure aren't "long" as at least
the GFNs ought to be (the counts should be not only for consistency,
but also because they're signed, and hence overflow earlier). As a
consequence, the PoD code itself would also need careful review,
as there are a number of questionable uses of plain "int" (and, quite
the opposite, two cases of "order" pointlessly being "unsigned long").

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-08-09  8:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-26 14:41 PoD code killing domain before it really gets started Jan Beulich
2012-07-26 15:37 ` Jan Beulich
2012-07-26 16:14 ` George Dunlap
2012-07-27  6:45   ` Jan Beulich
2012-08-06 13:57   ` Jan Beulich
2012-08-06 14:12     ` Jan Beulich
2012-08-06 16:03       ` George Dunlap
2012-08-07  7:34         ` Jan Beulich
2012-08-07 10:00           ` Tim Deegan
2012-08-07 10:32             ` George Dunlap
2012-08-07 11:03             ` Jan Beulich
2012-08-07 10:20           ` George Dunlap
2012-08-07 11:05             ` Jan Beulich
2012-08-07 12:17         ` Jan Beulich
2012-08-07 13:13           ` George Dunlap
2012-08-07 13:29             ` Jan Beulich
2012-08-07 15:08               ` George Dunlap
2012-08-07 15:36                 ` Jan Beulich
2012-08-09  8:37                 ` Jan Beulich
     [not found] <mailman.10292.1344326858.1399.xen-devel@lists.xen.org>
2012-08-07 14:40 ` Andres Lagar-Cavilla
2012-08-07 15:04   ` George Dunlap
2012-08-07 15:36     ` Andres Lagar-Cavilla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).