All of lore.kernel.org
 help / color / mirror / Atom feed
* Buggy interaction of live migration and p2m updates
@ 2014-11-20 18:28 Andrew Cooper
  2014-11-21  5:41 ` Juergen Gross
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Andrew Cooper @ 2014-11-20 18:28 UTC (permalink / raw)
  To: Xen-devel List
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Tim Deegan,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

Hello,

Tim, David and I were discussing this over lunch.  This email is a
(hopefully accurate) account of our findings, and potential solutions. 
(If I have messed up, please shout.)

Currently, correct live migration of PV domains relies on the toolstack
(which has a live mapping of the guests p2m) not observing stale values
when the guest updates its p2m, and the race condition between a p2m
update and an m2p update.  Realistically, this means no updates to the
p2m at all, due to several potential race conditions.  Should any race
conditions happen (e.g. ballooning while live migrating), the effects
could be anything from an aborted migration to VM memory corruption.

It should be noted that migrationv2 does not fix any of this.  It alters
the way in which some race conditions might be observed.  During
development of migrationv2, there was an explicit non-requirement of
fixing the existing Ballooning+LiveMigration issues we were aware,
although at the time, we were not aware of this specific set of issues. 
Our goal was to simply make migrationv2 work in the same circumstances
as previously, but with a bitness-agnostic wire format and
forward-extensible protocol.


As far as these issues are concerned, there are two distinct p2m
modifications which we care about:
1) p2m structure changes (rearranging the layout of the p2m)
2) p2m content changes (altering entries in the p2m)

There is no possible way for the toolstack to prevent a domain from
altering its p2m.  At the moment, ballooning typically only occurs when
requested by the toolstack, but the underlying operations
(increase/decrease_reservation, mem_exchange, etc) can be used by the
guest at any point.  This includes Wei's guest memory fragmentation
changes.  Changes to the content of the p2m also occur for grant map and
unmap operations.


Currently in PV guests, the p2m is implemented using a 3-level tree,
with its root in the guests shared_info page.  It provides a hard VM
memory limit of 4TB for 32bit PV guests (which is far higher than the
128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.

Juergen has a proposed new p2m interface using a virtual linear
mapping.  This is conceptually similar to the previous implementation
(which is fine from the toolstacks point of view), but far less
complicated from the guests point of view, and removes the memory limits
imposed by the p2m structure.

The new virtual linear mapping suffers from the same interaction issues
as the old 3-level tree did, but the introduction of the new interface
affords us an opportunity to make all API modifications at once to
reduce churn.


During live migration, the toolstack maps the guests p2m into a linear
mapping in the toolstacks virtual address space.  This is done once at
the start of migration, and never subsequently altered.  During live
migration, the p2m is cross-verified with the m2p, and frames are sent
using pfns as a reference, as they will be located in different frames
on the receiving side.

Should the guest change the p2m structure during live migration, the
toolstack ends up with a stale p2m with a non-p2m frame in the middle,
resulting in bogus cross-referencing.  Should the guest change an entry
in the p2m, the p2m frame itself will be resent as it would be marked as
dirty in the logdirty bitmap, but the target pfn will remain unsent and
probably stale on the receiving side.


Another factor which needs to be taken into account is Remus/COLO, which
run the domains under live migration conditions for the duration of
their lifetime.

During the live part of migration, the toolstack already has to be able
to tolerate failures to normalise the pagetables, which result as a
consequent of the pagetables being in active.  These failures are fatal
on the final iteration after the guest has been paused, but the same
logic could be extended to p2m/m2p issues, if needed.


There are several potential solutions to these problems.

1) Freeze the guests p2m during live migrate

This is the simplest sounding option, but is quite problematic from the
point of view of the guest.  It is essentially a shared spinlock between
the toolstack and the guest kernel.  It would prevent any grant
map/unmap operations from occurring, and might interact badly with
certain p2m updated in the guest which would previously be expected to
unconditionally succeed.

Pros) (Can't think of any)
Cons) Not easy to implement (even conceptually), requires invasive guest
changes, will cripple Remus/COLO


2) Deep p2m dirty tracking

In the case that a p2m frame is discovered dirty in the logdirty bitmap,
we can be certain that a write has occurred to it, and in the common
case, means that the mapping has changed.  The toolstack could maintain
a non-live copy of the p2m which is updated as new frames are sent. 
When a dirty p2m frame is found, the live and non-live copies can be
consulted to find which pfn mappings have changed, and locally mark all
the altered pfns for retransmit.

Pros) No guest changes required
Cons) Toolstack needs to keep an additional copy of the guests p2m on
the sending side

3) Eagerly check for p2m structure changes.

p2m structure changes are rare after boot, but not impossible.  Each
iteration of live migration, the toolstack can check for dirty
higher-level p2m frames in the dirty bitmap.  In the case that a
structure update occurs, the toolstack can use information it already
has to calculate a subset of pfns affected by the update, and mark them
for resending.  (This can currently be done to the frame granularity
given the p2m frame lit, but in combination with 2), could result in
fewer pfns needing resending.)

Pros) No guest changes required.
Cons) Moderately high toolstack overhead,  Possibility to resend far
more pfns than strictly required.

4) Request p2m structure change updates from the guest

The guest could provide a "p2m generation count" to allow the toolstack
to evaluate whether the structure had changed.  This would allow the
live part of migration to periodically re-evaluate whether it should
remap the p2m to avoid stale mappings.

Pros) Easy to implement alongside the virtual linear mapping support. 
Easy for toolstack and guest
Cons) Only works with new virtual linear guests.


Proposed solution:  A combination of 2, 3 and 4.

For legacy 3-level p2m guests, the toolstack can detect p2m structure
updates by tracking the p2m top and mid levels in the logdirty bitmap,
and invalidating the modified subset of pfns.  It has to eagerly check
the p2m frame list list mfn entry in the shared info to see whether the
guest has swapped onto a completely new p2m.

For a virtual linear map, the intermediate levels are not available to
track, but we can require that the guest increment p2m generation clock
in the shared info.  When the structure changes, the toolstack can remap
the p2m and calculate the altered subset of pfns, and mark for resend.

The toolstack must also track changes in the p2m itself, and compare to
a local copy showing the mapping at the time at which the pfn was last
sent.  This can be used to work out which p2m mappings have changed, and
also be used to confirm whether the pfns on the receiving side are stale
or not.

I believe this covered all cases and race conditions.  In the case that
the p2m is updated before the m2p, the p2m frame will be marked dirty in
the bitmap, and discoverable on the next iteration.  At that point, if
the p2m and m2p are inconsistent, the pfn will be deferred until the
final iteration.  If not, the frame is sent and everything is all ok. 
In the case that the p2m is updated after the m2p, the p2m/m2p will be
consistent when the dirty bitmap is acted on.


Thoughts? (for anyone who has made it this far :)  I think I have
covered everything.)

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-20 18:28 Buggy interaction of live migration and p2m updates Andrew Cooper
@ 2014-11-21  5:41 ` Juergen Gross
  2014-11-21 10:32   ` Andrew Cooper
  2014-11-21  9:43 ` Ian Campbell
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Juergen Gross @ 2014-11-21  5:41 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel List
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, David Vrabel,
	Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 11/20/2014 07:28 PM, Andrew Cooper wrote:
> Hello,
>
> Tim, David and I were discussing this over lunch.  This email is a
> (hopefully accurate) account of our findings, and potential solutions.
> (If I have messed up, please shout.)
>
> Currently, correct live migration of PV domains relies on the toolstack
> (which has a live mapping of the guests p2m) not observing stale values
> when the guest updates its p2m, and the race condition between a p2m
> update and an m2p update.  Realistically, this means no updates to the
> p2m at all, due to several potential race conditions.  Should any race
> conditions happen (e.g. ballooning while live migrating), the effects
> could be anything from an aborted migration to VM memory corruption.
>
> It should be noted that migrationv2 does not fix any of this.  It alters
> the way in which some race conditions might be observed.  During
> development of migrationv2, there was an explicit non-requirement of
> fixing the existing Ballooning+LiveMigration issues we were aware,
> although at the time, we were not aware of this specific set of issues.
> Our goal was to simply make migrationv2 work in the same circumstances
> as previously, but with a bitness-agnostic wire format and
> forward-extensible protocol.
>
>
> As far as these issues are concerned, there are two distinct p2m
> modifications which we care about:
> 1) p2m structure changes (rearranging the layout of the p2m)
> 2) p2m content changes (altering entries in the p2m)
>
> There is no possible way for the toolstack to prevent a domain from
> altering its p2m.  At the moment, ballooning typically only occurs when
> requested by the toolstack, but the underlying operations
> (increase/decrease_reservation, mem_exchange, etc) can be used by the
> guest at any point.  This includes Wei's guest memory fragmentation
> changes.  Changes to the content of the p2m also occur for grant map and
> unmap operations.
>
>
> Currently in PV guests, the p2m is implemented using a 3-level tree,
> with its root in the guests shared_info page.  It provides a hard VM
> memory limit of 4TB for 32bit PV guests (which is far higher than the
> 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.
>
> Juergen has a proposed new p2m interface using a virtual linear
> mapping.  This is conceptually similar to the previous implementation
> (which is fine from the toolstacks point of view), but far less
> complicated from the guests point of view, and removes the memory limits
> imposed by the p2m structure.
>
> The new virtual linear mapping suffers from the same interaction issues
> as the old 3-level tree did, but the introduction of the new interface
> affords us an opportunity to make all API modifications at once to
> reduce churn.
>
>
> During live migration, the toolstack maps the guests p2m into a linear
> mapping in the toolstacks virtual address space.  This is done once at
> the start of migration, and never subsequently altered.  During live
> migration, the p2m is cross-verified with the m2p, and frames are sent
> using pfns as a reference, as they will be located in different frames
> on the receiving side.
>
> Should the guest change the p2m structure during live migration, the
> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> resulting in bogus cross-referencing.  Should the guest change an entry
> in the p2m, the p2m frame itself will be resent as it would be marked as
> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> probably stale on the receiving side.
>
>
> Another factor which needs to be taken into account is Remus/COLO, which
> run the domains under live migration conditions for the duration of
> their lifetime.
>
> During the live part of migration, the toolstack already has to be able
> to tolerate failures to normalise the pagetables, which result as a
> consequent of the pagetables being in active.  These failures are fatal
> on the final iteration after the guest has been paused, but the same
> logic could be extended to p2m/m2p issues, if needed.
>
>
> There are several potential solutions to these problems.
>
> 1) Freeze the guests p2m during live migrate
>
> This is the simplest sounding option, but is quite problematic from the
> point of view of the guest.  It is essentially a shared spinlock between
> the toolstack and the guest kernel.  It would prevent any grant
> map/unmap operations from occurring, and might interact badly with
> certain p2m updated in the guest which would previously be expected to
> unconditionally succeed.
>
> Pros) (Can't think of any)
> Cons) Not easy to implement (even conceptually), requires invasive guest
> changes, will cripple Remus/COLO
>
>
> 2) Deep p2m dirty tracking
>
> In the case that a p2m frame is discovered dirty in the logdirty bitmap,
> we can be certain that a write has occurred to it, and in the common
> case, means that the mapping has changed.  The toolstack could maintain
> a non-live copy of the p2m which is updated as new frames are sent.
> When a dirty p2m frame is found, the live and non-live copies can be
> consulted to find which pfn mappings have changed, and locally mark all
> the altered pfns for retransmit.
>
> Pros) No guest changes required
> Cons) Toolstack needs to keep an additional copy of the guests p2m on
> the sending side
>
> 3) Eagerly check for p2m structure changes.
>
> p2m structure changes are rare after boot, but not impossible.  Each
> iteration of live migration, the toolstack can check for dirty
> higher-level p2m frames in the dirty bitmap.  In the case that a
> structure update occurs, the toolstack can use information it already
> has to calculate a subset of pfns affected by the update, and mark them
> for resending.  (This can currently be done to the frame granularity
> given the p2m frame lit, but in combination with 2), could result in
> fewer pfns needing resending.)
>
> Pros) No guest changes required.
> Cons) Moderately high toolstack overhead,  Possibility to resend far
> more pfns than strictly required.
>
> 4) Request p2m structure change updates from the guest
>
> The guest could provide a "p2m generation count" to allow the toolstack
> to evaluate whether the structure had changed.  This would allow the
> live part of migration to periodically re-evaluate whether it should
> remap the p2m to avoid stale mappings.
>
> Pros) Easy to implement alongside the virtual linear mapping support.
> Easy for toolstack and guest
> Cons) Only works with new virtual linear guests.
>
>
> Proposed solution:  A combination of 2, 3 and 4.
>
> For legacy 3-level p2m guests, the toolstack can detect p2m structure
> updates by tracking the p2m top and mid levels in the logdirty bitmap,
> and invalidating the modified subset of pfns.  It has to eagerly check
> the p2m frame list list mfn entry in the shared info to see whether the
> guest has swapped onto a completely new p2m.
>
> For a virtual linear map, the intermediate levels are not available to
> track, but we can require that the guest increment p2m generation clock
> in the shared info.  When the structure changes, the toolstack can remap
> the p2m and calculate the altered subset of pfns, and mark for resend.
>
> The toolstack must also track changes in the p2m itself, and compare to
> a local copy showing the mapping at the time at which the pfn was last
> sent.  This can be used to work out which p2m mappings have changed, and
> also be used to confirm whether the pfns on the receiving side are stale
> or not.
>
> I believe this covered all cases and race conditions.  In the case that
> the p2m is updated before the m2p, the p2m frame will be marked dirty in
> the bitmap, and discoverable on the next iteration.  At that point, if
> the p2m and m2p are inconsistent, the pfn will be deferred until the
> final iteration.  If not, the frame is sent and everything is all ok.
> In the case that the p2m is updated after the m2p, the p2m/m2p will be
> consistent when the dirty bitmap is acted on.
>
>
> Thoughts? (for anyone who has made it this far :)  I think I have
> covered everything.)

Sounds okay.

Two remarks regarding the virtual linear map:
- The intermediate levels could be tracked, as they are memory pages as
   well. It is not practical to do so, however, as there might be lots of
   changes not related to the p2m.
- The generation count is being checked by the tools in a lazy manner.
   This will require an increment of the count by the guest only after
   changing the structure of the p2m map, I think.


Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-20 18:28 Buggy interaction of live migration and p2m updates Andrew Cooper
  2014-11-21  5:41 ` Juergen Gross
@ 2014-11-21  9:43 ` Ian Campbell
  2014-11-21 10:24   ` Andrew Cooper
  2014-11-21 10:43 ` Jan Beulich
  2014-12-01 14:38 ` David Vrabel
  3 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2014-11-21  9:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On Thu, 2014-11-20 at 18:28 +0000, Andrew Cooper wrote:
> Realistically, this means no updates to the
> p2m at all, due to several potential race conditions.

>From the rest of the mail it seems as if you are talking primarily about
changes to the p2m *structure*, i.e. which guest frames contain the p2m
pages, rather than changes to the p2m entries themselves. Is that
correct?

I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
here, where does that fit in?

> As far as these issues are concerned, there are two distinct p2m
> modifications which we care about:
> 1) p2m structure changes (rearranging the layout of the p2m)
> 2) p2m content changes (altering entries in the p2m)
> 
> There is no possible way for the toolstack to prevent a domain from
> altering its p2m.  At the moment, ballooning typically only occurs when
> requested by the toolstack, but the underlying operations
> (increase/decrease_reservation, mem_exchange, etc) can be used by the
> guest at any point.  This includes Wei's guest memory fragmentation
> changes.  Changes to the content of the p2m also occur for grant map and
> unmap operations.
> 
> 
> Currently in PV guests, the p2m is implemented using a 3-level tree,
> with its root in the guests shared_info page.  It provides a hard VM
> memory limit of 4TB for 32bit PV guests (which is far higher than the
> 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.
> 
> Juergen has a proposed new p2m interface using a virtual linear
> mapping.  This is conceptually similar to the previous implementation
> (which is fine from the toolstacks point of view), but far less
> complicated from the guests point of view, and removes the memory limits
> imposed by the p2m structure.
> 
> The new virtual linear mapping suffers from the same interaction issues
> as the old 3-level tree did, but the introduction of the new interface
> affords us an opportunity to make all API modifications at once to
> reduce churn.
> 
> 
> During live migration, the toolstack maps the guests p2m into a linear
> mapping in the toolstacks virtual address space.  This is done once at
> the start of migration, and never subsequently altered.  During live
> migration, the p2m is cross-verified with the m2p, and frames are sent
> using pfns as a reference, as they will be located in different frames
> on the receiving side.
> 
> Should the guest change the p2m structure during live migration, the
> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> resulting in bogus cross-referencing.  Should the guest change an entry
> in the p2m, the p2m frame itself will be resent as it would be marked as
> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> probably stale on the receiving side.
> 
> 
> Another factor which needs to be taken into account is Remus/COLO, which
> run the domains under live migration conditions for the duration of
> their lifetime.
> 
> During the live part of migration, the toolstack already has to be able
> to tolerate failures to normalise the pagetables, which result as a
> consequent of the pagetables being in active.  These failures are fatal
> on the final iteration after the guest has been paused, but the same
> logic could be extended to p2m/m2p issues, if needed.
> 
> 
> There are several potential solutions to these problems.
> 
> 1) Freeze the guests p2m during live migrate
> 
> This is the simplest sounding option, but is quite problematic from the
> point of view of the guest.  It is essentially a shared spinlock between
> the toolstack and the guest kernel.  It would prevent any grant
> map/unmap operations from occurring, and might interact badly with
> certain p2m updated in the guest which would previously be expected to
> unconditionally succeed.
> 
> Pros) (Can't think of any)
> Cons) Not easy to implement (even conceptually), requires invasive guest
> changes, will cripple Remus/COLO
> 
> 
> 2) Deep p2m dirty tracking
> 
> In the case that a p2m frame is discovered dirty in the logdirty bitmap,
> we can be certain that a write has occurred to it, and in the common
> case, means that the mapping has changed.  The toolstack could maintain
> a non-live copy of the p2m which is updated as new frames are sent. 
> When a dirty p2m frame is found, the live and non-live copies can be
> consulted to find which pfn mappings have changed, and locally mark all
> the altered pfns for retransmit.
> 
> Pros) No guest changes required
> Cons) Toolstack needs to keep an additional copy of the guests p2m on
> the sending side
> 
> 3) Eagerly check for p2m structure changes.
> 
> p2m structure changes are rare after boot, but not impossible.  Each
> iteration of live migration, the toolstack can check for dirty
> higher-level p2m frames in the dirty bitmap.  In the case that a
> structure update occurs, the toolstack can use information it already
> has to calculate a subset of pfns affected by the update, and mark them
> for resending.  (This can currently be done to the frame granularity
> given the p2m frame lit, but in combination with 2), could result in
> fewer pfns needing resending.)
> 
> Pros) No guest changes required.
> Cons) Moderately high toolstack overhead,  Possibility to resend far
> more pfns than strictly required.
> 
> 4) Request p2m structure change updates from the guest
> 
> The guest could provide a "p2m generation count" to allow the toolstack
> to evaluate whether the structure had changed.  This would allow the
> live part of migration to periodically re-evaluate whether it should
> remap the p2m to avoid stale mappings.
> 
> Pros) Easy to implement alongside the virtual linear mapping support. 
> Easy for toolstack and guest
> Cons) Only works with new virtual linear guests.
> 
> 
> Proposed solution:  A combination of 2, 3 and 4.
> 
> For legacy 3-level p2m guests, the toolstack can detect p2m structure
> updates by tracking the p2m top and mid levels in the logdirty bitmap,
> and invalidating the modified subset of pfns.  It has to eagerly check
> the p2m frame list list mfn entry in the shared info to see whether the
> guest has swapped onto a completely new p2m.
> 
> For a virtual linear map, the intermediate levels are not available to
> track, but we can require that the guest increment p2m generation clock
> in the shared info.  When the structure changes, the toolstack can remap
> the p2m and calculate the altered subset of pfns, and mark for resend.
> 
> The toolstack must also track changes in the p2m itself, and compare to
> a local copy showing the mapping at the time at which the pfn was last
> sent.  This can be used to work out which p2m mappings have changed, and
> also be used to confirm whether the pfns on the receiving side are stale
> or not.
> 
> I believe this covered all cases and race conditions.  In the case that
> the p2m is updated before the m2p, the p2m frame will be marked dirty in
> the bitmap, and discoverable on the next iteration.  At that point, if
> the p2m and m2p are inconsistent, the pfn will be deferred until the
> final iteration.  If not, the frame is sent and everything is all ok. 
> In the case that the p2m is updated after the m2p, the p2m/m2p will be
> consistent when the dirty bitmap is acted on.
> 
> 
> Thoughts? (for anyone who has made it this far :)  I think I have
> covered everything.)
> 
> ~Andrew
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21  9:43 ` Ian Campbell
@ 2014-11-21 10:24   ` Andrew Cooper
  2014-11-21 10:46     ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2014-11-21 10:24 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Juergen Gross, Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang


[-- Attachment #1.1: Type: text/plain, Size: 897 bytes --]

On 21/11/14 09:43, Ian Campbell wrote:
> On Thu, 2014-11-20 at 18:28 +0000, Andrew Cooper wrote:
>> Realistically, this means no updates to the
>> p2m at all, due to several potential race conditions.
> From the rest of the mail it seems as if you are talking primarily about
> changes to the p2m *structure*, i.e. which guest frames contain the p2m
> pages, rather than changes to the p2m entries themselves. Is that
> correct?

I cover both, although the structure changes are the more obvious, and
more complicated to fix.

There are race conditions in both existing implementation between a
p2m/m2p update and the toolstack sampling the dirty bitmap where stale
frames don't get resent.

>
> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
> here, where does that fit in?
>

It is referenced several times, although not by its exact name.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 1656 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21  5:41 ` Juergen Gross
@ 2014-11-21 10:32   ` Andrew Cooper
  2014-11-27 15:14     ` Tim Deegan
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2014-11-21 10:32 UTC (permalink / raw)
  To: Juergen Gross, Xen-devel List
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, David Vrabel,
	Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 21/11/14 05:41, Juergen Gross wrote:
> On 11/20/2014 07:28 PM, Andrew Cooper wrote:
>> Hello,
>>
>> Tim, David and I were discussing this over lunch.  This email is a
>> (hopefully accurate) account of our findings, and potential solutions.
>> (If I have messed up, please shout.)
>>
>> Currently, correct live migration of PV domains relies on the toolstack
>> (which has a live mapping of the guests p2m) not observing stale values
>> when the guest updates its p2m, and the race condition between a p2m
>> update and an m2p update.  Realistically, this means no updates to the
>> p2m at all, due to several potential race conditions.  Should any race
>> conditions happen (e.g. ballooning while live migrating), the effects
>> could be anything from an aborted migration to VM memory corruption.
>>
>> It should be noted that migrationv2 does not fix any of this.  It alters
>> the way in which some race conditions might be observed.  During
>> development of migrationv2, there was an explicit non-requirement of
>> fixing the existing Ballooning+LiveMigration issues we were aware,
>> although at the time, we were not aware of this specific set of issues.
>> Our goal was to simply make migrationv2 work in the same circumstances
>> as previously, but with a bitness-agnostic wire format and
>> forward-extensible protocol.
>>
>>
>> As far as these issues are concerned, there are two distinct p2m
>> modifications which we care about:
>> 1) p2m structure changes (rearranging the layout of the p2m)
>> 2) p2m content changes (altering entries in the p2m)
>>
>> There is no possible way for the toolstack to prevent a domain from
>> altering its p2m.  At the moment, ballooning typically only occurs when
>> requested by the toolstack, but the underlying operations
>> (increase/decrease_reservation, mem_exchange, etc) can be used by the
>> guest at any point.  This includes Wei's guest memory fragmentation
>> changes.  Changes to the content of the p2m also occur for grant map and
>> unmap operations.
>>
>>
>> Currently in PV guests, the p2m is implemented using a 3-level tree,
>> with its root in the guests shared_info page.  It provides a hard VM
>> memory limit of 4TB for 32bit PV guests (which is far higher than the
>> 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.
>>
>> Juergen has a proposed new p2m interface using a virtual linear
>> mapping.  This is conceptually similar to the previous implementation
>> (which is fine from the toolstacks point of view), but far less
>> complicated from the guests point of view, and removes the memory limits
>> imposed by the p2m structure.
>>
>> The new virtual linear mapping suffers from the same interaction issues
>> as the old 3-level tree did, but the introduction of the new interface
>> affords us an opportunity to make all API modifications at once to
>> reduce churn.
>>
>>
>> During live migration, the toolstack maps the guests p2m into a linear
>> mapping in the toolstacks virtual address space.  This is done once at
>> the start of migration, and never subsequently altered.  During live
>> migration, the p2m is cross-verified with the m2p, and frames are sent
>> using pfns as a reference, as they will be located in different frames
>> on the receiving side.
>>
>> Should the guest change the p2m structure during live migration, the
>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
>> resulting in bogus cross-referencing.  Should the guest change an entry
>> in the p2m, the p2m frame itself will be resent as it would be marked as
>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
>> probably stale on the receiving side.
>>
>>
>> Another factor which needs to be taken into account is Remus/COLO, which
>> run the domains under live migration conditions for the duration of
>> their lifetime.
>>
>> During the live part of migration, the toolstack already has to be able
>> to tolerate failures to normalise the pagetables, which result as a
>> consequent of the pagetables being in active.  These failures are fatal
>> on the final iteration after the guest has been paused, but the same
>> logic could be extended to p2m/m2p issues, if needed.
>>
>>
>> There are several potential solutions to these problems.
>>
>> 1) Freeze the guests p2m during live migrate
>>
>> This is the simplest sounding option, but is quite problematic from the
>> point of view of the guest.  It is essentially a shared spinlock between
>> the toolstack and the guest kernel.  It would prevent any grant
>> map/unmap operations from occurring, and might interact badly with
>> certain p2m updated in the guest which would previously be expected to
>> unconditionally succeed.
>>
>> Pros) (Can't think of any)
>> Cons) Not easy to implement (even conceptually), requires invasive guest
>> changes, will cripple Remus/COLO
>>
>>
>> 2) Deep p2m dirty tracking
>>
>> In the case that a p2m frame is discovered dirty in the logdirty bitmap,
>> we can be certain that a write has occurred to it, and in the common
>> case, means that the mapping has changed.  The toolstack could maintain
>> a non-live copy of the p2m which is updated as new frames are sent.
>> When a dirty p2m frame is found, the live and non-live copies can be
>> consulted to find which pfn mappings have changed, and locally mark all
>> the altered pfns for retransmit.
>>
>> Pros) No guest changes required
>> Cons) Toolstack needs to keep an additional copy of the guests p2m on
>> the sending side
>>
>> 3) Eagerly check for p2m structure changes.
>>
>> p2m structure changes are rare after boot, but not impossible.  Each
>> iteration of live migration, the toolstack can check for dirty
>> higher-level p2m frames in the dirty bitmap.  In the case that a
>> structure update occurs, the toolstack can use information it already
>> has to calculate a subset of pfns affected by the update, and mark them
>> for resending.  (This can currently be done to the frame granularity
>> given the p2m frame lit, but in combination with 2), could result in
>> fewer pfns needing resending.)
>>
>> Pros) No guest changes required.
>> Cons) Moderately high toolstack overhead,  Possibility to resend far
>> more pfns than strictly required.
>>
>> 4) Request p2m structure change updates from the guest
>>
>> The guest could provide a "p2m generation count" to allow the toolstack
>> to evaluate whether the structure had changed.  This would allow the
>> live part of migration to periodically re-evaluate whether it should
>> remap the p2m to avoid stale mappings.
>>
>> Pros) Easy to implement alongside the virtual linear mapping support.
>> Easy for toolstack and guest
>> Cons) Only works with new virtual linear guests.
>>
>>
>> Proposed solution:  A combination of 2, 3 and 4.
>>
>> For legacy 3-level p2m guests, the toolstack can detect p2m structure
>> updates by tracking the p2m top and mid levels in the logdirty bitmap,
>> and invalidating the modified subset of pfns.  It has to eagerly check
>> the p2m frame list list mfn entry in the shared info to see whether the
>> guest has swapped onto a completely new p2m.
>>
>> For a virtual linear map, the intermediate levels are not available to
>> track, but we can require that the guest increment p2m generation clock
>> in the shared info.  When the structure changes, the toolstack can remap
>> the p2m and calculate the altered subset of pfns, and mark for resend.
>>
>> The toolstack must also track changes in the p2m itself, and compare to
>> a local copy showing the mapping at the time at which the pfn was last
>> sent.  This can be used to work out which p2m mappings have changed, and
>> also be used to confirm whether the pfns on the receiving side are stale
>> or not.
>>
>> I believe this covered all cases and race conditions.  In the case that
>> the p2m is updated before the m2p, the p2m frame will be marked dirty in
>> the bitmap, and discoverable on the next iteration.  At that point, if
>> the p2m and m2p are inconsistent, the pfn will be deferred until the
>> final iteration.  If not, the frame is sent and everything is all ok.
>> In the case that the p2m is updated after the m2p, the p2m/m2p will be
>> consistent when the dirty bitmap is acted on.
>>
>>
>> Thoughts? (for anyone who has made it this far :)  I think I have
>> covered everything.)
>
> Sounds okay.
>
> Two remarks regarding the virtual linear map:
> - The intermediate levels could be tracked, as they are memory pages as
>   well. It is not practical to do so, however, as there might be lots of
>   changes not related to the p2m.

The intermediate levels are just pagetables, are they not? Or is there a
separate tracking structure?

> - The generation count is being checked by the tools in a lazy manner.
>   This will require an increment of the count by the guest only after
>   changing the structure of the p2m map, I think.

On further consideration, I think this needs to be a lockless
producer/consumer interface, with increment once at start, and once
again at the end.  The toolstack needs some ability to confirm that it
has got a consistent mapping of the virtual p2m, as it cant practically
detect updates via the logdirty bitmap.

It also occurs to me that the toolstack code needs to gain some use of
ACCESS_ONCE() when reading the live p2m.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-20 18:28 Buggy interaction of live migration and p2m updates Andrew Cooper
  2014-11-21  5:41 ` Juergen Gross
  2014-11-21  9:43 ` Ian Campbell
@ 2014-11-21 10:43 ` Jan Beulich
  2014-11-21 10:54   ` Andrew Cooper
  2014-12-01 14:38 ` David Vrabel
  3 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2014-11-21 10:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson,
	Xen-devel List, David Vrabel, Shriram Rajagopalan, Hongyang Yang

>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
> Should the guest change the p2m structure during live migration, the
> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> resulting in bogus cross-referencing.  Should the guest change an entry
> in the p2m, the p2m frame itself will be resent as it would be marked as
> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> probably stale on the receiving side.

MMU_MACHPHYS_UPDATE processing marks the page being changed
as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
callers thereof) should do so too?

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 10:24   ` Andrew Cooper
@ 2014-11-21 10:46     ` Ian Campbell
  2014-11-21 11:07       ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2014-11-21 10:46 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
> On 21/11/14 09:43, Ian Campbell wrote:
> 
> > On Thu, 2014-11-20 at 18:28 +0000, Andrew Cooper wrote:
> > > Realistically, this means no updates to the
> > > p2m at all, due to several potential race conditions.
> > From the rest of the mail it seems as if you are talking primarily about
> > changes to the p2m *structure*, i.e. which guest frames contain the p2m
> > pages, rather than changes to the p2m entries themselves. Is that
> > correct?
> 
> I cover both, although the structure changes are the more obvious, and
> more complicated to fix.

Sure.

> There are race conditions in both existing implementation between a
> p2m/m2p update and the toolstack sampling the dirty bitmap where stale
> frames don't get resent.
> 
> > 
> > I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
> > here, where does that fit in?
> > 
> 
> It is referenced several times, although not by its exact name.

Hence no explicit mention.

It's ambiguous when you refer to "higher level frames" (which I presume
are the reference you are referring to) because some kernels (perhaps
only historic ones, I've not been keeping up) keep both an N-level tree
of their own internally and the toolstack visible frame_list_list
(sometimes partially overlapping at some level). Is every reference to
"higher level frames" actually intended to be a reference to
pfn_to_mfn_frame_list_list or not?

It seems like sometimes you are talking at times about tracking the
kernel's internal structure and not just pfn_to_mfn_frame_list_list and
I'm not sure why that would be. I'm also not sure why
pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
AFAIK the guest is still obliged to keep that up to date regardless of
the scheme it uses internally for accessing the p2m.

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 10:43 ` Jan Beulich
@ 2014-11-21 10:54   ` Andrew Cooper
  2014-11-27 15:00     ` Tim Deegan
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2014-11-21 10:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson,
	Xen-devel List, David Vrabel, Shriram Rajagopalan, Hongyang Yang

On 21/11/14 10:43, Jan Beulich wrote:
>>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
>> Should the guest change the p2m structure during live migration, the
>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
>> resulting in bogus cross-referencing.  Should the guest change an entry
>> in the p2m, the p2m frame itself will be resent as it would be marked as
>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
>> probably stale on the receiving side.
> MMU_MACHPHYS_UPDATE processing marks the page being changed
> as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
> callers thereof) should do so too?
>
> Jan
>

This is certainly needed to fix HVM ballooning and live migration
issues, although now you point it out, it applies just as much to PV
guests as HVM guests.

I believe this might allow the toolstack to avoid keeping a second copy
of the p2m.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 10:46     ` Ian Campbell
@ 2014-11-21 11:07       ` Andrew Cooper
  2014-11-21 11:15         ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2014-11-21 11:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Juergen Gross, Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 21/11/14 10:46, Ian Campbell wrote:
> On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
>> On 21/11/14 09:43, Ian Campbell wrote:
>>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
>>> here, where does that fit in?
>>>
>> It is referenced several times, although not by its exact name.
> Hence no explicit mention.
>
> It's ambiguous when you refer to "higher level frames" (which I presume
> are the reference you are referring to) because some kernels (perhaps
> only historic ones, I've not been keeping up) keep both an N-level tree
> of their own internally and the toolstack visible frame_list_list
> (sometimes partially overlapping at some level). Is every reference to
> "higher level frames" actually intended to be a reference to
> pfn_to_mfn_frame_list_list or not?

"higher level frames" would be the toolstack-abi-defined first and
second level lists.  The logdirty infrastructure can be used to detect
writes to these frames, and therefore detect structural changes to the p2m.

I would like to hope that every kernel out there keeps this information
correctly up-to-date and updates it in an appropriate order...

>
> It seems like sometimes you are talking at times about tracking the
> kernel's internal structure and not just pfn_to_mfn_frame_list_list and
> I'm not sure why that would be.

I apologise for giving this impression.  It was not intended.

> I'm also not sure why
> pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
> AFAIK the guest is still obliged to keep that up to date regardless of
> the scheme it uses internally for accessing the p2m.

There are two reasons for the virtual linear p2m, the primary one being
to break the hard 512GB limit given the old 3-level table.

A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
the virtual linear method, a PV guest explicitly sets
pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
adjacent information.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 11:07       ` Andrew Cooper
@ 2014-11-21 11:15         ` Ian Campbell
  2014-11-21 11:20           ` Juergen Gross
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2014-11-21 11:15 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On Fri, 2014-11-21 at 11:07 +0000, Andrew Cooper wrote:
> On 21/11/14 10:46, Ian Campbell wrote:
> > On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
> >> On 21/11/14 09:43, Ian Campbell wrote:
> >>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
> >>> here, where does that fit in?
> >>>
> >> It is referenced several times, although not by its exact name.
> > Hence no explicit mention.
> >
> > It's ambiguous when you refer to "higher level frames" (which I presume
> > are the reference you are referring to) because some kernels (perhaps
> > only historic ones, I've not been keeping up) keep both an N-level tree
> > of their own internally and the toolstack visible frame_list_list
> > (sometimes partially overlapping at some level). Is every reference to
> > "higher level frames" actually intended to be a reference to
> > pfn_to_mfn_frame_list_list or not?
> 
> "higher level frames" would be the toolstack-abi-defined first and
> second level lists.  The logdirty infrastructure can be used to detect
> writes to these frames, and therefore detect structural changes to the p2m.
> 
> I would like to hope that every kernel out there keeps this information
> correctly up-to-date and updates it in an appropriate order...
> 
> >
> > It seems like sometimes you are talking at times about tracking the
> > kernel's internal structure and not just pfn_to_mfn_frame_list_list and
> > I'm not sure why that would be.
> 
> I apologise for giving this impression.  It was not intended.

Great, I just wanted to be sure we were all on the same page, since
scrobbling around in the kernel's internal data structures would clearly
be mad...

> 
> > I'm also not sure why
> > pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
> > AFAIK the guest is still obliged to keep that up to date regardless of
> > the scheme it uses internally for accessing the p2m.
> 
> There are two reasons for the virtual linear p2m, the primary one being
> to break the hard 512GB limit given the old 3-level table.
> 
> A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
> it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
> the virtual linear method, a PV guest explicitly sets
> pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
> adjacent information.

Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
Have I somehow completely missed the xen.git side of these patches? I
thought I'd only seen linux.git ones (and hence wasn't looking very
closely). 

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 11:15         ` Ian Campbell
@ 2014-11-21 11:20           ` Juergen Gross
  2014-11-21 11:24             ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Juergen Gross @ 2014-11-21 11:20 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper
  Cc: Wei Liu, Tim Deegan, Ian Jackson, Xen-devel List, David Vrabel,
	Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 11/21/2014 12:15 PM, Ian Campbell wrote:
> On Fri, 2014-11-21 at 11:07 +0000, Andrew Cooper wrote:
>> On 21/11/14 10:46, Ian Campbell wrote:
>>> On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
>>>> On 21/11/14 09:43, Ian Campbell wrote:
>>>>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
>>>>> here, where does that fit in?
>>>>>
>>>> It is referenced several times, although not by its exact name.
>>> Hence no explicit mention.
>>>
>>> It's ambiguous when you refer to "higher level frames" (which I presume
>>> are the reference you are referring to) because some kernels (perhaps
>>> only historic ones, I've not been keeping up) keep both an N-level tree
>>> of their own internally and the toolstack visible frame_list_list
>>> (sometimes partially overlapping at some level). Is every reference to
>>> "higher level frames" actually intended to be a reference to
>>> pfn_to_mfn_frame_list_list or not?
>>
>> "higher level frames" would be the toolstack-abi-defined first and
>> second level lists.  The logdirty infrastructure can be used to detect
>> writes to these frames, and therefore detect structural changes to the p2m.
>>
>> I would like to hope that every kernel out there keeps this information
>> correctly up-to-date and updates it in an appropriate order...
>>
>>>
>>> It seems like sometimes you are talking at times about tracking the
>>> kernel's internal structure and not just pfn_to_mfn_frame_list_list and
>>> I'm not sure why that would be.
>>
>> I apologise for giving this impression.  It was not intended.
>
> Great, I just wanted to be sure we were all on the same page, since
> scrobbling around in the kernel's internal data structures would clearly
> be mad...
>
>>
>>> I'm also not sure why
>>> pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
>>> AFAIK the guest is still obliged to keep that up to date regardless of
>>> the scheme it uses internally for accessing the p2m.
>>
>> There are two reasons for the virtual linear p2m, the primary one being
>> to break the hard 512GB limit given the old 3-level table.
>>
>> A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
>> it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
>> the virtual linear method, a PV guest explicitly sets
>> pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
>> adjacent information.
>
> Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
> Have I somehow completely missed the xen.git side of these patches? I
> thought I'd only seen linux.git ones (and hence wasn't looking very
> closely).

V1 of the patches suggesting such a change have been posted a week ago:

http://lists.xen.org/archives/html/xen-devel/2014-11/msg01276.html

The linear p2m stuff is a prerequisite for this change, not the reason
for it.


Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 11:20           ` Juergen Gross
@ 2014-11-21 11:24             ` Ian Campbell
  2014-11-21 12:15               ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2014-11-21 11:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Wei Liu, Andrew Cooper, Tim Deegan, Xen-devel List, David Vrabel,
	Jan Beulich, Shriram Rajagopalan, Hongyang Yang, Ian Jackson

On Fri, 2014-11-21 at 12:20 +0100, Juergen Gross wrote:
> On 11/21/2014 12:15 PM, Ian Campbell wrote:
> > On Fri, 2014-11-21 at 11:07 +0000, Andrew Cooper wrote:
> >> On 21/11/14 10:46, Ian Campbell wrote:
> >>> On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
> >>>> On 21/11/14 09:43, Ian Campbell wrote:
> >>>>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
> >>>>> here, where does that fit in?
> >>>>>
> >>>> It is referenced several times, although not by its exact name.
> >>> Hence no explicit mention.
> >>>
> >>> It's ambiguous when you refer to "higher level frames" (which I presume
> >>> are the reference you are referring to) because some kernels (perhaps
> >>> only historic ones, I've not been keeping up) keep both an N-level tree
> >>> of their own internally and the toolstack visible frame_list_list
> >>> (sometimes partially overlapping at some level). Is every reference to
> >>> "higher level frames" actually intended to be a reference to
> >>> pfn_to_mfn_frame_list_list or not?
> >>
> >> "higher level frames" would be the toolstack-abi-defined first and
> >> second level lists.  The logdirty infrastructure can be used to detect
> >> writes to these frames, and therefore detect structural changes to the p2m.
> >>
> >> I would like to hope that every kernel out there keeps this information
> >> correctly up-to-date and updates it in an appropriate order...
> >>
> >>>
> >>> It seems like sometimes you are talking at times about tracking the
> >>> kernel's internal structure and not just pfn_to_mfn_frame_list_list and
> >>> I'm not sure why that would be.
> >>
> >> I apologise for giving this impression.  It was not intended.
> >
> > Great, I just wanted to be sure we were all on the same page, since
> > scrobbling around in the kernel's internal data structures would clearly
> > be mad...
> >
> >>
> >>> I'm also not sure why
> >>> pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
> >>> AFAIK the guest is still obliged to keep that up to date regardless of
> >>> the scheme it uses internally for accessing the p2m.
> >>
> >> There are two reasons for the virtual linear p2m, the primary one being
> >> to break the hard 512GB limit given the old 3-level table.
> >>
> >> A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
> >> it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
> >> the virtual linear method, a PV guest explicitly sets
> >> pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
> >> adjacent information.
> >
> > Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
> > Have I somehow completely missed the xen.git side of these patches? I
> > thought I'd only seen linux.git ones (and hence wasn't looking very
> > closely).
> 
> V1 of the patches suggesting such a change have been posted a week ago:
> 
> http://lists.xen.org/archives/html/xen-devel/2014-11/msg01276.html

Ah, thanks. I had indeed ignored that as "just another iteration of the
linux patches", oops!

> The linear p2m stuff is a prerequisite for this change, not the reason
> for it.

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 11:24             ` Ian Campbell
@ 2014-11-21 12:15               ` Jan Beulich
  2014-11-21 12:20                 ` Jürgen Groß
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2014-11-21 12:15 UTC (permalink / raw)
  To: Ian Campbell, Juergen Gross
  Cc: Tim Deegan, Wei Liu, Andrew Cooper, Ian Jackson, Xen-devel List,
	David Vrabel, ShriramRajagopalan, Hongyang Yang

>>> On 21.11.14 at 12:24, <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-11-21 at 12:20 +0100, Juergen Gross wrote:
>> On 11/21/2014 12:15 PM, Ian Campbell wrote:
>> > On Fri, 2014-11-21 at 11:07 +0000, Andrew Cooper wrote:
>> >> On 21/11/14 10:46, Ian Campbell wrote:
>> >>> On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
>> >>>> On 21/11/14 09:43, Ian Campbell wrote:
>> >>>>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
>> >>>>> here, where does that fit in?
>> >>>>>
>> >>>> It is referenced several times, although not by its exact name.
>> >>> Hence no explicit mention.
>> >>>
>> >>> It's ambiguous when you refer to "higher level frames" (which I presume
>> >>> are the reference you are referring to) because some kernels (perhaps
>> >>> only historic ones, I've not been keeping up) keep both an N-level tree
>> >>> of their own internally and the toolstack visible frame_list_list
>> >>> (sometimes partially overlapping at some level). Is every reference to
>> >>> "higher level frames" actually intended to be a reference to
>> >>> pfn_to_mfn_frame_list_list or not?
>> >>
>> >> "higher level frames" would be the toolstack-abi-defined first and
>> >> second level lists.  The logdirty infrastructure can be used to detect
>> >> writes to these frames, and therefore detect structural changes to the p2m.
>> >>
>> >> I would like to hope that every kernel out there keeps this information
>> >> correctly up-to-date and updates it in an appropriate order...
>> >>
>> >>>
>> >>> It seems like sometimes you are talking at times about tracking the
>> >>> kernel's internal structure and not just pfn_to_mfn_frame_list_list and
>> >>> I'm not sure why that would be.
>> >>
>> >> I apologise for giving this impression.  It was not intended.
>> >
>> > Great, I just wanted to be sure we were all on the same page, since
>> > scrobbling around in the kernel's internal data structures would clearly
>> > be mad...
>> >
>> >>
>> >>> I'm also not sure why
>> >>> pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
>> >>> AFAIK the guest is still obliged to keep that up to date regardless of
>> >>> the scheme it uses internally for accessing the p2m.
>> >>
>> >> There are two reasons for the virtual linear p2m, the primary one being
>> >> to break the hard 512GB limit given the old 3-level table.
>> >>
>> >> A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
>> >> it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
>> >> the virtual linear method, a PV guest explicitly sets
>> >> pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
>> >> adjacent information.
>> >
>> > Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
>> > Have I somehow completely missed the xen.git side of these patches? I
>> > thought I'd only seen linux.git ones (and hence wasn't looking very
>> > closely).
>> 
>> V1 of the patches suggesting such a change have been posted a week ago:
>> 
>> http://lists.xen.org/archives/html/xen-devel/2014-11/msg01276.html 
> 
> Ah, thanks. I had indeed ignored that as "just another iteration of the
> linux patches", oops!

So did I, not the least because it was sent to David and Konrad
rather than Cc-ing the hypervisor side maintainers (other than
me).

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 12:15               ` Jan Beulich
@ 2014-11-21 12:20                 ` Jürgen Groß
  0 siblings, 0 replies; 21+ messages in thread
From: Jürgen Groß @ 2014-11-21 12:20 UTC (permalink / raw)
  To: Jan Beulich, Ian Campbell
  Cc: Wei Liu, Ian Jackson, Tim Deegan, Xen-devel List, David Vrabel,
	Andrew Cooper, ShriramRajagopalan, Hongyang Yang

On 11/21/2014 01:15 PM, Jan Beulich wrote:
>>>> On 21.11.14 at 12:24, <Ian.Campbell@citrix.com> wrote:
>> On Fri, 2014-11-21 at 12:20 +0100, Juergen Gross wrote:
>>> On 11/21/2014 12:15 PM, Ian Campbell wrote:
>>>> On Fri, 2014-11-21 at 11:07 +0000, Andrew Cooper wrote:
>>>>> On 21/11/14 10:46, Ian Campbell wrote:
>>>>>> On Fri, 2014-11-21 at 10:24 +0000, Andrew Cooper wrote:
>>>>>>> On 21/11/14 09:43, Ian Campbell wrote:
>>>>>>>> I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
>>>>>>>> here, where does that fit in?
>>>>>>>>
>>>>>>> It is referenced several times, although not by its exact name.
>>>>>> Hence no explicit mention.
>>>>>>
>>>>>> It's ambiguous when you refer to "higher level frames" (which I presume
>>>>>> are the reference you are referring to) because some kernels (perhaps
>>>>>> only historic ones, I've not been keeping up) keep both an N-level tree
>>>>>> of their own internally and the toolstack visible frame_list_list
>>>>>> (sometimes partially overlapping at some level). Is every reference to
>>>>>> "higher level frames" actually intended to be a reference to
>>>>>> pfn_to_mfn_frame_list_list or not?
>>>>>
>>>>> "higher level frames" would be the toolstack-abi-defined first and
>>>>> second level lists.  The logdirty infrastructure can be used to detect
>>>>> writes to these frames, and therefore detect structural changes to the p2m.
>>>>>
>>>>> I would like to hope that every kernel out there keeps this information
>>>>> correctly up-to-date and updates it in an appropriate order...
>>>>>
>>>>>>
>>>>>> It seems like sometimes you are talking at times about tracking the
>>>>>> kernel's internal structure and not just pfn_to_mfn_frame_list_list and
>>>>>> I'm not sure why that would be.
>>>>>
>>>>> I apologise for giving this impression.  It was not intended.
>>>>
>>>> Great, I just wanted to be sure we were all on the same page, since
>>>> scrobbling around in the kernel's internal data structures would clearly
>>>> be mad...
>>>>
>>>>>
>>>>>> I'm also not sure why
>>>>>> pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
>>>>>> AFAIK the guest is still obliged to keep that up to date regardless of
>>>>>> the scheme it uses internally for accessing the p2m.
>>>>>
>>>>> There are two reasons for the virtual linear p2m, the primary one being
>>>>> to break the hard 512GB limit given the old 3-level table.
>>>>>
>>>>> A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
>>>>> it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
>>>>> the virtual linear method, a PV guest explicitly sets
>>>>> pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
>>>>> adjacent information.
>>>>
>>>> Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
>>>> Have I somehow completely missed the xen.git side of these patches? I
>>>> thought I'd only seen linux.git ones (and hence wasn't looking very
>>>> closely).
>>>
>>> V1 of the patches suggesting such a change have been posted a week ago:
>>>
>>> http://lists.xen.org/archives/html/xen-devel/2014-11/msg01276.html
>>
>> Ah, thanks. I had indeed ignored that as "just another iteration of the
>> linux patches", oops!
>
> So did I, not the least because it was sent to David and Konrad
> rather than Cc-ing the hypervisor side maintainers (other than
> me).

Oops, sorry.

Will do better for V2.


Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 10:54   ` Andrew Cooper
@ 2014-11-27 15:00     ` Tim Deegan
  2014-11-27 15:16       ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2014-11-27 15:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

At 10:54 +0000 on 21 Nov (1416563695), Andrew Cooper wrote:
> On 21/11/14 10:43, Jan Beulich wrote:
> >>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
> >> Should the guest change the p2m structure during live migration, the
> >> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> >> resulting in bogus cross-referencing.  Should the guest change an entry
> >> in the p2m, the p2m frame itself will be resent as it would be marked as
> >> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> >> probably stale on the receiving side.
> > MMU_MACHPHYS_UPDATE processing marks the page being changed
> > as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
> > callers thereof) should do so too?
> >
> > Jan
> >
> 
> This is certainly needed to fix HVM ballooning and live migration
> issues

Agreed.  We should be marking HVM frames dirty when they have any p2m
update that changes the mapping.  Maybe in paging_write_p2m_entry() or
the various implementation-specific versions.

>, although now you point it out, it applies just as much to PV
> guests as HVM guests.
> 
> I believe this might allow the toolstack to avoid keeping a second copy
> of the p2m.

I don't think so. :(  Because the toolstack is reading the guest's own
p2m, there is still a race where:

 - guest calls physmap_add_page, as part of which Xen marks the pfn dirty;
 - toolstack reads + cleans the dirty bitmap;
 - toolstack reads the guest p2m and DTRT for this pfn;
 - guest updates its p2m with the result of the physmap_add_page call.

After that, if the guest doesn't dirty that pfn again it won't be
fixed up.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-21 10:32   ` Andrew Cooper
@ 2014-11-27 15:14     ` Tim Deegan
  0 siblings, 0 replies; 21+ messages in thread
From: Tim Deegan @ 2014-11-27 15:14 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

At 10:32 +0000 on 21 Nov (1416562351), Andrew Cooper wrote:
> On 21/11/14 05:41, Juergen Gross wrote:
> > On 11/20/2014 07:28 PM, Andrew Cooper wrote:
> >> Hello,
> >>
> >> Tim, David and I were discussing this over lunch.  This email is a
> >> (hopefully accurate) account of our findings, and potential solutions.
> >> (If I have messed up, please shout.)
> >>
> >> Currently, correct live migration of PV domains relies on the toolstack
> >> (which has a live mapping of the guests p2m) not observing stale values
> >> when the guest updates its p2m, and the race condition between a p2m
> >> update and an m2p update.  Realistically, this means no updates to the
> >> p2m at all, due to several potential race conditions.  Should any race
> >> conditions happen (e.g. ballooning while live migrating), the effects
> >> could be anything from an aborted migration to VM memory corruption.
> >>
> >> It should be noted that migrationv2 does not fix any of this.  It alters
> >> the way in which some race conditions might be observed.  During
> >> development of migrationv2, there was an explicit non-requirement of
> >> fixing the existing Ballooning+LiveMigration issues we were aware,
> >> although at the time, we were not aware of this specific set of issues.
> >> Our goal was to simply make migrationv2 work in the same circumstances
> >> as previously, but with a bitness-agnostic wire format and
> >> forward-extensible protocol.
> >>
> >>
> >> As far as these issues are concerned, there are two distinct p2m
> >> modifications which we care about:
> >> 1) p2m structure changes (rearranging the layout of the p2m)
> >> 2) p2m content changes (altering entries in the p2m)
> >>
> >> There is no possible way for the toolstack to prevent a domain from
> >> altering its p2m.  At the moment, ballooning typically only occurs when
> >> requested by the toolstack, but the underlying operations
> >> (increase/decrease_reservation, mem_exchange, etc) can be used by the
> >> guest at any point.  This includes Wei's guest memory fragmentation
> >> changes.  Changes to the content of the p2m also occur for grant map and
> >> unmap operations.
> >>
> >>
> >> Currently in PV guests, the p2m is implemented using a 3-level tree,
> >> with its root in the guests shared_info page.  It provides a hard VM
> >> memory limit of 4TB for 32bit PV guests (which is far higher than the
> >> 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.
> >>
> >> Juergen has a proposed new p2m interface using a virtual linear
> >> mapping.  This is conceptually similar to the previous implementation
> >> (which is fine from the toolstacks point of view), but far less
> >> complicated from the guests point of view, and removes the memory limits
> >> imposed by the p2m structure.
> >>
> >> The new virtual linear mapping suffers from the same interaction issues
> >> as the old 3-level tree did, but the introduction of the new interface
> >> affords us an opportunity to make all API modifications at once to
> >> reduce churn.
> >>
> >>
> >> During live migration, the toolstack maps the guests p2m into a linear
> >> mapping in the toolstacks virtual address space.  This is done once at
> >> the start of migration, and never subsequently altered.  During live
> >> migration, the p2m is cross-verified with the m2p, and frames are sent
> >> using pfns as a reference, as they will be located in different frames
> >> on the receiving side.
> >>
> >> Should the guest change the p2m structure during live migration, the
> >> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> >> resulting in bogus cross-referencing.  Should the guest change an entry
> >> in the p2m, the p2m frame itself will be resent as it would be marked as
> >> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> >> probably stale on the receiving side.
> >>
> >>
> >> Another factor which needs to be taken into account is Remus/COLO, which
> >> run the domains under live migration conditions for the duration of
> >> their lifetime.
> >>
> >> During the live part of migration, the toolstack already has to be able
> >> to tolerate failures to normalise the pagetables, which result as a
> >> consequent of the pagetables being in active.  These failures are fatal
> >> on the final iteration after the guest has been paused, but the same
> >> logic could be extended to p2m/m2p issues, if needed.
> >>
> >>
> >> There are several potential solutions to these problems.
> >>
> >> 1) Freeze the guests p2m during live migrate
> >>
> >> This is the simplest sounding option, but is quite problematic from the
> >> point of view of the guest.  It is essentially a shared spinlock between
> >> the toolstack and the guest kernel.  It would prevent any grant
> >> map/unmap operations from occurring, and might interact badly with
> >> certain p2m updated in the guest which would previously be expected to
> >> unconditionally succeed.
> >>
> >> Pros) (Can't think of any)
> >> Cons) Not easy to implement (even conceptually), requires invasive guest
> >> changes, will cripple Remus/COLO
> >>
> >>
> >> 2) Deep p2m dirty tracking
> >>
> >> In the case that a p2m frame is discovered dirty in the logdirty bitmap,
> >> we can be certain that a write has occurred to it, and in the common
> >> case, means that the mapping has changed.  The toolstack could maintain
> >> a non-live copy of the p2m which is updated as new frames are sent.
> >> When a dirty p2m frame is found, the live and non-live copies can be
> >> consulted to find which pfn mappings have changed, and locally mark all
> >> the altered pfns for retransmit.
> >>
> >> Pros) No guest changes required
> >> Cons) Toolstack needs to keep an additional copy of the guests p2m on
> >> the sending side
> >>
> >> 3) Eagerly check for p2m structure changes.
> >>
> >> p2m structure changes are rare after boot, but not impossible.  Each
> >> iteration of live migration, the toolstack can check for dirty
> >> higher-level p2m frames in the dirty bitmap.  In the case that a
> >> structure update occurs, the toolstack can use information it already
> >> has to calculate a subset of pfns affected by the update, and mark them
> >> for resending.  (This can currently be done to the frame granularity
> >> given the p2m frame lit, but in combination with 2), could result in
> >> fewer pfns needing resending.)
> >>
> >> Pros) No guest changes required.
> >> Cons) Moderately high toolstack overhead,  Possibility to resend far
> >> more pfns than strictly required.
> >>
> >> 4) Request p2m structure change updates from the guest
> >>
> >> The guest could provide a "p2m generation count" to allow the toolstack
> >> to evaluate whether the structure had changed.  This would allow the
> >> live part of migration to periodically re-evaluate whether it should
> >> remap the p2m to avoid stale mappings.
> >>
> >> Pros) Easy to implement alongside the virtual linear mapping support.
> >> Easy for toolstack and guest
> >> Cons) Only works with new virtual linear guests.
> >>
> >>
> >> Proposed solution:  A combination of 2, 3 and 4.
> >>
> >> For legacy 3-level p2m guests, the toolstack can detect p2m structure
> >> updates by tracking the p2m top and mid levels in the logdirty bitmap,
> >> and invalidating the modified subset of pfns.  It has to eagerly check
> >> the p2m frame list list mfn entry in the shared info to see whether the
> >> guest has swapped onto a completely new p2m.
> >>
> >> For a virtual linear map, the intermediate levels are not available to
> >> track, but we can require that the guest increment p2m generation clock
> >> in the shared info.  When the structure changes, the toolstack can remap
> >> the p2m and calculate the altered subset of pfns, and mark for resend.
> >>
> >> The toolstack must also track changes in the p2m itself, and compare to
> >> a local copy showing the mapping at the time at which the pfn was last
> >> sent.  This can be used to work out which p2m mappings have changed, and
> >> also be used to confirm whether the pfns on the receiving side are stale
> >> or not.
> >>
> >> I believe this covered all cases and race conditions.  In the case that
> >> the p2m is updated before the m2p, the p2m frame will be marked dirty in
> >> the bitmap, and discoverable on the next iteration.  At that point, if
> >> the p2m and m2p are inconsistent, the pfn will be deferred until the
> >> final iteration.  If not, the frame is sent and everything is all ok.
> >> In the case that the p2m is updated after the m2p, the p2m/m2p will be
> >> consistent when the dirty bitmap is acted on.
> >>
> >>
> >> Thoughts? (for anyone who has made it this far :)  I think I have
> >> covered everything.)
> >
> > Sounds okay.
> >
> > Two remarks regarding the virtual linear map:
> > - The intermediate levels could be tracked, as they are memory pages as
> >   well. It is not practical to do so, however, as there might be lots of
> >   changes not related to the p2m.
> 
> The intermediate levels are just pagetables, are they not? Or is there a
> separate tracking structure?

They are just pagetables, but the toolstack needs to filter out
unrelated changes (esp. in the higher-level pagetables).  That will
require keeping copies of the pagetables that map the p2m, but that
should be OK -- much cheaper than keeping copies of the p2m contents!

> > - The generation count is being checked by the tools in a lazy manner.
> >   This will require an increment of the count by the guest only after
> >   changing the structure of the p2m map, I think.
> 
> On further consideration, I think this needs to be a lockless
> producer/consumer interface, with increment once at start, and once
> again at the end.  The toolstack needs some ability to confirm that it
> has got a consistent mapping of the virtual p2m, as it cant practically
> detect updates via the logdirty bitmap.

I think it probably _can_ just use the log-dirty bitmap to track
changes, but having a double increment would save the tools from
reading garbage even if they have a live mapping of the guest p2m
(e.g. by making every p2m read a read-epoch/read-p2m/read-epoch triple).

OTOH if we want to support older guests we'll have to handle reading
garbage anyway, so maybe we should save ourselves all this mechanism
in the linear map case.

> It also occurs to me that the toolstack code needs to gain some use of
> ACCESS_ONCE() when reading the live p2m.

Good idea.  And document (if not already done) that the guest should
use atomic writes. :)

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-27 15:00     ` Tim Deegan
@ 2014-11-27 15:16       ` Andrew Cooper
  2014-11-27 15:28         ` Tim Deegan
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2014-11-27 15:16 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 27/11/14 15:00, Tim Deegan wrote:
> At 10:54 +0000 on 21 Nov (1416563695), Andrew Cooper wrote:
>> On 21/11/14 10:43, Jan Beulich wrote:
>>>>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
>>>> Should the guest change the p2m structure during live migration, the
>>>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
>>>> resulting in bogus cross-referencing.  Should the guest change an entry
>>>> in the p2m, the p2m frame itself will be resent as it would be marked as
>>>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
>>>> probably stale on the receiving side.
>>> MMU_MACHPHYS_UPDATE processing marks the page being changed
>>> as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
>>> callers thereof) should do so too?
>>>
>>> Jan
>>>
>> This is certainly needed to fix HVM ballooning and live migration
>> issues
> Agreed.  We should be marking HVM frames dirty when they have any p2m
> update that changes the mapping.  Maybe in paging_write_p2m_entry() or
> the various implementation-specific versions.
>
>> , although now you point it out, it applies just as much to PV
>> guests as HVM guests.
>>
>> I believe this might allow the toolstack to avoid keeping a second copy
>> of the p2m.
> I don't think so. :(  Because the toolstack is reading the guest's own
> p2m, there is still a race where:
>
>  - guest calls physmap_add_page, as part of which Xen marks the pfn dirty;
>  - toolstack reads + cleans the dirty bitmap;
>  - toolstack reads the guest p2m and DTRT for this pfn;
>  - guest updates its p2m with the result of the physmap_add_page call.
>
> After that, if the guest doesn't dirty that pfn again it won't be
> fixed up.

It will (I think).

In the above scenario, step 3 will (certainly in v2) fail the p2m/m2p
consistency check.  This error is currently fatal, but need to be made
nonfatal during the live part, and mark the pfn as deferred.

The deferred scheme is currently used to track pagetables which are
found to be in an inconsistent state, and causes them to be reconsidered
after pause.  As implemented in v2, the deferred bitmap is or'd into the
final dirty bitmap from Xen, before the last sweep through dirty pages.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-27 15:16       ` Andrew Cooper
@ 2014-11-27 15:28         ` Tim Deegan
  2014-11-27 15:41           ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Deegan @ 2014-11-27 15:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

At 15:16 +0000 on 27 Nov (1417097812), Andrew Cooper wrote:
> On 27/11/14 15:00, Tim Deegan wrote:
> > At 10:54 +0000 on 21 Nov (1416563695), Andrew Cooper wrote:
> >> On 21/11/14 10:43, Jan Beulich wrote:
> >>>>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
> >>>> Should the guest change the p2m structure during live migration, the
> >>>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
> >>>> resulting in bogus cross-referencing.  Should the guest change an entry
> >>>> in the p2m, the p2m frame itself will be resent as it would be marked as
> >>>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
> >>>> probably stale on the receiving side.
> >>> MMU_MACHPHYS_UPDATE processing marks the page being changed
> >>> as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
> >>> callers thereof) should do so too?
> >>>
> >>> Jan
> >>>
> >> This is certainly needed to fix HVM ballooning and live migration
> >> issues
> > Agreed.  We should be marking HVM frames dirty when they have any p2m
> > update that changes the mapping.  Maybe in paging_write_p2m_entry() or
> > the various implementation-specific versions.
> >
> >> , although now you point it out, it applies just as much to PV
> >> guests as HVM guests.
> >>
> >> I believe this might allow the toolstack to avoid keeping a second copy
> >> of the p2m.
> > I don't think so. :(  Because the toolstack is reading the guest's own
> > p2m, there is still a race where:
> >
> >  - guest calls physmap_add_page, as part of which Xen marks the pfn dirty;
> >  - toolstack reads + cleans the dirty bitmap;
> >  - toolstack reads the guest p2m and DTRT for this pfn;
> >  - guest updates its p2m with the result of the physmap_add_page call.
> >
> > After that, if the guest doesn't dirty that pfn again it won't be
> > fixed up.
> 
> It will (I think).
> 
> In the above scenario, step 3 will (certainly in v2) fail the p2m/m2p
> consistency check.  This error is currently fatal, but need to be made
> nonfatal during the live part, and mark the pfn as deferred.

That doesn't work if the guest is mapping a new entry: the guest p2m
will show the pfn as unallocated, which is fine.

There's a similar race if the guest wants to move a frame from one pfn
to another, unless you mandate that the guest must do the m2p update
after all p2m updates.

Tim.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-27 15:28         ` Tim Deegan
@ 2014-11-27 15:41           ` Andrew Cooper
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Cooper @ 2014-11-27 15:41 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Ian Jackson, Xen-devel List,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 27/11/14 15:28, Tim Deegan wrote:
> At 15:16 +0000 on 27 Nov (1417097812), Andrew Cooper wrote:
>> On 27/11/14 15:00, Tim Deegan wrote:
>>> At 10:54 +0000 on 21 Nov (1416563695), Andrew Cooper wrote:
>>>> On 21/11/14 10:43, Jan Beulich wrote:
>>>>>>>> On 20.11.14 at 19:28, <andrew.cooper3@citrix.com> wrote:
>>>>>> Should the guest change the p2m structure during live migration, the
>>>>>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
>>>>>> resulting in bogus cross-referencing.  Should the guest change an entry
>>>>>> in the p2m, the p2m frame itself will be resent as it would be marked as
>>>>>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
>>>>>> probably stale on the receiving side.
>>>>> MMU_MACHPHYS_UPDATE processing marks the page being changed
>>>>> as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain
>>>>> callers thereof) should do so too?
>>>>>
>>>>> Jan
>>>>>
>>>> This is certainly needed to fix HVM ballooning and live migration
>>>> issues
>>> Agreed.  We should be marking HVM frames dirty when they have any p2m
>>> update that changes the mapping.  Maybe in paging_write_p2m_entry() or
>>> the various implementation-specific versions.
>>>
>>>> , although now you point it out, it applies just as much to PV
>>>> guests as HVM guests.
>>>>
>>>> I believe this might allow the toolstack to avoid keeping a second copy
>>>> of the p2m.
>>> I don't think so. :(  Because the toolstack is reading the guest's own
>>> p2m, there is still a race where:
>>>
>>>  - guest calls physmap_add_page, as part of which Xen marks the pfn dirty;
>>>  - toolstack reads + cleans the dirty bitmap;
>>>  - toolstack reads the guest p2m and DTRT for this pfn;
>>>  - guest updates its p2m with the result of the physmap_add_page call.
>>>
>>> After that, if the guest doesn't dirty that pfn again it won't be
>>> fixed up.
>> It will (I think).
>>
>> In the above scenario, step 3 will (certainly in v2) fail the p2m/m2p
>> consistency check.  This error is currently fatal, but need to be made
>> nonfatal during the live part, and mark the pfn as deferred.
> That doesn't work if the guest is mapping a new entry: the guest p2m
> will show the pfn as unallocated, which is fine.

Hmm - so it will.

>
> There's a similar race if the guest wants to move a frame from one pfn
> to another, unless you mandate that the guest must do the m2p update
> after all p2m updates.

We certainly can't retroactively enforce that, and thusfar are in a
position to provide this safety to older PV guests.

It looks like we absolutely do need a second copy of the guests p2m. 
While unfortunate, I suspect admins will begrudgingly accept extra
memory usage in preference to potential VM memory corruption on migrate.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-11-20 18:28 Buggy interaction of live migration and p2m updates Andrew Cooper
                   ` (2 preceding siblings ...)
  2014-11-21 10:43 ` Jan Beulich
@ 2014-12-01 14:38 ` David Vrabel
  2014-12-01 16:58   ` Andrew Cooper
  3 siblings, 1 reply; 21+ messages in thread
From: David Vrabel @ 2014-12-01 14:38 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel List
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson,
	David Vrabel, Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 20/11/14 18:28, Andrew Cooper wrote:
> 
> 1) Freeze the guests p2m during live migrate
> 2) Deep p2m dirty tracking
> 3) Eagerly check for p2m structure changes.
> 4) Request p2m structure change updates from the guest

> Proposed solution:  A combination of 2, 3 and 4.

Option 5: don't change anything.  PV guests and toolstacks are
well-behaved and ensure that p2m updates do not occur during domain save.

We have never supported migrating non-cooperative PV guests.

I'd be more interested fixing this for HVM guests.

David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Buggy interaction of live migration and p2m updates
  2014-12-01 14:38 ` David Vrabel
@ 2014-12-01 16:58   ` Andrew Cooper
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Cooper @ 2014-12-01 16:58 UTC (permalink / raw)
  To: David Vrabel, Xen-devel List
  Cc: Juergen Gross, Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson,
	Jan Beulich, Shriram Rajagopalan, Hongyang Yang

On 01/12/14 14:38, David Vrabel wrote:
> On 20/11/14 18:28, Andrew Cooper wrote:
>> 1) Freeze the guests p2m during live migrate
>> 2) Deep p2m dirty tracking
>> 3) Eagerly check for p2m structure changes.
>> 4) Request p2m structure change updates from the guest
>> Proposed solution:  A combination of 2, 3 and 4.
> Option 5: don't change anything.  PV guests and toolstacks are
> well-behaved and ensure that p2m updates do not occur during domain save.

We are currently relying on this is true, knowing full well that it
isn't in practice.  Guest balloon drivers will happily balloon if told
to do so by the toolstack at the same point that the domain is going
down for migrate.

Even at the moment, if the toolstack observes the intermediate actions
of a grant being unmapped, the migration will either be aborted, or VM
corruption will occur on the far side.

>
> We have never supported migrating non-cooperative PV guests.

We will never be in the position to migrate a non-cooperative PV domain,
as the final action it needs to do is stuff an MFN in %rdx, to be
replaced on the far side.

>
> I'd be more interested fixing this for HVM guests.

Agreed, and the HVM case is easier to fix.  I am fairly certain the HVM
side can be fixed with a few extra paging_mark_dirty() for p2m updates
in Xen alone. The PV case is however a timebomb waiting to happen.

~Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-12-01 16:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-20 18:28 Buggy interaction of live migration and p2m updates Andrew Cooper
2014-11-21  5:41 ` Juergen Gross
2014-11-21 10:32   ` Andrew Cooper
2014-11-27 15:14     ` Tim Deegan
2014-11-21  9:43 ` Ian Campbell
2014-11-21 10:24   ` Andrew Cooper
2014-11-21 10:46     ` Ian Campbell
2014-11-21 11:07       ` Andrew Cooper
2014-11-21 11:15         ` Ian Campbell
2014-11-21 11:20           ` Juergen Gross
2014-11-21 11:24             ` Ian Campbell
2014-11-21 12:15               ` Jan Beulich
2014-11-21 12:20                 ` Jürgen Groß
2014-11-21 10:43 ` Jan Beulich
2014-11-21 10:54   ` Andrew Cooper
2014-11-27 15:00     ` Tim Deegan
2014-11-27 15:16       ` Andrew Cooper
2014-11-27 15:28         ` Tim Deegan
2014-11-27 15:41           ` Andrew Cooper
2014-12-01 14:38 ` David Vrabel
2014-12-01 16:58   ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.