Migration memory corruption - PV backends need to quiesce

All of lore.kernel.org
 help / color / mirror / Atom feed

* Migration memory corruption - PV backends need to quiesce
@ 2014-06-27 16:51 Andrew Cooper
  2014-06-27 17:28 ` David Vrabel
  2014-06-30  8:38 ` Ian Campbell
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Cooper @ 2014-06-27 16:51 UTC (permalink / raw)
  To: Xen-devel List
  Cc: Ian Campbell, Tim Deegan, Ian Jackson, Paul Durrant, David Vrabel,
	Jan Beulich

Hello,

After a long time fixing my own memory corruption bugs with migration
v2, I have finally tracked down (what I really really hope is) the last
of the corruption.

There appears to be a systematic problem affecting all PV drivers,
whereby a non-quiescent backend can cause memory corruption in the VM.

Active grant mapped pages are only reflected in the dirty bitmap after
the grant has been unmapped, as mapping the ring read-only would be
catastrophic to performance, and remapping as read-only when logdirty is
enabled is (as far as I understand) impossible, as Xen doesn't track the
PTEs pointing at granted frames.

PV backend drivers hold their mappings of the rings (and persistently
granted frames) open until the domain is destroyed, which is after the
memory image has been sent.  Therefore, any requests which are processed
after the migration code sending the ring frame on its first pass will
not be reflected in the resumed domain, as this frame will never be
marked as dirty in Xen.

Furthermore, as the migration code uses memcpy() on the frames, it is
possible that a backed update intersects with the copy, and a corrupt
descriptor appears on the resumed side.

In addition, after the domain has been paused, the backend might still
process requests.  The migration code excepts the guest be completely
quiesced after it has been suspended, so will only check the dirty
bitmap once.  Any requests which get processed and completed might still
be missed by the migration code.

>From a heavily instrumented Xen and migration code, I am fairly sure I
have confirmed that all pages corrupted on migration are a result of
still-active grant maps, grant copies which complete after domain
suspend, or the xenstore ring which xenstored has a magic mapping of,
and will never be reflected in the dirty bitmap.

Overall, it would appear that there needs to be a hook for all PV
drivers to force quiescence.  In particular, a backend must guarantee to
unmap all active grant maps (so the frames get properly reflected in the
dirty bitmap), and never process subsequent requests (so no new frames
appear dirty in the bitmap after the guest has been paused).

Thoughts/comments?

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-27 16:51 Migration memory corruption - PV backends need to quiesce Andrew Cooper
@ 2014-06-27 17:28 ` David Vrabel
  2014-06-27 18:15   ` Tim Deegan
  2014-06-30  8:38 ` Ian Campbell
  1 sibling, 1 reply; 24+ messages in thread
From: David Vrabel @ 2014-06-27 17:28 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel List
  Cc: Ian Campbell, Tim Deegan, Ian Jackson, Paul Durrant, Jan Beulich

On 27/06/14 17:51, Andrew Cooper wrote:
> 
> Overall, it would appear that there needs to be a hook for all PV
> drivers to force quiescence.  In particular, a backend must guarantee to
> unmap all active grant maps (so the frames get properly reflected in the
> dirty bitmap), and never process subsequent requests (so no new frames
> appear dirty in the bitmap after the guest has been paused).

I think this would be much too expensive for snapshots and things like
remus.  Waiting for all outstanding I/O could take seconds.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-27 17:28 ` David Vrabel
@ 2014-06-27 18:15   ` Tim Deegan
  2014-06-27 18:37     ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Tim Deegan @ 2014-06-27 18:15 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Andrew Cooper, Ian Jackson, Xen-devel List,
	Paul Durrant, Jan Beulich

At 18:28 +0100 on 27 Jun (1403890088), David Vrabel wrote:
> On 27/06/14 17:51, Andrew Cooper wrote:
> > 
> > Overall, it would appear that there needs to be a hook for all PV
> > drivers to force quiescence.  In particular, a backend must guarantee to
> > unmap all active grant maps (so the frames get properly reflected in the
> > dirty bitmap), and never process subsequent requests (so no new frames
> > appear dirty in the bitmap after the guest has been paused).
> 
> I think this would be much too expensive for snapshots and things like
> remus.  Waiting for all outstanding I/O could take seconds.

The other option we talked about yesterday was a flag to the log-dirty
operation that reports all grant-mapped frames as dirty.  Then the
tools would add such frames to the final pass.  That could take a long
time too, of course.

I'm not sure how you would synchronize the final pass with backends
that were doing grant copy operations -- you could exclude copies for
the duration, but I'm not sure what that would look like for the
backend.

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-27 18:15   ` Tim Deegan
@ 2014-06-27 18:37     ` Andrew Cooper
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2014-06-27 18:37 UTC (permalink / raw)
  To: Tim Deegan, David Vrabel
  Cc: Ian Campbell, Ian Jackson, Xen-devel List, Paul Durrant,
	Jan Beulich

On 27/06/14 19:15, Tim Deegan wrote:
> At 18:28 +0100 on 27 Jun (1403890088), David Vrabel wrote:
>> On 27/06/14 17:51, Andrew Cooper wrote:
>>> Overall, it would appear that there needs to be a hook for all PV
>>> drivers to force quiescence.  In particular, a backend must guarantee to
>>> unmap all active grant maps (so the frames get properly reflected in the
>>> dirty bitmap), and never process subsequent requests (so no new frames
>>> appear dirty in the bitmap after the guest has been paused).
>> I think this would be much too expensive for snapshots and things like
>> remus.  Waiting for all outstanding I/O could take seconds.
> The other option we talked about yesterday was a flag to the log-dirty
> operation that reports all grant-mapped frames as dirty.  Then the
> tools would add such frames to the final pass.  That could take a long
> time too, of course.
>
> I'm not sure how you would synchronize the final pass with backends
> that were doing grant copy operations -- you could exclude copies for
> the duration, but I'm not sure what that would look like for the
> backend.
>
> Tim.

Hmm - I have a crazy idea.

As identified by David, it is impractical to wait for backends to
complete any outstanding requests and unmap the grants, as this could
take seconds.

However, what the backend can do very quickly is guarantee that it will
never start processing any further requests, and never mark
subsequently-completed requests as complete in the ring.

This means that a the backend will not submit any new grant copy
operations, or regular copies to/from persistent grants, and even if a
hardware device has a dma mapping of an active grant, the request will
not be marked as completed in the ring. Even if the eventual dma'd pages
end up dirty, the frontend will replay the uncompleted requests in the
ring and be mostly fine[1].

Combined with a XEN_DOMCTL_SHADOW_OP_PEEK_INCLUDING_ACTIVE_GRANTS (name
subject to improvement), the migration code can guarantee that there
will be no corruption of the ring, and no relevant corruption of guest
memory.

I *believe* this covers all the cases, and doesn't depend on waiting for
the backends to fully complete all outstanding requests.

~Andrew

[1] The caveat is a pending read followed by a write of the same block
which, once replayed, might be out-of-order if the write did take effect
on the source side.  Any frontends which care about this must wait for
all write requests to complete before entering the suspend state.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-27 16:51 Migration memory corruption - PV backends need to quiesce Andrew Cooper
  2014-06-27 17:28 ` David Vrabel
@ 2014-06-30  8:38 ` Ian Campbell
  2014-06-30  9:02   ` Andrew Cooper
  2014-06-30  9:47   ` David Vrabel
  1 sibling, 2 replies; 24+ messages in thread
From: Ian Campbell @ 2014-06-30  8:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
> Hello,
> 
> After a long time fixing my own memory corruption bugs with migration
> v2, I have finally tracked down (what I really really hope is) the last
> of the corruption.
> 
> 
> There appears to be a systematic problem affecting all PV drivers,
> whereby a non-quiescent backend can cause memory corruption in the VM.
> 
> Active grant mapped pages are only reflected in the dirty bitmap after
> the grant has been unmapped, as mapping the ring read-only would be
> catastrophic to performance, and remapping as read-only when logdirty is
> enabled is (as far as I understand) impossible, as Xen doesn't track the
> PTEs pointing at granted frames.
> 
> PV backend drivers hold their mappings of the rings (and persistently
> granted frames) open until the domain is destroyed, which is after the
> memory image has been sent.  Therefore, any requests which are processed
> after the migration code sending the ring frame on its first pass will
> not be reflected in the resumed domain, as this frame will never be
> marked as dirty in Xen.
> 
> Furthermore, as the migration code uses memcpy() on the frames, it is
> possible that a backed update intersects with the copy, and a corrupt
> descriptor appears on the resumed side.
> 
> In addition, after the domain has been paused, the backend might still
> process requests.  The migration code excepts the guest be completely
> quiesced after it has been suspended, so will only check the dirty
> bitmap once.  Any requests which get processed and completed might still
> be missed by the migration code.
> 
> From a heavily instrumented Xen and migration code, I am fairly sure I
> have confirmed that all pages corrupted on migration are a result of
> still-active grant maps, grant copies which complete after domain
> suspend, or the xenstore ring which xenstored has a magic mapping of,
> and will never be reflected in the dirty bitmap.
> 
> 
> Overall, it would appear that there needs to be a hook for all PV
> drivers to force quiescence.  In particular, a backend must guarantee to
> unmap all active grant maps (so the frames get properly reflected in the
> dirty bitmap), and never process subsequent requests (so no new frames
> appear dirty in the bitmap after the guest has been paused).
> 
> Thoughts/comments?

I thought PV drivers were already (supposed to be) handling this in the
frontend.

For reasons of checkpoint performance I think Linux's net and blkfront
are handling this on resume rather than on suspend by tearing down on
resume and then requeueing any outstanding I/O after they reattach to
the new backend. In the blkfront case this is explicit, whereas IIRC
netfront just discards any active requests and relies on L3+
retransmition to get the job done. (see netfront_resume and
blkfront_resume/blkif_recover).

Part of the tear down and reconnect should involve invalidating any
inflight descriptors, whether or not they were partially completed or
have corrupted replies in them etc. This ought to be happening before
the new backend sees the ring at all.

Can you give an example of an instance of the corruption which you've
seen?

Do you have a theory why this wasn't seen with the migration v1 code?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  8:38 ` Ian Campbell
@ 2014-06-30  9:02   ` Andrew Cooper
  2014-06-30  9:21     ` Ian Campbell
  2014-06-30  9:47   ` David Vrabel
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2014-06-30  9:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On 30/06/14 09:38, Ian Campbell wrote:
> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>> Hello,
>>
>> After a long time fixing my own memory corruption bugs with migration
>> v2, I have finally tracked down (what I really really hope is) the last
>> of the corruption.
>>
>>
>> There appears to be a systematic problem affecting all PV drivers,
>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>
>> Active grant mapped pages are only reflected in the dirty bitmap after
>> the grant has been unmapped, as mapping the ring read-only would be
>> catastrophic to performance, and remapping as read-only when logdirty is
>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>> PTEs pointing at granted frames.
>>
>> PV backend drivers hold their mappings of the rings (and persistently
>> granted frames) open until the domain is destroyed, which is after the
>> memory image has been sent.  Therefore, any requests which are processed
>> after the migration code sending the ring frame on its first pass will
>> not be reflected in the resumed domain, as this frame will never be
>> marked as dirty in Xen.
>>
>> Furthermore, as the migration code uses memcpy() on the frames, it is
>> possible that a backed update intersects with the copy, and a corrupt
>> descriptor appears on the resumed side.
>>
>> In addition, after the domain has been paused, the backend might still
>> process requests.  The migration code excepts the guest be completely
>> quiesced after it has been suspended, so will only check the dirty
>> bitmap once.  Any requests which get processed and completed might still
>> be missed by the migration code.
>>
>> From a heavily instrumented Xen and migration code, I am fairly sure I
>> have confirmed that all pages corrupted on migration are a result of
>> still-active grant maps, grant copies which complete after domain
>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>> and will never be reflected in the dirty bitmap.
>>
>>
>> Overall, it would appear that there needs to be a hook for all PV
>> drivers to force quiescence.  In particular, a backend must guarantee to
>> unmap all active grant maps (so the frames get properly reflected in the
>> dirty bitmap), and never process subsequent requests (so no new frames
>> appear dirty in the bitmap after the guest has been paused).
>>
>> Thoughts/comments?
> I thought PV drivers were already (supposed to be) handling this in the
> frontend.
>
> For reasons of checkpoint performance I think Linux's net and blkfront
> are handling this on resume rather than on suspend by tearing down on
> resume and then requeueing any outstanding I/O after they reattach to
> the new backend. In the blkfront case this is explicit, whereas IIRC
> netfront just discards any active requests and relies on L3+
> retransmition to get the job done. (see netfront_resume and
> blkfront_resume/blkif_recover).
>
> Part of the tear down and reconnect should involve invalidating any
> inflight descriptors, whether or not they were partially completed or
> have corrupted replies in them etc. This ought to be happening before
> the new backend sees the ring at all.
>
> Can you give an example of an instance of the corruption which you've
> seen?

The intersection of requests processed by the backends and memcpy() in
the migration code means that the migration code might send a frame
which is actively being written by the backend at the same time.

As these changes are not reflected in the dirty bitmap (or indeed
reflected only after the migration code has stopped looking for further
changes), the corrupt frames never get resent.

>
> Do you have a theory why this wasn't seen with the migration v1 code?

It most certainly is.  This is not a problem unique to migration v2, and
I can prove its presence in the legacy migration code given a bit of
extra debugging.

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:02   ` Andrew Cooper
@ 2014-06-30  9:21     ` Ian Campbell
  2014-06-30  9:46       ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-06-30  9:21 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
> On 30/06/14 09:38, Ian Campbell wrote:
> > On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
> >> Hello,
> >>
> >> After a long time fixing my own memory corruption bugs with migration
> >> v2, I have finally tracked down (what I really really hope is) the last
> >> of the corruption.
> >>
> >>
> >> There appears to be a systematic problem affecting all PV drivers,
> >> whereby a non-quiescent backend can cause memory corruption in the VM.
> >>
> >> Active grant mapped pages are only reflected in the dirty bitmap after
> >> the grant has been unmapped, as mapping the ring read-only would be
> >> catastrophic to performance, and remapping as read-only when logdirty is
> >> enabled is (as far as I understand) impossible, as Xen doesn't track the
> >> PTEs pointing at granted frames.
> >>
> >> PV backend drivers hold their mappings of the rings (and persistently
> >> granted frames) open until the domain is destroyed, which is after the
> >> memory image has been sent.  Therefore, any requests which are processed
> >> after the migration code sending the ring frame on its first pass will
> >> not be reflected in the resumed domain, as this frame will never be
> >> marked as dirty in Xen.
> >>
> >> Furthermore, as the migration code uses memcpy() on the frames, it is
> >> possible that a backed update intersects with the copy, and a corrupt
> >> descriptor appears on the resumed side.
> >>
> >> In addition, after the domain has been paused, the backend might still
> >> process requests.  The migration code excepts the guest be completely
> >> quiesced after it has been suspended, so will only check the dirty
> >> bitmap once.  Any requests which get processed and completed might still
> >> be missed by the migration code.
> >>
> >> From a heavily instrumented Xen and migration code, I am fairly sure I
> >> have confirmed that all pages corrupted on migration are a result of
> >> still-active grant maps, grant copies which complete after domain
> >> suspend, or the xenstore ring which xenstored has a magic mapping of,
> >> and will never be reflected in the dirty bitmap.
> >>
> >>
> >> Overall, it would appear that there needs to be a hook for all PV
> >> drivers to force quiescence.  In particular, a backend must guarantee to
> >> unmap all active grant maps (so the frames get properly reflected in the
> >> dirty bitmap), and never process subsequent requests (so no new frames
> >> appear dirty in the bitmap after the guest has been paused).
> >>
> >> Thoughts/comments?
> > I thought PV drivers were already (supposed to be) handling this in the
> > frontend.
> >
> > For reasons of checkpoint performance I think Linux's net and blkfront
> > are handling this on resume rather than on suspend by tearing down on
> > resume and then requeueing any outstanding I/O after they reattach to
> > the new backend. In the blkfront case this is explicit, whereas IIRC
> > netfront just discards any active requests and relies on L3+
> > retransmition to get the job done. (see netfront_resume and
> > blkfront_resume/blkif_recover).
> >
> > Part of the tear down and reconnect should involve invalidating any
> > inflight descriptors, whether or not they were partially completed or
> > have corrupted replies in them etc. This ought to be happening before
> > the new backend sees the ring at all.
> >
> > Can you give an example of an instance of the corruption which you've
> > seen?
> 
> The intersection of requests processed by the backends and memcpy() in
> the migration code means that the migration code might send a frame
> which is actively being written by the backend at the same time.

The I/O will either complete before or after the migration.

If it is before then there is no problem, I think, the associated grant
unmap will mark the page dirty (unless e.g. persistent grants or
something has broken one of the assumptions here?) before the reply hits
the ring (i.e. before the guest can be aware of it).

If the I/O completes after the migration then it won't actually complete
(at least not anywhere which matters). Instead it will be reissued by
the frtonend and the new attempt will overwrite whatever partial data
was there during the migration with the correct data.

> As these changes are not reflected in the dirty bitmap (or indeed
> reflected only after the migration code has stopped looking for further
> changes), the corrupt frames never get resent.

I could believe that something like persistent grants might have
bypassed some of the mechanisms used here, by ski9ping the unmap on i/o
completion, but I don't think the issue should exist for the older ways
of doing things.

I don't think there is any way for a backend to mark pages as dirty (the
existing h/call is a domctl), but perhaps there needs to be?

Does GNTABOP_copy correctly update the dirty state? Perhaps backends
should prefer that even when a direct memcpy() would be usable?

> > Do you have a theory why this wasn't seen with the migration v1 code?
> 
> It most certainly is.  This is not a problem unique to migration v2, and
> I can prove its presence in the legacy migration code given a bit of
> extra debugging.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:21     ` Ian Campbell
@ 2014-06-30  9:46       ` Andrew Cooper
  2014-06-30  9:52         ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2014-06-30  9:46 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On 30/06/14 10:21, Ian Campbell wrote:
> On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
>> On 30/06/14 09:38, Ian Campbell wrote:
>>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>>>> Hello,
>>>>
>>>> After a long time fixing my own memory corruption bugs with migration
>>>> v2, I have finally tracked down (what I really really hope is) the last
>>>> of the corruption.
>>>>
>>>>
>>>> There appears to be a systematic problem affecting all PV drivers,
>>>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>>>
>>>> Active grant mapped pages are only reflected in the dirty bitmap after
>>>> the grant has been unmapped, as mapping the ring read-only would be
>>>> catastrophic to performance, and remapping as read-only when logdirty is
>>>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>>>> PTEs pointing at granted frames.
>>>>
>>>> PV backend drivers hold their mappings of the rings (and persistently
>>>> granted frames) open until the domain is destroyed, which is after the
>>>> memory image has been sent.  Therefore, any requests which are processed
>>>> after the migration code sending the ring frame on its first pass will
>>>> not be reflected in the resumed domain, as this frame will never be
>>>> marked as dirty in Xen.
>>>>
>>>> Furthermore, as the migration code uses memcpy() on the frames, it is
>>>> possible that a backed update intersects with the copy, and a corrupt
>>>> descriptor appears on the resumed side.
>>>>
>>>> In addition, after the domain has been paused, the backend might still
>>>> process requests.  The migration code excepts the guest be completely
>>>> quiesced after it has been suspended, so will only check the dirty
>>>> bitmap once.  Any requests which get processed and completed might still
>>>> be missed by the migration code.
>>>>
>>>> From a heavily instrumented Xen and migration code, I am fairly sure I
>>>> have confirmed that all pages corrupted on migration are a result of
>>>> still-active grant maps, grant copies which complete after domain
>>>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>>>> and will never be reflected in the dirty bitmap.
>>>>
>>>>
>>>> Overall, it would appear that there needs to be a hook for all PV
>>>> drivers to force quiescence.  In particular, a backend must guarantee to
>>>> unmap all active grant maps (so the frames get properly reflected in the
>>>> dirty bitmap), and never process subsequent requests (so no new frames
>>>> appear dirty in the bitmap after the guest has been paused).
>>>>
>>>> Thoughts/comments?
>>> I thought PV drivers were already (supposed to be) handling this in the
>>> frontend.
>>>
>>> For reasons of checkpoint performance I think Linux's net and blkfront
>>> are handling this on resume rather than on suspend by tearing down on
>>> resume and then requeueing any outstanding I/O after they reattach to
>>> the new backend. In the blkfront case this is explicit, whereas IIRC
>>> netfront just discards any active requests and relies on L3+
>>> retransmition to get the job done. (see netfront_resume and
>>> blkfront_resume/blkif_recover).
>>>
>>> Part of the tear down and reconnect should involve invalidating any
>>> inflight descriptors, whether or not they were partially completed or
>>> have corrupted replies in them etc. This ought to be happening before
>>> the new backend sees the ring at all.
>>>
>>> Can you give an example of an instance of the corruption which you've
>>> seen?
>> The intersection of requests processed by the backends and memcpy() in
>> the migration code means that the migration code might send a frame
>> which is actively being written by the backend at the same time.
> The I/O will either complete before or after the migration.
>
> If it is before then there is no problem, I think

IO completed before vm suspend is fine.

However, there is a window of time between the vm suspend (after which
the frontend is quiesced) and the migration completing during which IO
is still being processed by the backend.

After the vm suspend, the migration code does not expect the contents of
VM memory to be changing, and it only consults the dirty bitmap once more.

Grants which are active until after this time (mappings of rings,
persistent areas) are never identified as dirty, which means that the
contents being sent will be whatever contents was present during the
first iteration of migration.

Grant copy operations do get correctly reflected in the dirty bitmap,
but any grant copies which happen after the final consultation of the
dirty bitmap will still be ignored.


> , the associated grant
> unmap will mark the page dirty (unless e.g. persistent grants or
> something has broken one of the assumptions here?)

All testing was done without persistent grant support, but persistent
grants will probably make the problem much more obvious, given a much
larger range of guest memory which will be out-of-date.

>  before the reply hits
> the ring (i.e. before the guest can be aware of it).
>
> If the I/O completes after the migration then it won't actually complete
> (at least not anywhere which matters). Instead it will be reissued by
> the frtonend and the new attempt will overwrite whatever partial data
> was there during the migration with the correct data.

Correct, which is why this problem goes mostly unnoticed.  So long as
there is no intersection of the backend writing a frame and the
migration code copying that frame, resulting a corrupt frame being sent,
the replay of the ring will completely hide the issue.

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  8:38 ` Ian Campbell
  2014-06-30  9:02   ` Andrew Cooper
@ 2014-06-30  9:47   ` David Vrabel
  2014-06-30  9:53     ` Ian Campbell
  2014-06-30 10:14     ` Tim Deegan
  1 sibling, 2 replies; 24+ messages in thread
From: David Vrabel @ 2014-06-30  9:47 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	Jan Beulich

On 30/06/14 09:38, Ian Campbell wrote:
> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>> Hello,
>>
>> After a long time fixing my own memory corruption bugs with migration
>> v2, I have finally tracked down (what I really really hope is) the last
>> of the corruption.
>>
>>
>> There appears to be a systematic problem affecting all PV drivers,
>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>
>> Active grant mapped pages are only reflected in the dirty bitmap after
>> the grant has been unmapped, as mapping the ring read-only would be
>> catastrophic to performance, and remapping as read-only when logdirty is
>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>> PTEs pointing at granted frames.
>>
>> PV backend drivers hold their mappings of the rings (and persistently
>> granted frames) open until the domain is destroyed, which is after the
>> memory image has been sent.  Therefore, any requests which are processed
>> after the migration code sending the ring frame on its first pass will
>> not be reflected in the resumed domain, as this frame will never be
>> marked as dirty in Xen.
>>
>> Furthermore, as the migration code uses memcpy() on the frames, it is
>> possible that a backed update intersects with the copy, and a corrupt
>> descriptor appears on the resumed side.
>>
>> In addition, after the domain has been paused, the backend might still
>> process requests.  The migration code excepts the guest be completely
>> quiesced after it has been suspended, so will only check the dirty
>> bitmap once.  Any requests which get processed and completed might still
>> be missed by the migration code.
>>
>> From a heavily instrumented Xen and migration code, I am fairly sure I
>> have confirmed that all pages corrupted on migration are a result of
>> still-active grant maps, grant copies which complete after domain
>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>> and will never be reflected in the dirty bitmap.
>>
>>
>> Overall, it would appear that there needs to be a hook for all PV
>> drivers to force quiescence.  In particular, a backend must guarantee to
>> unmap all active grant maps (so the frames get properly reflected in the
>> dirty bitmap), and never process subsequent requests (so no new frames
>> appear dirty in the bitmap after the guest has been paused).
>>
>> Thoughts/comments?
> 
> I thought PV drivers were already (supposed to be) handling this in the
> frontend.
> 
> For reasons of checkpoint performance I think Linux's net and blkfront
> are handling this on resume rather than on suspend by tearing down on
> resume and then requeueing any outstanding I/O after they reattach to
> the new backend. In the blkfront case this is explicit, whereas IIRC
> netfront just discards any active requests and relies on L3+
> retransmition to get the job done. (see netfront_resume and
> blkfront_resume/blkif_recover).
> 
> Part of the tear down and reconnect should involve invalidating any
> inflight descriptors, whether or not they were partially completed or
> have corrupted replies in them etc. This ought to be happening before
> the new backend sees the ring at all.
> 
> Can you give an example of an instance of the corruption which you've
> seen?
> 
> Do you have a theory why this wasn't seen with the migration v1 code?

I think it does not result in any problems (except if persistent grants
are used, see below).

Shared ring updates are strictly ordered with respect to the writes to
data pages (either via grant map or grant copy).  This means that is the
guest sees a response in the ring it is guaranteed that all writes to
the associated pages are also present.

The write of the response and the write of the producer index are
strictly ordered.  If the backend is in the process of writing a
response and the page is saved then the partial (corrupt) response is
not visible to the guest.  The write of the producer index is atomic so
the saver cannot see a partial producer index write.

Using persistent grants means that backend writes may be lost since the
memcpy in the backend does not update the dirty bitmap and guest writes
to the shared ring may make the backends response writes visible.  I
think we may need to disable persistent grant support from blkback until
a mechanism for updating the dirty log is in place.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:46       ` Andrew Cooper
@ 2014-06-30  9:52         ` Ian Campbell
  2014-06-30 10:13           ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-06-30  9:52 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Mon, 2014-06-30 at 10:46 +0100, Andrew Cooper wrote:
> On 30/06/14 10:21, Ian Campbell wrote:
> > On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
> >> On 30/06/14 09:38, Ian Campbell wrote:
> >>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
> >>>> Hello,
> >>>>
> >>>> After a long time fixing my own memory corruption bugs with migration
> >>>> v2, I have finally tracked down (what I really really hope is) the last
> >>>> of the corruption.
> >>>>
> >>>>
> >>>> There appears to be a systematic problem affecting all PV drivers,
> >>>> whereby a non-quiescent backend can cause memory corruption in the VM.
> >>>>
> >>>> Active grant mapped pages are only reflected in the dirty bitmap after
> >>>> the grant has been unmapped, as mapping the ring read-only would be
> >>>> catastrophic to performance, and remapping as read-only when logdirty is
> >>>> enabled is (as far as I understand) impossible, as Xen doesn't track the
> >>>> PTEs pointing at granted frames.
> >>>>
> >>>> PV backend drivers hold their mappings of the rings (and persistently
> >>>> granted frames) open until the domain is destroyed, which is after the
> >>>> memory image has been sent.  Therefore, any requests which are processed
> >>>> after the migration code sending the ring frame on its first pass will
> >>>> not be reflected in the resumed domain, as this frame will never be
> >>>> marked as dirty in Xen.
> >>>>
> >>>> Furthermore, as the migration code uses memcpy() on the frames, it is
> >>>> possible that a backed update intersects with the copy, and a corrupt
> >>>> descriptor appears on the resumed side.
> >>>>
> >>>> In addition, after the domain has been paused, the backend might still
> >>>> process requests.  The migration code excepts the guest be completely
> >>>> quiesced after it has been suspended, so will only check the dirty
> >>>> bitmap once.  Any requests which get processed and completed might still
> >>>> be missed by the migration code.
> >>>>
> >>>> From a heavily instrumented Xen and migration code, I am fairly sure I
> >>>> have confirmed that all pages corrupted on migration are a result of
> >>>> still-active grant maps, grant copies which complete after domain
> >>>> suspend, or the xenstore ring which xenstored has a magic mapping of,
> >>>> and will never be reflected in the dirty bitmap.
> >>>>
> >>>>
> >>>> Overall, it would appear that there needs to be a hook for all PV
> >>>> drivers to force quiescence.  In particular, a backend must guarantee to
> >>>> unmap all active grant maps (so the frames get properly reflected in the
> >>>> dirty bitmap), and never process subsequent requests (so no new frames
> >>>> appear dirty in the bitmap after the guest has been paused).
> >>>>
> >>>> Thoughts/comments?
> >>> I thought PV drivers were already (supposed to be) handling this in the
> >>> frontend.
> >>>
> >>> For reasons of checkpoint performance I think Linux's net and blkfront
> >>> are handling this on resume rather than on suspend by tearing down on
> >>> resume and then requeueing any outstanding I/O after they reattach to
> >>> the new backend. In the blkfront case this is explicit, whereas IIRC
> >>> netfront just discards any active requests and relies on L3+
> >>> retransmition to get the job done. (see netfront_resume and
> >>> blkfront_resume/blkif_recover).
> >>>
> >>> Part of the tear down and reconnect should involve invalidating any
> >>> inflight descriptors, whether or not they were partially completed or
> >>> have corrupted replies in them etc. This ought to be happening before
> >>> the new backend sees the ring at all.
> >>>
> >>> Can you give an example of an instance of the corruption which you've
> >>> seen?
> >> The intersection of requests processed by the backends and memcpy() in
> >> the migration code means that the migration code might send a frame
> >> which is actively being written by the backend at the same time.
> > The I/O will either complete before or after the migration.
> >
> > If it is before then there is no problem, I think
> 
> IO completed before vm suspend is fine.
> 
> However, there is a window of time between the vm suspend (after which
> the frontend is quiesced) and the migration completing during which IO
> is still being processed by the backend.
> 
> After the vm suspend, the migration code does not expect the contents of
> VM memory to be changing, and it only consults the dirty bitmap once more.
> 
> Grants which are active until after this time (mappings of rings,
> persistent areas) are never identified as dirty, which means that the
> contents being sent will be whatever contents was present during the
> first iteration of migration.

Correct. And then that random contents will be fixed by the replay of
the I/O on the resume side.

> Grant copy operations do get correctly reflected in the dirty bitmap,
> but any grant copies which happen after the final consultation of the
> dirty bitmap will still be ignored.

That's fine since there is no way to signal that the I/O has completed
to the frontend, since it is no longer running. As long as the I/O is
incomplete then the frontend will replay it.

> > , the associated grant
> > unmap will mark the page dirty (unless e.g. persistent grants or
> > something has broken one of the assumptions here?)
> 
> All testing was done without persistent grant support, but persistent
> grants will probably make the problem much more obvious, given a much
> larger range of guest memory which will be out-of-date.
> 
> >  before the reply hits
> > the ring (i.e. before the guest can be aware of it).
> >
> > If the I/O completes after the migration then it won't actually complete
> > (at least not anywhere which matters). Instead it will be reissued by
> > the frtonend and the new attempt will overwrite whatever partial data
> > was there during the migration with the correct data.
> 
> Correct, which is why this problem goes mostly unnoticed.  So long as
> there is no intersection of the backend writing a frame and the
> migration code copying that frame, resulting a corrupt frame being sent,
> the replay of the ring will completely hide the issue.

It's not "mostly unnoticed", it is working as designed.

I think you need to consider the content of a frame which is under I/O
at the time of migration to be "undefined" rather than looking for a
hard "correct" vs. "corrupt" distinction.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:47   ` David Vrabel
@ 2014-06-30  9:53     ` Ian Campbell
  2014-07-01 10:29       ` David Vrabel
  2014-06-30 10:14     ` Tim Deegan
  1 sibling, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-06-30  9:53 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Tim Deegan, Xen-devel List, Paul Durrant,
	Jan Beulich, Ian Jackson

On Mon, 2014-06-30 at 10:47 +0100, David Vrabel wrote:
> On 30/06/14 09:38, Ian Campbell wrote:
> > On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
> >> Hello,
> >>
> >> After a long time fixing my own memory corruption bugs with migration
> >> v2, I have finally tracked down (what I really really hope is) the last
> >> of the corruption.
> >>
> >>
> >> There appears to be a systematic problem affecting all PV drivers,
> >> whereby a non-quiescent backend can cause memory corruption in the VM.
> >>
> >> Active grant mapped pages are only reflected in the dirty bitmap after
> >> the grant has been unmapped, as mapping the ring read-only would be
> >> catastrophic to performance, and remapping as read-only when logdirty is
> >> enabled is (as far as I understand) impossible, as Xen doesn't track the
> >> PTEs pointing at granted frames.
> >>
> >> PV backend drivers hold their mappings of the rings (and persistently
> >> granted frames) open until the domain is destroyed, which is after the
> >> memory image has been sent.  Therefore, any requests which are processed
> >> after the migration code sending the ring frame on its first pass will
> >> not be reflected in the resumed domain, as this frame will never be
> >> marked as dirty in Xen.
> >>
> >> Furthermore, as the migration code uses memcpy() on the frames, it is
> >> possible that a backed update intersects with the copy, and a corrupt
> >> descriptor appears on the resumed side.
> >>
> >> In addition, after the domain has been paused, the backend might still
> >> process requests.  The migration code excepts the guest be completely
> >> quiesced after it has been suspended, so will only check the dirty
> >> bitmap once.  Any requests which get processed and completed might still
> >> be missed by the migration code.
> >>
> >> From a heavily instrumented Xen and migration code, I am fairly sure I
> >> have confirmed that all pages corrupted on migration are a result of
> >> still-active grant maps, grant copies which complete after domain
> >> suspend, or the xenstore ring which xenstored has a magic mapping of,
> >> and will never be reflected in the dirty bitmap.
> >>
> >>
> >> Overall, it would appear that there needs to be a hook for all PV
> >> drivers to force quiescence.  In particular, a backend must guarantee to
> >> unmap all active grant maps (so the frames get properly reflected in the
> >> dirty bitmap), and never process subsequent requests (so no new frames
> >> appear dirty in the bitmap after the guest has been paused).
> >>
> >> Thoughts/comments?
> > 
> > I thought PV drivers were already (supposed to be) handling this in the
> > frontend.
> > 
> > For reasons of checkpoint performance I think Linux's net and blkfront
> > are handling this on resume rather than on suspend by tearing down on
> > resume and then requeueing any outstanding I/O after they reattach to
> > the new backend. In the blkfront case this is explicit, whereas IIRC
> > netfront just discards any active requests and relies on L3+
> > retransmition to get the job done. (see netfront_resume and
> > blkfront_resume/blkif_recover).
> > 
> > Part of the tear down and reconnect should involve invalidating any
> > inflight descriptors, whether or not they were partially completed or
> > have corrupted replies in them etc. This ought to be happening before
> > the new backend sees the ring at all.
> > 
> > Can you give an example of an instance of the corruption which you've
> > seen?
> > 
> > Do you have a theory why this wasn't seen with the migration v1 code?
> 
> I think it does not result in any problems (except if persistent grants
> are used, see below).
> 
> Shared ring updates are strictly ordered with respect to the writes to
> data pages (either via grant map or grant copy).  This means that is the
> guest sees a response in the ring it is guaranteed that all writes to
> the associated pages are also present.
> 
> The write of the response and the write of the producer index are
> strictly ordered.  If the backend is in the process of writing a
> response and the page is saved then the partial (corrupt) response is
> not visible to the guest.  The write of the producer index is atomic so
> the saver cannot see a partial producer index write.

Exactly.

> Using persistent grants means that backend writes may be lost since the
> memcpy in the backend does not update the dirty bitmap and guest writes
> to the shared ring may make the backends response writes visible.  I
> think we may need to disable persistent grant support from blkback until
> a mechanism for updating the dirty log is in place.

Yes, sadly I think you might be right here.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:52         ` Ian Campbell
@ 2014-06-30 10:13           ` Andrew Cooper
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2014-06-30 10:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On 30/06/14 10:52, Ian Campbell wrote:
> On Mon, 2014-06-30 at 10:46 +0100, Andrew Cooper wrote:
>> On 30/06/14 10:21, Ian Campbell wrote:
>>> On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
>>>> On 30/06/14 09:38, Ian Campbell wrote:
>>>>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>>>>>> Hello,
>>>>>>
>>>>>> After a long time fixing my own memory corruption bugs with migration
>>>>>> v2, I have finally tracked down (what I really really hope is) the last
>>>>>> of the corruption.
>>>>>>
>>>>>>
>>>>>> There appears to be a systematic problem affecting all PV drivers,
>>>>>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>>>>>
>>>>>> Active grant mapped pages are only reflected in the dirty bitmap after
>>>>>> the grant has been unmapped, as mapping the ring read-only would be
>>>>>> catastrophic to performance, and remapping as read-only when logdirty is
>>>>>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>>>>>> PTEs pointing at granted frames.
>>>>>>
>>>>>> PV backend drivers hold their mappings of the rings (and persistently
>>>>>> granted frames) open until the domain is destroyed, which is after the
>>>>>> memory image has been sent.  Therefore, any requests which are processed
>>>>>> after the migration code sending the ring frame on its first pass will
>>>>>> not be reflected in the resumed domain, as this frame will never be
>>>>>> marked as dirty in Xen.
>>>>>>
>>>>>> Furthermore, as the migration code uses memcpy() on the frames, it is
>>>>>> possible that a backed update intersects with the copy, and a corrupt
>>>>>> descriptor appears on the resumed side.
>>>>>>
>>>>>> In addition, after the domain has been paused, the backend might still
>>>>>> process requests.  The migration code excepts the guest be completely
>>>>>> quiesced after it has been suspended, so will only check the dirty
>>>>>> bitmap once.  Any requests which get processed and completed might still
>>>>>> be missed by the migration code.
>>>>>>
>>>>>> From a heavily instrumented Xen and migration code, I am fairly sure I
>>>>>> have confirmed that all pages corrupted on migration are a result of
>>>>>> still-active grant maps, grant copies which complete after domain
>>>>>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>>>>>> and will never be reflected in the dirty bitmap.
>>>>>>
>>>>>>
>>>>>> Overall, it would appear that there needs to be a hook for all PV
>>>>>> drivers to force quiescence.  In particular, a backend must guarantee to
>>>>>> unmap all active grant maps (so the frames get properly reflected in the
>>>>>> dirty bitmap), and never process subsequent requests (so no new frames
>>>>>> appear dirty in the bitmap after the guest has been paused).
>>>>>>
>>>>>> Thoughts/comments?
>>>>> I thought PV drivers were already (supposed to be) handling this in the
>>>>> frontend.
>>>>>
>>>>> For reasons of checkpoint performance I think Linux's net and blkfront
>>>>> are handling this on resume rather than on suspend by tearing down on
>>>>> resume and then requeueing any outstanding I/O after they reattach to
>>>>> the new backend. In the blkfront case this is explicit, whereas IIRC
>>>>> netfront just discards any active requests and relies on L3+
>>>>> retransmition to get the job done. (see netfront_resume and
>>>>> blkfront_resume/blkif_recover).
>>>>>
>>>>> Part of the tear down and reconnect should involve invalidating any
>>>>> inflight descriptors, whether or not they were partially completed or
>>>>> have corrupted replies in them etc. This ought to be happening before
>>>>> the new backend sees the ring at all.
>>>>>
>>>>> Can you give an example of an instance of the corruption which you've
>>>>> seen?
>>>> The intersection of requests processed by the backends and memcpy() in
>>>> the migration code means that the migration code might send a frame
>>>> which is actively being written by the backend at the same time.
>>> The I/O will either complete before or after the migration.
>>>
>>> If it is before then there is no problem, I think
>> IO completed before vm suspend is fine.
>>
>> However, there is a window of time between the vm suspend (after which
>> the frontend is quiesced) and the migration completing during which IO
>> is still being processed by the backend.
>>
>> After the vm suspend, the migration code does not expect the contents of
>> VM memory to be changing, and it only consults the dirty bitmap once more.
>>
>> Grants which are active until after this time (mappings of rings,
>> persistent areas) are never identified as dirty, which means that the
>> contents being sent will be whatever contents was present during the
>> first iteration of migration.
> Correct. And then that random contents will be fixed by the replay of
> the I/O on the resume side.
>
>> Grant copy operations do get correctly reflected in the dirty bitmap,
>> but any grant copies which happen after the final consultation of the
>> dirty bitmap will still be ignored.
> That's fine since there is no way to signal that the I/O has completed
> to the frontend, since it is no longer running. As long as the I/O is
> incomplete then the frontend will replay it.
>
>>> , the associated grant
>>> unmap will mark the page dirty (unless e.g. persistent grants or
>>> something has broken one of the assumptions here?)
>> All testing was done without persistent grant support, but persistent
>> grants will probably make the problem much more obvious, given a much
>> larger range of guest memory which will be out-of-date.
>>
>>>  before the reply hits
>>> the ring (i.e. before the guest can be aware of it).
>>>
>>> If the I/O completes after the migration then it won't actually complete
>>> (at least not anywhere which matters). Instead it will be reissued by
>>> the frtonend and the new attempt will overwrite whatever partial data
>>> was there during the migration with the correct data.
>> Correct, which is why this problem goes mostly unnoticed.  So long as
>> there is no intersection of the backend writing a frame and the
>> migration code copying that frame, resulting a corrupt frame being sent,
>> the replay of the ring will completely hide the issue.
> It's not "mostly unnoticed", it is working as designed.
>
> I think you need to consider the content of a frame which is under I/O
> at the time of migration to be "undefined" rather than looking for a
> hard "correct" vs. "corrupt" distinction.
>
> Ian.
>

There is no guarentee which direction the ring frame gets memcpy()'d. 
Even with strict ordering of changes to the ring and to the pointers,
you can still end up in a situation where the producer index has been
incremented but the underlying descriptor has not been written (as far
as the frontend is can see).

Fundamentally, it doesn't matter if the ring is out date when it is sent
(as replaying the ring will fix this), but it *must* be consistent,
which means there needs to be a guarantee that the backend does not
write into the ring frame at the same time that it is being sent by the
migration code.

This necessitates quiescing of the backend, or a guarantee that after
the vm suspend it will not update the ring page further, even if
outstanding IO subsequently completes.  This allows a consistent ring
page to be sent in the stream.

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:47   ` David Vrabel
  2014-06-30  9:53     ` Ian Campbell
@ 2014-06-30 10:14     ` Tim Deegan
  2014-06-30 10:24       ` Ian Campbell
                         ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Tim Deegan @ 2014-06-30 10:14 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Andrew Cooper, Ian Jackson, Xen-devel List,
	Paul Durrant, Jan Beulich

At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> Shared ring updates are strictly ordered with respect to the writes to
> data pages (either via grant map or grant copy).  This means that is the
> guest sees a response in the ring it is guaranteed that all writes to
> the associated pages are also present.

Is the ring update also strictly ordered wrt the grant unmap operation?

> The write of the response and the write of the producer index are
> strictly ordered.  If the backend is in the process of writing a
> response and the page is saved then the partial (corrupt) response is
> not visible to the guest.  The write of the producer index is atomic so
> the saver cannot see a partial producer index write.

Yes.  The (suggested) problem is that live migration does not preserve
that write ordering.  So we have to worry about something like this:

1. Toolstack pauses the domain for the final pass.  Reads the final
   LGD bitmap, which happens to include the shared ring but not the
   data pages.
2. Backend writes the data.
3. Backend unmaps the data page, marking it dirty.
4. Backend writes the ring.
5. Toolstack sends the ring page across in the last pass.
6. Guest resumes, seeing the I/O marked as complete, but without the
   data.

ISTR working though this before and being convinced that the backends
were correctly detaching before the final pass.  That was a long time
ago, though.

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 10:14     ` Tim Deegan
@ 2014-06-30 10:24       ` Ian Campbell
  2014-06-30 10:52       ` Tim Deegan
  2014-06-30 11:01       ` Ian Campbell
  2 siblings, 0 replies; 24+ messages in thread
From: Ian Campbell @ 2014-06-30 10:24 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Mon, 2014-06-30 at 12:14 +0200, Tim Deegan wrote:
> At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> > Shared ring updates are strictly ordered with respect to the writes to
> > data pages (either via grant map or grant copy).  This means that is the
> > guest sees a response in the ring it is guaranteed that all writes to
> > the associated pages are also present.
> 
> Is the ring update also strictly ordered wrt the grant unmap operation?

Probably nothing actually enforces this and we've likely never written
it down but I think in general a well behaved backend should be doing
the unmap before indicating completion (modulo persistent grants).

I think doing otherwise would cause the guest issues when it tried to
reuse the grant slot, because the gref would potentially be marked busy.

> > The write of the response and the write of the producer index are
> > strictly ordered.  If the backend is in the process of writing a
> > response and the page is saved then the partial (corrupt) response is
> > not visible to the guest.  The write of the producer index is atomic so
> > the saver cannot see a partial producer index write.
> 
> Yes.  The (suggested) problem is that live migration does not preserve
> that write ordering.  So we have to worry about something like this:

Thank you for the clear explanation of the issue.

> 1. Toolstack pauses the domain for the final pass.  Reads the final
>    LGD bitmap, which happens to include the shared ring but not the
>    data pages.
> 2. Backend writes the data.
> 3. Backend unmaps the data page, marking it dirty.
> 4. Backend writes the ring.
> 5. Toolstack sends the ring page across in the last pass.
> 6. Guest resumes, seeing the I/O marked as complete, but without the
>    data.
> 
> ISTR working though this before and being convinced that the backends
> were correctly detaching before the final pass.  That was a long time
> ago, though.

It's certainly not impossible that the switch to xl or the move of the
calls to the hotplug scripts from udev to libxl or something like that
has caused this sequencing to become incorrect.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 10:14     ` Tim Deegan
  2014-06-30 10:24       ` Ian Campbell
@ 2014-06-30 10:52       ` Tim Deegan
  2014-06-30 11:07         ` Ian Campbell
  2014-06-30 11:01       ` Ian Campbell
  2 siblings, 1 reply; 24+ messages in thread
From: Tim Deegan @ 2014-06-30 10:52 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Andrew Cooper, Ian Jackson, Xen-devel List,
	Paul Durrant, Jan Beulich

At 12:14 +0200 on 30 Jun (1404126862), Tim Deegan wrote:
> At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> > Shared ring updates are strictly ordered with respect to the writes to
> > data pages (either via grant map or grant copy).  This means that is the
> > guest sees a response in the ring it is guaranteed that all writes to
> > the associated pages are also present.
> 
> Is the ring update also strictly ordered wrt the grant unmap operation?
> 
> > The write of the response and the write of the producer index are
> > strictly ordered.  If the backend is in the process of writing a
> > response and the page is saved then the partial (corrupt) response is
> > not visible to the guest.  The write of the producer index is atomic so
> > the saver cannot see a partial producer index write.
> 
> Yes.  The (suggested) problem is that live migration does not preserve
> that write ordering.  So we have to worry about something like this:
> 
> 1. Toolstack pauses the domain for the final pass.  Reads the final
>    LGD bitmap, which happens to include the shared ring but not the
>    data pages.
> 2. Backend writes the data.
> 3. Backend unmaps the data page, marking it dirty.
> 4. Backend writes the ring.
> 5. Toolstack sends the ring page across in the last pass.
> 6. Guest resumes, seeing the I/O marked as complete, but without the
>    data.

It occurs to me that the guest should be able to defend against this
by taking a local copy of the response producer before migration and
using _that_ for the replay logic afterwards.  That is guaranteed to
exclude any I/O that completed after the VM was paused, and as long as
the unmap is guaranteed to happen before the ring update, we're OK.

(That still leaves the question that Andrew raised of memcpy()
breaking atomicity/ordering of updates.)

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 10:14     ` Tim Deegan
  2014-06-30 10:24       ` Ian Campbell
  2014-06-30 10:52       ` Tim Deegan
@ 2014-06-30 11:01       ` Ian Campbell
  2014-06-30 11:08         ` Tim Deegan
  2 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-06-30 11:01 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Mon, 2014-06-30 at 12:14 +0200, Tim Deegan wrote:
> At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> > Shared ring updates are strictly ordered with respect to the writes to
> > data pages (either via grant map or grant copy).  This means that is the
> > guest sees a response in the ring it is guaranteed that all writes to
> > the associated pages are also present.
> 
> Is the ring update also strictly ordered wrt the grant unmap operation?
> 
> > The write of the response and the write of the producer index are
> > strictly ordered.  If the backend is in the process of writing a
> > response and the page is saved then the partial (corrupt) response is
> > not visible to the guest.  The write of the producer index is atomic so
> > the saver cannot see a partial producer index write.
> 
> Yes.  The (suggested) problem is that live migration does not preserve
> that write ordering.  So we have to worry about something like this:
> 
> 1. Toolstack pauses the domain for the final pass.  Reads the final
>    LGD bitmap, which happens to include the shared ring but not the
>    data pages.

WRT this -- writes by the backend don't cause the page to become dirty.
What about writes by the f.e? Are pages which are granted and actively
mapped by another domain handled as normal guest RAM pages for the
purposes of logdirty or are they handle specially?

> 2. Backend writes the data.
> 3. Backend unmaps the data page, marking it dirty.
> 4. Backend writes the ring.
> 5. Toolstack sends the ring page across in the last pass.
> 6. Guest resumes, seeing the I/O marked as complete, but without the
>    data.
> 
> ISTR working though this before and being convinced that the backends
> were correctly detaching before the final pass.  That was a long time
> ago, though.
> 
> Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 10:52       ` Tim Deegan
@ 2014-06-30 11:07         ` Ian Campbell
  2014-06-30 11:12           ` Tim Deegan
  2014-06-30 11:57           ` David Vrabel
  0 siblings, 2 replies; 24+ messages in thread
From: Ian Campbell @ 2014-06-30 11:07 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

On Mon, 2014-06-30 at 12:52 +0200, Tim Deegan wrote:
> At 12:14 +0200 on 30 Jun (1404126862), Tim Deegan wrote:
> > At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> > > Shared ring updates are strictly ordered with respect to the writes to
> > > data pages (either via grant map or grant copy).  This means that is the
> > > guest sees a response in the ring it is guaranteed that all writes to
> > > the associated pages are also present.
> > 
> > Is the ring update also strictly ordered wrt the grant unmap operation?
> > 
> > > The write of the response and the write of the producer index are
> > > strictly ordered.  If the backend is in the process of writing a
> > > response and the page is saved then the partial (corrupt) response is
> > > not visible to the guest.  The write of the producer index is atomic so
> > > the saver cannot see a partial producer index write.
> > 
> > Yes.  The (suggested) problem is that live migration does not preserve
> > that write ordering.  So we have to worry about something like this:
> > 
> > 1. Toolstack pauses the domain for the final pass.  Reads the final
> >    LGD bitmap, which happens to include the shared ring but not the
> >    data pages.
> > 2. Backend writes the data.
> > 3. Backend unmaps the data page, marking it dirty.
> > 4. Backend writes the ring.
> > 5. Toolstack sends the ring page across in the last pass.
> > 6. Guest resumes, seeing the I/O marked as complete, but without the
> >    data.
> 
> It occurs to me that the guest should be able to defend against this
> by taking a local copy of the response producer before migration and
> using _that_ for the replay logic afterwards.  That is guaranteed to
> exclude any I/O that completed after the VM was paused, and as long as
> the unmap is guaranteed to happen before the ring update, we're OK.

AIUI blkfront at least maintains it's own shadow copy of the ring at all
times, and the recovery process doesn't use the migrated copy of the
ring at all (at least not the responses). I might be misunderstanding
the code there though.

> (That still leaves the question that Andrew raised of memcpy()
> breaking atomicity/ordering of updates.)

That's the memcpy in the migration code vs the definitely correctly
ordered updates done by the b.e., right?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 11:01       ` Ian Campbell
@ 2014-06-30 11:08         ` Tim Deegan
  0 siblings, 0 replies; 24+ messages in thread
From: Tim Deegan @ 2014-06-30 11:08 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

At 12:01 +0100 on 30 Jun (1404126084), Ian Campbell wrote:
> On Mon, 2014-06-30 at 12:14 +0200, Tim Deegan wrote:
> > Yes.  The (suggested) problem is that live migration does not preserve
> > that write ordering.  So we have to worry about something like this:
> > 
> > 1. Toolstack pauses the domain for the final pass.  Reads the final
> >    LGD bitmap, which happens to include the shared ring but not the
> >    data pages.
> 
> WRT this -- writes by the backend don't cause the page to become dirty.
> What about writes by the f.e? Are pages which are granted and actively
> mapped by another domain handled as normal guest RAM pages for the
> purposes of logdirty or are they handle specially?

They're handled as normal RAM, so yes f/e writes will mark that page
dirty.

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 11:07         ` Ian Campbell
@ 2014-06-30 11:12           ` Tim Deegan
  2014-06-30 11:57           ` David Vrabel
  1 sibling, 0 replies; 24+ messages in thread
From: Tim Deegan @ 2014-06-30 11:12 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	David Vrabel, Jan Beulich

At 12:07 +0100 on 30 Jun (1404126466), Ian Campbell wrote:
> On Mon, 2014-06-30 at 12:52 +0200, Tim Deegan wrote:
> > At 12:14 +0200 on 30 Jun (1404126862), Tim Deegan wrote:
> > > At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> > > > Shared ring updates are strictly ordered with respect to the writes to
> > > > data pages (either via grant map or grant copy).  This means that is the
> > > > guest sees a response in the ring it is guaranteed that all writes to
> > > > the associated pages are also present.
> > > 
> > > Is the ring update also strictly ordered wrt the grant unmap operation?
> > > 
> > > > The write of the response and the write of the producer index are
> > > > strictly ordered.  If the backend is in the process of writing a
> > > > response and the page is saved then the partial (corrupt) response is
> > > > not visible to the guest.  The write of the producer index is atomic so
> > > > the saver cannot see a partial producer index write.
> > > 
> > > Yes.  The (suggested) problem is that live migration does not preserve
> > > that write ordering.  So we have to worry about something like this:
> > > 
> > > 1. Toolstack pauses the domain for the final pass.  Reads the final
> > >    LGD bitmap, which happens to include the shared ring but not the
> > >    data pages.
> > > 2. Backend writes the data.
> > > 3. Backend unmaps the data page, marking it dirty.
> > > 4. Backend writes the ring.
> > > 5. Toolstack sends the ring page across in the last pass.
> > > 6. Guest resumes, seeing the I/O marked as complete, but without the
> > >    data.
> > 
> > It occurs to me that the guest should be able to defend against this
> > by taking a local copy of the response producer before migration and
> > using _that_ for the replay logic afterwards.  That is guaranteed to
> > exclude any I/O that completed after the VM was paused, and as long as
> > the unmap is guaranteed to happen before the ring update, we're OK.
> 
> AIUI blkfront at least maintains it's own shadow copy of the ring at all
> times, and the recovery process doesn't use the migrated copy of the
> ring at all (at least not the responses). I might be misunderstanding
> the code there though.

That sounds like it shoud be OK then, though we might want to document
exactly why. :)

> > (That still leaves the question that Andrew raised of memcpy()
> > breaking atomicity/ordering of updates.)
> 
> That's the memcpy in the migration code vs the definitely correctly
> ordered updates done by the b.e., right?

Yes.  That's the risk that
 - the memcpy might use non-atomic reads and so corrupt the rsp-prod; or
 - the memcpy might be ordered s.t. that it copies the ring itself
   before it copies rsp-prod, which breaks the required ordering on
   the frontend.

I think that on x86 we're unlikely to have either of those problems
in practice, at least for single-page rings.

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 11:07         ` Ian Campbell
  2014-06-30 11:12           ` Tim Deegan
@ 2014-06-30 11:57           ` David Vrabel
  2014-06-30 12:20             ` Ian Campbell
  1 sibling, 1 reply; 24+ messages in thread
From: David Vrabel @ 2014-06-30 11:57 UTC (permalink / raw)
  To: Ian Campbell, Tim Deegan
  Cc: Andrew Cooper, Ian Jackson, Xen-devel List, Paul Durrant,
	Jan Beulich

On 30/06/14 12:07, Ian Campbell wrote:
> On Mon, 2014-06-30 at 12:52 +0200, Tim Deegan wrote:
>> At 12:14 +0200 on 30 Jun (1404126862), Tim Deegan wrote:
>>> At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
>>>> Shared ring updates are strictly ordered with respect to the writes to
>>>> data pages (either via grant map or grant copy).  This means that is the
>>>> guest sees a response in the ring it is guaranteed that all writes to
>>>> the associated pages are also present.
>>>
>>> Is the ring update also strictly ordered wrt the grant unmap operation?
>>>
>>>> The write of the response and the write of the producer index are
>>>> strictly ordered.  If the backend is in the process of writing a
>>>> response and the page is saved then the partial (corrupt) response is
>>>> not visible to the guest.  The write of the producer index is atomic so
>>>> the saver cannot see a partial producer index write.
>>>
>>> Yes.  The (suggested) problem is that live migration does not preserve
>>> that write ordering.  So we have to worry about something like this:
>>>
>>> 1. Toolstack pauses the domain for the final pass.  Reads the final
>>>    LGD bitmap, which happens to include the shared ring but not the
>>>    data pages.
>>> 2. Backend writes the data.
>>> 3. Backend unmaps the data page, marking it dirty.
>>> 4. Backend writes the ring.
>>> 5. Toolstack sends the ring page across in the last pass.
>>> 6. Guest resumes, seeing the I/O marked as complete, but without the
>>>    data.
>>
>> It occurs to me that the guest should be able to defend against this
>> by taking a local copy of the response producer before migration and
>> using _that_ for the replay logic afterwards.  That is guaranteed to
>> exclude any I/O that completed after the VM was paused, and as long as
>> the unmap is guaranteed to happen before the ring update, we're OK.
> 
> AIUI blkfront at least maintains it's own shadow copy of the ring at all
> times, and the recovery process doesn't use the migrated copy of the
> ring at all (at least not the responses). I might be misunderstanding
> the code there though.

Yes, this is what it looks like to me.  netfront is also safe since it
just discards everything and doesn't replay at all.

This does rather feel like we're discovering problems that were
identified (and fixed) a long time ago.

I think there needs to be better documentation of front/backend drivers
so people know how to write them correctly.  I'll try to carve out some
time for this if no one else volunteers...

Perhaps all we need is a clear statement of: The contents of shared ring
are undefined on resume.  A frontend driver must not use the shared ring
to replay any requests.

>> (That still leaves the question that Andrew raised of memcpy()
>> breaking atomicity/ordering of updates.)
> 
> That's the memcpy in the migration code vs the definitely correctly
> ordered updates done by the b.e., right?

But if the frontend doesn't use the shared ring for replay, it doesn't
matter if this memcpy is correctly ordered or not?

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30 11:57           ` David Vrabel
@ 2014-06-30 12:20             ` Ian Campbell
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Campbell @ 2014-06-30 12:20 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Tim Deegan, Xen-devel List, Paul Durrant,
	Jan Beulich, Ian Jackson

On Mon, 2014-06-30 at 12:57 +0100, David Vrabel wrote:
> On 30/06/14 12:07, Ian Campbell wrote:
> > On Mon, 2014-06-30 at 12:52 +0200, Tim Deegan wrote:
> >> At 12:14 +0200 on 30 Jun (1404126862), Tim Deegan wrote:
> >>> At 10:47 +0100 on 30 Jun (1404121679), David Vrabel wrote:
> >>>> Shared ring updates are strictly ordered with respect to the writes to
> >>>> data pages (either via grant map or grant copy).  This means that is the
> >>>> guest sees a response in the ring it is guaranteed that all writes to
> >>>> the associated pages are also present.
> >>>
> >>> Is the ring update also strictly ordered wrt the grant unmap operation?
> >>>
> >>>> The write of the response and the write of the producer index are
> >>>> strictly ordered.  If the backend is in the process of writing a
> >>>> response and the page is saved then the partial (corrupt) response is
> >>>> not visible to the guest.  The write of the producer index is atomic so
> >>>> the saver cannot see a partial producer index write.
> >>>
> >>> Yes.  The (suggested) problem is that live migration does not preserve
> >>> that write ordering.  So we have to worry about something like this:
> >>>
> >>> 1. Toolstack pauses the domain for the final pass.  Reads the final
> >>>    LGD bitmap, which happens to include the shared ring but not the
> >>>    data pages.
> >>> 2. Backend writes the data.
> >>> 3. Backend unmaps the data page, marking it dirty.
> >>> 4. Backend writes the ring.
> >>> 5. Toolstack sends the ring page across in the last pass.
> >>> 6. Guest resumes, seeing the I/O marked as complete, but without the
> >>>    data.
> >>
> >> It occurs to me that the guest should be able to defend against this
> >> by taking a local copy of the response producer before migration and
> >> using _that_ for the replay logic afterwards.  That is guaranteed to
> >> exclude any I/O that completed after the VM was paused, and as long as
> >> the unmap is guaranteed to happen before the ring update, we're OK.
> > 
> > AIUI blkfront at least maintains it's own shadow copy of the ring at all
> > times, and the recovery process doesn't use the migrated copy of the
> > ring at all (at least not the responses). I might be misunderstanding
> > the code there though.
> 
> Yes, this is what it looks like to me.  netfront is also safe since it
> just discards everything and doesn't replay at all.
> 
> This does rather feel like we're discovering problems that were
> identified (and fixed) a long time ago.
> 
> I think there needs to be better documentation of front/backend drivers
> so people know how to write them correctly.

Yes :-/

> I'll try to carve out some
> time for this if no one else volunteers...

Thanks, that would be wonderful as always.

(Maybe add a TODO to the doc day page too?)

> Perhaps all we need is a clear statement of: The contents of shared ring
> are undefined on resume.  A frontend driver must not use the shared ring
> to replay any requests.

I *think* that should be sufficient.

> 
> >> (That still leaves the question that Andrew raised of memcpy()
> >> breaking atomicity/ordering of updates.)
> > 
> > That's the memcpy in the migration code vs the definitely correctly
> > ordered updates done by the b.e., right?
> 
> But if the frontend doesn't use the shared ring for replay, it doesn't
> matter if this memcpy is correctly ordered or not?

That's my understanding, yes.

> 
> David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-06-30  9:53     ` Ian Campbell
@ 2014-07-01 10:29       ` David Vrabel
  2014-07-02 10:02         ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: David Vrabel @ 2014-07-01 10:29 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Andrew Cooper, Tim Deegan, Xen-devel List, Paul Durrant,
	Jan Beulich, Ian Jackson

On 30/06/14 10:53, Ian Campbell wrote:
> On Mon, 2014-06-30 at 10:47 +0100, David Vrabel wrote:
>
>> Using persistent grants means that backend writes may be lost since the
>> memcpy in the backend does not update the dirty bitmap and guest writes
>> to the shared ring may make the backends response writes visible.  I
>> think we may need to disable persistent grant support from blkback until
>> a mechanism for updating the dirty log is in place.
> 
> Yes, sadly I think you might be right here.

The copy in the frontend from the persistently granted pages to the
pages in the bio dirties those pages so this is actually safe.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-07-01 10:29       ` David Vrabel
@ 2014-07-02 10:02         ` Ian Campbell
  2014-07-02 10:03           ` David Vrabel
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-07-02 10:02 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Tim Deegan, Xen-devel List, Paul Durrant,
	Jan Beulich, Ian Jackson

On Tue, 2014-07-01 at 11:29 +0100, David Vrabel wrote:
> On 30/06/14 10:53, Ian Campbell wrote:
> > On Mon, 2014-06-30 at 10:47 +0100, David Vrabel wrote:
> >
> >> Using persistent grants means that backend writes may be lost since the
> >> memcpy in the backend does not update the dirty bitmap and guest writes
> >> to the shared ring may make the backends response writes visible.  I
> >> think we may need to disable persistent grant support from blkback until
> >> a mechanism for updating the dirty log is in place.
> > 
> > Yes, sadly I think you might be right here.
> 
> The copy in the frontend from the persistently granted pages to the
> pages in the bio dirties those pages so this is actually safe.

What about if the pause+copy happens after the I/O has completed
(response on the ring) and the copy out to the bio pages? Meaning that
on resume the copy would pick up the stale data from the persistent
grant.

I think the answer is that the frontend won't consider things actually
complete until after the copy and therefore that read will be reissued.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Migration memory corruption - PV backends need to quiesce
  2014-07-02 10:02         ` Ian Campbell
@ 2014-07-02 10:03           ` David Vrabel
  0 siblings, 0 replies; 24+ messages in thread
From: David Vrabel @ 2014-07-02 10:03 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Andrew Cooper, Tim Deegan, Xen-devel List, Paul Durrant,
	Jan Beulich, Ian Jackson

On 02/07/14 11:02, Ian Campbell wrote:
> On Tue, 2014-07-01 at 11:29 +0100, David Vrabel wrote:
>> On 30/06/14 10:53, Ian Campbell wrote:
>>> On Mon, 2014-06-30 at 10:47 +0100, David Vrabel wrote:
>>>
>>>> Using persistent grants means that backend writes may be lost since the
>>>> memcpy in the backend does not update the dirty bitmap and guest writes
>>>> to the shared ring may make the backends response writes visible.  I
>>>> think we may need to disable persistent grant support from blkback until
>>>> a mechanism for updating the dirty log is in place.
>>>
>>> Yes, sadly I think you might be right here.
>>
>> The copy in the frontend from the persistently granted pages to the
>> pages in the bio dirties those pages so this is actually safe.
> 
> What about if the pause+copy happens after the I/O has completed
> (response on the ring) and the copy out to the bio pages? Meaning that
> on resume the copy would pick up the stale data from the persistent
> grant.
> 
> I think the answer is that the frontend won't consider things actually
> complete until after the copy and therefore that read will be reissued.

Yes.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-07-02 10:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-27 16:51 Migration memory corruption - PV backends need to quiesce Andrew Cooper
2014-06-27 17:28 ` David Vrabel
2014-06-27 18:15   ` Tim Deegan
2014-06-27 18:37     ` Andrew Cooper
2014-06-30  8:38 ` Ian Campbell
2014-06-30  9:02   ` Andrew Cooper
2014-06-30  9:21     ` Ian Campbell
2014-06-30  9:46       ` Andrew Cooper
2014-06-30  9:52         ` Ian Campbell
2014-06-30 10:13           ` Andrew Cooper
2014-06-30  9:47   ` David Vrabel
2014-06-30  9:53     ` Ian Campbell
2014-07-01 10:29       ` David Vrabel
2014-07-02 10:02         ` Ian Campbell
2014-07-02 10:03           ` David Vrabel
2014-06-30 10:14     ` Tim Deegan
2014-06-30 10:24       ` Ian Campbell
2014-06-30 10:52       ` Tim Deegan
2014-06-30 11:07         ` Ian Campbell
2014-06-30 11:12           ` Tim Deegan
2014-06-30 11:57           ` David Vrabel
2014-06-30 12:20             ` Ian Campbell
2014-06-30 11:01       ` Ian Campbell
2014-06-30 11:08         ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.