Re: Migration memory corruption - PV backends need to quiesce

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Vrabel <david.vrabel@citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tim Deegan <tim@xen.org>, Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Xen-devel List <xen-devel@lists.xen.org>,
	Paul Durrant <Paul.Durrant@citrix.com>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: Migration memory corruption - PV backends need to quiesce
Date: Mon, 30 Jun 2014 10:47:59 +0100	[thread overview]
Message-ID: <53B1324F.1090304@citrix.com> (raw)
In-Reply-To: <1404117494.3827.8.camel@kazak.uk.xensource.com>

On 30/06/14 09:38, Ian Campbell wrote:
> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>> Hello,
>>
>> After a long time fixing my own memory corruption bugs with migration
>> v2, I have finally tracked down (what I really really hope is) the last
>> of the corruption.
>>
>>
>> There appears to be a systematic problem affecting all PV drivers,
>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>
>> Active grant mapped pages are only reflected in the dirty bitmap after
>> the grant has been unmapped, as mapping the ring read-only would be
>> catastrophic to performance, and remapping as read-only when logdirty is
>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>> PTEs pointing at granted frames.
>>
>> PV backend drivers hold their mappings of the rings (and persistently
>> granted frames) open until the domain is destroyed, which is after the
>> memory image has been sent.  Therefore, any requests which are processed
>> after the migration code sending the ring frame on its first pass will
>> not be reflected in the resumed domain, as this frame will never be
>> marked as dirty in Xen.
>>
>> Furthermore, as the migration code uses memcpy() on the frames, it is
>> possible that a backed update intersects with the copy, and a corrupt
>> descriptor appears on the resumed side.
>>
>> In addition, after the domain has been paused, the backend might still
>> process requests.  The migration code excepts the guest be completely
>> quiesced after it has been suspended, so will only check the dirty
>> bitmap once.  Any requests which get processed and completed might still
>> be missed by the migration code.
>>
>> From a heavily instrumented Xen and migration code, I am fairly sure I
>> have confirmed that all pages corrupted on migration are a result of
>> still-active grant maps, grant copies which complete after domain
>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>> and will never be reflected in the dirty bitmap.
>>
>>
>> Overall, it would appear that there needs to be a hook for all PV
>> drivers to force quiescence.  In particular, a backend must guarantee to
>> unmap all active grant maps (so the frames get properly reflected in the
>> dirty bitmap), and never process subsequent requests (so no new frames
>> appear dirty in the bitmap after the guest has been paused).
>>
>> Thoughts/comments?
> 
> I thought PV drivers were already (supposed to be) handling this in the
> frontend.
> 
> For reasons of checkpoint performance I think Linux's net and blkfront
> are handling this on resume rather than on suspend by tearing down on
> resume and then requeueing any outstanding I/O after they reattach to
> the new backend. In the blkfront case this is explicit, whereas IIRC
> netfront just discards any active requests and relies on L3+
> retransmition to get the job done. (see netfront_resume and
> blkfront_resume/blkif_recover).
> 
> Part of the tear down and reconnect should involve invalidating any
> inflight descriptors, whether or not they were partially completed or
> have corrupted replies in them etc. This ought to be happening before
> the new backend sees the ring at all.
> 
> Can you give an example of an instance of the corruption which you've
> seen?
> 
> Do you have a theory why this wasn't seen with the migration v1 code?

I think it does not result in any problems (except if persistent grants
are used, see below).

Shared ring updates are strictly ordered with respect to the writes to
data pages (either via grant map or grant copy).  This means that is the
guest sees a response in the ring it is guaranteed that all writes to
the associated pages are also present.

The write of the response and the write of the producer index are
strictly ordered.  If the backend is in the process of writing a
response and the page is saved then the partial (corrupt) response is
not visible to the guest.  The write of the producer index is atomic so
the saver cannot see a partial producer index write.

Using persistent grants means that backend writes may be lost since the
memcpy in the backend does not update the dirty bitmap and guest writes
to the shared ring may make the backends response writes visible.  I
think we may need to disable persistent grant support from blkback until
a mechanism for updating the dirty log is in place.

David

next prev parent reply	other threads:[~2014-06-30  9:47 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-27 16:51 Migration memory corruption - PV backends need to quiesce Andrew Cooper
2014-06-27 17:28 ` David Vrabel
2014-06-27 18:15   ` Tim Deegan
2014-06-27 18:37     ` Andrew Cooper
2014-06-30  8:38 ` Ian Campbell
2014-06-30  9:02   ` Andrew Cooper
2014-06-30  9:21     ` Ian Campbell
2014-06-30  9:46       ` Andrew Cooper
2014-06-30  9:52         ` Ian Campbell
2014-06-30 10:13           ` Andrew Cooper
2014-06-30  9:47   ` David Vrabel [this message]
2014-06-30  9:53     ` Ian Campbell
2014-07-01 10:29       ` David Vrabel
2014-07-02 10:02         ` Ian Campbell
2014-07-02 10:03           ` David Vrabel
2014-06-30 10:14     ` Tim Deegan
2014-06-30 10:24       ` Ian Campbell
2014-06-30 10:52       ` Tim Deegan
2014-06-30 11:07         ` Ian Campbell
2014-06-30 11:12           ` Tim Deegan
2014-06-30 11:57           ` David Vrabel
2014-06-30 12:20             ` Ian Campbell
2014-06-30 11:01       ` Ian Campbell
2014-06-30 11:08         ` Tim Deegan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53B1324F.1090304@citrix.com \
    --to=david.vrabel@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Paul.Durrant@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.