From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH V4] X86/vMCE: handle broken page with regard
 to migration
Date: Thu, 29 Nov 2012 10:50:50 +0000
Message-ID: <50B73E0A.1080600@eu.citrix.com>
References: <d3378692eece3e552ba2.1353630328@ljsromley.bj.intel.com>
	<CAFLBxZZ_W5PMLYiqEwdrpqkjgiMN9NoN2f33ErLriOwNOD9GgA@mail.gmail.com>
	<DE8DF0795D48FD4CA783C40EC82923353996FF@SHSMSX101.ccr.corp.intel.com>
	<1354183356.25834.108.camel@zakaz.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1354183356.25834.108.camel@zakaz.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: "Liu, Jinsong" <jinsong.liu@intel.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On 29/11/12 10:02, Ian Campbell wrote:
> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
>> Ping?
> Sorry I've been meaning to reply but didn't manage to yet. Also you
> replied to V4 saying to ignore it, so I was half waiting for V5 but I
> see this should actually be labelled V5 anyway.
>
> I'm afraid I still don't fully grok the reason for the loop that goes
> with:
> +            /*
> +             * At the last iter, count the number of broken pages after sending,
> +             * and if there are more than before sending, do one or more iter
> +             * to make sure the pages are marked broken on the receiving side.
> +             */
>
> Can we go through it one more time? Sorry.
>
> Let me outline the sequence of events and you can point out where I'm
> going wrong. I'm afraid this has turned out to be rather long, again I'm
> sorry for that.
>
> First we do some number of iterations with the guest live. If an MCE
> occurs during this phase then the page will be marked dirty and we will
> pick this up on the next iteration and resend the page with the dirty
> type etc and all is fine. This all looks good to me, so we don't need to
> worry about anything at this stage.
>
> Eventually we get to the last iteration, at which point we pause the
> guest. From here on in the guest itself is not going to cause an MCE
> (e.g. by touching its RAM) because it is not running but we must still
> account for the possibility of an asynchronous MCE of some sort e.g.
> triggered by the error detection mechanisms in the hardware, cosmic rays
> and such like.
>
> The final iteration proceeds roughly as follows.
>
>       1. The domain is paused
>       2. We scan the dirty bitmap and add dirty pages to the batch of
>          pages to process (there may be several batches in the last
>          iteration, we only need to concern ourselves with any one batch
>          here).
>       3. We map all of the pages in the resulting batch with
>          xc_map_foreign_bulk
>       4. We query the types of all the pages in the batch with
>          xc_get_pfn_type_batch
>       5. We iterate over the batch, checking the type of each page, in
>          some cases we do some incidental processing.
>       6. We send the types of the pages in the batch over the wire.
>       7. We iterate over the batch again, and send any normal page (not
>          broken, xtab etc) over the wire. Actually we do this as runs of
>          normal pages, but the key point is we avoid touching any special
>          page (including ones marked as broken by #4)
>
> Is this sequence of events accurate?
>
> Now lets consider the consequences of an MCE occurring at various stages
> here.
>
> Any MCE which happens before #4 is fine, we will pick that up in #4 and
> the following steps will do the right thing.
>
> Note that I am assuming that the mapping step in #3 is safe even for a
> broken page, so long as we don't actually try and use the mapping (more
> on that later), is this true?
>
> If an MCE occurs after #4 then the page will be marked as dirty in the
> bitmap and Xen will internally mark it as broken, but we won't see
> either of those with the current algorithm. There are two cases to think
> about here AFAICT,
>       A. The page was not already dirty at #2. In this case we know that
>          the guest hasn't dirtied the page since the previous iteration
>          and therefore the target has a good copy of this page from that
>          time. The page isn't even in the batch we are processing So we
>          don't particularly care about the MCE here and can, from the PoV
>          of migrating this guest, ignore it.
>       B. The page was already dirty (but not broken, we handled that case
>          above in "Any MCE which happens before #4...") at #2 which means
>          we have do not have an up to date copy on the target. This has
>          two subcases:
>               I. The MCE occurs before (or during) #6 (sending the page)
>                  and therefore we do not have a good up to date copy of
>                  that data at either end.
>              II. The MCE occurs after #6, in which case we already have a
>                  good copy at the target end.
>
> To fix B you have added an 8th step to the above:
>
>          8. Query the types of the pages again, using
>          xc_get_pfn_type_batch, and if there are more pages dirty now
>          than we say at #4 (actually #5 when we scanned the array, but
>          that distinction doesn't matter) then a new MCE must have
>          occurred. Go back to #2 and try again.
>
> This won't do anything for A since the page wasn't in the batch to start
> with and so neither #4 or #8 will look at its type, this is good and
> proper.
>
> So now we consider the two subcases of B. Lets consider B.II first since
> it seems to be the more obvious case.
>
> In case B.II the target end already has a good copy of the data page,
> there is no need to mark the page as broken on the far end, nor to
> arrange for a vMCE to be injected. I don't know if/how we arrange for
> vMCEs to be injected under these circumstances, however even if a vMCE
> does get injected into the guest when it eventually gets unpaused on the
> target then all that will happen is that it will needlessly throw away a
> good page. However this is a rare corner case which is not worth
> concerning ourselves with (it's largely indistinguishable from case A).
> If the MCE had happened even a single cycle earlier then this would have
> been a B.I event instead of a B.II one. In any case there is no need to
> return to #2 and try again, everything will be just fine if we complete
> the migration at this point.
>
> In case B.I the MCE occurs before (or while) we send the page onto the
> wire. We will therefore try to read from this page because we haven't
> looked at the type since #4 and have no idea that it is now broken.
> Reading from the broken page will cause a fault, perhaps causing a vMCE
> to be delivered to dom0, which causes the kernel to kill the process
> doing the migration. Or maybe it kills dom0 or the host entirely. Either
> way the idea of looping again is rather moot.
>
> Have I missed a case which needs thinking about?
>
> I suspect B.I is the case where you are most likely to find a flaw in my
> argument. Is there something else which is done in this case which would
> allow us to continue?

I think your analysis is correct -- the only question is whether B.I 
will 100% cause the migration to crash, or whether there's a chance of 
not crashing on the read.  I had tried to ask that question before, and 
understood Jinsong's response to be that it's not 100% sure that the 
read would cause an error.  However, looking back at the thread, I think 
I may have understood something that was not there. :-)

So I guess the question for Jinsong is this:

The only time this extra loop could help is if there is a page broken 
after being paused but before being sent the last time (B.I in Ian's 
analysis) -- in which case, the migration code is 100% guaranteed to 
read a now-broken page.  What are the chances that this read of a broken 
page will *not* cause a fault which will kill at least the migration 
process, if not dom0?  If the chances are "slim-to-none", then there is 
no point for the extra check.

  -George