All of lore.kernel.org
 help / color / mirror / Atom feed
From: George Dunlap <george.dunlap@eu.citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: "Liu, Jinsong" <jinsong.liu@intel.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Date: Thu, 29 Nov 2012 10:50:50 +0000	[thread overview]
Message-ID: <50B73E0A.1080600@eu.citrix.com> (raw)
In-Reply-To: <1354183356.25834.108.camel@zakaz.uk.xensource.com>

On 29/11/12 10:02, Ian Campbell wrote:
> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
>> Ping?
> Sorry I've been meaning to reply but didn't manage to yet. Also you
> replied to V4 saying to ignore it, so I was half waiting for V5 but I
> see this should actually be labelled V5 anyway.
>
> I'm afraid I still don't fully grok the reason for the loop that goes
> with:
> +            /*
> +             * At the last iter, count the number of broken pages after sending,
> +             * and if there are more than before sending, do one or more iter
> +             * to make sure the pages are marked broken on the receiving side.
> +             */
>
> Can we go through it one more time? Sorry.
>
> Let me outline the sequence of events and you can point out where I'm
> going wrong. I'm afraid this has turned out to be rather long, again I'm
> sorry for that.
>
> First we do some number of iterations with the guest live. If an MCE
> occurs during this phase then the page will be marked dirty and we will
> pick this up on the next iteration and resend the page with the dirty
> type etc and all is fine. This all looks good to me, so we don't need to
> worry about anything at this stage.
>
> Eventually we get to the last iteration, at which point we pause the
> guest. From here on in the guest itself is not going to cause an MCE
> (e.g. by touching its RAM) because it is not running but we must still
> account for the possibility of an asynchronous MCE of some sort e.g.
> triggered by the error detection mechanisms in the hardware, cosmic rays
> and such like.
>
> The final iteration proceeds roughly as follows.
>
>       1. The domain is paused
>       2. We scan the dirty bitmap and add dirty pages to the batch of
>          pages to process (there may be several batches in the last
>          iteration, we only need to concern ourselves with any one batch
>          here).
>       3. We map all of the pages in the resulting batch with
>          xc_map_foreign_bulk
>       4. We query the types of all the pages in the batch with
>          xc_get_pfn_type_batch
>       5. We iterate over the batch, checking the type of each page, in
>          some cases we do some incidental processing.
>       6. We send the types of the pages in the batch over the wire.
>       7. We iterate over the batch again, and send any normal page (not
>          broken, xtab etc) over the wire. Actually we do this as runs of
>          normal pages, but the key point is we avoid touching any special
>          page (including ones marked as broken by #4)
>
> Is this sequence of events accurate?
>
> Now lets consider the consequences of an MCE occurring at various stages
> here.
>
> Any MCE which happens before #4 is fine, we will pick that up in #4 and
> the following steps will do the right thing.
>
> Note that I am assuming that the mapping step in #3 is safe even for a
> broken page, so long as we don't actually try and use the mapping (more
> on that later), is this true?
>
> If an MCE occurs after #4 then the page will be marked as dirty in the
> bitmap and Xen will internally mark it as broken, but we won't see
> either of those with the current algorithm. There are two cases to think
> about here AFAICT,
>       A. The page was not already dirty at #2. In this case we know that
>          the guest hasn't dirtied the page since the previous iteration
>          and therefore the target has a good copy of this page from that
>          time. The page isn't even in the batch we are processing So we
>          don't particularly care about the MCE here and can, from the PoV
>          of migrating this guest, ignore it.
>       B. The page was already dirty (but not broken, we handled that case
>          above in "Any MCE which happens before #4...") at #2 which means
>          we have do not have an up to date copy on the target. This has
>          two subcases:
>               I. The MCE occurs before (or during) #6 (sending the page)
>                  and therefore we do not have a good up to date copy of
>                  that data at either end.
>              II. The MCE occurs after #6, in which case we already have a
>                  good copy at the target end.
>
> To fix B you have added an 8th step to the above:
>
>          8. Query the types of the pages again, using
>          xc_get_pfn_type_batch, and if there are more pages dirty now
>          than we say at #4 (actually #5 when we scanned the array, but
>          that distinction doesn't matter) then a new MCE must have
>          occurred. Go back to #2 and try again.
>
> This won't do anything for A since the page wasn't in the batch to start
> with and so neither #4 or #8 will look at its type, this is good and
> proper.
>
> So now we consider the two subcases of B. Lets consider B.II first since
> it seems to be the more obvious case.
>
> In case B.II the target end already has a good copy of the data page,
> there is no need to mark the page as broken on the far end, nor to
> arrange for a vMCE to be injected. I don't know if/how we arrange for
> vMCEs to be injected under these circumstances, however even if a vMCE
> does get injected into the guest when it eventually gets unpaused on the
> target then all that will happen is that it will needlessly throw away a
> good page. However this is a rare corner case which is not worth
> concerning ourselves with (it's largely indistinguishable from case A).
> If the MCE had happened even a single cycle earlier then this would have
> been a B.I event instead of a B.II one. In any case there is no need to
> return to #2 and try again, everything will be just fine if we complete
> the migration at this point.
>
> In case B.I the MCE occurs before (or while) we send the page onto the
> wire. We will therefore try to read from this page because we haven't
> looked at the type since #4 and have no idea that it is now broken.
> Reading from the broken page will cause a fault, perhaps causing a vMCE
> to be delivered to dom0, which causes the kernel to kill the process
> doing the migration. Or maybe it kills dom0 or the host entirely. Either
> way the idea of looping again is rather moot.
>
> Have I missed a case which needs thinking about?
>
> I suspect B.I is the case where you are most likely to find a flaw in my
> argument. Is there something else which is done in this case which would
> allow us to continue?

I think your analysis is correct -- the only question is whether B.I 
will 100% cause the migration to crash, or whether there's a chance of 
not crashing on the read.  I had tried to ask that question before, and 
understood Jinsong's response to be that it's not 100% sure that the 
read would cause an error.  However, looking back at the thread, I think 
I may have understood something that was not there. :-)

So I guess the question for Jinsong is this:

The only time this extra loop could help is if there is a page broken 
after being paused but before being sent the last time (B.I in Ian's 
analysis) -- in which case, the migration code is 100% guaranteed to 
read a now-broken page.  What are the chances that this read of a broken 
page will *not* cause a fault which will kill at least the migration 
process, if not dom0?  If the chances are "slim-to-none", then there is 
no point for the extra check.

  -George

  reply	other threads:[~2012-11-29 10:50 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-23  0:25 [PATCH V4] X86/vMCE: handle broken page with regard to migration Liu Jinsong
2012-11-23 16:26 ` George Dunlap
2012-11-25 13:24   ` Liu, Jinsong
2012-11-28 14:37   ` Liu, Jinsong
2012-11-29 10:02     ` Ian Campbell
2012-11-29 10:50       ` George Dunlap [this message]
2012-11-30 18:57         ` Liu, Jinsong
2012-11-30 18:51       ` Liu, Jinsong
2012-12-03 11:24         ` George Dunlap
2012-12-05 15:57         ` Liu, Jinsong
2012-12-05 15:59           ` George Dunlap
2012-12-05 16:36             ` Liu, Jinsong
2012-12-05 16:01           ` Ian Campbell
2012-12-05 16:15             ` Liu, Jinsong
  -- strict thread matches above, loose matches on Subject: below --
2012-11-23  0:13 Liu Jinsong
2012-11-22 16:28 ` Liu, Jinsong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50B73E0A.1080600@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=jinsong.liu@intel.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.