From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration Date: Thu, 29 Nov 2012 10:50:50 +0000 Message-ID: <50B73E0A.1080600@eu.citrix.com> References: <1354183356.25834.108.camel@zakaz.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1354183356.25834.108.camel@zakaz.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: "Liu, Jinsong" , "xen-devel@lists.xensource.com" , Ian Jackson , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 29/11/12 10:02, Ian Campbell wrote: > On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >> Ping? > Sorry I've been meaning to reply but didn't manage to yet. Also you > replied to V4 saying to ignore it, so I was half waiting for V5 but I > see this should actually be labelled V5 anyway. > > I'm afraid I still don't fully grok the reason for the loop that goes > with: > + /* > + * At the last iter, count the number of broken pages after sending, > + * and if there are more than before sending, do one or more iter > + * to make sure the pages are marked broken on the receiving side. > + */ > > Can we go through it one more time? Sorry. > > Let me outline the sequence of events and you can point out where I'm > going wrong. I'm afraid this has turned out to be rather long, again I'm > sorry for that. > > First we do some number of iterations with the guest live. If an MCE > occurs during this phase then the page will be marked dirty and we will > pick this up on the next iteration and resend the page with the dirty > type etc and all is fine. This all looks good to me, so we don't need to > worry about anything at this stage. > > Eventually we get to the last iteration, at which point we pause the > guest. From here on in the guest itself is not going to cause an MCE > (e.g. by touching its RAM) because it is not running but we must still > account for the possibility of an asynchronous MCE of some sort e.g. > triggered by the error detection mechanisms in the hardware, cosmic rays > and such like. > > The final iteration proceeds roughly as follows. > > 1. The domain is paused > 2. We scan the dirty bitmap and add dirty pages to the batch of > pages to process (there may be several batches in the last > iteration, we only need to concern ourselves with any one batch > here). > 3. We map all of the pages in the resulting batch with > xc_map_foreign_bulk > 4. We query the types of all the pages in the batch with > xc_get_pfn_type_batch > 5. We iterate over the batch, checking the type of each page, in > some cases we do some incidental processing. > 6. We send the types of the pages in the batch over the wire. > 7. We iterate over the batch again, and send any normal page (not > broken, xtab etc) over the wire. Actually we do this as runs of > normal pages, but the key point is we avoid touching any special > page (including ones marked as broken by #4) > > Is this sequence of events accurate? > > Now lets consider the consequences of an MCE occurring at various stages > here. > > Any MCE which happens before #4 is fine, we will pick that up in #4 and > the following steps will do the right thing. > > Note that I am assuming that the mapping step in #3 is safe even for a > broken page, so long as we don't actually try and use the mapping (more > on that later), is this true? > > If an MCE occurs after #4 then the page will be marked as dirty in the > bitmap and Xen will internally mark it as broken, but we won't see > either of those with the current algorithm. There are two cases to think > about here AFAICT, > A. The page was not already dirty at #2. In this case we know that > the guest hasn't dirtied the page since the previous iteration > and therefore the target has a good copy of this page from that > time. The page isn't even in the batch we are processing So we > don't particularly care about the MCE here and can, from the PoV > of migrating this guest, ignore it. > B. The page was already dirty (but not broken, we handled that case > above in "Any MCE which happens before #4...") at #2 which means > we have do not have an up to date copy on the target. This has > two subcases: > I. The MCE occurs before (or during) #6 (sending the page) > and therefore we do not have a good up to date copy of > that data at either end. > II. The MCE occurs after #6, in which case we already have a > good copy at the target end. > > To fix B you have added an 8th step to the above: > > 8. Query the types of the pages again, using > xc_get_pfn_type_batch, and if there are more pages dirty now > than we say at #4 (actually #5 when we scanned the array, but > that distinction doesn't matter) then a new MCE must have > occurred. Go back to #2 and try again. > > This won't do anything for A since the page wasn't in the batch to start > with and so neither #4 or #8 will look at its type, this is good and > proper. > > So now we consider the two subcases of B. Lets consider B.II first since > it seems to be the more obvious case. > > In case B.II the target end already has a good copy of the data page, > there is no need to mark the page as broken on the far end, nor to > arrange for a vMCE to be injected. I don't know if/how we arrange for > vMCEs to be injected under these circumstances, however even if a vMCE > does get injected into the guest when it eventually gets unpaused on the > target then all that will happen is that it will needlessly throw away a > good page. However this is a rare corner case which is not worth > concerning ourselves with (it's largely indistinguishable from case A). > If the MCE had happened even a single cycle earlier then this would have > been a B.I event instead of a B.II one. In any case there is no need to > return to #2 and try again, everything will be just fine if we complete > the migration at this point. > > In case B.I the MCE occurs before (or while) we send the page onto the > wire. We will therefore try to read from this page because we haven't > looked at the type since #4 and have no idea that it is now broken. > Reading from the broken page will cause a fault, perhaps causing a vMCE > to be delivered to dom0, which causes the kernel to kill the process > doing the migration. Or maybe it kills dom0 or the host entirely. Either > way the idea of looping again is rather moot. > > Have I missed a case which needs thinking about? > > I suspect B.I is the case where you are most likely to find a flaw in my > argument. Is there something else which is done in this case which would > allow us to continue? I think your analysis is correct -- the only question is whether B.I will 100% cause the migration to crash, or whether there's a chance of not crashing on the read. I had tried to ask that question before, and understood Jinsong's response to be that it's not 100% sure that the read would cause an error. However, looking back at the thread, I think I may have understood something that was not there. :-) So I guess the question for Jinsong is this: The only time this extra loop could help is if there is a page broken after being paused but before being sent the last time (B.I in Ian's analysis) -- in which case, the migration code is 100% guaranteed to read a now-broken page. What are the chances that this read of a broken page will *not* cause a fault which will kill at least the migration process, if not dom0? If the chances are "slim-to-none", then there is no point for the extra check. -George