From: David Gibson <david@gibson.dropbear.id.au>
To: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Cc: qemu-ppc@nongnu.org, qemu-devel@nongnu.org, aik@ozlabs.ru,
mahesh@linux.vnet.ibm.com, benh@au1.ibm.com, paulus@samba.org,
sam.bobroff@au1.ibm.com
Subject: Re: [Qemu-devel] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and "ibm, nmi-interlock" RTAS calls
Date: Thu, 28 Sep 2017 13:58:15 +1000 [thread overview]
Message-ID: <20170928035815.GX12504@umbus> (raw)
In-Reply-To: <bcc42ba6-6c71-d8d0-a196-7588a4d4a0d2@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]
On Wed, Sep 27, 2017 at 05:23:51PM +0530, Aravinda Prasad wrote:
>
>
> On Wednesday 27 September 2017 12:45 PM, David Gibson wrote:
> > On Thu, Sep 21, 2017 at 02:39:06PM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Tuesday 22 August 2017 07:38 AM, David Gibson wrote:
> >>
> >> [ . . . ]
> >>
> >>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>>>> index 46012b3..eee8d33 100644
> >>>>>> --- a/include/hw/ppc/spapr.h
> >>>>>> +++ b/include/hw/ppc/spapr.h
> >>>>>> @@ -123,6 +123,12 @@ struct sPAPRMachineState {
> >>>>>> * occurs during the unplug process. */
> >>>>>> QTAILQ_HEAD(, sPAPRDIMMState) pending_dimm_unplugs;
> >>>>>>
> >>>>>> + /* State related to "ibm,nmi-register" and "ibm,nmi-interlock" calls */
> >>>>>> + target_ulong guest_machine_check_addr;
> >>>>>> + bool mc_in_progress;
> >>>>>> + int mc_cpu;
> >>>>>
> >>>>> mc_cpu isn't actually used yet in this patch. In any case it and
> >>>>> mc_in_progress could probably be folded together, no?
> >>>>
> >>>> It is possible to fold mc_cpu and mc_in_progress together with the
> >>>> convention that if it is set to -1 mc is not in progress otherwise it is
> >>>> set to the CPU handling the mc.
> >>>>
> >>>>>
> >>>>> These values will also need to be migrated, AFAICT.
> >>>>
> >>>> I am thinking of how to handle the migration when machine check handling
> >>>> is in progress. Probably wait for machine check handling to complete
> >>>> before migrating as the error could be irrelevant once migrated to a new
> >>>> hardware. If that is the case we don't need to migrate these values.
> >>>
> >>> Ok.
> >>
> >> This is what I think about handling machine check during migration based
> >> on my understanding of the VM migration code.
> >>
> >> There are two possibilities here. First, migration can be initiated
> >> while the machine check handling is in progress. Second, A machine check
> >> error can happen when the migration is in progress.
> >>
> >> To handle the first case we can add migrate_add_blocker() call when we
> >> start handling the machine check error and issue migrate_del_blocker()
> >> when done. I think this should solve the issue.
> >>
> >> The second case is bit tricky. The migration has already started and
> >> hence migrate_add_blocker() call will fail. We also cannot wait till the
> >> completion of the migration to handle machine check error as the VM's
> >> data could be corrupt.
> >>
> >> Machine check errors should not be an issue when the migration is in the
> >> RAM copy phase as VM is still active with vCPUs running. The problem is
> >> when we hit a machine check when the migration is about to complete. For
> >> example,
> >>
> >> 1. vCPU2 hits a machine check error during migration.
> >>
> >> 2. KVM causes VM exit on vCPU2 and the NIP of vCPU2 is changed to the
> >> guest registered machine check handler.
> >>
> >> 3. The migration_completion() issues vm_stop() and hence either vCPU2 is
> >> never scheduled again on the source hardware or vCPU2 is preempted while
> >> executing the machine check handler.
> >>
> >> 4. vCPU2 is resumed on the target hardware and either starts or
> >> continues processing the machine check error. This could be a problem as
> >> these errors are specific to the source hardware. For instance, when the
> >> the guest issues memory poisoning upon such error, a clean page on the
> >> target hardware is poisoned while the corrupt page on source hardware is
> >> not poisoned.
> >>
> >> The second case of hitting machine check during the final phase of
> >> migration is rare but wanted to check what others think about it.
> >
> > So, I've had a bit of a think about this. I don't recall if these
> > fwnmi machine checks are expected on guest RAM, or guest IO addresses.
>
> It is expected on guest RAM. I am not sure about guest IO address.
>
> >
> > 1) If RAM
> >
> > What exactly is the guest's notification for? Even without
> > migration, the host's free to move guest memory around in host
> > memory, so it seems any hardware level poking should be done on the
> > host side.
>
> If the error is a correctable error, then host takes care of it by
> moving the page to a different location, the guest need not be and will
> not be notified. Guest will be notified if host is not able to fully
> recover. Hence we hit FWNMI in guest when RAM errors are not recovered
> by the host.
Ok.
> > Is it just to notify the guest that we weren't able to fully recover
> > on the host side and that page may contain corrupted data? If
> > that's so then it seems resuming the handling on the destination is
> > still right. It may be new good RAM, but the contents we migrated
> > could still be corrupt from the machine check event on the source.
>
> Yes. This is what I am doing in my v5 patch set which I am about to
> post. Additionally I block migration when processing machine check errors.
>
> >
> > 2) If IO
> >
> > AFAICT this could only happen with VFIO passthrough devices.. but it
> > shouldn't be possible to migrate if there are any of those.
> >
>
> I am not very sure about IO errors.
Ok. It sounds like that's not the primary case you're interested, so
I guess we can ignore it for now.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2017-09-28 4:13 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-16 9:11 [Qemu-devel] [PATCH v3 0/5] target-ppc/spapr: Add FWNMI support in QEMU for PowerKVM guests Aravinda Prasad
2017-08-16 9:12 ` [Qemu-devel] [PATCH v3 1/5] ppc: spapr: Register and handle HCALL to receive updated RTAS region Aravinda Prasad
2017-08-17 1:34 ` David Gibson
2017-08-17 10:27 ` Nikunj A Dadhania
2017-08-21 9:42 ` Aravinda Prasad
2017-08-22 3:33 ` David Gibson
2017-08-22 7:16 ` Aravinda Prasad
2017-08-16 9:12 ` [Qemu-devel] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and "ibm, nmi-interlock" RTAS calls Aravinda Prasad
2017-08-17 1:39 ` David Gibson
2017-08-21 12:35 ` Aravinda Prasad
2017-08-22 2:08 ` David Gibson
2017-08-22 7:12 ` Aravinda Prasad
2017-09-21 9:09 ` Aravinda Prasad
2017-09-27 7:15 ` David Gibson
2017-09-27 11:53 ` Aravinda Prasad
2017-09-28 3:58 ` David Gibson [this message]
2017-08-16 9:12 ` [Qemu-devel] [PATCH v3 3/5] Wrapper function to wait on condition for the main loop mutex Aravinda Prasad
2017-08-16 9:12 ` [Qemu-devel] [PATCH v3 4/5] target/ppc: Handle NMI guest exit Aravinda Prasad
2017-08-17 1:57 ` David Gibson
2017-08-21 12:30 ` Aravinda Prasad
2017-08-23 8:39 ` David Gibson
2017-09-08 8:09 ` [Qemu-devel] [Qemu-ppc] " Aravinda Prasad
2017-09-10 3:22 ` David Gibson
2017-08-16 9:12 ` [Qemu-devel] [PATCH v3 5/5] ppc: spapr: Enable FWNMI capability Aravinda Prasad
2017-08-17 1:59 ` David Gibson
2017-08-21 10:50 ` Aravinda Prasad
2017-08-16 9:29 ` [Qemu-devel] [PATCH v3 0/5] target-ppc/spapr: Add FWNMI support in QEMU for PowerKVM guests no-reply
2017-08-16 9:40 ` no-reply
2017-08-17 3:35 ` Sam Bobroff
2017-08-21 9:10 ` Aravinda Prasad
2017-08-23 0:43 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170928035815.GX12504@umbus \
--to=david@gibson.dropbear.id.au \
--cc=aik@ozlabs.ru \
--cc=aravinda@linux.vnet.ibm.com \
--cc=benh@au1.ibm.com \
--cc=mahesh@linux.vnet.ibm.com \
--cc=paulus@samba.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=sam.bobroff@au1.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).