From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57670) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dT3PJ-0000wy-HD for qemu-devel@nongnu.org; Thu, 06 Jul 2017 05:49:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dT3PI-0000HX-6y for qemu-devel@nongnu.org; Thu, 06 Jul 2017 05:49:09 -0400 Date: Thu, 6 Jul 2017 19:46:26 +1000 From: David Gibson Message-ID: <20170706094626.GQ2180@umbus.fritz.box> References: <20170621091848.28256-1-david@gibson.dropbear.id.au> <20170705110414.GN2180@umbus.fritz.box> <0e4deabe-58e9-6c15-5910-cda9f8e63f9b@linux.vnet.ibm.com> <149929450093.3492.12871680935157233569@loki> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Tcb1KvpfnM4LxW2s" Content-Disposition: inline In-Reply-To: <149929450093.3492.12871680935157233569@loki> Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Roth Cc: Daniel Henrique Barboza , lvivier@redhat.com, sursingh@redhat.com, qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org, bharata@linux.vnet.ibm.com --Tcb1KvpfnM4LxW2s Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 05, 2017 at 05:41:40PM -0500, Michael Roth wrote: > Quoting Daniel Henrique Barboza (2017-07-05 16:53:57) > >=20 > >=20 > > On 07/05/2017 08:04 AM, David Gibson wrote: > > > On Tue, Jul 04, 2017 at 06:13:31PM -0300, Daniel Henrique Barboza wro= te: > > >> I just tested this patch set on top of current ppc-for-2.10 branch (= which > > >> contains > > >> the patches from part V). It applied cleanly but required a couple of > > >> trivial > > >> fixes to build probably because it was made on top of an older code = base. > > > Right, I fixed that up locally already, but haven't gotten around to > > > reposting yet. You can look at the 'drcVI' branch on my github tree, > > > if you're interested. > > > > > >> The trivial migration test worked fine. The libvirt scenario (attach= ing a > > >> device on > > >> target before migration, try to unplug after migration) isn't workin= g as > > >> expected > > >> but we have a different result with this series. Instead of silently= failing > > >> to unplug > > >> with error messages on dmesg, the hot unplug works on QEMU level: > > > Thanks for testing. Just to clarify what you're saying here, you > > > haven't spotted a regression with this series, but there is a case > > > which was broken and is still broken with slightly different > > > symptoms. Yes? > > In my opinion, yes. It is debatable if the patch series made it worse= =20 > > because > > the guest is now misbehaving, but the feature per se wasn't working > > prior to it. >=20 > I think it's the removal of awaiting_allocation. So.. yes, in the sense that I think we've rediscovered the problem which prompter the awaiting_allocation flag in the first place. I still don't think awaiting_allocation is a sensible fix for.. well, anything. So we need to find the right fix for this problem. > We know currently that > in the libvirt scenario the DRC is exposed in an pre-hotplug state of > ISOLATED/UNALLOCATED. Right, need to understand exactly how we get there. > In that state, spapr_drc_detach() completes > immediately because from the perspective of QEMU it apparently has not > been exposed to the guest yet, or the guest has already quiesced it on > it's end. Right. > awaiting_allocation guarded against this, as it's intention was to make > sure that resource was put into an ALLOCATED state prior to getting moved > back into an UNALLOCATED state, Right, but that seems broken to me. It means if a cpu is hotplug when the guest isn't paying attention (e.g. early boot, guest is halted/crashed), then you can't remove it until you either boot an OS that allocates then releases it, or you reset (and then release). > so we didn't immediately unplug a CPU > while the hotplug was in progress. But in what sense is the hotplug "in progress"? The guest has been given a notification, but it hasn't touched the device yet. Moving the state away from UNALLOCATED is never guaranteed to work, regardless of what notifications have been received. > So in your scenario the CPU is just mysteriously vanishing out from > under the guest, Because the DRC is ending up in UNUSABLE state on the destination when it should be in some other state (and was on the source)? Yeah, that sounds plausible. > which probably explains the hang. The fix for this > particular scenario is to fix the initial DRC on the target. I think > we have a plan for that so I wouldn't consider this a regression > necessarily. Yeah.. I'm not quite sure how it's ending up wrong; why isn't the incoming migration state setting it correctly? --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --Tcb1KvpfnM4LxW2s Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJZXgbvAAoJEGw4ysog2bOSxGIP/2Wzgbp0t3oqw4SAje77nLAF inVBEbJHWxCl9diceXPJMuzqsb7lfWJDvMMdToCu7MT5bmeUyIKd6OvPD6KLadls Gx15jHwsztCHLmGMTDRkKFQK2ZhyoUWzHsb5dbYOHThpi+sSgEhNSIcXxu44+S8S tso39zTzxHKFCODmkI+VzJ/y1j4/kddXlSlRcD62GtUUhBiYfyY/dcG3IyqsINhh vIqZjIB0rP4unAjCBcCOS9AUprjau+f13t+r3NqS8LaeFFdiYzhrtmoQqWriz1uU XWRMp2FBpvn5DY2r0IS15y1vEBITHoxcqUnXYE7ve+xDVwYEzNPa6dVJhDgXfOVN CAU7Rus4UNvGYKwgdZwxK7XeMGtxnQ4r/C/248GTzUSThD4TO+2WK8ycIkvKwPuy HFra8aCQtGm469XABW9GRjf4SvYhR28LORfWNlmg7EnO4aXk7rNXKwf54ljBk0v3 gvmxfCygDsf7q3OdPvbbGXufkNE5zL8Yl86MAAnDJHXUJ6TZZZSCXdXViypbqg2M CZt81Vho0mR02FN+dUWYR/87KR3mweRuLX9QXumwAJIko7rBH8dFIA4FkDSdxwz9 Uopqidx5S3B0VLbur8eiBeRclLi0tjIUZ2KjJB8Wv1pC6T2vlECrJhCRWnpg+w9Z QtwrfVyDx4h8m0PeNvif =vLJ0 -----END PGP SIGNATURE----- --Tcb1KvpfnM4LxW2s--