From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=YZIX=IK=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EFB7CC433E0
	for <qemu-devel@archiver.kernel.org>; Fri, 12 Mar 2021 01:21:01 +0000 (UTC)
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 0B56464F10
	for <qemu-devel@archiver.kernel.org>; Fri, 12 Mar 2021 01:21:00 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0B56464F10
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Received: from localhost ([::1]:41384 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>)
	id 1lKWU7-0000ml-TI
	for qemu-devel@archiver.kernel.org; Thu, 11 Mar 2021 20:20:59 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:56948)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <dgibson@ozlabs.org>)
 id 1lKWSx-00005U-D2; Thu, 11 Mar 2021 20:19:47 -0500
Received: from ozlabs.org ([203.11.71.1]:32881)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <dgibson@ozlabs.org>)
 id 1lKWSu-0007FL-1w; Thu, 11 Mar 2021 20:19:47 -0500
Received: by ozlabs.org (Postfix, from userid 1007)
 id 4DxSf254Xfz9sVw; Fri, 12 Mar 2021 12:19:38 +1100 (AEDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=gibson.dropbear.id.au; s=201602; t=1615511978;
 bh=M+w9UjJoqaSbWHSyVHdvsp5keL1d2LjhcoVznQ9Ik3A=;
 h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
 b=HTS3QnDX9MHK7+IYoj1vBSYjz5Gc/M2eWhi0f9ByJnA9qwNlQTDEE7bm+PJuorLnE
 XXvw/zsYXKiuxqDyuZMRPyRB4ftOHt+3qW0uSgEr6UkbztI1f6VG6LTJEgU5st2RBb
 m0pEukNhz4s16t09wZDYS1/k9nYPwZcMqaLzzYI0=
Date: Fri, 12 Mar 2021 12:19:33 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Daniel Henrique Barboza <danielhb413@gmail.com>
Subject: Re: [RFC] adding a generic QAPI event for failed device hotunplug
Message-ID: <YErBpf7w25xho1so@yekko.fritz.box>
References: <155911cc-8764-1a65-4bb3-2fc0628d52e5@gmail.com>
 <877dmkrcpl.fsf@dusky.pond.sub.org>
 <d9567bf3-8740-e8fe-b29b-a3b0ebdb5809@gmail.com>
 <87blbt60hc.fsf@dusky.pond.sub.org>
 <8b79c207-f653-9eec-77f1-ea46c7c75ad5@gmail.com>
 <YEbp4wKK/QY7uKYw@yekko.fritz.box>
 <87mtvczvzw.fsf@dusky.pond.sub.org>
 <98d44670-5a63-1feb-aad8-9dbc62cf2e7a@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature"; boundary="/85zIeRl38jyKNXG"
Content-Disposition: inline
In-Reply-To: <98d44670-5a63-1feb-aad8-9dbc62cf2e7a@gmail.com>
Received-SPF: pass client-ip=203.11.71.1; envelope-from=dgibson@ozlabs.org;
 helo=ozlabs.org
X-Spam_score_int: -17
X-Spam_score: -1.8
X-Spam_bar: -
X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: "Michael S. Tsirkin" <mst@redhat.com>, michael.roth@amd.com,
 Julia Suvorova <jusual@redhat.com>,
 "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
 Markus Armbruster <armbru@redhat.com>,
 "qemu-ppc@nongnu.org" <qemu-ppc@nongnu.org>, Laine Stump <laine@redhat.com>,
 Paolo Bonzini <pbonzini@redhat.com>, Igor Mammedov <imammedo@redhat.com>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>


--/85zIeRl38jyKNXG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 11, 2021 at 05:50:42PM -0300, Daniel Henrique Barboza wrote:
>=20
>=20
> On 3/9/21 3:22 AM, Markus Armbruster wrote:
> > Cc: Paolo and Julia in addition to Igor, because the thread is wandering
> > towards DeviceState member pending_deleted_event.
> >=20
> > Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
> > please loop in the right person.
> >=20
> > David Gibson <david@gibson.dropbear.id.au> writes:
> >=20
> > > On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wro=
te:
> > > >=20
> > > >=20
> > > > On 3/8/21 2:04 PM, Markus Armbruster wrote:
> > > > > Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> > > > >=20
> > > > > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
> > [...]
> > > > > > > We should document the event's reliability.  Are there unplug=
 operations
> > > > > > > where we can't detect failure?  Are there unplug operations w=
here we
> > > > > > > could, but haven't implemented the event?
> > > > > >=20
> > > > > > The current version of the PowerPC spec that the pseries machin=
e implements
> > > > > > (LOPAR) does not predict a way for the virtual machine to repor=
t a hotunplug
> > > > > > error back to the hypervisor. If something fails, the hyperviso=
r is left
> > > > > > in the dark.
> > > > > >=20
> > > > > > What happened in the 6.0.0 dev cycle is that we faced a reliabl=
e way of
> > > > >=20
> > > > > Do you mean "unreliable way"?
> > > >=20
> > > > I guess a better word would be 'reproducible', as in we discovered =
a reproducible
> > > > way of getting the guest kernel to refuse the CPU hotunplug.
> > >=20
> > > Right.  It's worth noting here that in the PAPR model, there are no
> > > "forced" unplugs.  Unplugs always consist of a request to the guest,
> > > which is then resposible for offlining the device and signalling back
> > > to the hypervisor that it's done with it.
> > >=20
> > > > > > making CPU hotunplug fail in the guest (trying to hotunplug the=
 last online
> > > > > > CPU) and the pseries machine was unprepared for dealing with it=
=2E We ended up
> > > > > > implementing a hotunplug timeout and, if the timeout kicks in, =
we're assuming
> > > > > > that the CPU hotunplug failed in the guest. This is the first s=
cenario we have
> > > > > > today where we want to send a QAPI event informing the CPU hotu=
nplug error,
> > > > > > but this event does not exist in QEMU ATM.
> > > > >=20
> > > > > When the timeout kicks in, how can you know the operation failed?=
  You
> > > > > better be sure when you send an error event.  In other words: what
> > > > > prevents the scenario where the operation is much slower than you
> > > > > expect, the timeout expires, the error event is sent, and then the
> > > > > operation completes successfully?
> > > >=20
> > > > A CPU hotunplug in a pseries guest takes no more than a couple of s=
econds, even
> > > > if the guest is under heavy load. The timeout is set to 15 seconds.
> > >=20
> > > Right.  We're well aware that a timeout is an ugly hack, since it's
> > > not really possible to distinguish it from a guest that's just being
> > > really slow.
> >=20
> > As long as unplug failure cannot be detected reliably, we need a timeout
> > *somewhere*.  I vaguely recall libvirt has one.  Laine?
>=20
> Yeah, Libvirt has a timeout for hotunplug operations. I agree that QEMU d=
oing
> the timeout makes more sense since it has more information about the
> conditions/mechanics involved.

Right.  In particular, I can't really see how libvirt can fully
implement that timeout.  AFAIK qemu has no way of listing or
cancelling "in flight" unplug requests, so it's entirely possible that
the unplug could still complete after libvirt's has "timed out".

> At this moment, the only case I know of hotunplug operations timing out in
> QEMU is what we did with CPU hotunplug in pseries though. I can't tell if
> unplug timeout is desirable mechanism for other machines/device types.
>=20
> > Putting the timeout in QEMU might be better.  QEMU might be in a better
> > position to pick a suitable timeout.
> >=20
> > > It was the best thing we could come up with in the short term though.
> > > Without the changes we're suggesting here, the guest can positively
> > > assert the unplug is complete, but it has no way to signal failure.
> > > So, without a timeout qemu is stuck waiting indefinitely (or at least
> > > until the next machine reset) on the guest for an unplug that might
> > > never complete.
> > >=20
> > > It's particularly nasty if the guest does really fail the hotplug
> > > internally, but then conditions change so that the same unplug could
> > > now succeed.  The unplug request is just a message - there's no guest
> > > visible persistent state saying we want the device unplugged, so if
> > > conditions change the guest won't then re-attempt the unplug.
> > > Meanwhile the user can't re-attempt the device_del, because on the
> > > qemu side it's still stuck in a pending unplug waiting on the guest.
> >=20
> > "Can't re-attempt" feels like a bug.  Let me explain.

You may be right.  Perhaps we should just send another unplug message
if we get a device_del on something that's already pending deletion.
AFAICT repeated unplug messages for the same device should be
harmless.

> So, what David mentioned above is related to pseries internals I believe.
>=20
> Up to 5.2.0 the pseries machine were silently ignoring all 'device_del'
> issued for devices that were in the middle of the unplug process, with the
> exception of DIMMs where we bothered to throw an error message back to the
> user.
>=20
> In 6.0.0 the code was reworked, and now we're always letting the user know
> when the 'device_del' was ignored due to an ongoing hotunplug of the devi=
ce.
> And for CPUs we also tell the timeout remaining. We're still not sending
> multiple hotunplug IRQ pulses to the guest, but at least the user is now
> informed about it.
>=20
> As for the commit mentioned below:
>=20
> >=20
> > Depending on the device, device_del can complete the unplug, or merely
> > initiate it.  Documentation:
> >=20
> > # Notes: When this command completes, the device may not be removed fro=
m the
> > #        guest.  Hot removal is an operation that requires guest cooper=
ation.
> > #        This command merely requests that the guest begin the hot remo=
val
> > #        process.  Completion of the device removal process is signaled=
 with a
> > #        DEVICE_DELETED event. Guest reset will automatically complete =
removal
> > #        for all devices.
> >=20
> > This is not as accurate as it could be.  Behavior depends on the device.
> >=20
> > For some kinds of devices, the command completes the unplug before it
> > sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
> > event gets send with a slight delay because device cleanup uses RCU.
> >=20
> > For others, notably PCI devices, the command only pokes the guest to do
> > its bit.  QEMU reacts to the guest's actions, which eventually leads to
> > the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
> > play ball to the end, the event doesn't get send.
> >=20
> > The "can't retry unplug" behavior is new.  Another device_del used to
> > simply poke the guest again.

Maybe on x86.  I think a repeated device_del was at best a no-op on
PAPR.

> > I think this regressed in commit
> > cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
> > Feels like a must-fix for 6.0.
>=20
>=20
> This doesn't directly affect pseries code because we're not using
> dev->pending_deleted_event to track the pending unplug status. We're

Huh... I didn't know about pending_deleted_event.  Maybe we *should*
be using that.  Handling the migration will be painful, of course.

> using an internal flag that is related to the DRC (the 'hotplug state'
> of the device).
>=20
> The commit seems a bit odd because it is making a change in the common co=
de
> inside qmp_device_del() based on a PCI-e specific behavior. In the end th=
is
> doesn't impact any other device but PCI-e (it is the only device that uses
> dev->pending_deleted_event to mark the start and finish of the unplug pro=
cess),
> but it's not something that I might expect in that function.
>=20
> IMO this verification should be removed from qmp_device_del and moved to
> pcie_cap_slot_unplug_request_cb(). This would fix the problem for PCIe de=
vices
> without making assumptions about everything else.
>=20
>=20
> >=20
> > > As we're discussing we think we have a better way, but it relies on
> > > guest changes, so we can't assume we already have that on the qemu
> > > side.
> > >=20
> > > > I'm aware that there's always that special use case, that we haven'=
t seen yet,
> > > > where this assumption is no longer valid. The plan is to change the=
 spec and the
> > > > guest kernel to signal the CPU hotunplug error back to QEMU before =
the dragon
> > > > lands. For now, timing out the CPU hotunplug process when we're alm=
ost certain
> > > > (but not 100%) that it failed in the guest is the best can do.
> > > >=20
> > > > For both cases I want to use DEVICE_UNPLUG_ERROR right from the sta=
rt, avoiding
> > > > guest visible changes when we change how we're detecting the unplug=
 errors.
> > > >=20
> > > > > > The second scenario is a memory hotunplug error. I found out th=
at the pseries
> > > > > > guest kernel does a reconfiguration step when re-attaching the =
DIMM right
> > > > > > after refusing the hotunplug, and this reconfiguration is visib=
le to QEMU.
> > > > > > I proceeded to make the pseries machine detect this error case,=
 rollback the
> > > > > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is=
 already covered
> > > > > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use =
it in this case as
> > > > > > well instead of the MEM specific one.
> > > > > >=20
> > > > > > This investigation and work in the mem hotunplug error path tri=
ggered a
> > > > > > discussion in qemu-ppc, where we're considering whether we shou=
ld do the same
> > > > > > signalling the kernel does for the DIMM hotunplug error for all=
 other device
> > > > > > hotunplug errors, given that the reconfiguration per se is not =
forbidden by LOPAR
> > > > > > and it's currently a no-op. We would make a LOPAR spec change t=
o make this an
> > > > > > official hotunplug error report mechanism, and all pseries hotu=
nplug operations,
> > > > > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events i=
n the error path.
> > > > > >=20
> > > > > > Granted, the spec change + Kernel change is not something that =
we will be able
> > > > > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_E=
RROR QAPI event
> > > > > > already in place would make it easier for the future as well.
> > > > > >=20
> > > > > >=20
> > > > > > I have a doc draft of these changes/infos that I forgot to post=
=2E I would post
> > > > > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the =
QAPI events
> > > > > > information there as well. Does that work for you as far as doc=
umentation
> > > > > > goes?
> > > > >=20
> > > > > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of mac=
hine
> > > > > and device.
> > >=20
> > > Ack.  Fwiw, I don't think this can ever be more than a "best effort"
> > > notification.  Even with a machine and OS that should support it, a
> > > guest bug or hang could mean that it never acks *or* nacks an unplug
> > > request.
> >=20
> > Since the success event is named DEVICE_DELETED, I'd name the
> > probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
> > statement of fact "still not deleted" instead of an error (that just
> > might not be one).
>=20
>=20
> "DEVICE_NOT_DELETED" sounds way better for what we want to express in the
> pseries case when we can't be 100% sure of a guest side error. However,
> there is at least one case where I'm sure of a guest side error (where I'm
> using MEM_UNPLUG_ERROR), so DEVICE_UNPLUG_ERROR seems fitting in that cas=
e.
>=20
>=20
> Should we add both DEVICE_NOT_DELETED and DEVICE_UNPLUG_ERROR then? I can=
 use
> both in pseries-6.0.0.
>=20
>=20
>=20
>=20
> Thanks,
>=20
>=20
> DHB
>=20
>=20
> >=20
> > > > > I'm not asking you to to implement it for all machines and device=
s.  But
> > > > > I want its design to support growth towards that goal, and its
> > > > > documentation reflect its current state.
> > > > >=20
> > > > > In particular, the doc comment in the QAPI schema should list the
> > > > > limitation.  If the event is only implemented for LOPAR for now, =
say so.
> > > > > If it's only implemented for certain devices, say so.
> > > > >=20
> > > > > Questions?
> > > >=20
> > > >=20
> > > > Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAP=
I event
> > > > in a way that it can be used by other machines in the future, and d=
ocumenting
> > > > where the event is being used ATM.
> > > >=20
> > > >=20
> > > > Thanks,
> > > >=20
> > > >=20
> > > > DHB
> > > >=20
> > > >=20
> > > > >=20
> > > >=20
> >=20
>=20

--=20
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--/85zIeRl38jyKNXG
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmBKwaIACgkQbDjKyiDZ
s5Juiw/+P/WQilAgik09lIMO0IJFjoMTs9HAqVhfuWaYei+kJHm0XSjUGED0g0tu
DpTts88qv6uCnYtDYtQDeMeU7dE7qVDwMy17HmIm78xK4USCdBwqnrHzhtdqvXrt
O1+u0o2qXznebDLQ16tM+39pspxsyuGqL+s/ffPLC1D3VY6q2Mg/BvolTa2gqKSC
HtZrDS87O9ynRqJED86JRoU/gZOm0bETXU/LLsLZYxWvIkRzMBgsUDuFvBbe+3J7
OayIKOW10SDM+eeUHukPV4xUVga1azbef6mWlNELAFuR2+aJK80dCaDK66RXqoiW
0fiwUJ5V2fIMjUsCqPRq+RbW8VPKeq5PwcNdmoRQ6CCAlBQpjxqzPfMsrAzmqPm1
JshVwzwedX6fhWgy+4YvxFW0gs2R9rCZMz8DO/REHajJ9pXqP+bQW0v3af+0HkEg
xGJLsmxC2NBsY0F8W8wxDdtjLtkMWNEDWkWPoPWXJnxROJ6IAQ7YyI/wMNO8QJjC
PtIp5qoqCFdYEv+F/T8/P8XdYIgORHDDl9RWvxQDg+lPTlJo1Z9IhUCyh/Ot3SL3
wLRaxKYN460QBLErruOBH4/JSuyn2NkaHnqUVx/EqeXFx6w8ff8HSOwslmsNQdN4
q2/CeAVh/InXiZ9CEYDHNOTaFjz3VCyPixUSY7W+PTLypq+JDVY=
=LVLQ
-----END PGP SIGNATURE-----

--/85zIeRl38jyKNXG--