From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:56213)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aB7fk-0003w1-CM
	for qemu-devel@nongnu.org; Mon, 21 Dec 2015 16:07:13 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aB7fg-0007Tt-AF
	for qemu-devel@nongnu.org; Mon, 21 Dec 2015 16:07:12 -0500
Received: from mx1.redhat.com ([209.132.183.28]:51464)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aB7fg-0007Tn-0T
	for qemu-devel@nongnu.org; Mon, 21 Dec 2015 16:07:08 -0500
Message-ID: <1450732026.3781.89.camel@redhat.com>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Mon, 21 Dec 2015 14:07:06 -0700
In-Reply-To: <56737D85.3030305@cn.fujitsu.com>
References: <cover.1447748073.git.chen.fan.fnst@cn.fujitsu.com>
	<35d75210aef922bd049d93e0c8dbe13a2b5b1dd3.1447748073.git.chen.fan.fnst@cn.fujitsu.com>
	<1450384318.2674.110.camel@redhat.com>
	<56737D85.3030305@cn.fujitsu.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v14 Resend 10/13] pci: add pci device
 pre-post reset callbacks for host bus reset
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Chen Fan <chen.fan.fnst@cn.fujitsu.com>, Cao jin <caoj.fnst@cn.fujitsu.com>, qemu-devel@nongnu.org
Cc: mst@redhat.com

On Fri, 2015-12-18 at 11:29 +0800, Chen Fan wrote:
> On 12/18/2015 04:31 AM, Alex Williamson wrote:
> > On Thu, 2015-12-17 at 09:41 +0800, Cao jin wrote:
> > > From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > >=20
> > > Particularly, For vfio devices, Once need to recovery devices
> > > by bus reset such as AER, we always need to reset the host bus
> > > to recovery the devices under the bus, so we need to add pci
> > > device
> > > callbacks to specify to do host bus reset.
> > >=20
> > > Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > > Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > > =C2=A0 hw/pci/pci.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0| 18 ++++++++++++++++++
> > > =C2=A0 hw/pci/pci_bridge.c=C2=A0=C2=A0|=C2=A0=C2=A09 +++++++++
> > > =C2=A0 hw/vfio/pci.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
| 26 ++++++++++++++++++++++++++
> > > =C2=A0 hw/vfio/pci.h=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
|=C2=A0=C2=A02 ++
> > > =C2=A0 include/hw/pci/pci.h |=C2=A0=C2=A07 +++++++
> > > =C2=A0 5 files changed, 62 insertions(+)
> > >=20
> > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > index f6ca6ef..64fa2cc 100644
> > > --- a/hw/pci/pci.c
> > > +++ b/hw/pci/pci.c
> > > @@ -247,6 +247,24 @@ static void pci_do_device_reset(PCIDevice
> > > *dev)
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0msix_reset(dev);
> > > =C2=A0 }
> > > =C2=A0=C2=A0
> > > +void pci_device_pre_reset(PCIBus *bus, PCIDevice *dev, void
> > > *unused)
> > > +{
> > > +=C2=A0=C2=A0=C2=A0=C2=A0PCIDeviceClass *dc =3D PCI_DEVICE_GET_CLAS=
S(dev);
> > > +
> > > +=C2=A0=C2=A0=C2=A0=C2=A0if (dc->pre_reset) {
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dc->pre_reset(dev)=
;
> > > +=C2=A0=C2=A0=C2=A0=C2=A0}
> > > +}
> > > +
> > > +void pci_device_post_reset(PCIBus *bus, PCIDevice *dev, void
> > > *unused)
> > > +{
> > > +=C2=A0=C2=A0=C2=A0=C2=A0PCIDeviceClass *dc =3D PCI_DEVICE_GET_CLAS=
S(dev);
> > > +
> > > +=C2=A0=C2=A0=C2=A0=C2=A0if (dc->post_reset) {
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dc->post_reset(dev=
);
> > > +=C2=A0=C2=A0=C2=A0=C2=A0}
> > > +}
> > > +
> > > =C2=A0 /*
> > > =C2=A0=C2=A0=C2=A0* This function is called on #RST and FLR.
> > > =C2=A0=C2=A0=C2=A0* FLR if PCI_EXP_DEVCTL_BCR_FLR is set
> > > diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> > > index 40c97b1..ddb76ab 100644
> > > --- a/hw/pci/pci_bridge.c
> > > +++ b/hw/pci/pci_bridge.c
> > > @@ -267,8 +267,17 @@ void pci_bridge_write_config(PCIDevice *d,
> > > =C2=A0=C2=A0
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0newctl =3D pci_get_word(d->conf=
ig + PCI_BRIDGE_CONTROL);
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (~oldctl & newctl & PCI_BRID=
GE_CTL_BUS_RESET) {
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/*
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* Notify all=
 vfio-pci devices under the bus
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* should do =
physical bus reset.
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0PCIBus *sec_bus =3D=
 pci_bridge_get_sec_bus(s);
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pci_for_each_devic=
e(sec_bus, pci_bus_num(sec_bus),
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0pci_device_pre_reset, NULL);
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/* Trig=
ger hot reset on 0->1 transition. */
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0qbus_re=
set_all(&s->sec_bus.qbus);
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pci_for_each_devic=
e(sec_bus, pci_bus_num(sec_bus),
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0pci_device_post_reset, NULL);
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0}
> > > =C2=A0 }
> > > =C2=A0=C2=A0
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index e17dc89..df32618 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -39,6 +39,7 @@
> > > =C2=A0=C2=A0
> > > =C2=A0 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > > =C2=A0 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool
> > > enabled);
> > > +static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single);
> > > =C2=A0=C2=A0
> > > =C2=A0 /*
> > > =C2=A0=C2=A0=C2=A0* Disabling BAR mmaping can be slow, but toggling=
 it around
> > > INTx
> > > can
> > > @@ -1888,6 +1889,8 @@ static int
> > > vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/* List all affected devices by=
 bus reset */
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0devices =3D &info->devices[0];
> > > =C2=A0=C2=A0
> > > +=C2=A0=C2=A0=C2=A0=C2=A0vdev->single_depend_dev =3D (info->count =3D=
=3D 1);
> > > +
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/* Verify that we have all the =
groups required */
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0for (i =3D 0; i < info->count; =
i++) {
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0PCIHost=
DeviceAddress host;
> > > @@ -2029,10 +2032,26 @@ static int
> > > vfio_check_bus_reset(NotifierWithReturn *n, void *opaque)
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return vfio_check_host_bus_rese=
t(vdev);
> > > =C2=A0 }
> > > =C2=A0=C2=A0
> > > +static void vfio_aer_pre_reset(PCIDevice *pdev)
> > > +{
> > > +=C2=A0=C2=A0=C2=A0=C2=A0VFIOPCIDevice *vdev =3D DO_UPCAST(VFIOPCID=
evice, pdev, pdev);
> > > +
> > > +=C2=A0=C2=A0=C2=A0=C2=A0vdev->aer_reset =3D true;
> > > +=C2=A0=C2=A0=C2=A0=C2=A0vfio_pci_hot_reset(vdev, vdev->single_depe=
nd_dev);
> > > +}
> > Doesn't this lead to multiple host bus resets per guest bus reset
> > in
> > many cases?=C2=A0=C2=A0It looks like we'll do it once per vfio-pci de=
vice,
> > even
> > if those devices are on the same host bus.=C2=A0=C2=A0That's a 1 seco=
nd
> > operation
> > per device.=C2=A0=C2=A0Can we avoid that?=C2=A0=C2=A0Maybe some sort =
of sequence ID
> > could
> > help a device figure out whether it's already been reset as part of
> > a
> > dependent device for this particular guest bus reset.=C2=A0=C2=A0Than=
ks,
> That's right, I missed this case, but I don't understand the scenario
> how to
> use a sequence ID to mark the device if been reset. can you detail it
> ?

I don't really have a concrete idea for a sequence ID, it was just a
thought that maybe if each bus reset had a sequence ID then devices
could know whether they've already been reset for that sequence ID.
=C2=A0The basic problem we have is that reset callbacks are per device an=
d
it's difficult to infer which individual resets are part of that bus
reset. =C2=A0In fact, do we propagate resets correctly down secondary
bridges? =C2=A0We're triggering off a VM write of the bridge control bus
reset bit triggering from 0->1 and we then call qbus_reset_all() on
that qbus, which I think is just going to call=C2=A0pci_bridge_reset() fo=
r
any other bridges, which doesn't do anything about resetting deeper
subordinate buses. =C2=A0I think that means that if we had a root port wi=
th
a switch below it and endpoints below that, if the VM triggered a
secondary bus reset at the root port, the endpoints would never see it,
which is not how real hardware works.

> additional, there was a mechanism to compute device whether need to
> be reset
> by hot reset. so I simply modify the code as the following:
>=20
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a9bc67e..42774ca 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2063,13 +2063,19 @@ static void vfio_aer_pre_reset(PCIDevice
> *pdev)
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0VFIOPCIDevice *vdev =3D DO_UPCAST(V=
FIOPCIDevice, pdev, pdev);
>=20
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vdev->aer_reset =3D true;
> -=C2=A0=C2=A0=C2=A0=C2=A0vfio_pci_hot_reset(vdev, vdev->single_depend_d=
ev);
> +=C2=A0=C2=A0=C2=A0=C2=A0vdev->vbasedev.needs_reset =3D true;
> =C2=A0 }
>=20
> =C2=A0 static void vfio_aer_post_reset(PCIDevice *pdev)
> =C2=A0 {
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0VFIOPCIDevice *vdev =3D DO_UPCAST(V=
FIOPCIDevice, pdev, pdev);
>=20
> +=C2=A0=C2=A0=C2=A0=C2=A0if (!vdev->single_depend_dev && vdev->vbasedev=
.needs_reset) {
> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vfio_pci_hot_reset(vde=
v, false);
> +=C2=A0=C2=A0=C2=A0=C2=A0} else {
> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vfio_pci_hot_reset(vde=
v, true);
> +=C2=A0=C2=A0=C2=A0=C2=A0}
> +
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vdev->aer_reset =3D false;
> =C2=A0 }
>=20
> what do you think of this ?

I think it might be a bigger problem than that subtle change. =C2=A0I won=
der
if we really need a better model of the reset line through the
subordinate buses. =C2=A0When reset is asserted, we'd set a bus_in_reset
flag on the bus and trigger downstream bridges to do the same. =C2=A0Then
when the user de-asserts reset, we'd call qbus_reset_all() and
propagate it through to downstream buses. =C2=A0That way the per device
reset callback could check to see if the bus is in reset and aer
devices can then know to do a bus reset. =C2=A0Finally, the bus_in_reset
flag would be cleared on all the affected buses. =C2=A0I'm sure there are
numerous details missing there, but it seems like it might be a
reasonable approach. =C2=A0Thanks,

Alex