From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:46002) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UJh0A-0005JB-NK for qemu-devel@nongnu.org; Sun, 24 Mar 2013 05:14:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UJh06-0001GJ-Oc for qemu-devel@nongnu.org; Sun, 24 Mar 2013 05:14:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:62528) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UJh06-0001Fw-Dk for qemu-devel@nongnu.org; Sun, 24 Mar 2013 05:14:02 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r2O9E1AT011089 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Sun, 24 Mar 2013 05:14:01 -0400 Date: Sun, 24 Mar 2013 11:14:39 +0200 From: "Michael S. Tsirkin" Message-ID: <20130324091439.GB3935@redhat.com> References: <20130319222330.19359.2521.stgit@bling.home> <20130321165604.GA2994@redhat.com> <1363965913.24132.605.camel@bling.home> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1363965913.24132.605.camel@bling.home> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] pcie: Enhance PCIe links List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: qemu-devel@nongnu.org On Fri, Mar 22, 2013 at 09:25:13AM -0600, Alex Williamson wrote: > On Thu, 2013-03-21 at 18:56 +0200, Michael S. Tsirkin wrote: > > On Tue, Mar 19, 2013 at 04:24:47PM -0600, Alex Williamson wrote: > > > Enable PCIe devices to negotiate links. This upgrades our root por= ts > > > and switches to advertising x16, 8.0GT/s and negotiates the current > > > link status to the best available width and speed. Note that we al= so > > > skip setting link fields for Root Complex Integrated Endpoints as > > > indicated by the PCIe spec. > > >=20 > > > Signed-off-by: Alex Williamson > > > --- > > >=20 > > > This depends on the previous pcie_endpoint_cap_init patch. > > >=20 > > > hw/ioh3420.c | 5 +- > > > hw/pci/pcie.c | 150 +++++++++++++++++++++++++++++++++++= +++++++++--- > > > hw/pci/pcie.h | 7 ++ > > > hw/pci/pcie_regs.h | 17 +++++ > > > hw/usb/hcd-xhci.c | 3 + > > > hw/xio3130_downstream.c | 4 + > > > hw/xio3130_upstream.c | 4 + > > > 7 files changed, 173 insertions(+), 17 deletions(-) > > >=20 > > > diff --git a/hw/ioh3420.c b/hw/ioh3420.c > > > index 5cff61e..0aaec5b 100644 > > > --- a/hw/ioh3420.c > > > +++ b/hw/ioh3420.c > > > @@ -115,7 +115,10 @@ static int ioh3420_initfn(PCIDevice *d) > > > if (rc < 0) { > > > goto err_bridge; > > > } > > > - rc =3D pcie_cap_init(d, IOH_EP_EXP_OFFSET, PCI_EXP_TYPE_ROOT_P= ORT, p->port); > > > + > > > + /* Real hardware only supports up to x4, 2.5GT/s, but we're no= t real hw */ > > > + rc =3D pcie_cap_init(d, IOH_EP_EXP_OFFSET, PCI_EXP_TYPE_ROOT_P= ORT, p->port, > > > + PCI_EXP_LNK_MLW_16, PCI_EXP_LNK_LS_80); > > > if (rc < 0) { > > > goto err_msi; > > > > > I'd like to see some documentation about why this is needed/a good id= ea. > > You could argue that some guest might be surprised if the card width > > does not match reality but it could work in reverse too ... > > Do you see some drivers complaining? Linux prints warnings sometimes = - > > is this what worries you? > > Let's document the motivation here not only what's going on. >=20 > Ok, I can add motivation. This is where I really wish we had a generic > switch that wouldn't risk having existing real world expectations. The > base motivation though is to not create artificial bottlenecks. If I > assign a graphics card to a guest, I want the link to negotiate to the > same bandwidth it has on the host. I'm not entirely sure how to do tha= t > yet, whether I should cap the device capabilities to it's current statu= s > or whether I should force a negotiation at the current speed. The > former may confuse drivers if they expect certain device capabilities, > the latter may cause drivers to attempt to renegotiate higher speeds. > The goal though is to have a virtual platform that advertises sufficien= t > speed on all ports to attach any real or virtual device. >=20 > Perhaps I should stick to hardware limitations for xio3130 & io3420 and > distill these drivers down to generic ones with x32 ports and 8GT/s. Hmm this doesn't actually answer the question ... Why does it matter what is the width negotiated by the guest? > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c > > > index 62bd0b8..c07d3cc 100644 > > > --- a/hw/pci/pcie.c > > > +++ b/hw/pci/pcie.c > > > @@ -37,11 +37,98 @@ > > > #define PCIE_DEV_PRINTF(dev, fmt, ...) = \ > > > PCIE_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ## __VA_A= RGS__) > > > =20 > > > +static uint16_t pcie_link_max_width(PCIDevice *dev) > > > +{ > > > + uint8_t *exp_cap; > > > + uint32_t lnkcap; > > > + > > > + exp_cap =3D dev->config + dev->exp.exp_cap; > > > + lnkcap =3D pci_get_long(exp_cap + PCI_EXP_LNKCAP); > > > + > > > + return lnkcap & PCI_EXP_LNKCAP_MLW; > > > +} > > > + > > > +static uint8_t pcie_link_speed_mask(PCIDevice *dev) > > > +{ > > > + uint8_t *exp_cap; > > > + uint32_t lnkcap, lnkcap2; > > > + uint8_t speeds, mask; > > > + > > > + exp_cap =3D dev->config + dev->exp.exp_cap; > > > + lnkcap =3D pci_get_long(exp_cap + PCI_EXP_LNKCAP); > > > + lnkcap2 =3D pci_get_long(exp_cap + PCI_EXP_LNKCAP2); > > > + > > > + mask =3D (1 << (lnkcap & PCI_EXP_LNKCAP_SLS)) - 1; > > > + > > > + /* > > > + * If LNKCAP2 reports supported link speeds, then LNKCAP index= es > > > + * the highest supported speed. Mask out the rest and return. > > > + */ > > > + speeds =3D (lnkcap2 & 0xfe) >> 1; > >=20 > > what's 0xfe? >=20 > Will add macro >=20 > > > + if (speeds) { > > > + return speeds & mask; > > > + } > > > + > > > + /* > > > + * Otherwise LNKCAP returns the maximum speed and the device s= upports > > > + * all speeds below it. This is really only valid for 2.5 & 5= GT/s > > > + */ > > > + return mask; > > > +} > > > + > > > +void pcie_negotiate_link(PCIDevice *dev) > > > +{ > > > + PCIDevice *parent; > > > + uint16_t flags, width; > > > + uint8_t type, speed; > > > + > > > + /* Skip non-express buses and Root Complex buses. */ > > > + if (!pci_bus_is_express(dev->bus) || pci_bus_is_root(dev->bus)= ) { > > > + return; > > > + } > > > + > > > + /* > > > + * Downstream ports don't negotiate with upstream ports, their= link > > > + * is negotiated by whatever is attached downstream to them. = The > > > + * same is true of root ports, but root ports are always attac= hed to > > > + * the root complex, so fall out above. > > > + */ > > > + flags =3D pci_get_word(dev->config + dev->exp.exp_cap + PCI_EX= P_FLAGS); > > > + type =3D (flags & PCI_EXP_FLAGS_TYPE) >> PCI_EXP_FLAGS_TYPE_SH= IFT; > > > + if (type =3D=3D PCI_EXP_TYPE_DOWNSTREAM) { > > > + return; > > > + } > > > + > > > + parent =3D dev->bus->parent_dev; > > > + > > > + assert(pci_is_express(dev) && dev->exp.exp_cap && > > > + pci_is_express(parent) && parent->exp.exp_cap); > > > + > > > + /* Devices support all widths below max, so use the MIN max */ > > > + width =3D MIN(pcie_link_max_width(dev), pcie_link_max_width(pa= rent)); > >=20 > >=20 > > So I see this in spec: > > 7.8.19.Link Control 2 Register (Offset 30h) > >=20 > >=20 > > 9 > > Hardware Autonomous Width Disable =E2=80=93 When Set, this bit > > disables hardware from changing the Link width for reasons > > other than attempting to correct unreliable Link operation by > > reducing Link width. > > For a Multi-Function device associated with an Upstream Port, > > the bit in Function 0 is of type RW, and only Function 0 controls > > the component=E2=80=99s Link behavior. In all other Functions of = that > > device, this bit is of type RsvdP. > > Components that do not implement the ability autonomously to > > change Link width are permitted to hardwire this bit to 0b. > > Default value of this bit is 0b. > > RW/RsvdP > > (see > > description) > >=20 > > So far we did not implement this according to rules: > > we always used the 1x width so not ability to change > > width. > >=20 > > Now that we do, we need to support the override? > >=20 > > Did not look but something like this could exist for speed too? >=20 > We should definitely only have function 0 perform the link negotiation > with the upstream port, subsequent functions can just copy the result > from function 0. Am I still missing something or is that sufficient? We would also need to implement the bit above. If we ever implement error reporting, we might see devices reduce the width automatically and need to implement that. All of this is bypassed if we simply report x1 to guest ... > > > + /* Choose the highest speed supported by both devices */ > > > + speed =3D ffs(pow2floor(pcie_link_speed_mask(dev) & > > > + pcie_link_speed_mask(parent))); > >=20 > > What's all this trickery about? Not same as fls by chance? >=20 > Yes, I couldn't find fls, I'll see if the compiler can. This rounds > down to the highest power of 2 then finds the first (only) set bit. It's in include/qemu-common.h > > > + > > > + pci_word_test_and_clear_mask(dev->config + dev->exp.exp_cap + > > > + PCI_EXP_LNKSTA, > > > + PCI_EXP_LNKCAP_MLW | PCI_EXP_LNKC= AP_SLS); > > > + pci_word_test_and_set_mask(dev->config + dev->exp.exp_cap + PC= I_EXP_LNKSTA, > > > + width | speed); > >=20 > > This looks strange. Should not be pci_set_word_by_mask, > > since speed is not a mask itself? >=20 > Yep, it looks like set_word_by_mask would replace clear_mask, set_mask. >=20 > > > + > > > + pci_word_test_and_clear_mask(parent->config + parent->exp.exp_= cap + > > > + PCI_EXP_LNKSTA, > > > + PCI_EXP_LNKCAP_MLW | PCI_EXP_LNKC= AP_SLS); > > > + pci_word_test_and_set_mask(parent->config + parent->exp.exp_ca= p + > > > + PCI_EXP_LNKSTA, > > > + width | speed); > > > +} > > > =20 > > > /*****************************************************************= ********** > > > * pci express capability helper functions > > > */ > > > -int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, ui= nt8_t port) > > > +int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, ui= nt8_t port, > > > + uint16_t width, uint8_t speed) > > > { > > > int pos; > > > uint8_t *exp_cap; > > > @@ -71,14 +158,56 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offs= et, uint8_t type, uint8_t port) > > > */ > > > pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER); > > > =20 > > > - pci_set_long(exp_cap + PCI_EXP_LNKCAP, > > > - (port << PCI_EXP_LNKCAP_PN_SHIFT) | > > > - PCI_EXP_LNKCAP_ASPMS_0S | > > > - PCI_EXP_LNK_MLW_1 | > > > - PCI_EXP_LNK_LS_25); > > > + /* Root Complex devices don't report link fields */ > > > + if (type !=3D PCI_EXP_TYPE_RC_END) { > > > + /* User specifies the link port # */ > > > + pci_set_long(exp_cap + PCI_EXP_LNKCAP, port << PCI_EXP_LNK= CAP_PN_SHIFT); > > > + > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP, width= | speed); > > > + > > > + /* Advertise L0s ASPM support */ > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP, > > > + PCI_EXP_LNKCAP_ASPMS_L0S); > > > =20 > > > - pci_set_word(exp_cap + PCI_EXP_LNKSTA, > > > - PCI_EXP_LNK_MLW_1 | PCI_EXP_LNK_LS_25); > > > + /* > > > + * Link bandwidth notification is required for all root po= rts and > > > + * downstream ports supporting links wider than x1. > > > + */ > > > + if (width > PCI_EXP_LNK_MLW_1 && (type =3D=3D PCI_EXP_TYPE= _ROOT_PORT || > > > + type =3D=3D PCI_EXP_TYPE_DOWNSTREAM)) { > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP, > > > + PCI_EXP_LNKCAP_LBNC); > > > + } > > > + > > > + /* 8.0GT/s adds some requirements */ > > > + if (speed >=3D PCI_EXP_LNK_LS_80) { > > > + /* > > > + * 2.5 & 5.0GT/s can be fully described by LNKCAP, but= 8.0GT/s > > > + * is actually a reference to the highest bit supporte= d in this > > > + * register. We assume the device supports all link s= peeds. > > > + */ > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP2, > > > + PCI_EXP_LNK2_LS_25); > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP2, > > > + PCI_EXP_LNK2_LS_50); > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCAP2, > > > + PCI_EXP_LNK2_LS_80); > >=20 > > Let's just do a | here. >=20 > Ok >=20 > > > + > > > + /* > > > + * Supporting 8.0GT/s requires that we advertise Data = Link Layer > > > + * Active on all downstream ports supporting hotplug o= r speeds > > > + * great than 5GT/s > >=20 > > typo > >=20 > > > + */ > > > + if (type =3D=3D PCI_EXP_TYPE_DOWNSTREAM) { > > > + pci_long_test_and_set_mask(exp_cap + PCI_EXP_LNKCA= P, > > > + PCI_EXP_LNKCAP_DLLLARC)= ; > > > + pci_word_test_and_set_mask(exp_cap + PCI_EXP_LNKST= A, > > > + PCI_EXP_LNKSTA_DLLLA); > > > + } > > > + } > > > + > > > + pcie_negotiate_link(dev); > > > + } > > > =20 > > > pci_set_long(exp_cap + PCI_EXP_DEVCAP2, > > > PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP); > > > @@ -87,7 +216,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset= , uint8_t type, uint8_t port) > > > return pos; > > > } > > > =20 > > > -int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset) > > > +int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset, uint16_= t width, > > > + uint8_t speed) > > > { > > > uint8_t type =3D PCI_EXP_TYPE_ENDPOINT; > > > =20 > > > @@ -100,7 +230,7 @@ int pcie_endpoint_cap_init(PCIDevice *dev, uint= 8_t offset) > > > type =3D PCI_EXP_TYPE_RC_END; > > > } > > > =20 > > > - return pcie_cap_init(dev, offset, type, 0); > > > + return pcie_cap_init(dev, offset, type, 0, width, speed); > > > } > > > =20 > > > void pcie_cap_exit(PCIDevice *dev) > > > diff --git a/hw/pci/pcie.h b/hw/pci/pcie.h > > > index c010007..051dd76 100644 > > > --- a/hw/pci/pcie.h > > > +++ b/hw/pci/pcie.h > > > @@ -94,9 +94,12 @@ struct PCIExpressDevice { > > > }; > > > =20 > > > /* PCI express capability helper functions */ > > > -int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, ui= nt8_t port); > > > -int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset); > > > +int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, ui= nt8_t port, > > > + uint16_t width, uint8_t speed); > > > +int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset, uint16_= t width, > > > + uint8_t speed); > > > void pcie_cap_exit(PCIDevice *dev); > > > +void pcie_negotiate_link(PCIDevice *dev); > > > uint8_t pcie_cap_get_type(const PCIDevice *dev); > > > void pcie_cap_flags_set_vector(PCIDevice *dev, uint8_t vector); > > > uint8_t pcie_cap_flags_get_vector(PCIDevice *dev); > > > diff --git a/hw/pci/pcie_regs.h b/hw/pci/pcie_regs.h > > > index 4d123d9..4078451 100644 > > > --- a/hw/pci/pcie_regs.h > > > +++ b/hw/pci/pcie_regs.h > > > @@ -34,13 +34,23 @@ > > > /* PCI_EXP_LINK{CAP, STA} */ > > > /* link speed */ > > > #define PCI_EXP_LNK_LS_25 1 > > > +#define PCI_EXP_LNK_LS_50 2 > > > +#define PCI_EXP_LNK_LS_80 3 > > > =20 > > > #define PCI_EXP_LNK_MLW_SHIFT (ffs(PCI_EXP_LNKCAP_MLW) -= 1) > > > #define PCI_EXP_LNK_MLW_1 (1 << PCI_EXP_LNK_MLW_SHIF= T) > > > +#define PCI_EXP_LNK_MLW_2 (2 << PCI_EXP_LNK_MLW_SHIF= T) > > > +#define PCI_EXP_LNK_MLW_4 (4 << PCI_EXP_LNK_MLW_SHIF= T) > > > +#define PCI_EXP_LNK_MLW_8 (8 << PCI_EXP_LNK_MLW_SHIF= T) > > > +#define PCI_EXP_LNK_MLW_12 (12 << PCI_EXP_LNK_MLW_SHI= FT) > > > +#define PCI_EXP_LNK_MLW_16 (16 << PCI_EXP_LNK_MLW_SHI= FT) > > > +#define PCI_EXP_LNK_MLW_32 (32 << PCI_EXP_LNK_MLW_SHI= FT) > > > =20 > > > /* PCI_EXP_LINKCAP */ > > > #define PCI_EXP_LNKCAP_ASPMS_SHIFT (ffs(PCI_EXP_LNKCAP_ASPMS)= - 1) > > > -#define PCI_EXP_LNKCAP_ASPMS_0S (1 << PCI_EXP_LNKCAP_ASPMS= _SHIFT) > > > +#define PCI_EXP_LNKCAP_ASPMS_L0S (1 << PCI_EXP_LNKCAP_ASPMS= _SHIFT) > > > +#define PCI_EXP_LNKCAP_ASPMS_L1 (2 << PCI_EXP_LNKCAP_ASPMS= _SHIFT) > > > +#define PCI_EXP_LNKCAP_ASPMS_L0SL1 (3 << PCI_EXP_LNKCAP_ASPMS= _SHIFT) > > > =20 > > > #define PCI_EXP_LNKCAP_PN_SHIFT (ffs(PCI_EXP_LNKCAP_PN) - = 1) > > > =20 > > > @@ -72,6 +82,11 @@ > > > =20 > > > #define PCI_EXP_DEVCTL2_EETLPPB 0x80 > > > =20 > > > +#define PCI_EXP_LNKCAP2 44 /* Link Capabilities 2 */ > > > +#define PCI_EXP_LNK2_LS_25 (1 << 1) > > > +#define PCI_EXP_LNK2_LS_50 (1 << 2) > > > +#define PCI_EXP_LNK2_LS_80 (1 << 3) > > > + > > > /* ARI */ > > > #define PCI_ARI_VER 1 > > > #define PCI_ARI_SIZEOF 8 > > > diff --git a/hw/usb/hcd-xhci.c b/hw/usb/hcd-xhci.c > > > index 5aa342b..5f57807 100644 > > > --- a/hw/usb/hcd-xhci.c > > > +++ b/hw/usb/hcd-xhci.c > > > @@ -3332,7 +3332,8 @@ static int usb_xhci_initfn(struct PCIDevice *= dev) > > > PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRES= S_MEM_TYPE_64, > > > &xhci->mem); > > > =20 > > > - ret =3D pcie_endpoint_cap_init(&xhci->pci_dev, 0xa0); > > > + ret =3D pcie_endpoint_cap_init(&xhci->pci_dev, 0xa0, PCI_EXP_L= NK_MLW_1, > > > + PCI_EXP_LNK_LS_25); > > > assert(ret >=3D 0); > > > =20 > > > if (xhci->flags & (1 << XHCI_FLAG_USE_MSI)) { > > > diff --git a/hw/xio3130_downstream.c b/hw/xio3130_downstream.c > > > index b868f56..086ada9 100644 > > > --- a/hw/xio3130_downstream.c > > > +++ b/hw/xio3130_downstream.c > > > @@ -79,8 +79,10 @@ static int xio3130_downstream_initfn(PCIDevice *= d) > > > if (rc < 0) { > > > goto err_bridge; > > > } > > > + > > > + /* Real hardware only supports x1, 2.5GT/s, but we're not real= hw */ > >=20 > > add "So set it to 8.0 GT/s because ... " >=20 > Point taken. Thanks, >=20 > Alex >=20 > > > rc =3D pcie_cap_init(d, XIO3130_EXP_OFFSET, PCI_EXP_TYPE_DOWNS= TREAM, > > > - p->port); > > > + p->port, PCI_EXP_LNK_MLW_16, PCI_EXP_LNK_LS= _80); > > > if (rc < 0) { > > > goto err_msi; > > > } > > > diff --git a/hw/xio3130_upstream.c b/hw/xio3130_upstream.c > > > index cd5d97d..4b6820b 100644 > > > --- a/hw/xio3130_upstream.c > > > +++ b/hw/xio3130_upstream.c > > > @@ -75,8 +75,10 @@ static int xio3130_upstream_initfn(PCIDevice *d) > > > if (rc < 0) { > > > goto err_bridge; > > > } > > > + > > > + /* Real hardware only supports x1, 2.5GT/s, but we're not real= hw */ > > > rc =3D pcie_cap_init(d, XIO3130_EXP_OFFSET, PCI_EXP_TYPE_UPSTR= EAM, > > > - p->port); > > > + p->port, PCI_EXP_LNK_MLW_16, PCI_EXP_LNK_LS= _80); > > > if (rc < 0) { > > > goto err_msi; > > > } >=20 >=20