* [Qemu-devel] QEMU PCIe link "negotiation"
@ 2018-10-15 20:18 Alex Williamson
2018-10-16 9:49 ` Dr. David Alan Gilbert
2018-10-16 15:21 ` Michael S. Tsirkin
0 siblings, 2 replies; 5+ messages in thread
From: Alex Williamson @ 2018-10-15 20:18 UTC (permalink / raw)
To: qemu-devel; +Cc: Michael S. Tsirkin, Marcel Apfelbaum
Hi,
I'd like to start a discussion about virtual PCIe link width and speeds
in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
we advertise today. This matters for assigned devices as the endpoint
driver may not enable full physical link utilization if the upstream
port only advertises minimal capabilities. One GPU assignment users
has measured that they only see an average transfer rate of 3.2GB/s
with current code, but hacking the downstream port to advertise an
8GT/s, x16 width link allows them to get 12GB/s. Obviously not all
devices and drivers will have this dependency and see these kinds of
improvements, or perhaps any improvement at all.
The first problem seems to be how we expose these link parameters in a
way that makes sense and supports backwards compatibility and
migration. I think we want the flexibility to allow the user to
specify per PCIe device the link width and at least the maximum link
speed, if not the actual discrete link speeds supported. However,
while I want to provide this flexibility, I don't necessarily think it
makes sense to burden the user to always specify these to get
reasonable defaults. So I would propose that we a) add link parameters
to the base PCIe device class and b) set defaults based on the machine
type. Additionally these machine type defaults would only apply to
generic PCIe root ports and switch ports, anything based on real
hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
overridden by the user. Existing machine types would also stay at this
"legacy" rate, while pc-q35-3.2 might bring all generic devices up to
PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
negotiation would bring us back to negotiated widths and speeds
matching the endpoint. Reasonable?
Next I think we need to look at how and when we do virtual link
negotiation. We're mostly discussing a virtual link, so I think
negotiation is simply filling in the negotiated link and width with the
highest common factor between endpoint and upstream port. For assigned
devices, this should match the endpoint's existing negotiated link
parameters, however, devices can dynamically change their link speed
(perhaps also width?), so I believe a current link seed of 2.5GT/s
could upshift to 8GT/s without any sort of visible renegotiation. Does
this mean that we should have link parameter callbacks from downstream
port to endpoint? Or maybe the downstream port link status register
should effectively be an alias for LNKSTA of devfn 00.0 of the
downstream device when it exists. We only need to report a consistent
link status value when someone looks at it, so reading directly from
the endpoint probably makes more sense than any sort of interface to
keep the value current.
If we take the above approach with LNKSTA (probably also LNKSTA2), is
any sort of "negotiation" required? We're automatically negotiated if
the capabilities of the upstream port are a superset of the endpoint's
capabilities. What do we do and what do we care about when the
upstream port is a subset of the endpoint though? For example, an
8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
On real hardware we obviously negotiate the endpoint down to the
downstream port parameters. We could do that with an emulated device,
but this is the scenario we have today with assigned devices and we
simply leave the inconsistency. I don't think we actually want to
(and there would be lots of complications to) force the physical device
to negotiate down to match a virtual downstream port. Do we simply
trigger a warning that this may result in non-optimal performance and
leave the inconsistency?
This email is already too long, but I also wonder whether we should
consider additional vfio-pci interfaces to trigger a link retraining or
allow virtualized access to the physical upstream port config space.
Clearly we need to consider multi-function devices and whether there
are useful configurations that could benefit from such access. Thanks
for reading, please discuss,
Alex
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] QEMU PCIe link "negotiation"
2018-10-15 20:18 [Qemu-devel] QEMU PCIe link "negotiation" Alex Williamson
@ 2018-10-16 9:49 ` Dr. David Alan Gilbert
2018-10-16 15:50 ` Alex Williamson
2018-10-16 15:21 ` Michael S. Tsirkin
1 sibling, 1 reply; 5+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-16 9:49 UTC (permalink / raw)
To: Alex Williamson; +Cc: qemu-devel, Michael S. Tsirkin
* Alex Williamson (alex.williamson@redhat.com) wrote:
> Hi,
>
> I'd like to start a discussion about virtual PCIe link width and speeds
> in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
> we advertise today. This matters for assigned devices as the endpoint
> driver may not enable full physical link utilization if the upstream
> port only advertises minimal capabilities. One GPU assignment users
> has measured that they only see an average transfer rate of 3.2GB/s
> with current code, but hacking the downstream port to advertise an
> 8GT/s, x16 width link allows them to get 12GB/s. Obviously not all
> devices and drivers will have this dependency and see these kinds of
> improvements, or perhaps any improvement at all.
>
> The first problem seems to be how we expose these link parameters in a
> way that makes sense and supports backwards compatibility and
> migration. I think we want the flexibility to allow the user to
> specify per PCIe device the link width and at least the maximum link
> speed, if not the actual discrete link speeds supported. However,
> while I want to provide this flexibility, I don't necessarily think it
> makes sense to burden the user to always specify these to get
> reasonable defaults. So I would propose that we a) add link parameters
> to the base PCIe device class and b) set defaults based on the machine
> type. Additionally these machine type defaults would only apply to
> generic PCIe root ports and switch ports, anything based on real
> hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
> overridden by the user. Existing machine types would also stay at this
> "legacy" rate, while pc-q35-3.2 might bring all generic devices up to
> PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
> negotiation would bring us back to negotiated widths and speeds
> matching the endpoint. Reasonable?
>
> Next I think we need to look at how and when we do virtual link
> negotiation. We're mostly discussing a virtual link, so I think
> negotiation is simply filling in the negotiated link and width with the
> highest common factor between endpoint and upstream port. For assigned
> devices, this should match the endpoint's existing negotiated link
> parameters, however, devices can dynamically change their link speed
> (perhaps also width?), so I believe a current link seed of 2.5GT/s
> could upshift to 8GT/s without any sort of visible renegotiation. Does
> this mean that we should have link parameter callbacks from downstream
> port to endpoint? Or maybe the downstream port link status register
> should effectively be an alias for LNKSTA of devfn 00.0 of the
> downstream device when it exists. We only need to report a consistent
> link status value when someone looks at it, so reading directly from
> the endpoint probably makes more sense than any sort of interface to
> keep the value current.
>
> If we take the above approach with LNKSTA (probably also LNKSTA2), is
> any sort of "negotiation" required? We're automatically negotiated if
> the capabilities of the upstream port are a superset of the endpoint's
> capabilities. What do we do and what do we care about when the
> upstream port is a subset of the endpoint though? For example, an
> 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
> On real hardware we obviously negotiate the endpoint down to the
> downstream port parameters. We could do that with an emulated device,
> but this is the scenario we have today with assigned devices and we
> simply leave the inconsistency. I don't think we actually want to
> (and there would be lots of complications to) force the physical device
> to negotiate down to match a virtual downstream port. Do we simply
> trigger a warning that this may result in non-optimal performance and
> leave the inconsistency?
>
> This email is already too long, but I also wonder whether we should
> consider additional vfio-pci interfaces to trigger a link retraining or
> allow virtualized access to the physical upstream port config space.
> Clearly we need to consider multi-function devices and whether there
> are useful configurations that could benefit from such access. Thanks
> for reading, please discuss,
I'm not sure about the consequences of the actual link speeds, but I
worry we'll hit things looking for PCIe v3 features; in particular
AMD's ROCm code needs PCIe atomics:
https://github.com/RadeonOpenCompute/ROCm_Documentation/blob/master/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.rst
so it feels like getting that to work with passthrough would
need some negotiation of features.
Dave
> Alex
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] QEMU PCIe link "negotiation"
2018-10-15 20:18 [Qemu-devel] QEMU PCIe link "negotiation" Alex Williamson
2018-10-16 9:49 ` Dr. David Alan Gilbert
@ 2018-10-16 15:21 ` Michael S. Tsirkin
2018-10-16 16:46 ` Alex Williamson
1 sibling, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2018-10-16 15:21 UTC (permalink / raw)
To: Alex Williamson; +Cc: qemu-devel, Marcel Apfelbaum
On Mon, Oct 15, 2018 at 02:18:41PM -0600, Alex Williamson wrote:
> Hi,
>
> I'd like to start a discussion about virtual PCIe link width and speeds
> in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
> we advertise today. This matters for assigned devices as the endpoint
> driver may not enable full physical link utilization if the upstream
> port only advertises minimal capabilities. One GPU assignment users
> has measured that they only see an average transfer rate of 3.2GB/s
> with current code, but hacking the downstream port to advertise an
> 8GT/s, x16 width link allows them to get 12GB/s. Obviously not all
> devices and drivers will have this dependency and see these kinds of
> improvements, or perhaps any improvement at all.
>
> The first problem seems to be how we expose these link parameters in a
> way that makes sense and supports backwards compatibility and
> migration.
Isn't this just for vfio though? So why worry about migration?
> I think we want the flexibility to allow the user to
> specify per PCIe device the link width and at least the maximum link
> speed, if not the actual discrete link speeds supported. However,
> while I want to provide this flexibility, I don't necessarily think it
> makes sense to burden the user to always specify these to get
> reasonable defaults. So I would propose that we a) add link parameters
> to the base PCIe device class and b) set defaults based on the machine
> type. Additionally these machine type defaults would only apply to
> generic PCIe root ports and switch ports, anything based on real
> hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
> overridden by the user. Existing machine types would also stay at this
> "legacy" rate, while pc-q35-3.2 might bring all generic devices up to
> PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
> negotiation would bring us back to negotiated widths and speeds
> matching the endpoint. Reasonable?
Generally yes. Last time I looked, there's a bunch of stuff the spec
says we need to do for the negotiation. E.g. guest can at any time
request width re-negotiation. Maybe most guests don't do it but it's
still in the spec and we never know whether anyone will do it in the
future.
VFIO is often a compromise but for virtual devices
I'd prefer we are stictly compliant if possible.
> Next I think we need to look at how and when we do virtual link
> negotiation. We're mostly discussing a virtual link, so I think
> negotiation is simply filling in the negotiated link and width with the
> highest common factor between endpoint and upstream port. For assigned
> devices, this should match the endpoint's existing negotiated link
> parameters, however, devices can dynamically change their link speed
> (perhaps also width?), so I believe a current link seed of 2.5GT/s
> could upshift to 8GT/s without any sort of visible renegotiation. Does
> this mean that we should have link parameter callbacks from downstream
> port to endpoint? Or maybe the downstream port link status register
> should effectively be an alias for LNKSTA of devfn 00.0 of the
> downstream device when it exists. We only need to report a consistent
> link status value when someone looks at it, so reading directly from
> the endpoint probably makes more sense than any sort of interface to
> keep the value current.
Don't we need to reflect the physical downstream link speed
somehow though?
> If we take the above approach with LNKSTA (probably also LNKSTA2), is
> any sort of "negotiation" required? We're automatically negotiated if
> the capabilities of the upstream port are a superset of the endpoint's
> capabilities. What do we do and what do we care about when the
> upstream port is a subset of the endpoint though? For example, an
> 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
> On real hardware we obviously negotiate the endpoint down to the
> downstream port parameters. We could do that with an emulated device,
> but this is the scenario we have today with assigned devices and we
> simply leave the inconsistency. I don't think we actually want to
> (and there would be lots of complications to) force the physical device
> to negotiate down to match a virtual downstream port. Do we simply
> trigger a warning that this may result in non-optimal performance and
> leave the inconsistency?
Also when guest pokes at the width do we need to tweak the
physical device/downstream port?
> This email is already too long, but I also wonder whether we should
> consider additional vfio-pci interfaces to trigger a link retraining or
> allow virtualized access to the physical upstream port config space.
> Clearly we need to consider multi-function devices and whether there
> are useful configurations that could benefit from such access. Thanks
> for reading, please discuss,
>
> Alex
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] QEMU PCIe link "negotiation"
2018-10-16 9:49 ` Dr. David Alan Gilbert
@ 2018-10-16 15:50 ` Alex Williamson
0 siblings, 0 replies; 5+ messages in thread
From: Alex Williamson @ 2018-10-16 15:50 UTC (permalink / raw)
To: Dr. David Alan Gilbert; +Cc: qemu-devel, Michael S. Tsirkin
On Tue, 16 Oct 2018 10:49:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > This email is already too long, but I also wonder whether we should
> > consider additional vfio-pci interfaces to trigger a link retraining or
> > allow virtualized access to the physical upstream port config space.
> > Clearly we need to consider multi-function devices and whether there
> > are useful configurations that could benefit from such access. Thanks
> > for reading, please discuss,
>
> I'm not sure about the consequences of the actual link speeds, but I
> worry we'll hit things looking for PCIe v3 features; in particular
> AMD's ROCm code needs PCIe atomics:
>
> https://github.com/RadeonOpenCompute/ROCm_Documentation/blob/master/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.rst
>
> so it feels like getting that to work with passthrough would
> need some negotiation of features.
Taking a quick read through the AtomicOps ECN, I think this is
somewhat orthogonal to the link speeds and widths. Support for acting
as a completer or router of atomic ops is indicated through the DEVCAP2
register in the PCIe capability, it's optional and should not be
assumed by other newer PCIe features. So I think we can tackle it
separately, but indeed it does appear to be a difficult feature to
implement correctly for a VM, at least if we attempt to do it
automatically. We might need to burden the user with this sort of
configuration unless AtomicOps support is so ubiquitous that we can
correctly assume that it's available between arbitrary endpoints which
might be assigned to the same VM. Probably a device option on the
pcie-root-port device to expose atomic op support is possible in the
short term, though more thorough reading of the spec is required.
Thanks,
Alex
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] QEMU PCIe link "negotiation"
2018-10-16 15:21 ` Michael S. Tsirkin
@ 2018-10-16 16:46 ` Alex Williamson
0 siblings, 0 replies; 5+ messages in thread
From: Alex Williamson @ 2018-10-16 16:46 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: qemu-devel, Marcel Apfelbaum
On Tue, 16 Oct 2018 11:21:28 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Oct 15, 2018 at 02:18:41PM -0600, Alex Williamson wrote:
> > Hi,
> >
> > I'd like to start a discussion about virtual PCIe link width and speeds
> > in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
> > we advertise today. This matters for assigned devices as the endpoint
> > driver may not enable full physical link utilization if the upstream
> > port only advertises minimal capabilities. One GPU assignment users
> > has measured that they only see an average transfer rate of 3.2GB/s
> > with current code, but hacking the downstream port to advertise an
> > 8GT/s, x16 width link allows them to get 12GB/s. Obviously not all
> > devices and drivers will have this dependency and see these kinds of
> > improvements, or perhaps any improvement at all.
> >
> > The first problem seems to be how we expose these link parameters in a
> > way that makes sense and supports backwards compatibility and
> > migration.
>
> Isn't this just for vfio though? So why worry about migration?
Migration is coming for vfio devices, mdev devices in the near(er)
term, but I wouldn't be too terribly surprised to see device specific
migration support either. Regardless, we support hotplug of vfio
devices therefore we cannot only focus on cold-plug scenarios and any
hotplug scenario must work irrespective of whether the VM has been
previously migrated. If we start with a x16/8GT root port with an
assigned GPU, unplug the GPU, migrate, and hot-add a GPU on the target,
it might behave differently if that root port is only exposing x1/2.5GT
capabilities.
I did consider whether devices can dynamically change their speed and
width capabilities, for instance the supported speeds vector in LNKCAP2
is indicated as hardware-init, so I think software would reasonable
expect that these values cannot change, however the max link speed and
max link width values in LNKCAP are simply read-only. Flirting with
which registers software might consider dynamic, when they're clearly
not dynamic on real hardware seems troublesome though.
> > I think we want the flexibility to allow the user to
> > specify per PCIe device the link width and at least the maximum link
> > speed, if not the actual discrete link speeds supported. However,
> > while I want to provide this flexibility, I don't necessarily think it
> > makes sense to burden the user to always specify these to get
> > reasonable defaults. So I would propose that we a) add link parameters
> > to the base PCIe device class and b) set defaults based on the machine
> > type. Additionally these machine type defaults would only apply to
> > generic PCIe root ports and switch ports, anything based on real
> > hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
> > overridden by the user. Existing machine types would also stay at this
> > "legacy" rate, while pc-q35-3.2 might bring all generic devices up to
> > PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
> > negotiation would bring us back to negotiated widths and speeds
> > matching the endpoint. Reasonable?
>
> Generally yes. Last time I looked, there's a bunch of stuff the spec
> says we need to do for the negotiation. E.g. guest can at any time
> request width re-negotiation. Maybe most guests don't do it but it's
> still in the spec and we never know whether anyone will do it in the
> future.
>
> VFIO is often a compromise but for virtual devices
> I'd prefer we are stictly compliant if possible.
I would also want to be as spec compliant as possible and we'll need to
think about how to incorporate things like link change notifications,
these may require additional support from vfio if we can capture the
event on the host and plumb it through the virtual downstream port. In
general though, I think retraining a link width changes will be rather
transparently handled if the downstream port defers to mirroring the
link status of the connected endpoint. I'll try to look specifically
at each interaction for compliance, but if you have any specific
things you think are going to be troublesome, please let me know.
> > Next I think we need to look at how and when we do virtual link
> > negotiation. We're mostly discussing a virtual link, so I think
> > negotiation is simply filling in the negotiated link and width with the
> > highest common factor between endpoint and upstream port. For assigned
> > devices, this should match the endpoint's existing negotiated link
> > parameters, however, devices can dynamically change their link speed
> > (perhaps also width?), so I believe a current link seed of 2.5GT/s
> > could upshift to 8GT/s without any sort of visible renegotiation. Does
> > this mean that we should have link parameter callbacks from downstream
> > port to endpoint? Or maybe the downstream port link status register
> > should effectively be an alias for LNKSTA of devfn 00.0 of the
> > downstream device when it exists. We only need to report a consistent
> > link status value when someone looks at it, so reading directly from
> > the endpoint probably makes more sense than any sort of interface to
> > keep the value current.
>
> Don't we need to reflect the physical downstream link speed
> somehow though?
The negotiated physical downstream port speed and width must match the
endpoint's speed and width, so I think the only concern here is that we
might have a mismatch of capabilities, right? I'm not sure we have an
alternative though. If the root port capabilities need to match the
physical device, then we've essentially precluded hotplug unless we're
going to suggest that we always hot-add a matching root port, into
which we'll then hot-add the assigned device. Therefore I favored the
approach of simply over-spec'ing the virtual devices and I think there
are physical precedents for such as well. For example, there exist a
range of passive adapter and expansion devices for PCIe which can
change the width and may also restrict the speed. A x16 endpoint may
only negotiate a x1 width even though both the endpoint and slot are
x16 capable if one of these[1] is interposed between them. The link
speed may be similarly restricted with one of these[2].
[1]https://www.amazon.com/gp/product/B0039XPS5W/
[2]https://www.amazon.com/Laptop-External-PCI-Graphics-Card/dp/B00Q4VMLF6
In the scheme I propose, the user would have the ability to set the
root port to speeds and widths that match the physical device, but the
default case would be to effectively over-provision the virtual device.
> > If we take the above approach with LNKSTA (probably also LNKSTA2), is
> > any sort of "negotiation" required? We're automatically negotiated if
> > the capabilities of the upstream port are a superset of the endpoint's
> > capabilities. What do we do and what do we care about when the
> > upstream port is a subset of the endpoint though? For example, an
> > 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
> > On real hardware we obviously negotiate the endpoint down to the
> > downstream port parameters. We could do that with an emulated device,
> > but this is the scenario we have today with assigned devices and we
> > simply leave the inconsistency. I don't think we actually want to
> > (and there would be lots of complications to) force the physical device
> > to negotiate down to match a virtual downstream port. Do we simply
> > trigger a warning that this may result in non-optimal performance and
> > leave the inconsistency?
>
> Also when guest pokes at the width do we need to tweak the
> physical device/downstream port?
How does the guest poke at width? The guest can set a link target
speed in LNKCTL2, but the width seems to be only negotiated at the
physical level. AIUI the standard procedure would be for a driver to
set a target link speed and then retrain the link from the downstream
port to implement that request. The retraining may or may not achieve
the target link speed and retraining is transparent to the data layer
of the link, therefore it seems safe to do nothing on retraining, but
we may want to consider it as a future enhancement. Physical link
retraining also gets us into scenarios where we need to think about
multifunction end points and the ownership of the endpoints affected by
a link retraining. It might be like the bus reset support where the
user needs to own all the affected devices to initiate a link
retraining. I don't think anything here precludes that, it'd be an
additional callback from the downstream port to the devfn 00.0 endpoint
to initiate a retraining and vfio would need to figure out what it can
do with that. Thanks,
Alex
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-10-16 16:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-15 20:18 [Qemu-devel] QEMU PCIe link "negotiation" Alex Williamson
2018-10-16 9:49 ` Dr. David Alan Gilbert
2018-10-16 15:50 ` Alex Williamson
2018-10-16 15:21 ` Michael S. Tsirkin
2018-10-16 16:46 ` Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).