Revisit VT-d asynchronous flush issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Revisit VT-d asynchronous flush issue
@ 2015-11-02  8:03 Tian, Kevin
  2015-11-02 11:39 ` Andrew Cooper
  2015-11-02 14:10 ` Jan Beulich
  0 siblings, 2 replies; 8+ messages in thread
From: Tian, Kevin @ 2015-11-02  8:03 UTC (permalink / raw)
  To: george.dunlap@eu.citrix.com, tim@xen.org, xen-devel@lists.xen.org,
	andrew.cooper3@citrix.com, Dong, Eddie, ian.campbell@citrix.com,
	jbeulich@suse.com, Nakajima, Jun, keir@xen.org, Tian, Kevin,
	Zhang, Yang Z, Xu, Quan, Sankaran, Rajesh

Let's start a new thread with a summary of previous discussion, and 
then our latest experiment data and updated proposal.

>From previous discussions, it's suggested that a spin model is accepted, 
only when spin timeout doesn't exceed the order of a scheduling time 
slice, or other blocking operations like what WBINVD might take. 
Otherwise async-flush model is preferred to prevent misbehaving guests 
taking long spins if possible, to impact whole system.

Below are some thresholds to be considered:

1) scheduling time slice in Credit is 1ms.

2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
10GB NIC assigned to the VM, running iperf). Detail data is append in 
the bottom. Actual cost varies on different platforms, due to different 
cache size/layout. For example, we also heard from other colleagues 
about 10ms level cost on another platform.

3) PCI SIG strongly recommends that Completion Timeout mechanism
not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
2 Register). It means CPU MMIO read might already take >10ms which 
we just didn't note.

Based on above information, at least we can think a timeout range
between [1ms, 10ms] would likely not introduce bad system behavior. 
Or conservatively, we can define the spin timeout default as 1ms, 
while allowing boot-time override up to 10ms for more flexibility.

Then regarding to VT-d flush:

- For context/iotlb/iec flush, our measurements show worst cases
<10us. We also confirmed with hardware team, that 1ms is large 
enough for IOMMU internal flush.

- For ATS device-TLB flush, PCI spec defines up to 60s, but:

	* Our hardware team confirms that 1ms should be enough for 
integrated PCI devices w/ ATS.

	* for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
or 10ms is too restrictive to them, but there are only a few devices
now in the market. 

Based on above information, we propose to continue spin-timeout
model w/ some adjustment, which fixes current timeout concern
and also allows limited ATS support in a light way:

1) reduce spin timeout to 1ms, which can be boot-time changed
up to 10ms.

2) if timeout expires, kill the VM which the target device is assigned 
to. Optionally hypervisor may mark device non-assignable.

It works for devices w/o ATS. It works for integrated devices w/ ATS.
It might or might not work for discrete devices w/ ATS, but we can
re-evaluate the gain vs. software complexity of async flush until we 
see many discrete devices breaking the timeout assumptions in the 
future.

Thoughts?

----
<detail data>
		Min(us)		Max(us)	Average(us)
context	5.24		5.49		5.36
iotlb	1.90		2.07		2.03
iec		5.54		7.86		6.58
wbinvd	2721.42		4655.71		3571.43

Platform info:
1. Base Board Information
        Manufacturer: Intel Corporation
        Product Name: S2600CP
        Version: E99552-561

2. CPU:
	cpu family : 6
	model : 62
	model name : Genuine Intel(R) CPU  @ 2.80GHz

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-02  8:03 Revisit VT-d asynchronous flush issue Tian, Kevin
@ 2015-11-02 11:39 ` Andrew Cooper
  2015-11-03  2:27   ` Tian, Kevin
  2015-11-02 14:10 ` Jan Beulich
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2015-11-02 11:39 UTC (permalink / raw)
  To: Tian, Kevin, george.dunlap@eu.citrix.com, tim@xen.org,
	xen-devel@lists.xen.org, Dong, Eddie, ian.campbell@citrix.com,
	jbeulich@suse.com, Nakajima, Jun, keir@xen.org, Zhang, Yang Z,
	Xu, Quan, Sankaran, Rajesh

On 02/11/15 08:03, Tian, Kevin wrote:
> Let's start a new thread with a summary of previous discussion, and 
> then our latest experiment data and updated proposal.
>
> From previous discussions, it's suggested that a spin model is accepted, 
> only when spin timeout doesn't exceed the order of a scheduling time 
> slice, or other blocking operations like what WBINVD might take. 
> Otherwise async-flush model is preferred to prevent misbehaving guests 
> taking long spins if possible, to impact whole system.
>
> Below are some thresholds to be considered:
>
> 1) scheduling time slice in Credit is 1ms.

1ms is the minimum scheduling timeslice.  30ms is the current default
(not that this should affect the following reasoning).

>
> 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
> 10GB NIC assigned to the VM, running iperf). Detail data is append in 
> the bottom. Actual cost varies on different platforms, due to different 
> cache size/layout. For example, we also heard from other colleagues 
> about 10ms level cost on another platform.
>
> 3) PCI SIG strongly recommends that Completion Timeout mechanism
> not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
> 2 Register). It means CPU MMIO read might already take >10ms which 
> we just didn't note.
>
> Based on above information, at least we can think a timeout range
> between [1ms, 10ms] would likely not introduce bad system behavior. 
> Or conservatively, we can define the spin timeout default as 1ms, 
> while allowing boot-time override up to 10ms for more flexibility.
>
> Then regarding to VT-d flush:
>
> - For context/iotlb/iec flush, our measurements show worst cases
> <10us. We also confirmed with hardware team, that 1ms is large 
> enough for IOMMU internal flush.
>
> - For ATS device-TLB flush, PCI spec defines up to 60s, but:
>
> 	* Our hardware team confirms that 1ms should be enough for 
> integrated PCI devices w/ ATS.
>
> 	* for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
> or 10ms is too restrictive to them, but there are only a few devices
> now in the market. 
>
> Based on above information, we propose to continue spin-timeout
> model w/ some adjustment, which fixes current timeout concern
> and also allows limited ATS support in a light way:
>
> 1) reduce spin timeout to 1ms, which can be boot-time changed
> up to 10ms.

If this is going to be command line configurable, don't have an upper limit.

Given the uncertainty with external devices, it might be necessary to
experiment with timeouts greater than 10ms.

>
> 2) if timeout expires, kill the VM which the target device is assigned 
> to. Optionally hypervisor may mark device non-assignable.
>
> It works for devices w/o ATS. It works for integrated devices w/ ATS.
> It might or might not work for discrete devices w/ ATS, but we can
> re-evaluate the gain vs. software complexity of async flush until we 
> see many discrete devices breaking the timeout assumptions in the 
> future.
>
> Thoughts?

As presented, this is probably an improvement, but I am concerning with
the case of external devices.

Then again, as none of this currently works at all, we are not in a
worse state.

~Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-02  8:03 Revisit VT-d asynchronous flush issue Tian, Kevin
  2015-11-02 11:39 ` Andrew Cooper
@ 2015-11-02 14:10 ` Jan Beulich
  2015-11-03  9:58   ` George Dunlap
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2015-11-02 14:10 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Jun Nakajima, keir@xen.org, ian.campbell@citrix.com,
	george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	tim@xen.org, xen-devel@lists.xen.org, Yang Z Zhang, Quan Xu,
	Rajesh Sankaran

>>> On 02.11.15 at 09:03, <kevin.tian@intel.com> wrote:
> Based on above information, we propose to continue spin-timeout
> model w/ some adjustment, which fixes current timeout concern
> and also allows limited ATS support in a light way:
> 
> 1) reduce spin timeout to 1ms, which can be boot-time changed
> up to 10ms.
> 
> 2) if timeout expires, kill the VM which the target device is assigned 
> to. Optionally hypervisor may mark device non-assignable.
> 
> It works for devices w/o ATS. It works for integrated devices w/ ATS.
> It might or might not work for discrete devices w/ ATS, but we can
> re-evaluate the gain vs. software complexity of async flush until we 
> see many discrete devices breaking the timeout assumptions in the 
> future.
> 
> Thoughts?

Yes, let's take this approach as a first step at least.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-02 11:39 ` Andrew Cooper
@ 2015-11-03  2:27   ` Tian, Kevin
  2015-11-03  3:28     ` Xu, Quan
  0 siblings, 1 reply; 8+ messages in thread
From: Tian, Kevin @ 2015-11-03  2:27 UTC (permalink / raw)
  To: Andrew Cooper, george.dunlap@eu.citrix.com, tim@xen.org,
	xen-devel@lists.xen.org, Dong, Eddie, ian.campbell@citrix.com,
	jbeulich@suse.com, Nakajima, Jun, keir@xen.org, Zhang, Yang Z,
	Xu, Quan, Sankaran, Rajesh

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, November 02, 2015 7:40 PM
> 
> >
> > Based on above information, we propose to continue spin-timeout
> > model w/ some adjustment, which fixes current timeout concern
> > and also allows limited ATS support in a light way:
> >
> > 1) reduce spin timeout to 1ms, which can be boot-time changed
> > up to 10ms.
> 
> If this is going to be command line configurable, don't have an upper limit.
> 
> Given the uncertainty with external devices, it might be necessary to
> experiment with timeouts greater than 10ms.

That also works.

> 
> >
> > 2) if timeout expires, kill the VM which the target device is assigned
> > to. Optionally hypervisor may mark device non-assignable.
> >
> > It works for devices w/o ATS. It works for integrated devices w/ ATS.
> > It might or might not work for discrete devices w/ ATS, but we can
> > re-evaluate the gain vs. software complexity of async flush until we
> > see many discrete devices breaking the timeout assumptions in the
> > future.
> >
> > Thoughts?
> 
> As presented, this is probably an improvement, but I am concerning with
> the case of external devices.
> 
> Then again, as none of this currently works at all, we are not in a
> worse state.
> 

Understood. So based on your and Jan's comments, let's go with
this proposal first.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-03  2:27   ` Tian, Kevin
@ 2015-11-03  3:28     ` Xu, Quan
  0 siblings, 0 replies; 8+ messages in thread
From: Xu, Quan @ 2015-11-03  3:28 UTC (permalink / raw)
  To: Tian, Kevin, Andrew Cooper, george.dunlap@eu.citrix.com,
	tim@xen.org, xen-devel@lists.xen.org, Dong, Eddie,
	ian.campbell@citrix.com, jbeulich@suse.com, Nakajima, Jun,
	keir@xen.org, Zhang, Yang Z, Sankaran, Rajesh

>>> On 03.11.2015 at 10:27, <Tian, Kevin> wrote:
> > > Based on above information, we propose to continue spin-timeout
> > > model w/ some adjustment, which fixes current timeout concern and
> > > also allows limited ATS support in a light way:
> > >
> > > 1) reduce spin timeout to 1ms, which can be boot-time changed up to
> > > 10ms.
> >
> > If this is going to be command line configurable, don't have an upper limit.
> >
> > Given the uncertainty with external devices, it might be necessary to
> > experiment with timeouts greater than 10ms.
> 
> That also works.
> 
> >
> > >
> > > 2) if timeout expires, kill the VM which the target device is
> > > assigned to. Optionally hypervisor may mark device non-assignable.
> > >
> > > It works for devices w/o ATS. It works for integrated devices w/ ATS.
> > > It might or might not work for discrete devices w/ ATS, but we can
> > > re-evaluate the gain vs. software complexity of async flush until we
> > > see many discrete devices breaking the timeout assumptions in the
> > > future.
> > >
> > > Thoughts?
> >
> > As presented, this is probably an improvement, but I am concerning
> > with the case of external devices.
> >
> > Then again, as none of this currently works at all, we are not in a
> > worse state.
> >
> 
> Understood. So based on your and Jan's comments, let's go with this proposal
> first.

Thanks!
I will take Kevin's approach and send out patch set ASAP.

-Quan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-02 14:10 ` Jan Beulich
@ 2015-11-03  9:58   ` George Dunlap
  2015-11-03 10:04     ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2015-11-03  9:58 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: Jun Nakajima, keir@xen.org, ian.campbell@citrix.com,
	george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	tim@xen.org, xen-devel@lists.xen.org, Yang Z Zhang, Quan Xu,
	Rajesh Sankaran

On 02/11/15 14:10, Jan Beulich wrote:
>>>> On 02.11.15 at 09:03, <kevin.tian@intel.com> wrote:
>> Based on above information, we propose to continue spin-timeout
>> model w/ some adjustment, which fixes current timeout concern
>> and also allows limited ATS support in a light way:
>>
>> 1) reduce spin timeout to 1ms, which can be boot-time changed
>> up to 10ms.

Out of curiosity, is there a reason to limit the timeout to 10ms?

I'm generally a believer that we should do something sensible by
default, but that an admin -- particularly someone who is going to be
messing around with this sort of setting -- should be allowed to "shoot
themselves in the foot" if they want to.

Suppose that there's some particularly grotty piece of hardware that
really does require a 30ms, or 100ms timeout to work effectively?  If we
have a hard limit of 10ms, there's nothing the person can do other than
re-compile Xen.  If we have no hard limit, they can simply set it to
100ms as a work-around until we get asynchronous flushing working.

Other than that, this sounds sensible to me.

 -George

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-03  9:58   ` George Dunlap
@ 2015-11-03 10:04     ` Jan Beulich
  2015-11-05  6:48       ` Tian, Kevin
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2015-11-03 10:04 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Yang Z Zhang, keir@xen.org, ian.campbell@citrix.com,
	george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	tim@xen.org, xen-devel@lists.xen.org, Rajesh Sankaran,
	Jun Nakajima, Quan Xu

>>> On 03.11.15 at 10:58, <george.dunlap@citrix.com> wrote:
> On 02/11/15 14:10, Jan Beulich wrote:
>>>>> On 02.11.15 at 09:03, <kevin.tian@intel.com> wrote:
>>> Based on above information, we propose to continue spin-timeout
>>> model w/ some adjustment, which fixes current timeout concern
>>> and also allows limited ATS support in a light way:
>>>
>>> 1) reduce spin timeout to 1ms, which can be boot-time changed
>>> up to 10ms.
> 
> Out of curiosity, is there a reason to limit the timeout to 10ms?
> 
> I'm generally a believer that we should do something sensible by
> default, but that an admin -- particularly someone who is going to be
> messing around with this sort of setting -- should be allowed to "shoot
> themselves in the foot" if they want to.
> 
> Suppose that there's some particularly grotty piece of hardware that
> really does require a 30ms, or 100ms timeout to work effectively?  If we
> have a hard limit of 10ms, there's nothing the person can do other than
> re-compile Xen.  If we have no hard limit, they can simply set it to
> 100ms as a work-around until we get asynchronous flushing working.

Andrew requested that too, and I understood that's what's planned
to be implemented.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisit VT-d asynchronous flush issue
  2015-11-03 10:04     ` Jan Beulich
@ 2015-11-05  6:48       ` Tian, Kevin
  0 siblings, 0 replies; 8+ messages in thread
From: Tian, Kevin @ 2015-11-05  6:48 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: Nakajima, Jun, keir@xen.org, ian.campbell@citrix.com,
	george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	tim@xen.org, xen-devel@lists.xen.org, Zhang, Yang Z, Xu, Quan,
	Sankaran, Rajesh

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, November 03, 2015 6:04 PM
> 
> >>> On 03.11.15 at 10:58, <george.dunlap@citrix.com> wrote:
> > On 02/11/15 14:10, Jan Beulich wrote:
> >>>>> On 02.11.15 at 09:03, <kevin.tian@intel.com> wrote:
> >>> Based on above information, we propose to continue spin-timeout
> >>> model w/ some adjustment, which fixes current timeout concern
> >>> and also allows limited ATS support in a light way:
> >>>
> >>> 1) reduce spin timeout to 1ms, which can be boot-time changed
> >>> up to 10ms.
> >
> > Out of curiosity, is there a reason to limit the timeout to 10ms?
> >
> > I'm generally a believer that we should do something sensible by
> > default, but that an admin -- particularly someone who is going to be
> > messing around with this sort of setting -- should be allowed to "shoot
> > themselves in the foot" if they want to.
> >
> > Suppose that there's some particularly grotty piece of hardware that
> > really does require a 30ms, or 100ms timeout to work effectively?  If we
> > have a hard limit of 10ms, there's nothing the person can do other than
> > re-compile Xen.  If we have no hard limit, they can simply set it to
> > 100ms as a work-around until we get asynchronous flushing working.
> 
> Andrew requested that too, and I understood that's what's planned
> to be implemented.
> 

Yes, that's the deal. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-11-05  6:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-02  8:03 Revisit VT-d asynchronous flush issue Tian, Kevin
2015-11-02 11:39 ` Andrew Cooper
2015-11-03  2:27   ` Tian, Kevin
2015-11-03  3:28     ` Xu, Quan
2015-11-02 14:10 ` Jan Beulich
2015-11-03  9:58   ` George Dunlap
2015-11-03 10:04     ` Jan Beulich
2015-11-05  6:48       ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.