All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: "Tian, Kevin" <kevin.tian@intel.com>,
	"george.dunlap@eu.citrix.com" <george.dunlap@eu.citrix.com>,
	"tim@xen.org" <tim@xen.org>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"Dong, Eddie" <eddie.dong@intel.com>,
	"ian.campbell@citrix.com" <ian.campbell@citrix.com>,
	"jbeulich@suse.com" <jbeulich@suse.com>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	"keir@xen.org" <keir@xen.org>,
	"Zhang, Yang Z" <yang.z.zhang@intel.com>,
	"Xu, Quan" <quan.xu@intel.com>,
	"Sankaran, Rajesh" <rajesh.sankaran@intel.com>
Subject: Re: Revisit VT-d asynchronous flush issue
Date: Mon, 2 Nov 2015 11:39:49 +0000	[thread overview]
Message-ID: <56374B85.6050301@citrix.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F6E9CF9@SHSMSX101.ccr.corp.intel.com>

On 02/11/15 08:03, Tian, Kevin wrote:
> Let's start a new thread with a summary of previous discussion, and 
> then our latest experiment data and updated proposal.
>
> From previous discussions, it's suggested that a spin model is accepted, 
> only when spin timeout doesn't exceed the order of a scheduling time 
> slice, or other blocking operations like what WBINVD might take. 
> Otherwise async-flush model is preferred to prevent misbehaving guests 
> taking long spins if possible, to impact whole system.
>
> Below are some thresholds to be considered:
>
> 1) scheduling time slice in Credit is 1ms.

1ms is the minimum scheduling timeslice.  30ms is the current default
(not that this should affect the following reasoning).

>
> 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
> 10GB NIC assigned to the VM, running iperf). Detail data is append in 
> the bottom. Actual cost varies on different platforms, due to different 
> cache size/layout. For example, we also heard from other colleagues 
> about 10ms level cost on another platform.
>
> 3) PCI SIG strongly recommends that Completion Timeout mechanism
> not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
> 2 Register). It means CPU MMIO read might already take >10ms which 
> we just didn't note.
>
> Based on above information, at least we can think a timeout range
> between [1ms, 10ms] would likely not introduce bad system behavior. 
> Or conservatively, we can define the spin timeout default as 1ms, 
> while allowing boot-time override up to 10ms for more flexibility.
>
> Then regarding to VT-d flush:
>
> - For context/iotlb/iec flush, our measurements show worst cases
> <10us. We also confirmed with hardware team, that 1ms is large 
> enough for IOMMU internal flush.
>
> - For ATS device-TLB flush, PCI spec defines up to 60s, but:
>
> 	* Our hardware team confirms that 1ms should be enough for 
> integrated PCI devices w/ ATS.
>
> 	* for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
> or 10ms is too restrictive to them, but there are only a few devices
> now in the market. 
>
> Based on above information, we propose to continue spin-timeout
> model w/ some adjustment, which fixes current timeout concern
> and also allows limited ATS support in a light way:
>
> 1) reduce spin timeout to 1ms, which can be boot-time changed
> up to 10ms.

If this is going to be command line configurable, don't have an upper limit.

Given the uncertainty with external devices, it might be necessary to
experiment with timeouts greater than 10ms.

>
> 2) if timeout expires, kill the VM which the target device is assigned 
> to. Optionally hypervisor may mark device non-assignable.
>
> It works for devices w/o ATS. It works for integrated devices w/ ATS.
> It might or might not work for discrete devices w/ ATS, but we can
> re-evaluate the gain vs. software complexity of async flush until we 
> see many discrete devices breaking the timeout assumptions in the 
> future.
>
> Thoughts?

As presented, this is probably an improvement, but I am concerning with
the case of external devices.

Then again, as none of this currently works at all, we are not in a
worse state.

~Andrew

  reply	other threads:[~2015-11-02 11:39 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-02  8:03 Revisit VT-d asynchronous flush issue Tian, Kevin
2015-11-02 11:39 ` Andrew Cooper [this message]
2015-11-03  2:27   ` Tian, Kevin
2015-11-03  3:28     ` Xu, Quan
2015-11-02 14:10 ` Jan Beulich
2015-11-03  9:58   ` George Dunlap
2015-11-03 10:04     ` Jan Beulich
2015-11-05  6:48       ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56374B85.6050301@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=eddie.dong@intel.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=ian.campbell@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=quan.xu@intel.com \
    --cc=rajesh.sankaran@intel.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    --cc=yang.z.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.