From mboxrd@z Thu Jan 1 00:00:00 1970 From: Suravee Suthikulanit Subject: Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from Date: Mon, 9 Mar 2015 14:02:15 -0500 Message-ID: <54FDEE37.7020008@amd.com> References: <54EF315902000078000640FF@mail.emea.novell.com> <54EF341C020000780006416C@mail.emea.novell.com> <54F892CD.2000300@citrix.com> <54F96A440200007800066D57@mail.emea.novell.com> <54F99A63.6040200@citrix.com> <54FDBF73.8040407@amd.com> <54FDD7C4.5020905@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1YV2wm-0005Xp-NC for xen-devel@lists.xenproject.org; Mon, 09 Mar 2015 19:02:36 +0000 In-Reply-To: <54FDD7C4.5020905@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper , Jan Beulich Cc: xen-devel , Dario Faggioli , Aravind Gopalakrishnan List-Id: xen-devel@lists.xenproject.org On 3/9/2015 12:26 PM, Andrew Cooper wrote: > On 09/03/15 15:42, Suravee Suthikulanit wrote: >> On 3/6/2015 6:15 AM, Andrew Cooper wrote: >>> On 06/03/2015 07:50, Jan Beulich wrote: >>>>>>> On 05.03.15 at 18:30, wrote: >>>>> On 26/02/15 13:56, Jan Beulich wrote: >>>>>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h >>>>>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h >>>>>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa >>>>>> return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> >>>>>> PAGE_SHIFT; >>>>>> } >>>>>> >>>>>> -static inline struct page_info* alloc_amd_iommu_pgtable(void) >>>>>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct >>>>>> domain *d) >>>>>> { >>>>>> struct page_info *pg; >>>>>> void *vaddr; >>>>>> >>>>>> - pg = alloc_domheap_page(NULL, 0); >>>>>> + pg = alloc_domheap_page(d, MEMF_no_owner); >>>>> Same comment as with the VT-d side of things. This should be based on >>>>> the proximity information of the IOMMU, not of the owning domain. >>>> I think I buy this argument on the VT-d side (under the assumption >>>> that there's going to be at least one IOMMU per node), but I'm not >>>> sure here: The most modern AMD box I have has just a single >>>> IOMMU for 4 nodes it reports. >>> >>> It is not possible for an IOMMU to cover multiple NUMA nodes worth of >>> IO, because of the position it has to sit relative to the IO root ports >>> and QPI/HT links. >>> >>> In AMD systems, the IOMMUs lives in the northbridges, meaning one per >>> numa node (as it is the northbridges which contain the hypertransport >>> links) >>> >>> The BIOS/firmware will only report IOMMUs from northbridges which have >>> IO connected to their IO hypertransport link (most systems in the wild >>> have all IO hanging off one or two Numa nodes). On the other hand, I >>> have an AMD system with 8 IOMMUs in use. >> >> >> Actually, a single IOMMU could handle multiple nodes. For example, in >> scenario of a multi-chip-module (MCM) setup, there could be at least >> 2-4 nodes sharing one IOMMU depending on how the platform vendor >> configuring the system. In the server platforms, IOMMU is in AMD >> northbridge chipsets (e.g. SR56xx). This website has an example of >> such system configuration >> (http://www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html). > > Ok - I was basing my example on the last layout I had the manual for, > which I believe was Bulldozer. > > However, my point still stands that there is an IOMMU between any IO and > RAM. An individual IOMMU will always benefit from having its > iopagetables on the local numa node, rather than the numa node(s) which > the domain owning the device is running on. > I agree that having the IO page tables on the NUMA node that is closest to the IOMMU would be beneficial. However, I am not sure at the moment that this information could be easily determined. I think ACPI _PXM for devices should be able to provide this information, but this is optional and often not available. >> >> For AMD IOMMU, the IVRS table specifies the PCI bus/device ranges to >> be handled by each IOMMU. This is probably should be considered here. > > Presumably a PCI transaction must never get onto the HT bus without > having already undergone translation, or there can be no guarantee that > it would be routed via the IOMMU? Or are you saying that there are > cases where a transaction will enter the HT bus, route sideways to an > IOMMU, undergo translation, then route back onto the HT bus to the > target RAM/processor? > > ~Andrew > IOMMU sits between PCI devices (downstream) and HT (uptream), all DMA transactions from downstream must go through IOMMU. On the other hand, the I/O page translation is handled by IOMMU, and it is a separate traffic than the downstream device DMA transactions. Suravee