From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46496) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aY8qE-0000BL-Se for qemu-devel@nongnu.org; Tue, 23 Feb 2016 04:01:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aY8qA-0003PZ-Od for qemu-devel@nongnu.org; Tue, 23 Feb 2016 04:01:10 -0500 Received: from mail-pa0-x230.google.com ([2607:f8b0:400e:c03::230]:35814) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aY8q9-0003BP-JJ for qemu-devel@nongnu.org; Tue, 23 Feb 2016 04:01:06 -0500 Received: by mail-pa0-x230.google.com with SMTP id ho8so109938596pac.2 for ; Tue, 23 Feb 2016 01:00:24 -0800 (PST) References: <1456121379-13434-1-git-send-email-aik@ozlabs.ru> <20160222062631.GH2808@voom.fritz.box> <56CAF2B3.2030502@ozlabs.ru> <20160222121227.GN2808@voom.fritz.box> <56CBBFD7.3050806@ozlabs.ru> <20160223062056.GU2808@voom.fritz.box> From: Alexey Kardashevskiy Message-ID: <56CC1FA3.5020003@ozlabs.ru> Date: Tue, 23 Feb 2016 20:00:19 +1100 MIME-Version: 1.0 In-Reply-To: <20160223062056.GU2808@voom.fritz.box> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu] memory: Fix IOMMU replay base address List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: qemu-devel@nongnu.org On 02/23/2016 05:20 PM, David Gibson wrote: > On Tue, Feb 23, 2016 at 01:11:35PM +1100, Alexey Kardashevskiy wrote: >> On 02/22/2016 11:12 PM, David Gibson wrote: >>> On Mon, Feb 22, 2016 at 10:36:19PM +1100, Alexey Kardashevskiy wrote: >>>> On 02/22/2016 05:26 PM, David Gibson wrote: >>>>> On Mon, Feb 22, 2016 at 05:09:39PM +1100, Alexey Kardashevskiy wrote: >>>>>> Since a788f227 "memory: Allow replay of IOMMU mapping notifications" >>>>>> when new VFIO listener is added, all existing IOMMU mappings are replayed. >>>>>> However there is a problem that the base address of an IOMMU memory region >>>>>> (IOMMU MR) is ignored which is not a problem for the existing user (which is >>>>>> pseries) with its default 32bit DMA window starting at 0 but it is if there is >>>>>> another DMA window. >>>>>> >>>>>> This adjusts the replaying address by mr->addr. >>>>> >>>>> Uh.. this doesn't look right to me. AFAICT from the existing >>>>> implementations the 'addr' parameter to the translate function is an >>>>> offset within the memory region, which would make the original version >>>>> correct. >>>> >>>> Ok, then spapr_tce_translate_iommu() needs to be fixed. Or I am missing >>>> something here? The @addr field is definitely ignored now. >>> >>> I think at least one of us is missing something. In the normal AS >>> translation path, IIUC, it will subtract the mr->addr value to give >>> offsets which are passed to translate. It uses those offsets to find >>> the TCE and returns a translated address. Where else the @addr be >>> used? >> >> Ok, this patch is definitely wrong. >> >> The actual problem I am debugging is this: >> 1. run guest with an emulated XHCI to have 2 DMA windows; >> 2. hotplug vfio-pci device. >> 3. observe failing vfio_dma_map(). >> >> This triggers memory_region_register_iommu_notifier() + >> memory_region_iommu_replay() which calls vfio_iommu_map_notify() in a loop. >> For the second window (huge one, 0x800000000000000 bus address) it fails as >> iotlb.iova's returned by spapr_tce_translate_iommu() are offsets within >> IOMMU MR. >> >> But vfio_dma_map() (which is eventually called by vfio_iommu_map_notify()) >> needs an absolute iotlb->iova address. >> >> imho this should be correct: >> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c >> index 2e02dc6..0d6b926 100644 >> --- a/hw/ppc/spapr_iommu.c >> +++ b/hw/ppc/spapr_iommu.c >> @@ -126,7 +126,7 @@ static IOMMUTLBEntry >> spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr, >> hwaddr page_mask = IOMMU_PAGE_MASK(tcet->page_shift); >> >> tce = tcet->table[addr >> tcet->page_shift]; >> - ret.iova = addr & page_mask; >> + ret.iova = (addr + iommu->addr) & page_mask; >> ret.translated_addr = tce & page_mask; > > I wondered about that change, but I'd have to look closer to see if > the iova field here is expected to be relative to the MR as well. It > would be oddly inconsistent if it wasn't. It is relative and it does not make sense as there is no source MR/AS in iotlb (only target AS) so there is no use in such iova. > >> However this does not help either. >> >> Because at the moment I create hardware windows from a helper which is >> called from spapr_phb_hot_plug_child() (HotplugHandlerClass::plug() >> callback). And this is too late as pci_qdev_realize("vfio-pci") calls: >> >> vfio_connect_container -> ... -> vfio_listener_region_add -> .. -> >> memory_region_iommu_replay >> and the latter fails - there is no huge window in the host kernel yet. >> What is the right way of fixing this?... > > Hmm.. I think we need a backend specific hook which will be called > from vfio_listener_region_add() when it sees a new iommu region, but > before it calls replay. > > Basically that notifier is enumerating to the VFIO subsystem where the > guest has IOMMU windows. With DDW on the host, we need something to > mirror the guest side windows into the host (or reject them, if > they're not compatible. The host kernel API consists of 2 ioctls basically - create and remove a window. There is no "query" which would tell how many windows are there and their parameters. When there is a first vfio-pci device, I basically go through the list of emulated windows and create hardware windows for every active window. But I do this from spapr-phb code where I have the complete picture - how many windows, LIOBNs. This vfio_listener_region_add() thing - it is called per IOMMU MR, i.e. once or twice per PHB, in undefined order, not good (however the order never changes in the reality). And how exactly do I connect VFIO container and spapr PHB (presumably using QOM)? I cannot think of any good interface now... Have another memory listener on PHB's AS in spapr_pci? > Actually.. that hook should replace the test a little bit higher > testing against container->min_iova and container->max_iova. With > host-side DDW the set of allowable IOVAs is no longer a single, simple > range - and it's now changeable. I am missing the purpose of container->min_iova/max_iova, on x86 these are constant values and on pseries - QEMU chooses the windows parameters when creates the windows so this check is more like assert... -- Alexey