From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0B27C433F5 for ; Fri, 20 May 2022 15:51:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347851AbiETPvZ (ORCPT ); Fri, 20 May 2022 11:51:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351201AbiETPvX (ORCPT ); Fri, 20 May 2022 11:51:23 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 821CD5B3F4 for ; Fri, 20 May 2022 08:51:21 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 35C421477; Fri, 20 May 2022 08:51:21 -0700 (PDT) Received: from [10.57.82.55] (unknown [10.57.82.55]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 942EF3F73D; Fri, 20 May 2022 08:51:17 -0700 (PDT) Message-ID: Date: Fri, 20 May 2022 16:51:11 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: s390-iommu.c default domain conversion Content-Language: en-GB To: Niklas Schnelle , Jason Gunthorpe Cc: Matthew Rosato , linux-s390@vger.kernel.org, alex.williamson@redhat.com, cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com, borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com, gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com, svens@linux.ibm.com, frankja@linux.ibm.com, david@redhat.com, imbrenda@linux.ibm.com, vneethv@linux.ibm.com, oberpar@linux.ibm.com, freude@linux.ibm.com, thuth@redhat.com, pasic@linux.ibm.com References: <20220509233552.GT49344@nvidia.com> <20220510160911.GH49344@nvidia.com> <20220520134414.GH1343366@nvidia.com> <6271dd24bfcf82b0c1b911a163ae9549c24691a4.camel@linux.ibm.com> From: Robin Murphy In-Reply-To: <6271dd24bfcf82b0c1b911a163ae9549c24691a4.camel@linux.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-s390@vger.kernel.org On 2022-05-20 16:17, Niklas Schnelle wrote: > On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote: >> On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote: >> >>> I did some testing and created a prototype that gets rid of >>> arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU >>> driver. It looks like the existing dma-iommu code allows us to do this >>> with relatively simple changes to the IOMMU driver only, mostly just >>> implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so >>> that's great. They also do seem to map quite well to our RPCIT I/O TLB >>> flush so that's great. For now the prototype still uses 4k pages only. >> >> You are going to want to improve that page sizes in the iommu driver >> anyhow for VFIO. > > Ok, we'll look into this. > >> >>> With that the performance on the LPAR machine hypervisor (no paging) is >>> on par with our existing code. On paging hypervisors (z/VM and KVM) >>> i.e. with the hypervisor shadowing the I/O translation tables, it's >>> still slower than our existing code and interestingly strict mode seems >>> to be better than lazy here. One thing I haven't done yet is implement >>> the map_pages() operation or adding larger page sizes. >> >> map_pages() speeds thiings up if there is contiguous memory, I'm not >> sure what work load you are testing with so hard to guess if that is >> interesting or not. > > Our most important driver is mlx5 with both IP and RDMA traffic on > ConnectX-4/5/6 but we also support NVMes. Since you already have the loop in s390_iommu_update_trans(), updating to map/unmap_pages should be trivial, and well worth it. >>> Maybe you have some tips what you'd expect to be most beneficial? >>> Either way we're optimistic this can be solved and this conversion >>> will be a high ranking item on my backlog going forward. >> >> I'm not really sure I understand the differences, do you have a sense >> what is making it slower? Maybe there is some small feature that can >> be added to the core code? It is very strange that strict is faster, >> that should not be, strict requires synchronous flush in the unmap >> cas, lazy does not. Are you sure you are getting the lazy flushes >> enabled? > > The lazy flushes are the timer triggered flush_iotlb_all() in > fq_flush_iotlb(), right? I definitely see that when tracing my > flush_iotlb_all() implementation via that path. That flush_iotlb_all() > in my prototype is basically the same as the global RPCIT we did once > we wrapped around our IOVA address space. I suspect that this just > happens much more often with the timer than our wrap around and > flushing the entire aperture is somewhat slow because it causes the > hypervisor to re-examine the entire I/O translation table. On the other > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() > always flushes a relatively small contiguous range as I'm using the > following construct to extend gather: > > if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) > iommu_iotlb_sync(domain, gather); > > iommu_iotlb_gather_add_range(gather, iova, size); > > Maybe the smaller contiguous ranges just help with locality/caching > because the flushed range in the guests I/O tables was just updated. That's entirely believable - both the AMD and Intel drivers force strict mode when virtualised for similar reasons, so feel free to do the same. Robin.