From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: decent performance drop for SCSI LLD / SAN initiator when iommu is turned on Date: Tue, 7 May 2013 15:22:35 +0300 Message-ID: <20130507122235.GA21361@redhat.com> References: <20130502015603.GC26105@redhat.com> <5188304E.9050603@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <5188304E.9050603-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Alexander Duyck Cc: Roland Dreier , Or Gerlitz , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Yan Burman , iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Paolo Bonzini , Asias He List-Id: linux-rdma@vger.kernel.org On Mon, May 06, 2013 at 03:35:58PM -0700, Alexander Duyck wrote: > On 05/06/2013 02:39 PM, Or Gerlitz wrote: > > On Thu, May 2, 2013 at 4:56 AM, Michael S. Tsirkin wrote: > >> On Thu, May 02, 2013 at 02:11:15AM +0300, Or Gerlitz wrote: > >>> So we've noted that when configuring the kernel && booting with intel > >>> iommu set to on on a physical node (non VM, and without enabling SRIOV > >>> by the HW device driver) raw performance of the iSER (iSCSI RDMA) SAN > >>> initiator is reduced notably, e.g in the testbed we looked today we > >>> had ~260K 1KB random IOPS and 5.5GBs BW for 128KB IOs with iommu > >>> turned off for single LUN, and ~150K IOPS and 4GBs BW with iommu > >>> turned on. No change on the target node between runs. > >> That's why we have iommu=pt. > >> See definition of iommu_pass_through in arch/x86/kernel/pci-dma.c. > > > > > > Hi Michael (hope you feel better), > > > > We did some runs with the pt approach you suggested and still didn't > > get the promised gain -- in parallel we came across this 2012 commit > > f800326dc "ixgbe: Replace standard receive path with a page based > > receive" where they say "[...] we are able to see a considerable > > performance gain when an IOMMU is enabled because we are no longer > > unmapping every buffer on receive [...] instead we can simply call > > sync_single_range [...]" looking on the commit you can see that they > > allocate a page/skb dma_map it initially and later of the life cycle > > of that buffer use dma_sync_for_device/cpu and avoid dma_map/unmap on > > the fast path. > > > > Well few questions which I'd love to hear people's opinion -- 1st this > > approach seems cool for network device RX path, but what about the TX > > path, any idea how to avoid dma_map for it? or why on the TX path > > calling dma_map/unmap for every buffer doesn't involve a notable perf > > hit? 2nd I don't see how to apply the method on block device since > > these devices don't allocate buffers, but rather get a scatter-gather > > list of pages from upper layers, issue dma_map_sg on them and submit > > the IO, later when done call dma_unmap_sg > > > > Or. > > The Tx path ends up taking a performance hit if IOMMU is enabled. It > just isn't as severe due to things like TSO. > > One way to work around the performance penalty is to allocate bounce > buffers and just leave them static mapped. Then you can simply memcpy > the data to the buffers and avoid the locking overhead of > allocating/freeing IOMMU resources. It consumes more memory but works > around the IOMMU limitations. > > Thanks, > > Alex But why isn't iommu=pt effective? AFAIK the whole point of it was to give up on security for host-controlled devices, but still get a measure of security for assigned devices. -- MST