Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

From: David Woodhouse <dwmw2@infradead.org>
To: Benjamin Serebrin <serebrin@google.com>,
	Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Cc: David Miller <davem@davemloft.net>,
	Joerg Roedel <jroedel@suse.de>,
	benh@kernel.crashing.org, arnd@arndb.de, corbet@lwn.net,
	linux-doc@vger.kernel.org, linux-arch@vger.kernel.org,
	luto@kernel.org, borntraeger@de.ibm.com,
	cornelia.huck@de.ibm.com, sebott@linux.vnet.ibm.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	hch@lst.de, kvm@vger.kernel.org, schwidefsky@de.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
Date: Mon, 16 Nov 2015 08:42:33 +0000	[thread overview]
Message-ID: <1447663353.145626.46.camel@infradead.org> (raw)
In-Reply-To: <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> We looked into Intel IOMMU performance a while ago and learned a few
> useful things.  We generally did a parallel 200 thread TCP_RR test,
> as this churns through mappings quickly and uses all available cores.
> 
> First, the main bottleneck was software performance[1].

For the Intel IOMMU, *all* we need to do is put a PTE in place. For
real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
need to do an IOTLB flush. It's a single 64-bit write of the PTE.

All else is software overhead.

(I'm deliberately ignoring the stupid chipsets where DMA page tables
aren't cache coherent and we need a clflush too. They make me too sad.)

>   This study preceded the recent patch to break the locks into pools
> ("Break up monolithic iommu table/lock into finer graularity pools
> and lock").  There were several points of lock contention:
> - the RB tree ...
> - the RB tree ...
> - the RB tree ...
> 
> Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> c15-paper-peleg.pdf) has some promising approaches.  The magazine
> avoids the RB tree issue.  

I'm thinking of ditching the RB tree altogether and switching to the
allocator in lib/iommu-common.c (and thus getting the benefit of the
finer granularity pools).

> I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> page table cleanup algorithm could do well.

When you say 'dynamic 1:1 mapping', is that the same thing that's been
suggested elsewhere — avoiding the IOVA allocator completely by using a
virtual address which *matches* the physical address, if that virtual
address is available? Simply cmpxchg on the PTE itself, and if it was
already set *then* we fall back to the allocator, obviously configured
to allocate from a range *higher* than the available physical memory.

Jörg has been looking at this too, and was even trying to find space in
the PTE for a use count so a given page could be in more than one
mapping before we call back to the IOVA allocator.

> There are correctness fixes and optimizations left in the
> invalidation path: I want strict-ish semantics (a page doesn't go
> back into the freelist until the last IOTLB/IOMMU TLB entry is
> invalidated) with good performance, and that seems to imply that an
> additional page reference should be gotten at dma_map time and put
> back at the completion of the IOMMU flush routine.  (This is worthy
> of much discussion.)  

We already do something like this for page table pages which are freed
by an unmap, FWIW.

> Additionally, we can find ways to optimize the flush routine by
> realizing that if we have frequent maps and unmaps, it may be because
> the device creates and destroys buffers a lot; these kind of
> workloads use an IOVA for one event and then never come back.  Maybe
> TLBs don't do much good and we could just flush the entire IOMMU TLB
> [and ATS cache] for that BDF.

That would be a very interesting thing to look at. Although it would be
nice if we had a better way to measure the performance impact of IOTLB
misses — currently we don't have a lot of visibility at all.

> 1: We verified that the IOMMU costs are almost entirely software
> overheads by forcing software 1:1 mode, where we create page tables
> for all physical addresses.  We tested using leaf nodes of size 4KB,
> of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> of the page tables, and no IOMMU invalidations.  We did piles of DMA
> maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> addresses.  At 4KB page size, we could see some bandwidth slowdown,
> but at 2MB and 1GB, there was < 1% performance loss as compared with
> IOMMU off.

Was this with ATS on or off? With ATS, the cost of the page walk can be
amortised in some cases — you can look up the physical address *before*
you are ready to actually start the DMA to it, and don't take that
latency at the time you're actually moving the data.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

next prev parent reply	other threads:[~2015-11-16  8:42 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1445789224-28032-1-git-send-email-shamir.rabinovitch@oracle.com>
     [not found] ` <1445789224-28032-2-git-send-email-shamir.rabinovitch@oracle.com>
2015-10-28  6:30   ` [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS David Woodhouse
2015-10-28 11:10     ` Shamir Rabinovitch
2015-10-28 13:31       ` David Woodhouse
2015-10-28 14:07         ` David Miller
2015-10-28 13:57           ` David Woodhouse
2015-10-29  0:23             ` David Miller
2015-10-29  0:32         ` Benjamin Herrenschmidt
2015-10-29  0:42           ` David Woodhouse
2015-10-29  1:10             ` Benjamin Herrenschmidt
2015-10-29 18:31               ` Andy Lutomirski
2015-10-29 22:35                 ` David Woodhouse
2015-11-01  7:45                   ` Shamir Rabinovitch
2015-11-01 21:10                     ` Benjamin Herrenschmidt
2015-11-02  7:23                       ` Shamir Rabinovitch
2015-11-02 10:00                         ` Benjamin Herrenschmidt
2015-11-02 12:07                           ` Shamir Rabinovitch
2015-11-02 20:13                             ` Benjamin Herrenschmidt
2015-11-02 21:45                               ` Arnd Bergmann
2015-11-02 23:08                                 ` Benjamin Herrenschmidt
2015-11-03 13:11                                   ` Christoph Hellwig
2015-11-03 19:35                                     ` Benjamin Herrenschmidt
2015-11-02 21:49                               ` Shamir Rabinovitch
2015-11-02 22:48                       ` David Woodhouse
2015-11-02 23:10                         ` Benjamin Herrenschmidt
2015-11-05 21:08                   ` David Miller
2015-10-30  1:51                 ` Benjamin Herrenschmidt
2015-10-30 10:32               ` Arnd Bergmann
2015-10-30 23:17                 ` Benjamin Herrenschmidt
2015-10-30 23:24                   ` Arnd Bergmann
2015-11-02 14:51                 ` Joerg Roedel
2015-10-29  7:32             ` Shamir Rabinovitch
2015-11-02 14:44               ` Joerg Roedel
2015-11-02 17:32                 ` Shamir Rabinovitch
2015-11-05 13:42                   ` Joerg Roedel
2015-11-05 21:11                     ` David Miller
2015-11-07 15:06                       ` Shamir Rabinovitch
     [not found]                         ` <CAN+hb0UvztgwNuAh93XdJEe7vgiZgNMc9mHNziHpEopg8Oi4Mg@mail.gmail.com>
2015-11-16  8:42                           ` David Woodhouse [this message]
     [not found]                             ` <CAN+hb0UWpfcS5DvgMxNjY-5JOztw2mO1r2FJAW17fn974mhxPA@mail.gmail.com>
2015-11-16 18:42                               ` Benjamin Serebrin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1447663353.145626.46.camel@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=arnd@arndb.de \
    --cc=benh@kernel.crashing.org \
    --cc=borntraeger@de.ibm.com \
    --cc=corbet@lwn.net \
    --cc=cornelia.huck@de.ibm.com \
    --cc=davem@davemloft.net \
    --cc=hch@lst.de \
    --cc=jroedel@suse.de \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=sebott@linux.vnet.ibm.com \
    --cc=serebrin@google.com \
    --cc=shamir.rabinovitch@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox