[PATCH 0/7] Intel IOMMU scalability improvements

iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] Intel IOMMU scalability improvements
@ 2015-12-28 16:14 Adam Morrison
       [not found] ` <20151228161421.GA27829-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-03-15 18:09 ` Ben Serebrin via iommu
  0 siblings, 2 replies; 5+ messages in thread
From: Adam Morrison @ 2015-12-28 16:14 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/

This patchset improves the scalability of the Intel IOMMU code by
resolving two spinlock bottlenecks, yielding up to ~10x performance
improvement and approaching iommu=off performance.

For example, here's the throughput obtained by 16 memcached instances
running on a 16-core Sandy Bridge system, accessed using memslap on
another machine that has iommu=off, using the default memslap config
(64-byte keys, 1024-byte values, and 10%/90% SET/GET ops):

    stock iommu=off:
       1,088,996 memcached transactions/sec (=100%, median of 10 runs).
    stock iommu=on:
       123,760 memcached transactions/sec (=11%).
       [perf: 43.56%    0.86%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]
    patched iommu=on:
       1,067,586 memcached transactions/sec (=98%).
       [perf: 0.75%     0.75%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]

The two resolved spinlocks:

 - Deferred IOTLB invalidations are batched in a global data structure
   and serialized under a spinlock (add_unmap() & flush_unmaps()); this
   patchset batches IOTLB invalidations in a per-CPU data structure.

 - IOVA management (alloc_iova() & __free_iova()) is serialized under
   the rbtree spinlock; this patchset adds per-CPU caches of allocated
   IOVAs so that the rbtree doesn't get accessed frequently. (Adding a
   cache above the existing IOVA allocator is less intrusive than dynamic
   identity mapping and helps keep IOMMU page table usage low; see
   Patch 7.)

The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX
Annual Technical Conference) contains many more details and experiments:

  https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf


Omer Peleg (7):
  iommu: refactoring of deferred flush entries
  iommu: per-cpu deferred invalidation queues
  iommu: correct flush_unmaps pfn usage
  iommu: only unmap mapped entries
  iommu: avoid dev iotlb logic in intel-iommu for domains with no dev
    iotlbs
  iommu: change intel-iommu to use IOVA frame numbers
  iommu: introduce per-cpu caching to iova allocation

 drivers/iommu/intel-iommu.c | 264 +++++++++++++++++++++-------------
 drivers/iommu/iova.c        | 334 +++++++++++++++++++++++++++++++++++++-------
 include/linux/iova.h        |  23 ++-
 3 files changed, 470 insertions(+), 151 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/7] Intel IOMMU scalability improvements
       [not found] ` <20151228161421.GA27829-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
@ 2016-01-04 17:35   ` Joerg Roedel
  2016-03-15 18:00   ` Benjamin Serebrin via iommu
  1 sibling, 0 replies; 5+ messages in thread
From: Joerg Roedel @ 2016-01-04 17:35 UTC (permalink / raw)
  To: Adam Morrison
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/

Hi Adam,

On Mon, Dec 28, 2015 at 06:14:21PM +0200, Adam Morrison wrote:
> This patchset improves the scalability of the Intel IOMMU code by
> resolving two spinlock bottlenecks, yielding up to ~10x performance
> improvement and approaching iommu=off performance.
> 
> For example, here's the throughput obtained by 16 memcached instances
> running on a 16-core Sandy Bridge system, accessed using memslap on
> another machine that has iommu=off, using the default memslap config
> (64-byte keys, 1024-byte values, and 10%/90% SET/GET ops):
> 
>     stock iommu=off:
>        1,088,996 memcached transactions/sec (=100%, median of 10 runs).
>     stock iommu=on:
>        123,760 memcached transactions/sec (=11%).
>        [perf: 43.56%    0.86%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]
>     patched iommu=on:
>        1,067,586 memcached transactions/sec (=98%).
>        [perf: 0.75%     0.75%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]
>

Thanks for the patches, the results look pretty impressive. I'll have a
closer look at your changes this week.



	Joerg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/7] Intel IOMMU scalability improvements
       [not found] ` <20151228161421.GA27829-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-01-04 17:35   ` Joerg Roedel
@ 2016-03-15 18:00   ` Benjamin Serebrin via iommu
       [not found]     ` <CAN+hb0Xt21CMmM7uE0rzjf5p9w-W+5y8at4v1J8+pYd8tamLpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 5+ messages in thread
From: Benjamin Serebrin via iommu @ 2016-03-15 18:00 UTC (permalink / raw)
  To: Adam Morrison
  Cc: Omer Peleg, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dan Tsafrir, David Woodhouse, Benjamin Serebrin


[-- Attachment #1.1: Type: text/plain, Size: 3115 bytes --]

There are nice.  Thanks very much for doing this work!

We have some preliminary results, looking at scaling to high core counts.
We tested the patches on a 2-socket high core count SNB-EP server with a
Broadcomm NIC.  Our benchmark uses 200 threads of TCP_RR.  We see similar
performance for IOMMU disabled as we do for IOMMU enabled with this
patchset, which is good news.  We're working on getting a lab setup with
Haswell servers so we can further test the scalability of the code.

We owe the scaling results and of course actual code reviews.

On Mon, Dec 28, 2015 at 8:14 AM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
wrote:

> This patchset improves the scalability of the Intel IOMMU code by
> resolving two spinlock bottlenecks, yielding up to ~10x performance
> improvement and approaching iommu=off performance.
>
> For example, here's the throughput obtained by 16 memcached instances
> running on a 16-core Sandy Bridge system, accessed using memslap on
> another machine that has iommu=off, using the default memslap config
> (64-byte keys, 1024-byte values, and 10%/90% SET/GET ops):
>
>     stock iommu=off:
>        1,088,996 memcached transactions/sec (=100%, median of 10 runs).
>     stock iommu=on:
>        123,760 memcached transactions/sec (=11%).
>        [perf: 43.56%    0.86%  memcached       [kernel.kallsyms]      [k]
> _raw_spin_lock_irqsave]
>     patched iommu=on:
>        1,067,586 memcached transactions/sec (=98%).
>        [perf: 0.75%     0.75%  memcached       [kernel.kallsyms]      [k]
> _raw_spin_lock_irqsave]
>
> The two resolved spinlocks:
>
>  - Deferred IOTLB invalidations are batched in a global data structure
>    and serialized under a spinlock (add_unmap() & flush_unmaps()); this
>    patchset batches IOTLB invalidations in a per-CPU data structure.
>
>  - IOVA management (alloc_iova() & __free_iova()) is serialized under
>    the rbtree spinlock; this patchset adds per-CPU caches of allocated
>    IOVAs so that the rbtree doesn't get accessed frequently. (Adding a
>    cache above the existing IOVA allocator is less intrusive than dynamic
>    identity mapping and helps keep IOMMU page table usage low; see
>    Patch 7.)
>
> The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX
> Annual Technical Conference) contains many more details and experiments:
>
>
> https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf
>
>
> Omer Peleg (7):
>   iommu: refactoring of deferred flush entries
>   iommu: per-cpu deferred invalidation queues
>   iommu: correct flush_unmaps pfn usage
>   iommu: only unmap mapped entries
>   iommu: avoid dev iotlb logic in intel-iommu for domains with no dev
>     iotlbs
>   iommu: change intel-iommu to use IOVA frame numbers
>   iommu: introduce per-cpu caching to iova allocation
>
>  drivers/iommu/intel-iommu.c | 264 +++++++++++++++++++++-------------
>  drivers/iommu/iova.c        | 334
> +++++++++++++++++++++++++++++++++++++-------
>  include/linux/iova.h        |  23 ++-
>  3 files changed, 470 insertions(+), 151 deletions(-)
>
> --
> 1.9.1
>
>

[-- Attachment #1.2: Type: text/html, Size: 4144 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/7] Intel IOMMU scalability improvements
  2015-12-28 16:14 [PATCH 0/7] Intel IOMMU scalability improvements Adam Morrison
       [not found] ` <20151228161421.GA27829-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
@ 2016-03-15 18:09 ` Ben Serebrin via iommu
  1 sibling, 0 replies; 5+ messages in thread
From: Ben Serebrin via iommu @ 2016-03-15 18:09 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

There are nice.  Thanks very much for doing this work!

We have some preliminary results, looking at scaling to high core counts.  
We tested the patches on a 2-socket high core count SNB-EP server with a 
Broadcomm NIC.  Our benchmark uses 200 threads of TCP_RR.  We see similar 
performance for IOMMU disabled as we do for IOMMU enabled with this patchset, 
which is good news.  We're working on getting a lab setup with Haswell servers 
so we can further test the scalability of the code.

We owe the scaling results and of course actual code reviews.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/7] Intel IOMMU scalability improvements
       [not found]     ` <CAN+hb0Xt21CMmM7uE0rzjf5p9w-W+5y8at4v1J8+pYd8tamLpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-05  9:22       ` Joerg Roedel
  0 siblings, 0 replies; 5+ messages in thread
From: Joerg Roedel @ 2016-04-05  9:22 UTC (permalink / raw)
  To: Benjamin Serebrin
  Cc: Omer Peleg, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dan Tsafrir, Adam Morrison, David Woodhouse

On Tue, Mar 15, 2016 at 11:00:34AM -0700, Benjamin Serebrin via iommu wrote:
> There are nice.  Thanks very much for doing this work!
> 
> We have some preliminary results, looking at scaling to high core counts.  We
> tested the patches on a 2-socket high core count SNB-EP server with a Broadcomm
> NIC.  Our benchmark uses 200 threads of TCP_RR.  We see similar performance for
> IOMMU disabled as we do for IOMMU enabled with this patchset, which is good
> news.  We're working on getting a lab setup with Haswell servers so we can
> further test the scalability of the code.
> 
> We owe the scaling results and of course actual code reviews.

This sounds very promising. Could you guys rebase the patches to a
recent upstream kernel and repost together with your performance
results? I'd really like to see that merged.


	Joerg

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-05  9:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-28 16:14 [PATCH 0/7] Intel IOMMU scalability improvements Adam Morrison
     [not found] ` <20151228161421.GA27829-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
2016-01-04 17:35   ` Joerg Roedel
2016-03-15 18:00   ` Benjamin Serebrin via iommu
     [not found]     ` <CAN+hb0Xt21CMmM7uE0rzjf5p9w-W+5y8at4v1J8+pYd8tamLpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-05  9:22       ` Joerg Roedel
2016-03-15 18:09 ` Ben Serebrin via iommu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).