From mboxrd@z Thu Jan 1 00:00:00 1970 From: Adam Morrison Subject: [PATCH 0/7] Intel IOMMU scalability improvements Date: Mon, 28 Dec 2015 18:14:21 +0200 Message-ID: <20151228161421.GA27829@cs.technion.ac.il> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org, omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org List-Id: iommu@lists.linux-foundation.org This patchset improves the scalability of the Intel IOMMU code by resolving two spinlock bottlenecks, yielding up to ~10x performance improvement and approaching iommu=off performance. For example, here's the throughput obtained by 16 memcached instances running on a 16-core Sandy Bridge system, accessed using memslap on another machine that has iommu=off, using the default memslap config (64-byte keys, 1024-byte values, and 10%/90% SET/GET ops): stock iommu=off: 1,088,996 memcached transactions/sec (=100%, median of 10 runs). stock iommu=on: 123,760 memcached transactions/sec (=11%). [perf: 43.56% 0.86% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave] patched iommu=on: 1,067,586 memcached transactions/sec (=98%). [perf: 0.75% 0.75% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave] The two resolved spinlocks: - Deferred IOTLB invalidations are batched in a global data structure and serialized under a spinlock (add_unmap() & flush_unmaps()); this patchset batches IOTLB invalidations in a per-CPU data structure. - IOVA management (alloc_iova() & __free_iova()) is serialized under the rbtree spinlock; this patchset adds per-CPU caches of allocated IOVAs so that the rbtree doesn't get accessed frequently. (Adding a cache above the existing IOVA allocator is less intrusive than dynamic identity mapping and helps keep IOMMU page table usage low; see Patch 7.) The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX Annual Technical Conference) contains many more details and experiments: https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf Omer Peleg (7): iommu: refactoring of deferred flush entries iommu: per-cpu deferred invalidation queues iommu: correct flush_unmaps pfn usage iommu: only unmap mapped entries iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs iommu: change intel-iommu to use IOVA frame numbers iommu: introduce per-cpu caching to iova allocation drivers/iommu/intel-iommu.c | 264 +++++++++++++++++++++------------- drivers/iommu/iova.c | 334 +++++++++++++++++++++++++++++++++++++------- include/linux/iova.h | 23 ++- 3 files changed, 470 insertions(+), 151 deletions(-) -- 1.9.1