From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shaohua Li Subject: Re: [PATCH V2 0/4]IOMMU: avoid lock contention in iova allocation Date: Thu, 21 Jan 2016 16:18:10 -0800 Message-ID: <20160122001801.GA4114061@devbig084.prn1.facebook.com> References: <20160110225610.GA16778@cs.technion.ac.il> <20160111033749.GA1811285@devbig084.prn1.facebook.com> <20160120122103.GE18805@8bytes.org> <20160120180956.GA230160@devbig084.prn1.facebook.com> <20160120211421.GA684235@devbig084.prn1.facebook.com> <20160121231326.GA16905@cs.technion.ac.il> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <20160121231326.GA16905-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Adam Morrison Cc: Kernel-team-b10kYP2dOMg@public.gmane.org, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org List-Id: iommu@lists.linux-foundation.org On Fri, Jan 22, 2016 at 01:14:56AM +0200, Adam Morrison wrote: > On Wed, Jan 20, 2016 at 01:14:35PM -0800, Shaohua Li wrote: > > > > My understanding from the above is that the only issue with our > > > patchset was not dealing with pfn_limit. I can just fix that and > > > repost, sounds good? > > > > Sure, please do it. For the patches, I'm not comformatable about the > > per-cpu deferred invalidation. One important benefit of IOMMU is > > isolation. Deferred invalidation already loose the isolation, per-cpu > > invalidation loose further. It would be better we can flush all per-cpu > > invalidation entries if one cpu hits its per-cpu limit. Also you'd > > better look at CPU hotplug. We don't want to lose the invalidation > > entries if one cpu is hot removed. > > I'll look into these. > > > The per-cpu iova implementation looks unnecessary complicated. I know > > you are referring the paper, but the whole point is batch > > allocation/free. > > Batched allocation/free isn't enough. It still creates spinlock > contention, even if there is per-cpu invalidation (that gets rid of > async_umap_flush_lock). Here are sample results from our memcached > test (throughput of querying 16 memcached instances on a 16-core box > with an Intel XL710 NIC): > > batched alloc/free, iommu=on: > 313,161 memcached transactions/sec (= 29% of iommu=off) > > batched alloc/free + per-cpu invalidations, iommu=on: > 434,590 memcached transactions/sec (= 40% of iommu=off) > > perf report: > 61.15% 0.33% swapper [kernel.kallsyms] [k] _raw_spin_lock_irqsave > | > ---_raw_spin_lock_irqsave > | > |--87.81%-- free_iova_array > |--11.71%-- alloc_iova > > In contrast, the per-cpu magazine cache in our patchset enables iova > allocation/free to complete without accessing the iova allocator at > all. So we don't touch the rbtree spinlock, and also complete iova > allocation in constant time, which avoids the linear-time allocations > that the iova allocator suffers from. (These were described in the > paper "Efficient intra-operating system protection against harmful > DMAs", presented at the USENIX FAST 2015 conference.) The end result: > > magazines cache + per-cpu invalidations, iommu=on: > 1,067,586 memcached transactions/sec (= 98% of iommu=off) Yes, I know a typical per-cpu cache algorithm will free entry to the per-cpu cache first. I didn't do it because it's not worthy in my test. We definitely should do it if it's worthy though. My point is we don't need implement the per-cpu that complicated. It could be simply: alloc: - refill per-cpu cache if it's empty - alloc from per-cpu cache free - free to per-cpu cache - if per-cpu cache hits threshold, free to global And please don't cache too much. The DMA address below 4G is still precious. Thanks, Shaohua