From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754106AbdDNKKi (ORCPT ); Fri, 14 Apr 2017 06:10:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38146 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751470AbdDNKKf (ORCPT ); Fri, 14 Apr 2017 06:10:35 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com BF6D73DBC2 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=brouer@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com BF6D73DBC2 Date: Fri, 14 Apr 2017 12:10:27 +0200 From: Jesper Dangaard Brouer To: Andrew Morton Cc: Mel Gorman , willy@infradead.org, peterz@infradead.org, pagupta@redhat.com, ttoukan.linux@gmail.com, tariqt@mellanox.com, netdev@vger.kernel.org, saeedm@mellanox.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, brouer@redhat.com Subject: Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator Message-ID: <20170414121027.079e5a4c@redhat.com> In-Reply-To: <20170410142616.6d37a11904dd153298cf7f3b@linux-foundation.org> References: <20170410150821.vcjlz7ntabtfsumm@techsingularity.net> <20170410142616.6d37a11904dd153298cf7f3b@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Fri, 14 Apr 2017 10:10:35 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 10 Apr 2017 14:26:16 -0700 Andrew Morton wrote: > On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman wrote: > > > IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching > > of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu > > allocator for irq-safe requests"). > > > > This unfortunately also included excluded SoftIRQ. This hurt the performance > > for the use-case of refilling DMA RX rings in softirq context. > > Out of curiosity: by how much did it "hurt"? > > > > Tariq found: > > : I disabled the page-cache (recycle) mechanism to stress the page > : allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to > : 31.4 G in v4.11-rc1 (34% drop). I've tried to reproduce this in my home testlab, using ConnectX-4 dual 100Gbit/s. Hardware limits cause that I cannot reach 100Gbit/s, once a memory copy is performed. (Word of warning: you need PCIe Gen3 width 16 (which I do have) to handle 100Gbit/s, and the memory bandwidth of the system also need something like 2x 12500MBytes/s (which is where my system failed)). The mlx5 driver have a driver local page recycler, which I can see fail between 29%-38% of the time, with 8 parallel netperf TCP_STREAMs. I speculate adding more streams will make in fail more. To factor out the driver recycler, I simply disable it (like I believe Tariq also did). With disabled-mlx5-recycler, 8 parallel netperf TCP_STREAMs: Baseline v4.10.0 : 60316 Mbit/s Current 4.11.0-rc6: 47491 Mbit/s This patch : 60662 Mbit/s While this patch does "fix" the performance regression, it does not bring any noticeable improvement (as my micro-bench also indicated), thus I feel our previous optimization is almost nullified. (p.s. It does feel wrong to argue against my own patch ;-)). The reason for the current 4.11.0-rc6 regression is lock congestion on the (per NUMA) page allocator lock, perf report show we spend 34.92% in queued_spin_lock_slowpath (compared to top#2 copy cost of 13.81% in copy_user_enhanced_fast_string). > then with this patch he found > > : It looks very good! I get line-rate (94Gbits/sec) with 8 streams, in > : comparison to less than 55Gbits/sec before. > > Can I take this to mean that the page allocator's per-cpu-pages feature > ended up doubling the performance of this driver? Better than the > driver's private page recycling? I'd like to believe that, but am > having trouble doing so ;) I would not conclude that. I'm also very suspicious about such big performance "jumps". Tariq should also benchmark with v4.10 and a disabled mlx5-recycler, as I believe the results should be the same as after this patch. That said, it is possible to see a regression this large, when all the CPUs are congesting on the page allocator lock. AFAIK Tariq also mentioned seeing 60% spend on the lock, which would confirm this theory. > > This patch re-allow softirq context, which should be safe by disabling > > BH/softirq, while accessing the list. PCP-lists access from both hard-IRQ > > and NMI context must not be allowed. Peter Zijlstra says in_nmi() code > > never access the page allocator, thus it should be sufficient to only test > > for !in_irq(). > > > > One concern with this change is adding a BH (enable) scheduling point at > > both PCP alloc and free. If further concerns are highlighted by this patch, > > the result wiill be to revert 374ad05ab64d and try again at a later date > > to offset the irq enable/disable overhead. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer