From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id C745E6B0003 for ; Thu, 1 Mar 2018 01:27:51 -0500 (EST) Received: by mail-pf0-f199.google.com with SMTP id x7so2905753pfd.19 for ; Wed, 28 Feb 2018 22:27:51 -0800 (PST) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id b60-v6si2536498plc.830.2018.02.28.22.27.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Feb 2018 22:27:50 -0800 (PST) From: Aaron Lu Subject: [PATCH v4 0/3] mm: improve zone->lock scalability Date: Thu, 1 Mar 2018 14:28:42 +0800 Message-Id: <20180301062845.26038-1-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Patch 1/3 is a small cleanup suggested by Matthew Wilcox which doesn't affect scalability or performance; Patch 2/3 moves some code in free_pcppages_bulk() outside of zone->lock and has Mel Gorman's ack; Patch 3/3 uses prefetch in free_pcppages_bulk() outside of zone->lock to speedup page merging under zone->lock but Mel Gorman has some concerns. For details, please see their changelogs. v4: Address David Rientjes' comments to not update pcp->count in front of free_pcppages_bulk() in patch 1/3. Reword code comments in patch 2/3 as suggested by David Rientjes. v3: Added patch 1/3 to update pcp->count inside of free_pcppages_bulk(); Rebase to v4.16-rc2. v2 & v1: https://lkml.org/lkml/2018/1/23/879 Aaron Lu (3): mm/free_pcppages_bulk: update pcp->count inside mm/free_pcppages_bulk: do not hold lock when picking pages to free mm/free_pcppages_bulk: prefetch buddy while not holding lock mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 22 deletions(-) -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 58F1C6B0005 for ; Thu, 1 Mar 2018 01:27:53 -0500 (EST) Received: by mail-pf0-f200.google.com with SMTP id e1so2893858pfi.10 for ; Wed, 28 Feb 2018 22:27:53 -0800 (PST) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id b60-v6si2536498plc.830.2018.02.28.22.27.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Feb 2018 22:27:52 -0800 (PST) From: Aaron Lu Subject: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Date: Thu, 1 Mar 2018 14:28:43 +0800 Message-Id: <20180301062845.26038-2-aaron.lu@intel.com> In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Matthew Wilcox found that all callers of free_pcppages_bulk() currently update pcp->count immediately after so it's natural to do it inside free_pcppages_bulk(). No functionality or performance change is expected from this patch. Suggested-by: Matthew Wilcox Signed-off-by: Aaron Lu --- mm/page_alloc.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cb416723538f..faa33eac1635 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); + pcp->count--; mt = get_pcppage_migratetype(page); /* MIGRATE_ISOLATE page should not go to pcplists */ @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) local_irq_save(flags); batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); - if (to_drain > 0) { + if (to_drain > 0) free_pcppages_bulk(zone, to_drain, pcp); - pcp->count -= to_drain; - } local_irq_restore(flags); } #endif @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) pset = per_cpu_ptr(zone->pageset, cpu); pcp = &pset->pcp; - if (pcp->count) { + if (pcp->count) free_pcppages_bulk(zone, pcp->count, pcp); - pcp->count = 0; - } local_irq_restore(flags); } @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) if (pcp->count >= pcp->high) { unsigned long batch = READ_ONCE(pcp->batch); free_pcppages_bulk(zone, batch, pcp); - pcp->count -= batch; } } -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 035BB6B0006 for ; Thu, 1 Mar 2018 01:27:56 -0500 (EST) Received: by mail-pf0-f200.google.com with SMTP id g66so2897523pfj.11 for ; Wed, 28 Feb 2018 22:27:55 -0800 (PST) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id b60-v6si2536498plc.830.2018.02.28.22.27.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Feb 2018 22:27:54 -0800 (PST) From: Aaron Lu Subject: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Date: Thu, 1 Mar 2018 14:28:44 +0800 Message-Id: <20180301062845.26038-3-aaron.lu@intel.com> In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the zone->lock is held and then pages are chosen from PCP's migratetype list. While there is actually no need to do this 'choose part' under lock since it's PCP pages, the only CPU that can touch them is us and irq is also disabled. Moving this part outside could reduce lock held time and improve performance. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% What the test does is: starts $nr_cpu processes and each will repeatedly do the following for 5 minutes: 1 mmap 128M anonymouse space; 2 write access to that space; 3 munmap. The score is the aggregated iteration. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Acked-by: Mel Gorman Signed-off-by: Aaron Lu --- mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index faa33eac1635..dafdcdec9c1f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, int migratetype = 0; int batch_free = 0; bool isolated_pageblocks; - - spin_lock(&zone->lock); - isolated_pageblocks = has_isolate_pageblock(zone); + struct page *page, *tmp; + LIST_HEAD(head); while (count) { - struct page *page; struct list_head *list; /* @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { - int mt; /* migratetype of the to-be-freed page */ - page = list_last_entry(list, struct page, lru); - /* must delete as __free_one_page list manipulates */ + /* must delete to avoid corrupting pcp list */ list_del(&page->lru); pcp->count--; - mt = get_pcppage_migratetype(page); - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - if (bulkfree_pcp_prepare(page)) continue; - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); + list_add_tail(&page->lru, &head); } while (--count && --batch_free && !list_empty(list)); } + + spin_lock(&zone->lock); + isolated_pageblocks = has_isolate_pageblock(zone); + + /* + * Use safe version since after __free_one_page(), + * page->lru.next will not point to original list. + */ + list_for_each_entry_safe(page, tmp, &head, lru) { + int mt = get_pcppage_migratetype(page); + /* MIGRATE_ISOLATE page should not go to pcplists */ + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); + /* Pageblock could have been isolated meanwhile */ + if (unlikely(isolated_pageblocks)) + mt = get_pageblock_migratetype(page); + + __free_one_page(page, page_to_pfn(page), zone, 0, mt); + trace_mm_page_pcpu_drain(page, 0, mt); + } spin_unlock(&zone->lock); } -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id DBB696B0007 for ; Thu, 1 Mar 2018 01:27:57 -0500 (EST) Received: by mail-pg0-f69.google.com with SMTP id l14so2198177pgn.21 for ; Wed, 28 Feb 2018 22:27:57 -0800 (PST) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id b60-v6si2536498plc.830.2018.02.28.22.27.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Feb 2018 22:27:56 -0800 (PST) From: Aaron Lu Subject: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Thu, 1 Mar 2018 14:28:45 +0800 Message-Id: <20180301062845.26038-4-aaron.lu@intel.com> In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% Note: this patch's performance improvement percent is against patch2/3. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang Signed-off-by: Aaron Lu --- mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..1d838041931e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete to avoid corrupting pcp list */ list_del(&page->lru); @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f198.google.com (mail-io0-f198.google.com [209.85.223.198]) by kanga.kvack.org (Postfix) with ESMTP id 5BAA26B0006 for ; Thu, 1 Mar 2018 07:11:15 -0500 (EST) Received: by mail-io0-f198.google.com with SMTP id m19so5637641iob.13 for ; Thu, 01 Mar 2018 04:11:15 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id 124sor2303198iow.118.2018.03.01.04.11.14 for (Google Transport Security); Thu, 01 Mar 2018 04:11:14 -0800 (PST) Date: Thu, 1 Mar 2018 04:11:11 -0800 (PST) From: David Rientjes Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside In-Reply-To: <20180301062845.26038-2-aaron.lu@intel.com> Message-ID: References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox On Thu, 1 Mar 2018, Aaron Lu wrote: > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > update pcp->count immediately after so it's natural to do it inside > free_pcppages_bulk(). > > No functionality or performance change is expected from this patch. > > Suggested-by: Matthew Wilcox > Signed-off-by: Aaron Lu Acked-by: David Rientjes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 6DC8B6B0005 for ; Thu, 1 Mar 2018 08:45:15 -0500 (EST) Received: by mail-wr0-f198.google.com with SMTP id r15so4113908wrr.16 for ; Thu, 01 Mar 2018 05:45:15 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id o13si2606262wmf.89.2018.03.01.05.45.13 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 01 Mar 2018 05:45:13 -0800 (PST) Date: Thu, 1 Mar 2018 14:45:12 +0100 From: Michal Hocko Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Message-ID: <20180301134512.GI15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-2-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu 01-03-18 14:28:43, Aaron Lu wrote: > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > update pcp->count immediately after so it's natural to do it inside > free_pcppages_bulk(). > > No functionality or performance change is expected from this patch. > > Suggested-by: Matthew Wilcox > Signed-off-by: Aaron Lu Makes a lot of sense to me. Acked-by: Michal Hocko > --- > mm/page_alloc.c | 10 +++------- > 1 file changed, 3 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cb416723538f..faa33eac1635 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > page = list_last_entry(list, struct page, lru); > /* must delete as __free_one_page list manipulates */ > list_del(&page->lru); > + pcp->count--; > > mt = get_pcppage_migratetype(page); > /* MIGRATE_ISOLATE page should not go to pcplists */ > @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > local_irq_save(flags); > batch = READ_ONCE(pcp->batch); > to_drain = min(pcp->count, batch); > - if (to_drain > 0) { > + if (to_drain > 0) > free_pcppages_bulk(zone, to_drain, pcp); > - pcp->count -= to_drain; > - } > local_irq_restore(flags); > } > #endif > @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > pset = per_cpu_ptr(zone->pageset, cpu); > > pcp = &pset->pcp; > - if (pcp->count) { > + if (pcp->count) > free_pcppages_bulk(zone, pcp->count, pcp); > - pcp->count = 0; > - } > local_irq_restore(flags); > } > > @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) > if (pcp->count >= pcp->high) { > unsigned long batch = READ_ONCE(pcp->batch); > free_pcppages_bulk(zone, batch, pcp); > - pcp->count -= batch; > } > } > > -- > 2.14.3 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 881206B0007 for ; Thu, 1 Mar 2018 08:55:21 -0500 (EST) Received: by mail-wm0-f70.google.com with SMTP id v64so3314986wma.4 for ; Thu, 01 Mar 2018 05:55:21 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id n4si2542127wmh.9.2018.03.01.05.55.20 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 01 Mar 2018 05:55:20 -0800 (PST) Date: Thu, 1 Mar 2018 14:55:18 +0100 From: Michal Hocko Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180301135518.GJ15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-3-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu 01-03-18 14:28:44, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. Iteration count I assume. I am still quite surprised that this would have such a large impact. > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Acked-by: Mel Gorman > Signed-off-by: Aaron Lu The patch makes sense to me Acked-by: Michal Hocko > --- > mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- > 1 file changed, 23 insertions(+), 16 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index faa33eac1635..dafdcdec9c1f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int migratetype = 0; > int batch_free = 0; > bool isolated_pageblocks; > - > - spin_lock(&zone->lock); > - isolated_pageblocks = has_isolate_pageblock(zone); > + struct page *page, *tmp; > + LIST_HEAD(head); > > while (count) { > - struct page *page; > struct list_head *list; > > /* > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, > batch_free = count; > > do { > - int mt; /* migratetype of the to-be-freed page */ > - > page = list_last_entry(list, struct page, lru); > - /* must delete as __free_one_page list manipulates */ > + /* must delete to avoid corrupting pcp list */ > list_del(&page->lru); > pcp->count--; > > - mt = get_pcppage_migratetype(page); > - /* MIGRATE_ISOLATE page should not go to pcplists */ > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > - /* Pageblock could have been isolated meanwhile */ > - if (unlikely(isolated_pageblocks)) > - mt = get_pageblock_migratetype(page); > - > if (bulkfree_pcp_prepare(page)) > continue; > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > - trace_mm_page_pcpu_drain(page, 0, mt); > + list_add_tail(&page->lru, &head); > } while (--count && --batch_free && !list_empty(list)); > } > + > + spin_lock(&zone->lock); > + isolated_pageblocks = has_isolate_pageblock(zone); > + > + /* > + * Use safe version since after __free_one_page(), > + * page->lru.next will not point to original list. > + */ > + list_for_each_entry_safe(page, tmp, &head, lru) { > + int mt = get_pcppage_migratetype(page); > + /* MIGRATE_ISOLATE page should not go to pcplists */ > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > + /* Pageblock could have been isolated meanwhile */ > + if (unlikely(isolated_pageblocks)) > + mt = get_pageblock_migratetype(page); > + > + __free_one_page(page, page_to_pfn(page), zone, 0, mt); > + trace_mm_page_pcpu_drain(page, 0, mt); > + } > spin_unlock(&zone->lock); > } > > -- > 2.14.3 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 2B7216B0009 for ; Thu, 1 Mar 2018 09:00:51 -0500 (EST) Received: by mail-wr0-f198.google.com with SMTP id v16so4186783wrv.14 for ; Thu, 01 Mar 2018 06:00:51 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r123si2774930wmf.51.2018.03.01.06.00.49 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 01 Mar 2018 06:00:49 -0800 (PST) Date: Thu, 1 Mar 2018 15:00:44 +0100 From: Michal Hocko Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180301140044.GK15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-4-aaron.lu@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu 01-03-18 14:28:45, Aaron Lu wrote: > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. Since we *always* do buddy merging and check > an order-0 page's buddy to try to merge it when it goes into the main > allocator, the cacheline will always come in, i.e. the prefetched data > will never be unused. > > In the meantime, there are two concerns: > 1 the prefetch could potentially evict existing cachelines, especially > for L1D cache since it is not huge; > 2 there is some additional instruction overhead, namely calculating > buddy pfn twice. > > For 1, it's hard to say, this microbenchmark though shows good result but > the actual benefit of this patch will be workload/CPU dependant; > For 2, since the calculation is a XOR on two local variables, it's expected > in many cases that cycles spent will be offset by reduced memory latency > later. This is especially true for NUMA machines where multiple CPUs are > contending on zone->lock and the most time consuming part under zone->lock > is the wait of 'struct page' cacheline of the to-be-freed pages and their > buddies. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > Note: this patch's performance improvement percent is against patch2/3. I am really surprised that this has such a big impact. Is this a win on other architectures as well? > [changelog stole from Dave Hansen and Mel Gorman's comments] > https://lkml.org/lkml/2018/1/24/551 Please use http://lkml.kernel.org/r/ for references because lkml.org is quite unstable. It would be http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com here. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 874DA6B0003 for ; Thu, 1 Mar 2018 19:01:09 -0500 (EST) Received: by mail-wr0-f197.google.com with SMTP id 96so2844941wrk.12 for ; Thu, 01 Mar 2018 16:01:09 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id q37si3646753wrb.536.2018.03.01.16.01.07 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Mar 2018 16:01:08 -0800 (PST) Date: Thu, 1 Mar 2018 16:01:05 -0800 From: Andrew Morton Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-Id: <20180301160105.aca958fac871998d582307d4@linux-foundation.org> In-Reply-To: <20180301062845.26038-3-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. But it's a loss for uniprocessor systems: it adds more code and adds an additional pass across a list. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 1D4226B0003 for ; Thu, 1 Mar 2018 19:09:56 -0500 (EST) Received: by mail-wr0-f199.google.com with SMTP id j28so5113679wrd.17 for ; Thu, 01 Mar 2018 16:09:56 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id z4si3385701wrh.1.2018.03.01.16.09.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Mar 2018 16:09:54 -0800 (PST) Date: Thu, 1 Mar 2018 16:09:50 -0800 From: Andrew Morton Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-Id: <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> In-Reply-To: <20180301062845.26038-4-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu wrote: > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. Since we *always* do buddy merging and check > an order-0 page's buddy to try to merge it when it goes into the main > allocator, the cacheline will always come in, i.e. the prefetched data > will never be unused. > > In the meantime, there are two concerns: > 1 the prefetch could potentially evict existing cachelines, especially > for L1D cache since it is not huge; > 2 there is some additional instruction overhead, namely calculating > buddy pfn twice. > > For 1, it's hard to say, this microbenchmark though shows good result but > the actual benefit of this patch will be workload/CPU dependant; > For 2, since the calculation is a XOR on two local variables, it's expected > in many cases that cycles spent will be offset by reduced memory latency > later. This is especially true for NUMA machines where multiple CPUs are > contending on zone->lock and the most time consuming part under zone->lock > is the wait of 'struct page' cacheline of the to-be-freed pages and their > buddies. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > Note: this patch's performance improvement percent is against patch2/3. > > ... > > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, > continue; > > list_add_tail(&page->lru, &head); > + > + /* > + * We are going to put the page back to the global > + * pool, prefetch its buddy to speed up later access > + * under zone->lock. It is believed the overhead of > + * calculating buddy_pfn here can be offset by reduced > + * memory latency later. > + */ > + pfn = page_to_pfn(page); > + buddy_pfn = __find_buddy_pfn(pfn, 0); > + buddy = page + (buddy_pfn - pfn); > + prefetch(buddy); What is the typical list length here? Maybe it's approximately the pcp batch size which is typically 128 pages? If so, I'm a bit surprised that it is effective to prefetch 128 page frames before using any them for real. I guess they'll fit in the L2 cache. Thoughts? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id B0F1D6B0003 for ; Fri, 2 Mar 2018 02:14:36 -0500 (EST) Received: by mail-pg0-f72.google.com with SMTP id k62so3758121pgd.11 for ; Thu, 01 Mar 2018 23:14:36 -0800 (PST) Received: from mga17.intel.com (mga17.intel.com. [192.55.52.151]) by mx.google.com with ESMTPS id 61-v6si4443032plf.640.2018.03.01.23.14.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Mar 2018 23:14:35 -0800 (PST) Date: Fri, 2 Mar 2018 15:15:34 +0800 From: Aaron Lu Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180302071533.GA6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301135518.GJ15057@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, Mar 01, 2018 at 02:55:18PM +0100, Michal Hocko wrote: > On Thu 01-03-18 14:28:44, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > Iteration count I assume. Correct. > I am still quite surprised that this would have such a large impact. Most likely due to the cachelines for these page structures are warmed up outside of zone->lock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 20AAA6B0007 for ; Fri, 2 Mar 2018 02:31:46 -0500 (EST) Received: by mail-pg0-f71.google.com with SMTP id m198so2726711pga.4 for ; Thu, 01 Mar 2018 23:31:46 -0800 (PST) Received: from mga17.intel.com (mga17.intel.com. [192.55.52.151]) by mx.google.com with ESMTPS id l184si3644118pge.224.2018.03.01.23.31.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Mar 2018 23:31:45 -0800 (PST) From: "Huang\, Ying" Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> Date: Fri, 02 Mar 2018 15:31:40 +0800 In-Reply-To: <20180301135518.GJ15057@dhcp22.suse.cz> (Michal Hocko's message of "Thu, 1 Mar 2018 14:55:18 +0100") Message-ID: <87r2p3c4rn.fsf@yhuang-dev.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Aaron Lu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Michal Hocko writes: > On Thu 01-03-18 14:28:44, Aaron Lu wrote: >> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, >> the zone->lock is held and then pages are chosen from PCP's migratetype >> list. While there is actually no need to do this 'choose part' under >> lock since it's PCP pages, the only CPU that can touch them is us and >> irq is also disabled. >> >> Moving this part outside could reduce lock held time and improve >> performance. Test with will-it-scale/page_fault1 full load: >> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) >> v4.16-rc2+ 9034215 7971818 13667135 15677465 >> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% >> >> What the test does is: starts $nr_cpu processes and each will repeatedly >> do the following for 5 minutes: >> 1 mmap 128M anonymouse space; >> 2 write access to that space; >> 3 munmap. >> The score is the aggregated iteration. > > Iteration count I assume. I am still quite surprised that this would > have such a large impact. The test is run with full load, this means near or more than 100 processes will allocate memory in parallel. According to Amdahl's law, the performance of a parallel program will be dominated by the serial part. For this case, the part protected by zone->lock. So small changes to code under zone->lock could make bigger changes to overall score. Best Regards, Huang, Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id A673E6B0003 for ; Fri, 2 Mar 2018 03:00:26 -0500 (EST) Received: by mail-pf0-f199.google.com with SMTP id z5so4901424pfe.16 for ; Fri, 02 Mar 2018 00:00:26 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTPS id p2si3689885pga.143.2018.03.02.00.00.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 00:00:25 -0800 (PST) Date: Fri, 2 Mar 2018 16:01:25 +0800 From: Aaron Lu Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180302080125.GB6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301160105.aca958fac871998d582307d4@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote: > On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > But it's a loss for uniprocessor systems: it adds more code and adds an > additional pass across a list. Performance wise, I assume the loss is pretty small and can not be measured. On my Sandybridge desktop, with will-it-scale/page_fault1/single process run to emulate uniprocessor system, the score is(average of 3 runs): base(patch 1/3): 649710 this patch: 653554 +0.6% prefetch(patch 3/3): 650336 (in noise range compared to base) On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single process run: base(patch 1/3): 498649 this patch: 504171 +1.1% prefetch(patch 3/3): 506334 +1.5% (compared to base) It looks like we don't need to worry too much about performance for uniprocessor system. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 3A6E16B0007 for ; Fri, 2 Mar 2018 03:26:58 -0500 (EST) Received: by mail-pf0-f197.google.com with SMTP id x7so4958051pfd.19 for ; Fri, 02 Mar 2018 00:26:58 -0800 (PST) Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTPS id l3si3684015pgp.141.2018.03.02.00.26.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 00:26:56 -0800 (PST) Date: Fri, 2 Mar 2018 16:27:56 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180302082756.GC6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, Mar 01, 2018 at 04:09:50PM -0800, Andrew Morton wrote: > On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu wrote: > > > When a page is freed back to the global pool, its buddy will be checked > > to see if it's possible to do a merge. This requires accessing buddy's > > page structure and that access could take a long time if it's cache cold. > > > > This patch adds a prefetch to the to-be-freed page's buddy outside of > > zone->lock in hope of accessing buddy's page structure later under > > zone->lock will be faster. Since we *always* do buddy merging and check > > an order-0 page's buddy to try to merge it when it goes into the main > > allocator, the cacheline will always come in, i.e. the prefetched data > > will never be unused. > > > > In the meantime, there are two concerns: > > 1 the prefetch could potentially evict existing cachelines, especially > > for L1D cache since it is not huge; > > 2 there is some additional instruction overhead, namely calculating > > buddy pfn twice. > > > > For 1, it's hard to say, this microbenchmark though shows good result but > > the actual benefit of this patch will be workload/CPU dependant; > > For 2, since the calculation is a XOR on two local variables, it's expected > > in many cases that cycles spent will be offset by reduced memory latency > > later. This is especially true for NUMA machines where multiple CPUs are > > contending on zone->lock and the most time consuming part under zone->lock > > is the wait of 'struct page' cacheline of the to-be-freed pages and their > > buddies. > > > > Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > Note: this patch's performance improvement percent is against patch2/3. > > > > ... > > > > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > continue; > > > > list_add_tail(&page->lru, &head); > > + > > + /* > > + * We are going to put the page back to the global > > + * pool, prefetch its buddy to speed up later access > > + * under zone->lock. It is believed the overhead of > > + * calculating buddy_pfn here can be offset by reduced > > + * memory latency later. > > + */ > > + pfn = page_to_pfn(page); > > + buddy_pfn = __find_buddy_pfn(pfn, 0); > > + buddy = page + (buddy_pfn - pfn); > > + prefetch(buddy); > > What is the typical list length here? Maybe it's approximately the pcp > batch size which is typically 128 pages? Most of time it is pcp->batch, unless when pcp's pages need to be all drained like in drain_local_pages(zone). The pcp->batch has a default value of 31 and its upper limit is 96 for x86_64. For this test, it is 31 here, I didn't manipulate /proc/sys/vm/percpu_pagelist_fraction to change it. With this said, the count here could be pcp->count when pcp's pages need to be all drained and though pcp->count's default value is (6*pcp->batch)=186, user can increase that value through the above mentioned procfs interface and the resulting pcp->count could be too big for prefetch. Ying also mentioned this today and suggested adding an upper limit here to avoid prefetching too much. Perhaps just prefetch the last pcp->batch pages if count here > pcp->batch? Since pcp->batch has an upper limit, we won't need to worry prefetching too much. > > If so, I'm a bit surprised that it is effective to prefetch 128 page > frames before using any them for real. I guess they'll fit in the L2 > cache. Thoughts? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id DB92D6B0007 for ; Fri, 2 Mar 2018 03:30:20 -0500 (EST) Received: by mail-pg0-f69.google.com with SMTP id m19so3859733pgv.5 for ; Fri, 02 Mar 2018 00:30:20 -0800 (PST) Received: from mga07.intel.com (mga07.intel.com. [134.134.136.100]) by mx.google.com with ESMTPS id u25si4479139pfm.164.2018.03.02.00.30.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 00:30:19 -0800 (PST) Date: Fri, 2 Mar 2018 16:31:19 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180302083119.GD6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301140044.GK15057@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Thu, Mar 01, 2018 at 03:00:44PM +0100, Michal Hocko wrote: > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > > When a page is freed back to the global pool, its buddy will be checked > > to see if it's possible to do a merge. This requires accessing buddy's > > page structure and that access could take a long time if it's cache cold. > > > > This patch adds a prefetch to the to-be-freed page's buddy outside of > > zone->lock in hope of accessing buddy's page structure later under > > zone->lock will be faster. Since we *always* do buddy merging and check > > an order-0 page's buddy to try to merge it when it goes into the main > > allocator, the cacheline will always come in, i.e. the prefetched data > > will never be unused. > > > > In the meantime, there are two concerns: > > 1 the prefetch could potentially evict existing cachelines, especially > > for L1D cache since it is not huge; > > 2 there is some additional instruction overhead, namely calculating > > buddy pfn twice. > > > > For 1, it's hard to say, this microbenchmark though shows good result but > > the actual benefit of this patch will be workload/CPU dependant; > > For 2, since the calculation is a XOR on two local variables, it's expected > > in many cases that cycles spent will be offset by reduced memory latency > > later. This is especially true for NUMA machines where multiple CPUs are > > contending on zone->lock and the most time consuming part under zone->lock > > is the wait of 'struct page' cacheline of the to-be-freed pages and their > > buddies. > > > > Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > Note: this patch's performance improvement percent is against patch2/3. > > I am really surprised that this has such a big impact. Is this a win on > other architectures as well? For NUMA machines, I guess so. But I didn't test other archs so can't say for sure. > > > [changelog stole from Dave Hansen and Mel Gorman's comments] > > https://lkml.org/lkml/2018/1/24/551 > > Please use http://lkml.kernel.org/r/ for references because > lkml.org is quite unstable. It would be > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com > here. Good to know this, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 761356B0007 for ; Fri, 2 Mar 2018 10:34:04 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id g66so5450759pfj.11 for ; Fri, 02 Mar 2018 07:34:04 -0800 (PST) Received: from mga06.intel.com (mga06.intel.com. [134.134.136.31]) by mx.google.com with ESMTPS id a88si5132882pfk.40.2018.03.02.07.34.03 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 07:34:03 -0800 (PST) Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> <20180302071533.GA6356@intel.com> From: Dave Hansen Message-ID: Date: Fri, 2 Mar 2018 07:34:01 -0800 MIME-Version: 1.0 In-Reply-To: <20180302071533.GA6356@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On 03/01/2018 11:15 PM, Aaron Lu wrote: > >> I am still quite surprised that this would have such a large impact. > Most likely due to the cachelines for these page structures are warmed > up outside of zone->lock. The workload here is a pretty tight microbenchmark and single biggest bottleneck is cache misses on 'struct page'. It's not memory bandwidth bound. So, anything you can give the CPU keep it fed and not waiting on cache misses will be a win. There's never going to be a real-world workload that sees this kind of increase, but the change in the micro isn't super-surprising because it so directly targets the bottleneck. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id C8ABA6B000E for ; Fri, 2 Mar 2018 12:57:14 -0500 (EST) Received: by mail-wr0-f199.google.com with SMTP id 5so6857594wrb.15 for ; Fri, 02 Mar 2018 09:57:14 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id c128si954775wma.181.2018.03.02.09.57.13 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 02 Mar 2018 09:57:13 -0800 (PST) Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> From: Vlastimil Babka Message-ID: Date: Fri, 2 Mar 2018 18:55:25 +0100 MIME-Version: 1.0 In-Reply-To: <20180301140044.GK15057@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On 03/01/2018 03:00 PM, Michal Hocko wrote: > On Thu 01-03-18 14:28:45, Aaron Lu wrote: >> When a page is freed back to the global pool, its buddy will be checked >> to see if it's possible to do a merge. This requires accessing buddy's >> page structure and that access could take a long time if it's cache cold. >> >> This patch adds a prefetch to the to-be-freed page's buddy outside of >> zone->lock in hope of accessing buddy's page structure later under >> zone->lock will be faster. Since we *always* do buddy merging and check >> an order-0 page's buddy to try to merge it when it goes into the main >> allocator, the cacheline will always come in, i.e. the prefetched data >> will never be unused. >> >> In the meantime, there are two concerns: >> 1 the prefetch could potentially evict existing cachelines, especially >> for L1D cache since it is not huge; >> 2 there is some additional instruction overhead, namely calculating >> buddy pfn twice. >> >> For 1, it's hard to say, this microbenchmark though shows good result but >> the actual benefit of this patch will be workload/CPU dependant; >> For 2, since the calculation is a XOR on two local variables, it's expected >> in many cases that cycles spent will be offset by reduced memory latency >> later. This is especially true for NUMA machines where multiple CPUs are >> contending on zone->lock and the most time consuming part under zone->lock >> is the wait of 'struct page' cacheline of the to-be-freed pages and their >> buddies. >> >> Test with will-it-scale/page_fault1 full load: >> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) >> v4.16-rc2+ 9034215 7971818 13667135 15677465 >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% >> Note: this patch's performance improvement percent is against patch2/3. > > I am really surprised that this has such a big impact. It's even stranger to me. Struct page is 64 bytes these days, exactly a a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache line (that forms an aligned 128 bytes block with the one we touch). Which is exactly a order-0 buddy struct page! Maybe that implicit prefetching stopped at L2 and explicit goes all the way to L1, can't remember. Would that make such a difference? It would be nice to do some perf tests with cache counters to see what is really going on... Vlastimil > Is this a win on > other architectures as well? > >> [changelog stole from Dave Hansen and Mel Gorman's comments] >> https://lkml.org/lkml/2018/1/24/551 > > Please use http://lkml.kernel.org/r/ for references because > lkml.org is quite unstable. It would be > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com > here. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 028C36B000E for ; Fri, 2 Mar 2018 13:00:28 -0500 (EST) Received: by mail-pf0-f197.google.com with SMTP id d5so2795428pfn.12 for ; Fri, 02 Mar 2018 10:00:27 -0800 (PST) Received: from mga17.intel.com (mga17.intel.com. [192.55.52.151]) by mx.google.com with ESMTPS id bj2-v6si5117666plb.286.2018.03.02.10.00.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 10:00:27 -0800 (PST) Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> From: Dave Hansen Message-ID: <2433e857-fea7-9af4-d124-538ad17de454@intel.com> Date: Fri, 2 Mar 2018 10:00:24 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka , Michal Hocko , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On 03/02/2018 09:55 AM, Vlastimil Babka wrote: > It's even stranger to me. Struct page is 64 bytes these days, exactly a > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > line (that forms an aligned 128 bytes block with the one we touch). I believe that was a behavior that was specific to the Pentium 4 "Netburst" era. I don't think the 128-byte line behavior exists on modern Intel cpus. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id C82016B0005 for ; Fri, 2 Mar 2018 13:09:48 -0500 (EST) Received: by mail-wr0-f197.google.com with SMTP id 5so6878404wrb.15 for ; Fri, 02 Mar 2018 10:09:48 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r4si5088006wrg.527.2018.03.02.10.09.47 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 02 Mar 2018 10:09:47 -0800 (PST) Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <2433e857-fea7-9af4-d124-538ad17de454@intel.com> From: Vlastimil Babka Message-ID: <787e53b9-20d3-a2df-44fc-ea8c31a6ced3@suse.cz> Date: Fri, 2 Mar 2018 19:08:00 +0100 MIME-Version: 1.0 In-Reply-To: <2433e857-fea7-9af4-d124-538ad17de454@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Michal Hocko , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On 03/02/2018 07:00 PM, Dave Hansen wrote: > On 03/02/2018 09:55 AM, Vlastimil Babka wrote: >> It's even stranger to me. Struct page is 64 bytes these days, exactly a >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache >> line (that forms an aligned 128 bytes block with the one we touch). > > I believe that was a behavior that was specific to the Pentium 4 > "Netburst" era. I don't think the 128-byte line behavior exists on > modern Intel cpus. I remember it on Core 2 something (Nehalem IIRC). And this page suggests up to Broadwell, and it can be disabled. And it's an L2 prefetcher indeed. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 1DF906B0005 for ; Fri, 2 Mar 2018 16:23:37 -0500 (EST) Received: by mail-wr0-f200.google.com with SMTP id 96so4770130wrk.12 for ; Fri, 02 Mar 2018 13:23:37 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id j12si1338612wmc.98.2018.03.02.13.23.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 13:23:35 -0800 (PST) Date: Fri, 2 Mar 2018 13:23:32 -0800 From: Andrew Morton Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-Id: <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> In-Reply-To: <20180302080125.GB6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> <20180302080125.GB6356@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Fri, 2 Mar 2018 16:01:25 +0800 Aaron Lu wrote: > On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote: > > On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > > > > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > > the zone->lock is held and then pages are chosen from PCP's migratetype > > > list. While there is actually no need to do this 'choose part' under > > > lock since it's PCP pages, the only CPU that can touch them is us and > > > irq is also disabled. > > > > > > Moving this part outside could reduce lock held time and improve > > > performance. Test with will-it-scale/page_fault1 full load: > > > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > > do the following for 5 minutes: > > > 1 mmap 128M anonymouse space; > > > 2 write access to that space; > > > 3 munmap. > > > The score is the aggregated iteration. > > > > But it's a loss for uniprocessor systems: it adds more code and adds an > > additional pass across a list. > > Performance wise, I assume the loss is pretty small and can not > be measured. > > On my Sandybridge desktop, with will-it-scale/page_fault1/single process > run to emulate uniprocessor system, the score is(average of 3 runs): > > base(patch 1/3): 649710 > this patch: 653554 +0.6% Does that mean we got faster or slower? > prefetch(patch 3/3): 650336 (in noise range compared to base) > > On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single > process run: > > base(patch 1/3): 498649 > this patch: 504171 +1.1% > prefetch(patch 3/3): 506334 +1.5% (compared to base) > > It looks like we don't need to worry too much about performance for > uniprocessor system. Well. We can say that of hundreds of patches. And we end up with a fatter and slower kernel than we otherwise would. Please take a look, see if there's a tidy way of avoiding this. Probably there isn't, in which case oh well. But let's at least try. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id B3F176B0005 for ; Fri, 2 Mar 2018 16:25:38 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id x6so624433pfx.16 for ; Fri, 02 Mar 2018 13:25:38 -0800 (PST) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id y2si4561439pgr.167.2018.03.02.13.25.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 13:25:37 -0800 (PST) Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> <20180302080125.GB6356@intel.com> <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> From: Dave Hansen Message-ID: <1110bcae-f4c5-68ff-77df-28934a21dd86@intel.com> Date: Fri, 2 Mar 2018 13:25:35 -0800 MIME-Version: 1.0 In-Reply-To: <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On 03/02/2018 01:23 PM, Andrew Morton wrote: >> On my Sandybridge desktop, with will-it-scale/page_fault1/single process >> run to emulate uniprocessor system, the score is(average of 3 runs): >> >> base(patch 1/3): 649710 >> this patch: 653554 +0.6% > Does that mean we got faster or slower? Faster. The unit of measurement here is iterations through a loop. More iterations are better. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 817846B0055 for ; Mon, 5 Mar 2018 06:41:04 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id s21so1427567pfm.15 for ; Mon, 05 Mar 2018 03:41:04 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id x87si10127947pff.17.2018.03.05.03.41.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Mar 2018 03:41:02 -0800 (PST) Date: Mon, 5 Mar 2018 19:41:59 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180305114159.GA32573@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > On 03/01/2018 03:00 PM, Michal Hocko wrote: > > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > >> When a page is freed back to the global pool, its buddy will be checked > >> to see if it's possible to do a merge. This requires accessing buddy's > >> page structure and that access could take a long time if it's cache cold. > >> > >> This patch adds a prefetch to the to-be-freed page's buddy outside of > >> zone->lock in hope of accessing buddy's page structure later under > >> zone->lock will be faster. Since we *always* do buddy merging and check > >> an order-0 page's buddy to try to merge it when it goes into the main > >> allocator, the cacheline will always come in, i.e. the prefetched data > >> will never be unused. > >> > >> In the meantime, there are two concerns: > >> 1 the prefetch could potentially evict existing cachelines, especially > >> for L1D cache since it is not huge; > >> 2 there is some additional instruction overhead, namely calculating > >> buddy pfn twice. > >> > >> For 1, it's hard to say, this microbenchmark though shows good result but > >> the actual benefit of this patch will be workload/CPU dependant; > >> For 2, since the calculation is a XOR on two local variables, it's expected > >> in many cases that cycles spent will be offset by reduced memory latency > >> later. This is especially true for NUMA machines where multiple CPUs are > >> contending on zone->lock and the most time consuming part under zone->lock > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their > >> buddies. > >> > >> Test with will-it-scale/page_fault1 full load: > >> > >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > >> v4.16-rc2+ 9034215 7971818 13667135 15677465 > >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > >> Note: this patch's performance improvement percent is against patch2/3. > > > > I am really surprised that this has such a big impact. > > It's even stranger to me. Struct page is 64 bytes these days, exactly a > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > line (that forms an aligned 128 bytes block with the one we touch). > Which is exactly a order-0 buddy struct page! Maybe that implicit > prefetching stopped at L2 and explicit goes all the way to L1, can't The Intel Architecture Optimization Manual section 7.3.2 says: prefetchT0 - fetch data into all cache levels Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: 1st, 2nd and 3rd level cache. prefetchT2 - fetch data into 2nd and 3rd level caches (identical to prefetchT1) Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: 2nd and 3rd level cache. prefetchNTA - fetch data into non-temporal cache close to the processor, minimizing cache pollution Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: must fetch into 3rd level cache with fast replacement. I tried 'prefetcht0' and 'prefetcht2' instead of the default 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about the same performance number as prefetchNTA. I had expected prefetchT0 to deliver a better score if it was indeed due to L1D since prefetchT2 will not place data into L1 while prefetchT0 will, but looks like it is not the case here. It feels more like the buddy cacheline isn't in any level of the caches without prefetch for some reason. > remember. Would that make such a difference? It would be nice to do some > perf tests with cache counters to see what is really going on... Compare prefetchT2 to no-prefetch, I saw these metrics change: no-prefetch change prefetchT2 metrics \ \ stddev stddev ------------------------------------------------------------------------ 0.18 +0.0 0.18 perf-stat.branch-miss-rate% 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references 3.52 -1.1% 3.48 perf-stat.cpi 0.02 -0.0 0.01 +-3% perf-stat.dTLB-load-miss-rate% 8.677e+08 -7.3% 8.048e+08 +-3% perf-stat.dTLB-load-misses 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores 6.126e+09 +10.1% 6.745e+09 +-3% perf-stat.iTLB-load-misses 3464 -8.4% 3172 +-3% perf-stat.instructions-per-iTLB-miss 0.28 +1.1% 0.29 perf-stat.ipc 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults 2182469 -4.2% 2090977 perf-stat.path-length Not sure if this is useful though... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id EF0556B005C for ; Mon, 5 Mar 2018 06:48:32 -0500 (EST) Received: by mail-pg0-f69.google.com with SMTP id d19so7273567pgn.20 for ; Mon, 05 Mar 2018 03:48:32 -0800 (PST) Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTPS id e22-v6si7629987plj.143.2018.03.05.03.48.31 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Mar 2018 03:48:31 -0800 (PST) Date: Mon, 5 Mar 2018 19:48:27 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180305114826.GA2184@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180305114159.GA32573@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On Mon, Mar 05, 2018 at 07:41:59PM +0800, Aaron Lu wrote: > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > > On 03/01/2018 03:00 PM, Michal Hocko wrote: > > > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > > >> When a page is freed back to the global pool, its buddy will be checked > > >> to see if it's possible to do a merge. This requires accessing buddy's > > >> page structure and that access could take a long time if it's cache cold. > > >> > > >> This patch adds a prefetch to the to-be-freed page's buddy outside of > > >> zone->lock in hope of accessing buddy's page structure later under > > >> zone->lock will be faster. Since we *always* do buddy merging and check > > >> an order-0 page's buddy to try to merge it when it goes into the main > > >> allocator, the cacheline will always come in, i.e. the prefetched data > > >> will never be unused. > > >> > > >> In the meantime, there are two concerns: > > >> 1 the prefetch could potentially evict existing cachelines, especially > > >> for L1D cache since it is not huge; > > >> 2 there is some additional instruction overhead, namely calculating > > >> buddy pfn twice. > > >> > > >> For 1, it's hard to say, this microbenchmark though shows good result but > > >> the actual benefit of this patch will be workload/CPU dependant; > > >> For 2, since the calculation is a XOR on two local variables, it's expected > > >> in many cases that cycles spent will be offset by reduced memory latency > > >> later. This is especially true for NUMA machines where multiple CPUs are > > >> contending on zone->lock and the most time consuming part under zone->lock > > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their > > >> buddies. > > >> > > >> Test with will-it-scale/page_fault1 full load: > > >> > > >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > >> v4.16-rc2+ 9034215 7971818 13667135 15677465 > > >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > >> Note: this patch's performance improvement percent is against patch2/3. > > > > > > I am really surprised that this has such a big impact. > > > > It's even stranger to me. Struct page is 64 bytes these days, exactly a > > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > > line (that forms an aligned 128 bytes block with the one we touch). > > Which is exactly a order-0 buddy struct page! Maybe that implicit > > prefetching stopped at L2 and explicit goes all the way to L1, can't > > The Intel Architecture Optimization Manual section 7.3.2 says: > > prefetchT0 - fetch data into all cache levels > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 1st, 2nd and 3rd level cache. > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > prefetchT1) > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 2nd and 3rd level cache. > > prefetchNTA - fetch data into non-temporal cache close to the processor, > minimizing cache pollution > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: must fetch into 3rd level cache with fast replacement. > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about ~~~~~~~ Correction: should be Broadwell here. > the same performance number as prefetchNTA. I had expected prefetchT0 to > deliver a better score if it was indeed due to L1D since prefetchT2 will > not place data into L1 while prefetchT0 will, but looks like it is not > the case here. > > It feels more like the buddy cacheline isn't in any level of the caches > without prefetch for some reason. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id AF4B46B0005 for ; Tue, 6 Mar 2018 02:56:01 -0500 (EST) Received: by mail-pf0-f200.google.com with SMTP id d5so8240657pfn.12 for ; Mon, 05 Mar 2018 23:56:01 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m12-v6si7641055pls.74.2018.03.05.23.55.59 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 05 Mar 2018 23:55:59 -0800 (PST) Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> From: Vlastimil Babka Message-ID: Date: Tue, 6 Mar 2018 08:55:57 +0100 MIME-Version: 1.0 In-Reply-To: <20180305114159.GA32573@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On 03/05/2018 12:41 PM, Aaron Lu wrote: > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: >> On 03/01/2018 03:00 PM, Michal Hocko wrote: >>> >>> I am really surprised that this has such a big impact. >> >> It's even stranger to me. Struct page is 64 bytes these days, exactly a >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache >> line (that forms an aligned 128 bytes block with the one we touch). >> Which is exactly a order-0 buddy struct page! Maybe that implicit >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > The Intel Architecture Optimization Manual section 7.3.2 says: > > prefetchT0 - fetch data into all cache levels > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 1st, 2nd and 3rd level cache. > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > prefetchT1) > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 2nd and 3rd level cache. > > prefetchNTA - fetch data into non-temporal cache close to the processor, > minimizing cache pollution > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: must fetch into 3rd level cache with fast replacement. > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > the same performance number as prefetchNTA. I had expected prefetchT0 to > deliver a better score if it was indeed due to L1D since prefetchT2 will > not place data into L1 while prefetchT0 will, but looks like it is not > the case here. > > It feels more like the buddy cacheline isn't in any level of the caches > without prefetch for some reason. So the adjacent line prefetch might be disabled? Could you check bios or the MSR mentioned in https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors >> remember. Would that make such a difference? It would be nice to do some >> perf tests with cache counters to see what is really going on... > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > no-prefetch change prefetchT2 metrics > \ \ > stddev stddev > ------------------------------------------------------------------------ > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > 3.52 -1.1% 3.48 perf-stat.cpi > 0.02 -0.0 0.01 A+-3% perf-stat.dTLB-load-miss-rate% > 8.677e+08 -7.3% 8.048e+08 A+-3% perf-stat.dTLB-load-misses > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > 6.126e+09 +10.1% 6.745e+09 A+-3% perf-stat.iTLB-load-misses > 3464 -8.4% 3172 A+-3% perf-stat.instructions-per-iTLB-miss > 0.28 +1.1% 0.29 perf-stat.ipc > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > 2182469 -4.2% 2090977 perf-stat.path-length > > Not sure if this is useful though... Looks like most stats increased in absolute values as the work done increased and this is a time-limited benchmark? Although number of instructions (calculated from itlb misses and insns-per-itlb-miss) shows less than 1% increase, so dunno. And the improvement comes from reduced dTLB-load-misses? That makes no sense for order-0 buddy struct pages which always share a page. And the memmap mapping should use huge pages. BTW what is path-length? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f71.google.com (mail-pl0-f71.google.com [209.85.160.71]) by kanga.kvack.org (Postfix) with ESMTP id E32CC6B0005 for ; Tue, 6 Mar 2018 07:26:39 -0500 (EST) Received: by mail-pl0-f71.google.com with SMTP id w22-v6so2040721pll.2 for ; Tue, 06 Mar 2018 04:26:39 -0800 (PST) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTPS id k23-v6si11091696pls.42.2018.03.06.04.26.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 06 Mar 2018 04:26:38 -0800 (PST) Date: Tue, 6 Mar 2018 20:27:33 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180306122733.GA9664@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > On 03/05/2018 12:41 PM, Aaron Lu wrote: > > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > >> On 03/01/2018 03:00 PM, Michal Hocko wrote: > >>> > >>> I am really surprised that this has such a big impact. > >> > >> It's even stranger to me. Struct page is 64 bytes these days, exactly a > >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > >> line (that forms an aligned 128 bytes block with the one we touch). > >> Which is exactly a order-0 buddy struct page! Maybe that implicit > >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > > > The Intel Architecture Optimization Manual section 7.3.2 says: > > > > prefetchT0 - fetch data into all cache levels > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 1st, 2nd and 3rd level cache. > > > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > > prefetchT1) > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 2nd and 3rd level cache. > > > > prefetchNTA - fetch data into non-temporal cache close to the processor, > > minimizing cache pollution > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: must fetch into 3rd level cache with fast replacement. > > > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > > the same performance number as prefetchNTA. I had expected prefetchT0 to > > deliver a better score if it was indeed due to L1D since prefetchT2 will > > not place data into L1 while prefetchT0 will, but looks like it is not > > the case here. > > > > It feels more like the buddy cacheline isn't in any level of the caches > > without prefetch for some reason. > > So the adjacent line prefetch might be disabled? Could you check bios or > the MSR mentioned in > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors root@lkp-bdw-ep2 ~# rdmsr 0x1a4 0 Looks like this feature isn't disabled(the doc you linked says value 1 means disable). > >> remember. Would that make such a difference? It would be nice to do some > >> perf tests with cache counters to see what is really going on... > > > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > > > no-prefetch change prefetchT2 metrics > > \ \ > > stddev stddev > > ------------------------------------------------------------------------ > > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > > 3.52 -1.1% 3.48 perf-stat.cpi > > 0.02 -0.0 0.01 +-3% perf-stat.dTLB-load-miss-rate% > > 8.677e+08 -7.3% 8.048e+08 +-3% perf-stat.dTLB-load-misses > > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > > 6.126e+09 +10.1% 6.745e+09 +-3% perf-stat.iTLB-load-misses > > 3464 -8.4% 3172 +-3% perf-stat.instructions-per-iTLB-miss > > 0.28 +1.1% 0.29 perf-stat.ipc > > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > > 2182469 -4.2% 2090977 perf-stat.path-length > > > > Not sure if this is useful though... > > Looks like most stats increased in absolute values as the work done > increased and this is a time-limited benchmark? Although number of Yes it is. > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > less than 1% increase, so dunno. And the improvement comes from reduced > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > which always share a page. And the memmap mapping should use huge pages. THP is disabled to stress order 0 pages(should have mentioned this in patch's description, sorry about this). > BTW what is path-length? It's the instruction path length: the number of machine code instructions required to execute a section of a computer program. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f69.google.com (mail-pl0-f69.google.com [209.85.160.69]) by kanga.kvack.org (Postfix) with ESMTP id CBB836B0005 for ; Tue, 6 Mar 2018 07:53:07 -0500 (EST) Received: by mail-pl0-f69.google.com with SMTP id l5-v6so9740115pli.8 for ; Tue, 06 Mar 2018 04:53:07 -0800 (PST) Received: from bombadil.infradead.org (bombadil.infradead.org. [2607:7c80:54:e::133]) by mx.google.com with ESMTPS id l11-v6si11541906pln.323.2018.03.06.04.53.06 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 06 Mar 2018 04:53:06 -0800 (PST) Date: Tue, 6 Mar 2018 04:53:04 -0800 From: Matthew Wilcox Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180306125303.GA13722@bombadil.infradead.org> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> <20180306122733.GA9664@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180306122733.GA9664@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Vlastimil Babka , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , David Rientjes On Tue, Mar 06, 2018 at 08:27:33PM +0800, Aaron Lu wrote: > On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > > So the adjacent line prefetch might be disabled? Could you check bios or > > the MSR mentioned in > > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors > > root@lkp-bdw-ep2 ~# rdmsr 0x1a4 > 0 Technically 0x1a4 is per-core, so you should run rdmsr -a 0x1a4 in order to check all the cores. But I can't imagine they're being set differently on each core. > > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > > less than 1% increase, so dunno. And the improvement comes from reduced > > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > > which always share a page. And the memmap mapping should use huge pages. > > THP is disabled to stress order 0 pages(should have mentioned this in > patch's description, sorry about this). THP isn't related to memmap; the kernel uses huge pages (usually the 1G pages) in order to map its own memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 91BF06B0007 for ; Fri, 9 Mar 2018 03:23:40 -0500 (EST) Received: by mail-pf0-f197.google.com with SMTP id g66so1303432pfj.11 for ; Fri, 09 Mar 2018 00:23:40 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTPS id v10-v6si469971plz.427.2018.03.09.00.23.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 09 Mar 2018 00:23:39 -0800 (PST) Date: Fri, 9 Mar 2018 16:24:31 +0800 From: Aaron Lu Subject: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180309082431.GB30868@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180302082756.GC6356@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Fri, Mar 02, 2018 at 04:27:56PM +0800, Aaron Lu wrote: > With this said, the count here could be pcp->count when pcp's pages > need to be all drained and though pcp->count's default value is > (6*pcp->batch)=186, user can increase that value through the above > mentioned procfs interface and the resulting pcp->count could be too > big for prefetch. Ying also mentioned this today and suggested adding > an upper limit here to avoid prefetching too much. Perhaps just prefetch > the last pcp->batch pages if count here > pcp->batch? Since pcp->batch > has an upper limit, we won't need to worry prefetching too much. The following patch adds an upper limit on prefetching, the upper limit is set to pcp->batch since 1) it is the most likely value of input param 'count' in free_pcppages_bulk() and 2) it has an upper limit itself. From: Aaron Lu Subject: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. Normally, the number of to-be-freed pages(i.e. count) equals to pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of pcp's pages getting all drained, it will be pcp->count which has an upper limit of pcp->high. pcp->high, although has a default value of 186 (pcp->batch=31 * 6), can be changed by user through /proc/sys/vm/percpu_pagelist_fraction and there is no software upper limit so could be large, like several thousand. For this reason, only the last pcp->batch number of page's buddy structure is prefetched to avoid excessive prefetching. pcp-batch is used because: 1 most often, count == pcp->batch; 2 it has an upper limit itself so we won't prefetch excessively. Considering the possible large value of pcp->high, it also makes sense to free the last added page first for cache hot's reason. That's where the change of list_add_tail() to list_add() comes in as we will free them from head to tail one by one. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% Note: this patch's performance improvement percent is against patch2/3. (Changelog stolen from Dave Hansen and Mel Gorman's comments at http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Signed-off-by: Aaron Lu Suggested-by: Ying Huang Reviewed-by: Andrew Morton --- mm/page_alloc.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..5f31f7bab583 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete to avoid corrupting pcp list */ list_del(&page->lru); @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (bulkfree_pcp_prepare(page)) continue; - list_add_tail(&page->lru, &head); + list_add(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * an additional test and calculating buddy_pfn here + * can be offset by reduced memory latency later. To + * avoid excessive prefetching due to large count, only + * prefetch buddy for the last pcp->batch nr of pages. + */ + if (count > pcp->batch) + continue; + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 643386B0005 for ; Fri, 9 Mar 2018 16:58:49 -0500 (EST) Received: by mail-wr0-f198.google.com with SMTP id h33so5799392wrh.10 for ; Fri, 09 Mar 2018 13:58:49 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id g6si1336805wra.310.2018.03.09.13.58.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 09 Mar 2018 13:58:47 -0800 (PST) Date: Fri, 9 Mar 2018 13:58:32 -0800 From: Andrew Morton Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-Id: <20180309135832.988ab6d3d986658d531a79ef@linux-foundation.org> In-Reply-To: <20180309082431.GB30868@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes > > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. Since we *always* do buddy merging and check > an order-0 page's buddy to try to merge it when it goes into the main > allocator, the cacheline will always come in, i.e. the prefetched data > will never be unused. > > Normally, the number of to-be-freed pages(i.e. count) equals to > pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on > x86_64) but in the case of pcp's pages getting all drained, it will be > pcp->count which has an upper limit of pcp->high. pcp->high, although > has a default value of 186 (pcp->batch=31 * 6), can be changed by user > through /proc/sys/vm/percpu_pagelist_fraction and there is no software > upper limit so could be large, like several thousand. For this reason, > only the last pcp->batch number of page's buddy structure is prefetched > to avoid excessive prefetching. pcp-batch is used because: > 1 most often, count == pcp->batch; > 2 it has an upper limit itself so we won't prefetch excessively. > > Considering the possible large value of pcp->high, it also makes > sense to free the last added page first for cache hot's reason. > That's where the change of list_add_tail() to list_add() comes in > as we will free them from head to tail one by one. > > In the meantime, there are two concerns: > 1 the prefetch could potentially evict existing cachelines, especially > for L1D cache since it is not huge; > 2 there is some additional instruction overhead, namely calculating > buddy pfn twice. > > For 1, it's hard to say, this microbenchmark though shows good result but > the actual benefit of this patch will be workload/CPU dependant; > For 2, since the calculation is a XOR on two local variables, it's expected > in many cases that cycles spent will be offset by reduced memory latency > later. This is especially true for NUMA machines where multiple CPUs are > contending on zone->lock and the most time consuming part under zone->lock > is the wait of 'struct page' cacheline of the to-be-freed pages and their > buddies. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% > Note: this patch's performance improvement percent is against patch2/3. > > (Changelog stolen from Dave Hansen and Mel Gorman's comments at > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) > > Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, > batch_free = count; > > do { > + unsigned long pfn, buddy_pfn; > + struct page *buddy; > + > page = list_last_entry(list, struct page, lru); > /* must delete to avoid corrupting pcp list */ > list_del(&page->lru); > @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count, > if (bulkfree_pcp_prepare(page)) > continue; > > - list_add_tail(&page->lru, &head); > + list_add(&page->lru, &head); The result here will be that free_pcppages_bulk() frees the pages in the reverse order? I don't immediately see a downside to that. In the (distant) past we had issues when successive alloc_page() calls would return pages in descending address order - that totally screwed up scatter-gather page merging. But this is the page-freeing path. Still, something to be thought about and monitored. > + > + /* > + * We are going to put the page back to the global > + * pool, prefetch its buddy to speed up later access > + * under zone->lock. It is believed the overhead of > + * an additional test and calculating buddy_pfn here > + * can be offset by reduced memory latency later. To > + * avoid excessive prefetching due to large count, only > + * prefetch buddy for the last pcp->batch nr of pages. > + */ > + if (count > pcp->batch) > + continue; > + pfn = page_to_pfn(page); > + buddy_pfn = __find_buddy_pfn(pfn, 0); > + buddy = page + (buddy_pfn - pfn); > + prefetch(buddy); > } while (--count && --batch_free && !list_empty(list)); This loop hurts my brain, mainly the handling of `count': while (count) { do { batch_free++; } while (list_empty(list)); /* This is the only non-empty list. Free them all. */ if (batch_free == MIGRATE_PCPTYPES) batch_free = count; do { } while (--count && --batch_free && !list_empty(list)); } I guess it kinda makes sense - both loops terminate on count==0. But still. Can it be clarified? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f70.google.com (mail-pl0-f70.google.com [209.85.160.70]) by kanga.kvack.org (Postfix) with ESMTP id 6E07B6B0007 for ; Sat, 10 Mar 2018 09:45:38 -0500 (EST) Received: by mail-pl0-f70.google.com with SMTP id 4-v6so5921792plb.1 for ; Sat, 10 Mar 2018 06:45:38 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id y66si2776061pff.331.2018.03.10.06.45.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 10 Mar 2018 06:45:36 -0800 (PST) Date: Sat, 10 Mar 2018 22:46:27 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180310144627.GA12254@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <20180309135832.988ab6d3d986658d531a79ef@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180309135832.988ab6d3d986658d531a79ef@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Fri, Mar 09, 2018 at 01:58:32PM -0800, Andrew Morton wrote: > > > > When a page is freed back to the global pool, its buddy will be checked > > to see if it's possible to do a merge. This requires accessing buddy's > > page structure and that access could take a long time if it's cache cold. > > > > This patch adds a prefetch to the to-be-freed page's buddy outside of > > zone->lock in hope of accessing buddy's page structure later under > > zone->lock will be faster. Since we *always* do buddy merging and check > > an order-0 page's buddy to try to merge it when it goes into the main > > allocator, the cacheline will always come in, i.e. the prefetched data > > will never be unused. > > > > Normally, the number of to-be-freed pages(i.e. count) equals to > > pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on > > x86_64) but in the case of pcp's pages getting all drained, it will be > > pcp->count which has an upper limit of pcp->high. pcp->high, although > > has a default value of 186 (pcp->batch=31 * 6), can be changed by user > > through /proc/sys/vm/percpu_pagelist_fraction and there is no software > > upper limit so could be large, like several thousand. For this reason, > > only the last pcp->batch number of page's buddy structure is prefetched > > to avoid excessive prefetching. pcp-batch is used because: > > 1 most often, count == pcp->batch; > > 2 it has an upper limit itself so we won't prefetch excessively. > > > > Considering the possible large value of pcp->high, it also makes > > sense to free the last added page first for cache hot's reason. > > That's where the change of list_add_tail() to list_add() comes in > > as we will free them from head to tail one by one. > > > > In the meantime, there are two concerns: > > 1 the prefetch could potentially evict existing cachelines, especially > > for L1D cache since it is not huge; > > 2 there is some additional instruction overhead, namely calculating > > buddy pfn twice. > > > > For 1, it's hard to say, this microbenchmark though shows good result but > > the actual benefit of this patch will be workload/CPU dependant; > > For 2, since the calculation is a XOR on two local variables, it's expected > > in many cases that cycles spent will be offset by reduced memory latency > > later. This is especially true for NUMA machines where multiple CPUs are > > contending on zone->lock and the most time consuming part under zone->lock > > is the wait of 'struct page' cacheline of the to-be-freed pages and their > > buddies. > > > > Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% > > Note: this patch's performance improvement percent is against patch2/3. > > > > (Changelog stolen from Dave Hansen and Mel Gorman's comments at > > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) > > > > Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > batch_free = count; > > > > do { > > + unsigned long pfn, buddy_pfn; > > + struct page *buddy; > > + > > page = list_last_entry(list, struct page, lru); > > /* must delete to avoid corrupting pcp list */ > > list_del(&page->lru); > > @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > if (bulkfree_pcp_prepare(page)) > > continue; > > > > - list_add_tail(&page->lru, &head); > > + list_add(&page->lru, &head); > > The result here will be that free_pcppages_bulk() frees the pages in > the reverse order? Yes, so that the last touched page will be freed first as the list of pages will be freed from head to tail later. This change is for the case when count is large(which is a rare case): since the last touched pages and their buddies are more likely to be still cache hot when later these pages are freed under lock, it seems natural to free them first. We can revert this change if it causes any trouble without affecting performance much as count is not a large value most of the time. > > I don't immediately see a downside to that. In the (distant) past we > had issues when successive alloc_page() calls would return pages in > descending address order - that totally screwed up scatter-gather page > merging. But this is the page-freeing path. Still, something to be > thought about and monitored. OK. > > > + > > + /* > > + * We are going to put the page back to the global > > + * pool, prefetch its buddy to speed up later access > > + * under zone->lock. It is believed the overhead of > > + * an additional test and calculating buddy_pfn here > > + * can be offset by reduced memory latency later. To > > + * avoid excessive prefetching due to large count, only > > + * prefetch buddy for the last pcp->batch nr of pages. > > + */ > > + if (count > pcp->batch) > > + continue; > > + pfn = page_to_pfn(page); > > + buddy_pfn = __find_buddy_pfn(pfn, 0); > > + buddy = page + (buddy_pfn - pfn); > > + prefetch(buddy); > > } while (--count && --batch_free && !list_empty(list)); > > This loop hurts my brain, mainly the handling of `count': > > while (count) { > do { > batch_free++; > } while (list_empty(list)); > > /* This is the only non-empty list. Free them all. */ > if (batch_free == MIGRATE_PCPTYPES) > batch_free = count; > > do { > } while (--count && --batch_free && !list_empty(list)); > } > > I guess it kinda makes sense - both loops terminate on count==0. But > still. Can it be clarified? That's right, count == 0 is the final termination condition. count is decremented when a page is to be freed so: if (count > pcp->batch) continue; prefetch(); means to only prefetch for the last pcp->batch pages. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 970AB6B0006 for ; Mon, 12 Mar 2018 09:22:34 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id j12so5730613pff.18 for ; Mon, 12 Mar 2018 06:22:34 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 3-v6si821232plv.323.2018.03.12.06.22.33 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 12 Mar 2018 06:22:33 -0700 (PDT) Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> From: Vlastimil Babka Message-ID: Date: Mon, 12 Mar 2018 14:22:28 +0100 MIME-Version: 1.0 In-Reply-To: <20180301062845.26038-2-aaron.lu@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On 03/01/2018 07:28 AM, Aaron Lu wrote: > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > update pcp->count immediately after so it's natural to do it inside > free_pcppages_bulk(). > > No functionality or performance change is expected from this patch. Well, it's N decrements instead of one decrement by N / assignment of zero. But I assume the difference is negligible anyway, right? > Suggested-by: Matthew Wilcox > Signed-off-by: Aaron Lu Acked-by: Vlastimil Babka > --- > mm/page_alloc.c | 10 +++------- > 1 file changed, 3 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cb416723538f..faa33eac1635 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > page = list_last_entry(list, struct page, lru); > /* must delete as __free_one_page list manipulates */ > list_del(&page->lru); > + pcp->count--; > > mt = get_pcppage_migratetype(page); > /* MIGRATE_ISOLATE page should not go to pcplists */ > @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > local_irq_save(flags); > batch = READ_ONCE(pcp->batch); > to_drain = min(pcp->count, batch); > - if (to_drain > 0) { > + if (to_drain > 0) > free_pcppages_bulk(zone, to_drain, pcp); > - pcp->count -= to_drain; > - } > local_irq_restore(flags); > } > #endif > @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > pset = per_cpu_ptr(zone->pageset, cpu); > > pcp = &pset->pcp; > - if (pcp->count) { > + if (pcp->count) > free_pcppages_bulk(zone, pcp->count, pcp); > - pcp->count = 0; > - } > local_irq_restore(flags); > } > > @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) > if (pcp->count >= pcp->high) { > unsigned long batch = READ_ONCE(pcp->batch); > free_pcppages_bulk(zone, batch, pcp); > - pcp->count -= batch; > } > } > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 32D746B000C for ; Mon, 12 Mar 2018 10:22:58 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id d12so9449284wri.4 for ; Mon, 12 Mar 2018 07:22:58 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r9si3317909wme.195.2018.03.12.07.22.55 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 12 Mar 2018 07:22:56 -0700 (PDT) Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> From: Vlastimil Babka Message-ID: <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> Date: Mon, 12 Mar 2018 15:22:53 +0100 MIME-Version: 1.0 In-Reply-To: <20180301062845.26038-3-aaron.lu@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On 03/01/2018 07:28 AM, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Acked-by: Mel Gorman > Signed-off-by: Aaron Lu > --- > mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- > 1 file changed, 23 insertions(+), 16 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index faa33eac1635..dafdcdec9c1f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int migratetype = 0; > int batch_free = 0; > bool isolated_pageblocks; > - > - spin_lock(&zone->lock); > - isolated_pageblocks = has_isolate_pageblock(zone); > + struct page *page, *tmp; > + LIST_HEAD(head); > > while (count) { > - struct page *page; > struct list_head *list; > > /* > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, > batch_free = count; > > do { > - int mt; /* migratetype of the to-be-freed page */ > - > page = list_last_entry(list, struct page, lru); > - /* must delete as __free_one_page list manipulates */ > + /* must delete to avoid corrupting pcp list */ > list_del(&page->lru); Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you could maybe use list_move_tail() instead of list_del() + list_add_tail()? That avoids temporarily writing poison values. Hm actually, you are reversing the list in the process, because page is obtained by list_last_entry and you use list_add_tail. That could have unintended performance consequences? Also maybe list_cut_position() could be faster than shuffling pages one by one? I guess not really, because batch_free will be generally low? > pcp->count--; > > - mt = get_pcppage_migratetype(page); > - /* MIGRATE_ISOLATE page should not go to pcplists */ > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > - /* Pageblock could have been isolated meanwhile */ > - if (unlikely(isolated_pageblocks)) > - mt = get_pageblock_migratetype(page); > - > if (bulkfree_pcp_prepare(page)) > continue; > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > - trace_mm_page_pcpu_drain(page, 0, mt); > + list_add_tail(&page->lru, &head); > } while (--count && --batch_free && !list_empty(list)); > } > + > + spin_lock(&zone->lock); > + isolated_pageblocks = has_isolate_pageblock(zone); > + > + /* > + * Use safe version since after __free_one_page(), > + * page->lru.next will not point to original list. > + */ > + list_for_each_entry_safe(page, tmp, &head, lru) { > + int mt = get_pcppage_migratetype(page); > + /* MIGRATE_ISOLATE page should not go to pcplists */ > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > + /* Pageblock could have been isolated meanwhile */ > + if (unlikely(isolated_pageblocks)) > + mt = get_pageblock_migratetype(page); > + > + __free_one_page(page, page_to_pfn(page), zone, 0, mt); > + trace_mm_page_pcpu_drain(page, 0, mt); > + } > spin_unlock(&zone->lock); > } > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 784CB6B0006 for ; Mon, 12 Mar 2018 11:05:52 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id y145so4325977wmd.4 for ; Mon, 12 Mar 2018 08:05:52 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 19si3396467wmq.208.2018.03.12.08.05.50 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 12 Mar 2018 08:05:50 -0700 (PDT) Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <20180309135832.988ab6d3d986658d531a79ef@linux-foundation.org> From: Vlastimil Babka Message-ID: <8818156a-f4a9-ac8a-7179-e0b5e4225e38@suse.cz> Date: Mon, 12 Mar 2018 16:05:47 +0100 MIME-Version: 1.0 In-Reply-To: <20180309135832.988ab6d3d986658d531a79ef@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On 03/09/2018 10:58 PM, Andrew Morton wrote: >> >> When a page is freed back to the global pool, its buddy will be checked >> to see if it's possible to do a merge. This requires accessing buddy's >> page structure and that access could take a long time if it's cache cold. >> >> This patch adds a prefetch to the to-be-freed page's buddy outside of >> zone->lock in hope of accessing buddy's page structure later under >> zone->lock will be faster. Since we *always* do buddy merging and check >> an order-0 page's buddy to try to merge it when it goes into the main >> allocator, the cacheline will always come in, i.e. the prefetched data >> will never be unused. >> >> Normally, the number of to-be-freed pages(i.e. count) equals to >> pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on >> x86_64) but in the case of pcp's pages getting all drained, it will be >> pcp->count which has an upper limit of pcp->high. pcp->high, although >> has a default value of 186 (pcp->batch=31 * 6), can be changed by user >> through /proc/sys/vm/percpu_pagelist_fraction and there is no software >> upper limit so could be large, like several thousand. For this reason, >> only the last pcp->batch number of page's buddy structure is prefetched >> to avoid excessive prefetching. pcp-batch is used because: >> 1 most often, count == pcp->batch; >> 2 it has an upper limit itself so we won't prefetch excessively. >> >> Considering the possible large value of pcp->high, it also makes >> sense to free the last added page first for cache hot's reason. >> That's where the change of list_add_tail() to list_add() comes in >> as we will free them from head to tail one by one. >> >> In the meantime, there are two concerns: >> 1 the prefetch could potentially evict existing cachelines, especially >> for L1D cache since it is not huge; >> 2 there is some additional instruction overhead, namely calculating >> buddy pfn twice. >> >> For 1, it's hard to say, this microbenchmark though shows good result but >> the actual benefit of this patch will be workload/CPU dependant; >> For 2, since the calculation is a XOR on two local variables, it's expected >> in many cases that cycles spent will be offset by reduced memory latency >> later. This is especially true for NUMA machines where multiple CPUs are >> contending on zone->lock and the most time consuming part under zone->lock >> is the wait of 'struct page' cacheline of the to-be-freed pages and their >> buddies. >> >> Test with will-it-scale/page_fault1 full load: >> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) >> v4.16-rc2+ 9034215 7971818 13667135 15677465 >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% >> this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% >> Note: this patch's performance improvement percent is against patch2/3. >> >> (Changelog stolen from Dave Hansen and Mel Gorman's comments at >> http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) >> >> Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com >> >> ... >> >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> batch_free = count; >> >> do { >> + unsigned long pfn, buddy_pfn; >> + struct page *buddy; >> + >> page = list_last_entry(list, struct page, lru); >> /* must delete to avoid corrupting pcp list */ >> list_del(&page->lru); >> @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> if (bulkfree_pcp_prepare(page)) >> continue; >> >> - list_add_tail(&page->lru, &head); >> + list_add(&page->lru, &head); > > The result here will be that free_pcppages_bulk() frees the pages in > the reverse order? I actually think it restores the order, compared to the previous version (see my earlier reply). > I don't immediately see a downside to that. In the (distant) past we > had issues when successive alloc_page() calls would return pages in > descending address order - that totally screwed up scatter-gather page > merging. But this is the page-freeing path. Still, something to be > thought about and monitored. > >> + >> + /* >> + * We are going to put the page back to the global >> + * pool, prefetch its buddy to speed up later access >> + * under zone->lock. It is believed the overhead of >> + * an additional test and calculating buddy_pfn here >> + * can be offset by reduced memory latency later. To >> + * avoid excessive prefetching due to large count, only >> + * prefetch buddy for the last pcp->batch nr of pages. >> + */ >> + if (count > pcp->batch) >> + continue; You could also go to the locked part after pcp->batch pages and then return back, but maybe let's not complicate it further for corner cases :) >> + pfn = page_to_pfn(page); >> + buddy_pfn = __find_buddy_pfn(pfn, 0); >> + buddy = page + (buddy_pfn - pfn); >> + prefetch(buddy); >> } while (--count && --batch_free && !list_empty(list)); > > This loop hurts my brain, mainly the handling of `count': > > while (count) { > do { > batch_free++; > } while (list_empty(list)); > > /* This is the only non-empty list. Free them all. */ > if (batch_free == MIGRATE_PCPTYPES) > batch_free = count; > > do { > } while (--count && --batch_free && !list_empty(list)); > } > > I guess it kinda makes sense - both loops terminate on count==0. But > still. Can it be clarified? Yeah this is rather far from straightforward :( From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f69.google.com (mail-pl0-f69.google.com [209.85.160.69]) by kanga.kvack.org (Postfix) with ESMTP id 318C26B000E for ; Mon, 12 Mar 2018 13:32:40 -0400 (EDT) Received: by mail-pl0-f69.google.com with SMTP id b4-v6so2676263plx.20 for ; Mon, 12 Mar 2018 10:32:40 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTPS id h2-v6si6437747pls.270.2018.03.12.10.32.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Mar 2018 10:32:38 -0700 (PDT) Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> From: Dave Hansen Message-ID: <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> Date: Mon, 12 Mar 2018 10:32:32 -0700 MIME-Version: 1.0 In-Reply-To: <20180309082431.GB30868@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On 03/09/2018 12:24 AM, Aaron Lu wrote: > + /* > + * We are going to put the page back to the global > + * pool, prefetch its buddy to speed up later access > + * under zone->lock. It is believed the overhead of > + * an additional test and calculating buddy_pfn here > + * can be offset by reduced memory latency later. To > + * avoid excessive prefetching due to large count, only > + * prefetch buddy for the last pcp->batch nr of pages. > + */ > + if (count > pcp->batch) > + continue; > + pfn = page_to_pfn(page); > + buddy_pfn = __find_buddy_pfn(pfn, 0); > + buddy = page + (buddy_pfn - pfn); > + prefetch(buddy); FWIW, I think this needs to go into a helper function. Is that possible? There's too much logic happening here. Also, 'count' going from batch_size->0 is totally non-obvious from the patch context. It makes this hunk look totally wrong by itself. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id C5F146B0007 for ; Mon, 12 Mar 2018 22:10:38 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id r1so7469792pgq.7 for ; Mon, 12 Mar 2018 19:10:38 -0700 (PDT) Received: from mga06.intel.com (mga06.intel.com. [134.134.136.31]) by mx.google.com with ESMTPS id s14-v6si4434411plq.674.2018.03.12.19.10.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Mar 2018 19:10:37 -0700 (PDT) Date: Tue, 13 Mar 2018 10:11:41 +0800 From: Aaron Lu Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Message-ID: <20180313021141.GA13782@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On Mon, Mar 12, 2018 at 02:22:28PM +0100, Vlastimil Babka wrote: > On 03/01/2018 07:28 AM, Aaron Lu wrote: > > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > > update pcp->count immediately after so it's natural to do it inside > > free_pcppages_bulk(). > > > > No functionality or performance change is expected from this patch. > > Well, it's N decrements instead of one decrement by N / assignment of > zero. But I assume the difference is negligible anyway, right? Yes. > > > Suggested-by: Matthew Wilcox > > Signed-off-by: Aaron Lu > > Acked-by: Vlastimil Babka Thanks! > > --- > > mm/page_alloc.c | 10 +++------- > > 1 file changed, 3 insertions(+), 7 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index cb416723538f..faa33eac1635 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > page = list_last_entry(list, struct page, lru); > > /* must delete as __free_one_page list manipulates */ > > list_del(&page->lru); > > + pcp->count--; > > > > mt = get_pcppage_migratetype(page); > > /* MIGRATE_ISOLATE page should not go to pcplists */ > > @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > > local_irq_save(flags); > > batch = READ_ONCE(pcp->batch); > > to_drain = min(pcp->count, batch); > > - if (to_drain > 0) { > > + if (to_drain > 0) > > free_pcppages_bulk(zone, to_drain, pcp); > > - pcp->count -= to_drain; > > - } > > local_irq_restore(flags); > > } > > #endif > > @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > > pset = per_cpu_ptr(zone->pageset, cpu); > > > > pcp = &pset->pcp; > > - if (pcp->count) { > > + if (pcp->count) > > free_pcppages_bulk(zone, pcp->count, pcp); > > - pcp->count = 0; > > - } > > local_irq_restore(flags); > > } > > > > @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) > > if (pcp->count >= pcp->high) { > > unsigned long batch = READ_ONCE(pcp->batch); > > free_pcppages_bulk(zone, batch, pcp); > > - pcp->count -= batch; > > } > > } > > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id D57606B0006 for ; Mon, 12 Mar 2018 23:33:50 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id o19so7537680pgn.12 for ; Mon, 12 Mar 2018 20:33:50 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id n10si6003700pge.342.2018.03.12.20.33.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Mar 2018 20:33:49 -0700 (PDT) Date: Tue, 13 Mar 2018 11:34:53 +0800 From: Aaron Lu Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180313033453.GB13782@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On Mon, Mar 12, 2018 at 03:22:53PM +0100, Vlastimil Babka wrote: > On 03/01/2018 07:28 AM, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > > > Acked-by: Mel Gorman > > Signed-off-by: Aaron Lu > > --- > > mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- > > 1 file changed, 23 insertions(+), 16 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index faa33eac1635..dafdcdec9c1f 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > int migratetype = 0; > > int batch_free = 0; > > bool isolated_pageblocks; > > - > > - spin_lock(&zone->lock); > > - isolated_pageblocks = has_isolate_pageblock(zone); > > + struct page *page, *tmp; > > + LIST_HEAD(head); > > > > while (count) { > > - struct page *page; > > struct list_head *list; > > > > /* > > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > batch_free = count; > > > > do { > > - int mt; /* migratetype of the to-be-freed page */ > > - > > page = list_last_entry(list, struct page, lru); > > - /* must delete as __free_one_page list manipulates */ > > + /* must delete to avoid corrupting pcp list */ > > list_del(&page->lru); > > Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you > could maybe use list_move_tail() instead of list_del() + > list_add_tail()? That avoids temporarily writing poison values. Good point, except bulkfree_pcp_prepare() could return error and then the page will need to be removed from the to-be-freed list, like this: do { page = list_last_entry(list, struct page, lru); list_move_tail(&page->lru, &head); pcp->count--; if (bulkfree_pcp_prepare(page)) list_del(&page->lru); } while (--count && --batch_free && !list_empty(list)); Considering bulkfree_pcp_prepare() returning error is the rare case, this list_del() should rarely happen. At the same time, this part is outside of zone->lock and can hardly impact performance...so I'm not sure. > Hm actually, you are reversing the list in the process, because page is > obtained by list_last_entry and you use list_add_tail. That could have > unintended performance consequences? True the order is changed when these to-be-freed pages are in this temporary list, but then they are iterated and freed one by one from head to tail so the order they landed in free_list is the same as before the patch(also the same as they are in pcp list). > > Also maybe list_cut_position() could be faster than shuffling pages one > by one? I guess not really, because batch_free will be generally low? We will need to know where to cut if list_cut_position() is to be used and to find that out, the list will need to be iterated first. I guess that's too much trouble. Since this part of code is per-cpu(outside of zone->lock) and these pages are in pcp(meaning their cachelines are not likely in remote), I didn't worry too much about not being able to list_cut_position() but iterate. On allocation side though, when manipulating the global free_list under zone->lock, this is a big problem since pages there are freed from different CPUs and the cache could be cold for the allocating CPU. That is why I'm proposing clusted allocation sometime ago as an RFC patch where list_cut_position() is so good that it could eliminate the cacheline miss issue since we do not need to iterate cold pages one by one. I wish there is a data structure that has the flexibility of list while at the same time we can locate the Nth element in the list without the need to iterate. That's what I'm looking for when developing clustered allocation for order 0 pages. In the end, I had to use another place to record where the Nth element is. I hope to send out v2 of that RFC series soon but I'm still collecting data for it. I would appreciate if people could take a look then :-) batch_free's value depends on what the system is doing. When user application is making use of memory, the common case is, only migratetype of MIGRATE_MOVABLE has pages to free and then batch_free will be 1 in the first round and (pcp->batch-1) in the 2nd round. Here is some data I collected recently on how often only MIGRATE_MOVABLE list has pages to free in free_pcppages_bulk(): On my desktop, after boot: free_pcppages_bulk: 6268 single_mt_movable: 2566 (41%) free_pcppages_bulk means the number of time this function gets called. single_mt_movable means number of times when only MIGRATE_MOVABLE list has pages to free. After kbuild with a distro kconfig: free_pcppages_bulk: 9100508 single_mt_movable: 8435483 (92.75%) If we change the initial value of migratetype in free_pcppages_bulk() from 0(MIGRATE_UNMOVABLE) to 1(MIGRATE_MOVABLE), then batch_free will be pcp->batch in the 1st round and we can save something but the saving is negligible when running a workload so I didn't send a patch for it yet. > > pcp->count--; > > > > - mt = get_pcppage_migratetype(page); > > - /* MIGRATE_ISOLATE page should not go to pcplists */ > > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > > - /* Pageblock could have been isolated meanwhile */ > > - if (unlikely(isolated_pageblocks)) > > - mt = get_pageblock_migratetype(page); > > - > > if (bulkfree_pcp_prepare(page)) > > continue; > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > - trace_mm_page_pcpu_drain(page, 0, mt); > > + list_add_tail(&page->lru, &head); > > } while (--count && --batch_free && !list_empty(list)); > > } > > + > > + spin_lock(&zone->lock); > > + isolated_pageblocks = has_isolate_pageblock(zone); > > + > > + /* > > + * Use safe version since after __free_one_page(), > > + * page->lru.next will not point to original list. > > + */ > > + list_for_each_entry_safe(page, tmp, &head, lru) { > > + int mt = get_pcppage_migratetype(page); > > + /* MIGRATE_ISOLATE page should not go to pcplists */ > > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > > + /* Pageblock could have been isolated meanwhile */ > > + if (unlikely(isolated_pageblocks)) > > + mt = get_pageblock_migratetype(page); > > + > > + __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > + trace_mm_page_pcpu_drain(page, 0, mt); > > + } > > spin_unlock(&zone->lock); > > } > > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f71.google.com (mail-pl0-f71.google.com [209.85.160.71]) by kanga.kvack.org (Postfix) with ESMTP id D7D2D6B0009 for ; Mon, 12 Mar 2018 23:34:16 -0400 (EDT) Received: by mail-pl0-f71.google.com with SMTP id o61-v6so9578194pld.5 for ; Mon, 12 Mar 2018 20:34:16 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id k17si6554549pff.157.2018.03.12.20.34.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Mar 2018 20:34:15 -0700 (PDT) Date: Tue, 13 Mar 2018 11:35:19 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180313033519.GC13782@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote: > On 03/09/2018 12:24 AM, Aaron Lu wrote: > > + /* > > + * We are going to put the page back to the global > > + * pool, prefetch its buddy to speed up later access > > + * under zone->lock. It is believed the overhead of > > + * an additional test and calculating buddy_pfn here > > + * can be offset by reduced memory latency later. To > > + * avoid excessive prefetching due to large count, only > > + * prefetch buddy for the last pcp->batch nr of pages. > > + */ > > + if (count > pcp->batch) > > + continue; > > + pfn = page_to_pfn(page); > > + buddy_pfn = __find_buddy_pfn(pfn, 0); > > + buddy = page + (buddy_pfn - pfn); > > + prefetch(buddy); > > FWIW, I think this needs to go into a helper function. Is that possible? I'll give it a try. > > There's too much logic happening here. Also, 'count' going from > batch_size->0 is totally non-obvious from the patch context. It makes > this hunk look totally wrong by itself. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id A6F916B0005 for ; Tue, 13 Mar 2018 03:03:00 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id g66so6985184pfj.11 for ; Tue, 13 Mar 2018 00:03:00 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTPS id e15si3063478pfl.284.2018.03.13.00.02.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 13 Mar 2018 00:02:59 -0700 (PDT) Date: Tue, 13 Mar 2018 15:04:04 +0800 From: Aaron Lu Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180313070404.GA7501@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> <20180313033519.GC13782@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180313033519.GC13782@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes On Tue, Mar 13, 2018 at 11:35:19AM +0800, Aaron Lu wrote: > On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote: > > On 03/09/2018 12:24 AM, Aaron Lu wrote: > > > + /* > > > + * We are going to put the page back to the global > > > + * pool, prefetch its buddy to speed up later access > > > + * under zone->lock. It is believed the overhead of > > > + * an additional test and calculating buddy_pfn here > > > + * can be offset by reduced memory latency later. To > > > + * avoid excessive prefetching due to large count, only > > > + * prefetch buddy for the last pcp->batch nr of pages. > > > + */ > > > + if (count > pcp->batch) > > > + continue; > > > + pfn = page_to_pfn(page); > > > + buddy_pfn = __find_buddy_pfn(pfn, 0); > > > + buddy = page + (buddy_pfn - pfn); > > > + prefetch(buddy); > > > > FWIW, I think this needs to go into a helper function. Is that possible? > > I'll give it a try. > > > > > There's too much logic happening here. Also, 'count' going from > > batch_size->0 is totally non-obvious from the patch context. It makes > > this hunk look totally wrong by itself. I tried to avoid adding one more local variable but looks like it caused a lot of pain. What about the following? It doesn't use count any more but prefetch_nr to indicate how many prefetches have happened. Also, I think it's not worth the risk of disordering pages in free_list by changing list_add_tail() to list_add() as Andrew reminded so I dropped that change too. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..00ea4483f679 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page) } #endif /* CONFIG_DEBUG_VM */ +static inline void prefetch_buddy(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0); + struct page *buddy = page + (buddy_pfn - pfn); + + prefetch(buddy); +} + /* * Frees a number of pages from the PCP lists * Assumes all pages on list are in same zone, and of same order. @@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, { int migratetype = 0; int batch_free = 0; + int prefetch_nr = 0; bool isolated_pageblocks; struct page *page, *tmp; LIST_HEAD(head); @@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * an additional test and calculating buddy_pfn here + * can be offset by reduced memory latency later. To + * avoid excessive prefetching due to large count, only + * prefetch buddy for the first pcp->batch nr of pages. + */ + if (prefetch_nr++ < pcp->batch) + prefetch_buddy(page); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f199.google.com (mail-ot0-f199.google.com [74.125.82.199]) by kanga.kvack.org (Postfix) with ESMTP id BE6A16B0003 for ; Tue, 20 Mar 2018 05:52:11 -0400 (EDT) Received: by mail-ot0-f199.google.com with SMTP id f32-v6so579776otc.13 for ; Tue, 20 Mar 2018 02:52:11 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p7si351031oib.509.2018.03.20.02.52.10 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 20 Mar 2018 02:52:10 -0700 (PDT) Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> <20180313033519.GC13782@intel.com> <20180313070404.GA7501@intel.com> From: Vlastimil Babka Message-ID: <5600c827-d22b-136c-6b90-a4b52f40af31@suse.cz> Date: Tue, 20 Mar 2018 10:50:18 +0100 MIME-Version: 1.0 In-Reply-To: <20180313070404.GA7501@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu , Dave Hansen Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On 03/13/2018 08:04 AM, Aaron Lu wrote: > On Tue, Mar 13, 2018 at 11:35:19AM +0800, Aaron Lu wrote: >> On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote: >>> On 03/09/2018 12:24 AM, Aaron Lu wrote: >>>> + /* >>>> + * We are going to put the page back to the global >>>> + * pool, prefetch its buddy to speed up later access >>>> + * under zone->lock. It is believed the overhead of >>>> + * an additional test and calculating buddy_pfn here >>>> + * can be offset by reduced memory latency later. To >>>> + * avoid excessive prefetching due to large count, only >>>> + * prefetch buddy for the last pcp->batch nr of pages. >>>> + */ >>>> + if (count > pcp->batch) >>>> + continue; >>>> + pfn = page_to_pfn(page); >>>> + buddy_pfn = __find_buddy_pfn(pfn, 0); >>>> + buddy = page + (buddy_pfn - pfn); >>>> + prefetch(buddy); >>> >>> FWIW, I think this needs to go into a helper function. Is that possible? >> >> I'll give it a try. >> >>> >>> There's too much logic happening here. Also, 'count' going from >>> batch_size->0 is totally non-obvious from the patch context. It makes >>> this hunk look totally wrong by itself. > > I tried to avoid adding one more local variable but looks like it caused > a lot of pain. What about the following? It doesn't use count any more > but prefetch_nr to indicate how many prefetches have happened. > > Also, I think it's not worth the risk of disordering pages in free_list > by changing list_add_tail() to list_add() as Andrew reminded so I > dropped that change too. Looks fine, you can add Acked-by: Vlastimil Babka > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dafdcdec9c1f..00ea4483f679 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page) > } > #endif /* CONFIG_DEBUG_VM */ > > +static inline void prefetch_buddy(struct page *page) > +{ > + unsigned long pfn = page_to_pfn(page); > + unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0); > + struct page *buddy = page + (buddy_pfn - pfn); > + > + prefetch(buddy); > +} > + > /* > * Frees a number of pages from the PCP lists > * Assumes all pages on list are in same zone, and of same order. > @@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > { > int migratetype = 0; > int batch_free = 0; > + int prefetch_nr = 0; > bool isolated_pageblocks; > struct page *page, *tmp; > LIST_HEAD(head); > @@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, > continue; > > list_add_tail(&page->lru, &head); > + > + /* > + * We are going to put the page back to the global > + * pool, prefetch its buddy to speed up later access > + * under zone->lock. It is believed the overhead of > + * an additional test and calculating buddy_pfn here > + * can be offset by reduced memory latency later. To > + * avoid excessive prefetching due to large count, only > + * prefetch buddy for the first pcp->batch nr of pages. > + */ > + if (prefetch_nr++ < pcp->batch) > + prefetch_buddy(page); > } while (--count && --batch_free && !list_empty(list)); > } > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f70.google.com (mail-pl0-f70.google.com [209.85.160.70]) by kanga.kvack.org (Postfix) with ESMTP id 3822E6B0003 for ; Tue, 20 Mar 2018 07:30:42 -0400 (EDT) Received: by mail-pl0-f70.google.com with SMTP id f59-v6so947218plb.7 for ; Tue, 20 Mar 2018 04:30:42 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTPS id c10si1043566pgv.591.2018.03.20.04.30.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 20 Mar 2018 04:30:40 -0700 (PDT) Date: Tue, 20 Mar 2018 19:31:46 +0800 From: Aaron Lu Subject: [PATCH v4 3/3 update2] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180320113146.GB24737@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> <20180302082756.GC6356@intel.com> <20180309082431.GB30868@intel.com> <988ce376-bdc4-0989-5133-612bfa3f7c45@intel.com> <20180313033519.GC13782@intel.com> <20180313070404.GA7501@intel.com> <5600c827-d22b-136c-6b90-a4b52f40af31@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5600c827-d22b-136c-6b90-a4b52f40af31@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Dave Hansen , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes On Tue, Mar 20, 2018 at 10:50:18AM +0100, Vlastimil Babka wrote: > On 03/13/2018 08:04 AM, Aaron Lu wrote: > > I tried to avoid adding one more local variable but looks like it caused > > a lot of pain. What about the following? It doesn't use count any more > > but prefetch_nr to indicate how many prefetches have happened. > > > > Also, I think it's not worth the risk of disordering pages in free_list > > by changing list_add_tail() to list_add() as Andrew reminded so I > > dropped that change too. > > Looks fine, you can add > > Acked-by: Vlastimil Babka Thanks, here is the updated patch. From: Aaron Lu Date: Thu, 18 Jan 2018 11:19:59 +0800 Subject: [PATCH v4 3/3 update2] mm/free_pcppages_bulk: prefetch buddy while not holding lock When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. Normally, the number of prefetch will be pcp->batch(default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of pcp's pages get all drained, it will be pcp->count which has an upper limit of pcp->high. pcp->high, although has a default value of 186 (pcp->batch=31 * 6), can be changed by user through /proc/sys/vm/percpu_pagelist_fraction and there is no software upper limit so could be large, like several thousand. For this reason, only the first pcp->batch number of page's buddy structure is prefetched to avoid excessive prefetching. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% Note: this patch's performance improvement percent is against patch2/3. (Changelog stolen from Dave Hansen and Mel Gorman's comments at http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Signed-off-by: Aaron Lu Suggested-by: Ying Huang Acked-by: Vlastimil Babka Reviewed-by: Andrew Morton --- update2: Use a helper function to prefetch buddy as suggested by Dave Hansen. Drop the change of list_add_tail() to avoid disordering page. mm/page_alloc.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..00ea4483f679 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page) } #endif /* CONFIG_DEBUG_VM */ +static inline void prefetch_buddy(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0); + struct page *buddy = page + (buddy_pfn - pfn); + + prefetch(buddy); +} + /* * Frees a number of pages from the PCP lists * Assumes all pages on list are in same zone, and of same order. @@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, { int migratetype = 0; int batch_free = 0; + int prefetch_nr = 0; bool isolated_pageblocks; struct page *page, *tmp; LIST_HEAD(head); @@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * an additional test and calculating buddy_pfn here + * can be offset by reduced memory latency later. To + * avoid excessive prefetching due to large count, only + * prefetch buddy for the first pcp->batch nr of pages. + */ + if (prefetch_nr++ < pcp->batch) + prefetch_buddy(page); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id D00256B0003 for ; Thu, 22 Mar 2018 11:17:26 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id q11so1885953pfd.8 for ; Thu, 22 Mar 2018 08:17:26 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org. [2607:7c80:54:e::133]) by mx.google.com with ESMTPS id c20si5060531pfi.18.2018.03.22.08.17.24 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 22 Mar 2018 08:17:25 -0700 (PDT) Date: Thu, 22 Mar 2018 08:17:19 -0700 From: Matthew Wilcox Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180322151719.GA28468@bombadil.infradead.org> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> <20180313033453.GB13782@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180313033453.GB13782@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , David Rientjes On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote: > I wish there is a data structure that has the flexibility of list while > at the same time we can locate the Nth element in the list without the > need to iterate. That's what I'm looking for when developing clustered > allocation for order 0 pages. In the end, I had to use another place to > record where the Nth element is. I hope to send out v2 of that RFC > series soon but I'm still collecting data for it. I would appreciate if > people could take a look then :-) Sorry, I missed this. There is such a data structure -- the IDR, or possibly a bare radix tree, or we can build a better data structure on top of the radix tree (I talked about one called the XQueue a while ago). The IDR will automatically grow to whatever needed size, it stores pointers, you can find out quickly where the last allocated index is, you can remove from the middle of the array. Disadvantage is that it requires memory allocation to store the array of pointers, *but* it can always hold at least one entry. So if you have no memory, you can always return the one element in your IDR to the free pool and allocate from that page. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 521766B0026 for ; Sun, 25 Mar 2018 23:02:47 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id 203so7443598pfz.19 for ; Sun, 25 Mar 2018 20:02:47 -0700 (PDT) Received: from mga12.intel.com (mga12.intel.com. [192.55.52.136]) by mx.google.com with ESMTPS id g8si9453328pgv.740.2018.03.25.20.02.46 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Mar 2018 20:02:46 -0700 (PDT) Date: Mon, 26 Mar 2018 11:03:45 +0800 From: Aaron Lu Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180326030344.GA30075@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> <20180313033453.GB13782@intel.com> <20180322151719.GA28468@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180322151719.GA28468@bombadil.infradead.org> Sender: owner-linux-mm@kvack.org List-ID: To: Matthew Wilcox Cc: Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , David Rientjes On Thu, Mar 22, 2018 at 08:17:19AM -0700, Matthew Wilcox wrote: > On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote: > > I wish there is a data structure that has the flexibility of list while > > at the same time we can locate the Nth element in the list without the > > need to iterate. That's what I'm looking for when developing clustered > > allocation for order 0 pages. In the end, I had to use another place to > > record where the Nth element is. I hope to send out v2 of that RFC > > series soon but I'm still collecting data for it. I would appreciate if > > people could take a look then :-) > > Sorry, I missed this. There is such a data structure -- the IDR, or > possibly a bare radix tree, or we can build a better data structure on > top of the radix tree (I talked about one called the XQueue a while ago). > > The IDR will automatically grow to whatever needed size, it stores > pointers, you can find out quickly where the last allocated index is, > you can remove from the middle of the array. Disadvantage is that it > requires memory allocation to store the array of pointers, *but* it > can always hold at least one entry. So if you have no memory, you can > always return the one element in your IDR to the free pool and allocate > from that page. Thanks for the pointer, will take a look later. Currently I'm focusing on finding real workloads that have zone lock contention issue. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966346AbeCAG1v (ORCPT ); Thu, 1 Mar 2018 01:27:51 -0500 Received: from mga03.intel.com ([134.134.136.65]:12103 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966268AbeCAG1u (ORCPT ); Thu, 1 Mar 2018 01:27:50 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,407,1515484800"; d="scan'208";a="33761273" From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: [PATCH v4 0/3] mm: improve zone->lock scalability Date: Thu, 1 Mar 2018 14:28:42 +0800 Message-Id: <20180301062845.26038-1-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Patch 1/3 is a small cleanup suggested by Matthew Wilcox which doesn't affect scalability or performance; Patch 2/3 moves some code in free_pcppages_bulk() outside of zone->lock and has Mel Gorman's ack; Patch 3/3 uses prefetch in free_pcppages_bulk() outside of zone->lock to speedup page merging under zone->lock but Mel Gorman has some concerns. For details, please see their changelogs. v4: Address David Rientjes' comments to not update pcp->count in front of free_pcppages_bulk() in patch 1/3. Reword code comments in patch 2/3 as suggested by David Rientjes. v3: Added patch 1/3 to update pcp->count inside of free_pcppages_bulk(); Rebase to v4.16-rc2. v2 & v1: https://lkml.org/lkml/2018/1/23/879 Aaron Lu (3): mm/free_pcppages_bulk: update pcp->count inside mm/free_pcppages_bulk: do not hold lock when picking pages to free mm/free_pcppages_bulk: prefetch buddy while not holding lock mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 22 deletions(-) -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966360AbeCAG1y (ORCPT ); Thu, 1 Mar 2018 01:27:54 -0500 Received: from mga03.intel.com ([134.134.136.65]:12103 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966268AbeCAG1w (ORCPT ); Thu, 1 Mar 2018 01:27:52 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,407,1515484800"; d="scan'208";a="33761278" From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Date: Thu, 1 Mar 2018 14:28:43 +0800 Message-Id: <20180301062845.26038-2-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Matthew Wilcox found that all callers of free_pcppages_bulk() currently update pcp->count immediately after so it's natural to do it inside free_pcppages_bulk(). No functionality or performance change is expected from this patch. Suggested-by: Matthew Wilcox Signed-off-by: Aaron Lu --- mm/page_alloc.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cb416723538f..faa33eac1635 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); + pcp->count--; mt = get_pcppage_migratetype(page); /* MIGRATE_ISOLATE page should not go to pcplists */ @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) local_irq_save(flags); batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); - if (to_drain > 0) { + if (to_drain > 0) free_pcppages_bulk(zone, to_drain, pcp); - pcp->count -= to_drain; - } local_irq_restore(flags); } #endif @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) pset = per_cpu_ptr(zone->pageset, cpu); pcp = &pset->pcp; - if (pcp->count) { + if (pcp->count) free_pcppages_bulk(zone, pcp->count, pcp); - pcp->count = 0; - } local_irq_restore(flags); } @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) if (pcp->count >= pcp->high) { unsigned long batch = READ_ONCE(pcp->batch); free_pcppages_bulk(zone, batch, pcp); - pcp->count -= batch; } } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966389AbeCAG2C (ORCPT ); Thu, 1 Mar 2018 01:28:02 -0500 Received: from mga03.intel.com ([134.134.136.65]:12103 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966378AbeCAG14 (ORCPT ); Thu, 1 Mar 2018 01:27:56 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,407,1515484800"; d="scan'208";a="33761289" From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Thu, 1 Mar 2018 14:28:45 +0800 Message-Id: <20180301062845.26038-4-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% Note: this patch's performance improvement percent is against patch2/3. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang Signed-off-by: Aaron Lu --- mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..1d838041931e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete to avoid corrupting pcp list */ list_del(&page->lru); @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966376AbeCAG14 (ORCPT ); Thu, 1 Mar 2018 01:27:56 -0500 Received: from mga03.intel.com ([134.134.136.65]:12103 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966268AbeCAG1y (ORCPT ); Thu, 1 Mar 2018 01:27:54 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,407,1515484800"; d="scan'208";a="33761283" From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Date: Thu, 1 Mar 2018 14:28:44 +0800 Message-Id: <20180301062845.26038-3-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the zone->lock is held and then pages are chosen from PCP's migratetype list. While there is actually no need to do this 'choose part' under lock since it's PCP pages, the only CPU that can touch them is us and irq is also disabled. Moving this part outside could reduce lock held time and improve performance. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% What the test does is: starts $nr_cpu processes and each will repeatedly do the following for 5 minutes: 1 mmap 128M anonymouse space; 2 write access to that space; 3 munmap. The score is the aggregated iteration. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Acked-by: Mel Gorman Signed-off-by: Aaron Lu --- mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index faa33eac1635..dafdcdec9c1f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, int migratetype = 0; int batch_free = 0; bool isolated_pageblocks; - - spin_lock(&zone->lock); - isolated_pageblocks = has_isolate_pageblock(zone); + struct page *page, *tmp; + LIST_HEAD(head); while (count) { - struct page *page; struct list_head *list; /* @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { - int mt; /* migratetype of the to-be-freed page */ - page = list_last_entry(list, struct page, lru); - /* must delete as __free_one_page list manipulates */ + /* must delete to avoid corrupting pcp list */ list_del(&page->lru); pcp->count--; - mt = get_pcppage_migratetype(page); - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - if (bulkfree_pcp_prepare(page)) continue; - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); + list_add_tail(&page->lru, &head); } while (--count && --batch_free && !list_empty(list)); } + + spin_lock(&zone->lock); + isolated_pageblocks = has_isolate_pageblock(zone); + + /* + * Use safe version since after __free_one_page(), + * page->lru.next will not point to original list. + */ + list_for_each_entry_safe(page, tmp, &head, lru) { + int mt = get_pcppage_migratetype(page); + /* MIGRATE_ISOLATE page should not go to pcplists */ + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); + /* Pageblock could have been isolated meanwhile */ + if (unlikely(isolated_pageblocks)) + mt = get_pageblock_migratetype(page); + + __free_one_page(page, page_to_pfn(page), zone, 0, mt); + trace_mm_page_pcpu_drain(page, 0, mt); + } spin_unlock(&zone->lock); } -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030317AbeCAMLQ (ORCPT ); Thu, 1 Mar 2018 07:11:16 -0500 Received: from mail-io0-f169.google.com ([209.85.223.169]:40133 "EHLO mail-io0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030253AbeCAMLO (ORCPT ); Thu, 1 Mar 2018 07:11:14 -0500 X-Google-Smtp-Source: AG47ELu6oI7rinhXRTQwdgc3+j/i0J9g9fu/bLE4ke/3nVUE8brhttHUiMPZQam4xnZYao/nUw8yGg== Date: Thu, 1 Mar 2018 04:11:11 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Aaron Lu cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside In-Reply-To: <20180301062845.26038-2-aaron.lu@intel.com> Message-ID: References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 1 Mar 2018, Aaron Lu wrote: > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > update pcp->count immediately after so it's natural to do it inside > free_pcppages_bulk(). > > No functionality or performance change is expected from this patch. > > Suggested-by: Matthew Wilcox > Signed-off-by: Aaron Lu Acked-by: David Rientjes From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030979AbeCANpP (ORCPT ); Thu, 1 Mar 2018 08:45:15 -0500 Received: from mx2.suse.de ([195.135.220.15]:59538 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030801AbeCANpO (ORCPT ); Thu, 1 Mar 2018 08:45:14 -0500 Date: Thu, 1 Mar 2018 14:45:12 +0100 From: Michal Hocko To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Message-ID: <20180301134512.GI15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-2-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-2-aaron.lu@intel.com> User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 01-03-18 14:28:43, Aaron Lu wrote: > Matthew Wilcox found that all callers of free_pcppages_bulk() currently > update pcp->count immediately after so it's natural to do it inside > free_pcppages_bulk(). > > No functionality or performance change is expected from this patch. > > Suggested-by: Matthew Wilcox > Signed-off-by: Aaron Lu Makes a lot of sense to me. Acked-by: Michal Hocko > --- > mm/page_alloc.c | 10 +++------- > 1 file changed, 3 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cb416723538f..faa33eac1635 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > page = list_last_entry(list, struct page, lru); > /* must delete as __free_one_page list manipulates */ > list_del(&page->lru); > + pcp->count--; > > mt = get_pcppage_migratetype(page); > /* MIGRATE_ISOLATE page should not go to pcplists */ > @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > local_irq_save(flags); > batch = READ_ONCE(pcp->batch); > to_drain = min(pcp->count, batch); > - if (to_drain > 0) { > + if (to_drain > 0) > free_pcppages_bulk(zone, to_drain, pcp); > - pcp->count -= to_drain; > - } > local_irq_restore(flags); > } > #endif > @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > pset = per_cpu_ptr(zone->pageset, cpu); > > pcp = &pset->pcp; > - if (pcp->count) { > + if (pcp->count) > free_pcppages_bulk(zone, pcp->count, pcp); > - pcp->count = 0; > - } > local_irq_restore(flags); > } > > @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) > if (pcp->count >= pcp->high) { > unsigned long batch = READ_ONCE(pcp->batch); > free_pcppages_bulk(zone, batch, pcp); > - pcp->count -= batch; > } > } > > -- > 2.14.3 > -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031016AbeCANzX (ORCPT ); Thu, 1 Mar 2018 08:55:23 -0500 Received: from mx2.suse.de ([195.135.220.15]:60288 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030826AbeCANzU (ORCPT ); Thu, 1 Mar 2018 08:55:20 -0500 Date: Thu, 1 Mar 2018 14:55:18 +0100 From: Michal Hocko To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180301135518.GJ15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-3-aaron.lu@intel.com> User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 01-03-18 14:28:44, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. Iteration count I assume. I am still quite surprised that this would have such a large impact. > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Acked-by: Mel Gorman > Signed-off-by: Aaron Lu The patch makes sense to me Acked-by: Michal Hocko > --- > mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- > 1 file changed, 23 insertions(+), 16 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index faa33eac1635..dafdcdec9c1f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int migratetype = 0; > int batch_free = 0; > bool isolated_pageblocks; > - > - spin_lock(&zone->lock); > - isolated_pageblocks = has_isolate_pageblock(zone); > + struct page *page, *tmp; > + LIST_HEAD(head); > > while (count) { > - struct page *page; > struct list_head *list; > > /* > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, > batch_free = count; > > do { > - int mt; /* migratetype of the to-be-freed page */ > - > page = list_last_entry(list, struct page, lru); > - /* must delete as __free_one_page list manipulates */ > + /* must delete to avoid corrupting pcp list */ > list_del(&page->lru); > pcp->count--; > > - mt = get_pcppage_migratetype(page); > - /* MIGRATE_ISOLATE page should not go to pcplists */ > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > - /* Pageblock could have been isolated meanwhile */ > - if (unlikely(isolated_pageblocks)) > - mt = get_pageblock_migratetype(page); > - > if (bulkfree_pcp_prepare(page)) > continue; > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > - trace_mm_page_pcpu_drain(page, 0, mt); > + list_add_tail(&page->lru, &head); > } while (--count && --batch_free && !list_empty(list)); > } > + > + spin_lock(&zone->lock); > + isolated_pageblocks = has_isolate_pageblock(zone); > + > + /* > + * Use safe version since after __free_one_page(), > + * page->lru.next will not point to original list. > + */ > + list_for_each_entry_safe(page, tmp, &head, lru) { > + int mt = get_pcppage_migratetype(page); > + /* MIGRATE_ISOLATE page should not go to pcplists */ > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > + /* Pageblock could have been isolated meanwhile */ > + if (unlikely(isolated_pageblocks)) > + mt = get_pageblock_migratetype(page); > + > + __free_one_page(page, page_to_pfn(page), zone, 0, mt); > + trace_mm_page_pcpu_drain(page, 0, mt); > + } > spin_unlock(&zone->lock); > } > > -- > 2.14.3 > -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031045AbeCAOAw (ORCPT ); Thu, 1 Mar 2018 09:00:52 -0500 Received: from mx2.suse.de ([195.135.220.15]:60648 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031003AbeCAOAu (ORCPT ); Thu, 1 Mar 2018 09:00:50 -0500 Date: Thu, 1 Mar 2018 15:00:44 +0100 From: Michal Hocko To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180301140044.GK15057@dhcp22.suse.cz> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301062845.26038-4-aaron.lu@intel.com> User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 01-03-18 14:28:45, Aaron Lu wrote: > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. Since we *always* do buddy merging and check > an order-0 page's buddy to try to merge it when it goes into the main > allocator, the cacheline will always come in, i.e. the prefetched data > will never be unused. > > In the meantime, there are two concerns: > 1 the prefetch could potentially evict existing cachelines, especially > for L1D cache since it is not huge; > 2 there is some additional instruction overhead, namely calculating > buddy pfn twice. > > For 1, it's hard to say, this microbenchmark though shows good result but > the actual benefit of this patch will be workload/CPU dependant; > For 2, since the calculation is a XOR on two local variables, it's expected > in many cases that cycles spent will be offset by reduced memory latency > later. This is especially true for NUMA machines where multiple CPUs are > contending on zone->lock and the most time consuming part under zone->lock > is the wait of 'struct page' cacheline of the to-be-freed pages and their > buddies. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > Note: this patch's performance improvement percent is against patch2/3. I am really surprised that this has such a big impact. Is this a win on other architectures as well? > [changelog stole from Dave Hansen and Mel Gorman's comments] > https://lkml.org/lkml/2018/1/24/551 Please use http://lkml.kernel.org/r/ for references because lkml.org is quite unstable. It would be http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com here. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1163669AbeCBABI (ORCPT ); Thu, 1 Mar 2018 19:01:08 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:42430 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1163565AbeCBABH (ORCPT ); Thu, 1 Mar 2018 19:01:07 -0500 Date: Thu, 1 Mar 2018 16:01:05 -0800 From: Andrew Morton To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-Id: <20180301160105.aca958fac871998d582307d4@linux-foundation.org> In-Reply-To: <20180301062845.26038-3-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. But it's a loss for uniprocessor systems: it adds more code and adds an additional pass across a list. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1163765AbeCBAJz (ORCPT ); Thu, 1 Mar 2018 19:09:55 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:46008 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1163688AbeCBAJx (ORCPT ); Thu, 1 Mar 2018 19:09:53 -0500 Date: Thu, 1 Mar 2018 16:09:50 -0800 From: Andrew Morton To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-Id: <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> In-Reply-To: <20180301062845.26038-4-aaron.lu@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu wrote: > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. Since we *always* do buddy merging and check > an order-0 page's buddy to try to merge it when it goes into the main > allocator, the cacheline will always come in, i.e. the prefetched data > will never be unused. > > In the meantime, there are two concerns: > 1 the prefetch could potentially evict existing cachelines, especially > for L1D cache since it is not huge; > 2 there is some additional instruction overhead, namely calculating > buddy pfn twice. > > For 1, it's hard to say, this microbenchmark though shows good result but > the actual benefit of this patch will be workload/CPU dependant; > For 2, since the calculation is a XOR on two local variables, it's expected > in many cases that cycles spent will be offset by reduced memory latency > later. This is especially true for NUMA machines where multiple CPUs are > contending on zone->lock and the most time consuming part under zone->lock > is the wait of 'struct page' cacheline of the to-be-freed pages and their > buddies. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.16-rc2+ 9034215 7971818 13667135 15677465 > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > Note: this patch's performance improvement percent is against patch2/3. > > ... > > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, > continue; > > list_add_tail(&page->lru, &head); > + > + /* > + * We are going to put the page back to the global > + * pool, prefetch its buddy to speed up later access > + * under zone->lock. It is believed the overhead of > + * calculating buddy_pfn here can be offset by reduced > + * memory latency later. > + */ > + pfn = page_to_pfn(page); > + buddy_pfn = __find_buddy_pfn(pfn, 0); > + buddy = page + (buddy_pfn - pfn); > + prefetch(buddy); What is the typical list length here? Maybe it's approximately the pcp batch size which is typically 128 pages? If so, I'm a bit surprised that it is effective to prefetch 128 page frames before using any them for real. I guess they'll fit in the L2 cache. Thoughts? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936112AbeCBHOj (ORCPT ); Fri, 2 Mar 2018 02:14:39 -0500 Received: from mga09.intel.com ([134.134.136.24]:16922 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934521AbeCBHOh (ORCPT ); Fri, 2 Mar 2018 02:14:37 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,411,1515484800"; d="scan'208";a="30889912" Date: Fri, 2 Mar 2018 15:15:34 +0800 From: Aaron Lu To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180302071533.GA6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301135518.GJ15057@dhcp22.suse.cz> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 01, 2018 at 02:55:18PM +0100, Michal Hocko wrote: > On Thu 01-03-18 14:28:44, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > Iteration count I assume. Correct. > I am still quite surprised that this would have such a large impact. Most likely due to the cachelines for these page structures are warmed up outside of zone->lock. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936146AbeCBHbs (ORCPT ); Fri, 2 Mar 2018 02:31:48 -0500 Received: from mga18.intel.com ([134.134.136.126]:46928 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935717AbeCBHbo (ORCPT ); Fri, 2 Mar 2018 02:31:44 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,411,1515484800"; d="scan'208";a="22274212" From: "Huang\, Ying" To: Michal Hocko Cc: Aaron Lu , , , Andrew Morton , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , "Vlastimil Babka" , Mel Gorman , "Matthew Wilcox" , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> Date: Fri, 02 Mar 2018 15:31:40 +0800 In-Reply-To: <20180301135518.GJ15057@dhcp22.suse.cz> (Michal Hocko's message of "Thu, 1 Mar 2018 14:55:18 +0100") Message-ID: <87r2p3c4rn.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal Hocko writes: > On Thu 01-03-18 14:28:44, Aaron Lu wrote: >> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, >> the zone->lock is held and then pages are chosen from PCP's migratetype >> list. While there is actually no need to do this 'choose part' under >> lock since it's PCP pages, the only CPU that can touch them is us and >> irq is also disabled. >> >> Moving this part outside could reduce lock held time and improve >> performance. Test with will-it-scale/page_fault1 full load: >> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) >> v4.16-rc2+ 9034215 7971818 13667135 15677465 >> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% >> >> What the test does is: starts $nr_cpu processes and each will repeatedly >> do the following for 5 minutes: >> 1 mmap 128M anonymouse space; >> 2 write access to that space; >> 3 munmap. >> The score is the aggregated iteration. > > Iteration count I assume. I am still quite surprised that this would > have such a large impact. The test is run with full load, this means near or more than 100 processes will allocate memory in parallel. According to Amdahl's law, the performance of a parallel program will be dominated by the serial part. For this case, the part protected by zone->lock. So small changes to code under zone->lock could make bigger changes to overall score. Best Regards, Huang, Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1034316AbeCBIAa (ORCPT ); Fri, 2 Mar 2018 03:00:30 -0500 Received: from mga04.intel.com ([192.55.52.120]:24578 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030476AbeCBIA2 (ORCPT ); Fri, 2 Mar 2018 03:00:28 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,411,1515484800"; d="scan'208";a="21512780" Date: Fri, 2 Mar 2018 16:01:25 +0800 From: Aaron Lu To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180302080125.GB6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301160105.aca958fac871998d582307d4@linux-foundation.org> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote: > On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > But it's a loss for uniprocessor systems: it adds more code and adds an > additional pass across a list. Performance wise, I assume the loss is pretty small and can not be measured. On my Sandybridge desktop, with will-it-scale/page_fault1/single process run to emulate uniprocessor system, the score is(average of 3 runs): base(patch 1/3): 649710 this patch: 653554 +0.6% prefetch(patch 3/3): 650336 (in noise range compared to base) On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single process run: base(patch 1/3): 498649 this patch: 504171 +1.1% prefetch(patch 3/3): 506334 +1.5% (compared to base) It looks like we don't need to worry too much about performance for uniprocessor system. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422914AbeCBI06 (ORCPT ); Fri, 2 Mar 2018 03:26:58 -0500 Received: from mga09.intel.com ([134.134.136.24]:19901 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422857AbeCBI05 (ORCPT ); Fri, 2 Mar 2018 03:26:57 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,411,1515484800"; d="scan'208";a="30902997" Date: Fri, 2 Mar 2018 16:27:56 +0800 From: Aaron Lu To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180302082756.GC6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301160950.b561d6b8b561217bad511229@linux-foundation.org> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 01, 2018 at 04:09:50PM -0800, Andrew Morton wrote: > On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu wrote: > > > When a page is freed back to the global pool, its buddy will be checked > > to see if it's possible to do a merge. This requires accessing buddy's > > page structure and that access could take a long time if it's cache cold. > > > > This patch adds a prefetch to the to-be-freed page's buddy outside of > > zone->lock in hope of accessing buddy's page structure later under > > zone->lock will be faster. Since we *always* do buddy merging and check > > an order-0 page's buddy to try to merge it when it goes into the main > > allocator, the cacheline will always come in, i.e. the prefetched data > > will never be unused. > > > > In the meantime, there are two concerns: > > 1 the prefetch could potentially evict existing cachelines, especially > > for L1D cache since it is not huge; > > 2 there is some additional instruction overhead, namely calculating > > buddy pfn twice. > > > > For 1, it's hard to say, this microbenchmark though shows good result but > > the actual benefit of this patch will be workload/CPU dependant; > > For 2, since the calculation is a XOR on two local variables, it's expected > > in many cases that cycles spent will be offset by reduced memory latency > > later. This is especially true for NUMA machines where multiple CPUs are > > contending on zone->lock and the most time consuming part under zone->lock > > is the wait of 'struct page' cacheline of the to-be-freed pages and their > > buddies. > > > > Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > Note: this patch's performance improvement percent is against patch2/3. > > > > ... > > > > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > continue; > > > > list_add_tail(&page->lru, &head); > > + > > + /* > > + * We are going to put the page back to the global > > + * pool, prefetch its buddy to speed up later access > > + * under zone->lock. It is believed the overhead of > > + * calculating buddy_pfn here can be offset by reduced > > + * memory latency later. > > + */ > > + pfn = page_to_pfn(page); > > + buddy_pfn = __find_buddy_pfn(pfn, 0); > > + buddy = page + (buddy_pfn - pfn); > > + prefetch(buddy); > > What is the typical list length here? Maybe it's approximately the pcp > batch size which is typically 128 pages? Most of time it is pcp->batch, unless when pcp's pages need to be all drained like in drain_local_pages(zone). The pcp->batch has a default value of 31 and its upper limit is 96 for x86_64. For this test, it is 31 here, I didn't manipulate /proc/sys/vm/percpu_pagelist_fraction to change it. With this said, the count here could be pcp->count when pcp's pages need to be all drained and though pcp->count's default value is (6*pcp->batch)=186, user can increase that value through the above mentioned procfs interface and the resulting pcp->count could be too big for prefetch. Ying also mentioned this today and suggested adding an upper limit here to avoid prefetching too much. Perhaps just prefetch the last pcp->batch pages if count here > pcp->batch? Since pcp->batch has an upper limit, we won't need to worry prefetching too much. > > If so, I'm a bit surprised that it is effective to prefetch 128 page > frames before using any them for real. I guess they'll fit in the L2 > cache. Thoughts? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422920AbeCBIaV (ORCPT ); Fri, 2 Mar 2018 03:30:21 -0500 Received: from mga02.intel.com ([134.134.136.20]:42242 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422857AbeCBIaU (ORCPT ); Fri, 2 Mar 2018 03:30:20 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,411,1515484800"; d="scan'208";a="34088912" Date: Fri, 2 Mar 2018 16:31:19 +0800 From: Aaron Lu To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180302083119.GD6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180301140044.GK15057@dhcp22.suse.cz> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 01, 2018 at 03:00:44PM +0100, Michal Hocko wrote: > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > > When a page is freed back to the global pool, its buddy will be checked > > to see if it's possible to do a merge. This requires accessing buddy's > > page structure and that access could take a long time if it's cache cold. > > > > This patch adds a prefetch to the to-be-freed page's buddy outside of > > zone->lock in hope of accessing buddy's page structure later under > > zone->lock will be faster. Since we *always* do buddy merging and check > > an order-0 page's buddy to try to merge it when it goes into the main > > allocator, the cacheline will always come in, i.e. the prefetched data > > will never be unused. > > > > In the meantime, there are two concerns: > > 1 the prefetch could potentially evict existing cachelines, especially > > for L1D cache since it is not huge; > > 2 there is some additional instruction overhead, namely calculating > > buddy pfn twice. > > > > For 1, it's hard to say, this microbenchmark though shows good result but > > the actual benefit of this patch will be workload/CPU dependant; > > For 2, since the calculation is a XOR on two local variables, it's expected > > in many cases that cycles spent will be offset by reduced memory latency > > later. This is especially true for NUMA machines where multiple CPUs are > > contending on zone->lock and the most time consuming part under zone->lock > > is the wait of 'struct page' cacheline of the to-be-freed pages and their > > buddies. > > > > Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > Note: this patch's performance improvement percent is against patch2/3. > > I am really surprised that this has such a big impact. Is this a win on > other architectures as well? For NUMA machines, I guess so. But I didn't test other archs so can't say for sure. > > > [changelog stole from Dave Hansen and Mel Gorman's comments] > > https://lkml.org/lkml/2018/1/24/551 > > Please use http://lkml.kernel.org/r/ for references because > lkml.org is quite unstable. It would be > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com > here. Good to know this, thanks! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1424157AbeCBPeI (ORCPT ); Fri, 2 Mar 2018 10:34:08 -0500 Received: from mga11.intel.com ([192.55.52.93]:38783 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423376AbeCBPeD (ORCPT ); Fri, 2 Mar 2018 10:34:03 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,412,1515484800"; d="scan'208";a="205024748" Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free To: Aaron Lu , Michal Hocko References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301135518.GJ15057@dhcp22.suse.cz> <20180302071533.GA6356@intel.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes From: Dave Hansen Message-ID: Date: Fri, 2 Mar 2018 07:34:01 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180302071533.GA6356@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/01/2018 11:15 PM, Aaron Lu wrote: > >> I am still quite surprised that this would have such a large impact. > Most likely due to the cachelines for these page structures are warmed > up outside of zone->lock. The workload here is a pretty tight microbenchmark and single biggest bottleneck is cache misses on 'struct page'. It's not memory bandwidth bound. So, anything you can give the CPU keep it fed and not waiting on cache misses will be a win. There's never going to be a real-world workload that sees this kind of increase, but the change in the micro isn't super-surprising because it so directly targets the bottleneck. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946965AbeCBR5T (ORCPT ); Fri, 2 Mar 2018 12:57:19 -0500 Received: from mx2.suse.de ([195.135.220.15]:56635 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946955AbeCBR5O (ORCPT ); Fri, 2 Mar 2018 12:57:14 -0500 Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock To: Michal Hocko , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> From: Vlastimil Babka Message-ID: Date: Fri, 2 Mar 2018 18:55:25 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180301140044.GK15057@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/01/2018 03:00 PM, Michal Hocko wrote: > On Thu 01-03-18 14:28:45, Aaron Lu wrote: >> When a page is freed back to the global pool, its buddy will be checked >> to see if it's possible to do a merge. This requires accessing buddy's >> page structure and that access could take a long time if it's cache cold. >> >> This patch adds a prefetch to the to-be-freed page's buddy outside of >> zone->lock in hope of accessing buddy's page structure later under >> zone->lock will be faster. Since we *always* do buddy merging and check >> an order-0 page's buddy to try to merge it when it goes into the main >> allocator, the cacheline will always come in, i.e. the prefetched data >> will never be unused. >> >> In the meantime, there are two concerns: >> 1 the prefetch could potentially evict existing cachelines, especially >> for L1D cache since it is not huge; >> 2 there is some additional instruction overhead, namely calculating >> buddy pfn twice. >> >> For 1, it's hard to say, this microbenchmark though shows good result but >> the actual benefit of this patch will be workload/CPU dependant; >> For 2, since the calculation is a XOR on two local variables, it's expected >> in many cases that cycles spent will be offset by reduced memory latency >> later. This is especially true for NUMA machines where multiple CPUs are >> contending on zone->lock and the most time consuming part under zone->lock >> is the wait of 'struct page' cacheline of the to-be-freed pages and their >> buddies. >> >> Test with will-it-scale/page_fault1 full load: >> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) >> v4.16-rc2+ 9034215 7971818 13667135 15677465 >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% >> Note: this patch's performance improvement percent is against patch2/3. > > I am really surprised that this has such a big impact. It's even stranger to me. Struct page is 64 bytes these days, exactly a a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache line (that forms an aligned 128 bytes block with the one we touch). Which is exactly a order-0 buddy struct page! Maybe that implicit prefetching stopped at L2 and explicit goes all the way to L1, can't remember. Would that make such a difference? It would be nice to do some perf tests with cache counters to see what is really going on... Vlastimil > Is this a win on > other architectures as well? > >> [changelog stole from Dave Hansen and Mel Gorman's comments] >> https://lkml.org/lkml/2018/1/24/551 > > Please use http://lkml.kernel.org/r/ for references because > lkml.org is quite unstable. It would be > http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com > here. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947125AbeCBSAa (ORCPT ); Fri, 2 Mar 2018 13:00:30 -0500 Received: from mga02.intel.com ([134.134.136.20]:9986 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1947104AbeCBSA0 (ORCPT ); Fri, 2 Mar 2018 13:00:26 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,413,1515484800"; d="scan'208";a="35471163" Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock To: Vlastimil Babka , Michal Hocko , Aaron Lu References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes From: Dave Hansen Message-ID: <2433e857-fea7-9af4-d124-538ad17de454@intel.com> Date: Fri, 2 Mar 2018 10:00:24 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2018 09:55 AM, Vlastimil Babka wrote: > It's even stranger to me. Struct page is 64 bytes these days, exactly a > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > line (that forms an aligned 128 bytes block with the one we touch). I believe that was a behavior that was specific to the Pentium 4 "Netburst" era. I don't think the 128-byte line behavior exists on modern Intel cpus. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1426699AbeCBSJt (ORCPT ); Fri, 2 Mar 2018 13:09:49 -0500 Received: from mx2.suse.de ([195.135.220.15]:57488 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423570AbeCBSJs (ORCPT ); Fri, 2 Mar 2018 13:09:48 -0500 Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock To: Dave Hansen , Michal Hocko , Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <2433e857-fea7-9af4-d124-538ad17de454@intel.com> From: Vlastimil Babka Message-ID: <787e53b9-20d3-a2df-44fc-ea8c31a6ced3@suse.cz> Date: Fri, 2 Mar 2018 19:08:00 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <2433e857-fea7-9af4-d124-538ad17de454@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2018 07:00 PM, Dave Hansen wrote: > On 03/02/2018 09:55 AM, Vlastimil Babka wrote: >> It's even stranger to me. Struct page is 64 bytes these days, exactly a >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache >> line (that forms an aligned 128 bytes block with the one we touch). > > I believe that was a behavior that was specific to the Pentium 4 > "Netburst" era. I don't think the 128-byte line behavior exists on > modern Intel cpus. I remember it on Core 2 something (Nehalem IIRC). And this page suggests up to Broadwell, and it can be disabled. And it's an L2 prefetcher indeed. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933751AbeCBVXh (ORCPT ); Fri, 2 Mar 2018 16:23:37 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:55404 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933732AbeCBVXe (ORCPT ); Fri, 2 Mar 2018 16:23:34 -0500 Date: Fri, 2 Mar 2018 13:23:32 -0800 From: Andrew Morton To: Aaron Lu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-Id: <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> In-Reply-To: <20180302080125.GB6356@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> <20180302080125.GB6356@intel.com> X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2 Mar 2018 16:01:25 +0800 Aaron Lu wrote: > On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote: > > On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu wrote: > > > > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > > the zone->lock is held and then pages are chosen from PCP's migratetype > > > list. While there is actually no need to do this 'choose part' under > > > lock since it's PCP pages, the only CPU that can touch them is us and > > > irq is also disabled. > > > > > > Moving this part outside could reduce lock held time and improve > > > performance. Test with will-it-scale/page_fault1 full load: > > > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > > do the following for 5 minutes: > > > 1 mmap 128M anonymouse space; > > > 2 write access to that space; > > > 3 munmap. > > > The score is the aggregated iteration. > > > > But it's a loss for uniprocessor systems: it adds more code and adds an > > additional pass across a list. > > Performance wise, I assume the loss is pretty small and can not > be measured. > > On my Sandybridge desktop, with will-it-scale/page_fault1/single process > run to emulate uniprocessor system, the score is(average of 3 runs): > > base(patch 1/3): 649710 > this patch: 653554 +0.6% Does that mean we got faster or slower? > prefetch(patch 3/3): 650336 (in noise range compared to base) > > On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single > process run: > > base(patch 1/3): 498649 > this patch: 504171 +1.1% > prefetch(patch 3/3): 506334 +1.5% (compared to base) > > It looks like we don't need to worry too much about performance for > uniprocessor system. Well. We can say that of hundreds of patches. And we end up with a fatter and slower kernel than we otherwise would. Please take a look, see if there's a tidy way of avoiding this. Probably there isn't, in which case oh well. But let's at least try. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933781AbeCBVZj (ORCPT ); Fri, 2 Mar 2018 16:25:39 -0500 Received: from mga02.intel.com ([134.134.136.20]:24470 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933763AbeCBVZi (ORCPT ); Fri, 2 Mar 2018 16:25:38 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,413,1515484800"; d="scan'208";a="22426863" Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free To: Andrew Morton , Aaron Lu References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <20180301160105.aca958fac871998d582307d4@linux-foundation.org> <20180302080125.GB6356@intel.com> <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , David Rientjes From: Dave Hansen Message-ID: <1110bcae-f4c5-68ff-77df-28934a21dd86@intel.com> Date: Fri, 2 Mar 2018 13:25:35 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180302132332.2c69559686ff24d15ff44ae8@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/02/2018 01:23 PM, Andrew Morton wrote: >> On my Sandybridge desktop, with will-it-scale/page_fault1/single process >> run to emulate uniprocessor system, the score is(average of 3 runs): >> >> base(patch 1/3): 649710 >> this patch: 653554 +0.6% > Does that mean we got faster or slower? Faster. The unit of measurement here is iterations through a loop. More iterations are better. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934302AbeCELlG (ORCPT ); Mon, 5 Mar 2018 06:41:06 -0500 Received: from mga05.intel.com ([192.55.52.43]:19047 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933550AbeCELlC (ORCPT ); Mon, 5 Mar 2018 06:41:02 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,426,1515484800"; d="scan'208";a="25262073" Date: Mon, 5 Mar 2018 19:41:59 +0800 From: Aaron Lu To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180305114159.GA32573@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > On 03/01/2018 03:00 PM, Michal Hocko wrote: > > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > >> When a page is freed back to the global pool, its buddy will be checked > >> to see if it's possible to do a merge. This requires accessing buddy's > >> page structure and that access could take a long time if it's cache cold. > >> > >> This patch adds a prefetch to the to-be-freed page's buddy outside of > >> zone->lock in hope of accessing buddy's page structure later under > >> zone->lock will be faster. Since we *always* do buddy merging and check > >> an order-0 page's buddy to try to merge it when it goes into the main > >> allocator, the cacheline will always come in, i.e. the prefetched data > >> will never be unused. > >> > >> In the meantime, there are two concerns: > >> 1 the prefetch could potentially evict existing cachelines, especially > >> for L1D cache since it is not huge; > >> 2 there is some additional instruction overhead, namely calculating > >> buddy pfn twice. > >> > >> For 1, it's hard to say, this microbenchmark though shows good result but > >> the actual benefit of this patch will be workload/CPU dependant; > >> For 2, since the calculation is a XOR on two local variables, it's expected > >> in many cases that cycles spent will be offset by reduced memory latency > >> later. This is especially true for NUMA machines where multiple CPUs are > >> contending on zone->lock and the most time consuming part under zone->lock > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their > >> buddies. > >> > >> Test with will-it-scale/page_fault1 full load: > >> > >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > >> v4.16-rc2+ 9034215 7971818 13667135 15677465 > >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > >> Note: this patch's performance improvement percent is against patch2/3. > > > > I am really surprised that this has such a big impact. > > It's even stranger to me. Struct page is 64 bytes these days, exactly a > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > line (that forms an aligned 128 bytes block with the one we touch). > Which is exactly a order-0 buddy struct page! Maybe that implicit > prefetching stopped at L2 and explicit goes all the way to L1, can't The Intel Architecture Optimization Manual section 7.3.2 says: prefetchT0 - fetch data into all cache levels Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: 1st, 2nd and 3rd level cache. prefetchT2 - fetch data into 2nd and 3rd level caches (identical to prefetchT1) Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: 2nd and 3rd level cache. prefetchNTA - fetch data into non-temporal cache close to the processor, minimizing cache pollution Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: must fetch into 3rd level cache with fast replacement. I tried 'prefetcht0' and 'prefetcht2' instead of the default 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about the same performance number as prefetchNTA. I had expected prefetchT0 to deliver a better score if it was indeed due to L1D since prefetchT2 will not place data into L1 while prefetchT0 will, but looks like it is not the case here. It feels more like the buddy cacheline isn't in any level of the caches without prefetch for some reason. > remember. Would that make such a difference? It would be nice to do some > perf tests with cache counters to see what is really going on... Compare prefetchT2 to no-prefetch, I saw these metrics change: no-prefetch change prefetchT2 metrics \ \ stddev stddev ------------------------------------------------------------------------ 0.18 +0.0 0.18 perf-stat.branch-miss-rate% 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references 3.52 -1.1% 3.48 perf-stat.cpi 0.02 -0.0 0.01 ±3% perf-stat.dTLB-load-miss-rate% 8.677e+08 -7.3% 8.048e+08 ±3% perf-stat.dTLB-load-misses 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores 6.126e+09 +10.1% 6.745e+09 ±3% perf-stat.iTLB-load-misses 3464 -8.4% 3172 ±3% perf-stat.instructions-per-iTLB-miss 0.28 +1.1% 0.29 perf-stat.ipc 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults 2182469 -4.2% 2090977 perf-stat.path-length Not sure if this is useful though... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934850AbeCELsc (ORCPT ); Mon, 5 Mar 2018 06:48:32 -0500 Received: from mga04.intel.com ([192.55.52.120]:31704 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932891AbeCELsb (ORCPT ); Mon, 5 Mar 2018 06:48:31 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,426,1515484800"; d="scan'208";a="32476668" Date: Mon, 5 Mar 2018 19:48:27 +0800 From: Aaron Lu To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180305114826.GA2184@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180305114159.GA32573@intel.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 05, 2018 at 07:41:59PM +0800, Aaron Lu wrote: > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > > On 03/01/2018 03:00 PM, Michal Hocko wrote: > > > On Thu 01-03-18 14:28:45, Aaron Lu wrote: > > >> When a page is freed back to the global pool, its buddy will be checked > > >> to see if it's possible to do a merge. This requires accessing buddy's > > >> page structure and that access could take a long time if it's cache cold. > > >> > > >> This patch adds a prefetch to the to-be-freed page's buddy outside of > > >> zone->lock in hope of accessing buddy's page structure later under > > >> zone->lock will be faster. Since we *always* do buddy merging and check > > >> an order-0 page's buddy to try to merge it when it goes into the main > > >> allocator, the cacheline will always come in, i.e. the prefetched data > > >> will never be unused. > > >> > > >> In the meantime, there are two concerns: > > >> 1 the prefetch could potentially evict existing cachelines, especially > > >> for L1D cache since it is not huge; > > >> 2 there is some additional instruction overhead, namely calculating > > >> buddy pfn twice. > > >> > > >> For 1, it's hard to say, this microbenchmark though shows good result but > > >> the actual benefit of this patch will be workload/CPU dependant; > > >> For 2, since the calculation is a XOR on two local variables, it's expected > > >> in many cases that cycles spent will be offset by reduced memory latency > > >> later. This is especially true for NUMA machines where multiple CPUs are > > >> contending on zone->lock and the most time consuming part under zone->lock > > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their > > >> buddies. > > >> > > >> Test with will-it-scale/page_fault1 full load: > > >> > > >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > >> v4.16-rc2+ 9034215 7971818 13667135 15677465 > > >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% > > >> Note: this patch's performance improvement percent is against patch2/3. > > > > > > I am really surprised that this has such a big impact. > > > > It's even stranger to me. Struct page is 64 bytes these days, exactly a > > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > > line (that forms an aligned 128 bytes block with the one we touch). > > Which is exactly a order-0 buddy struct page! Maybe that implicit > > prefetching stopped at L2 and explicit goes all the way to L1, can't > > The Intel Architecture Optimization Manual section 7.3.2 says: > > prefetchT0 - fetch data into all cache levels > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 1st, 2nd and 3rd level cache. > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > prefetchT1) > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 2nd and 3rd level cache. > > prefetchNTA - fetch data into non-temporal cache close to the processor, > minimizing cache pollution > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: must fetch into 3rd level cache with fast replacement. > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about ~~~~~~~ Correction: should be Broadwell here. > the same performance number as prefetchNTA. I had expected prefetchT0 to > deliver a better score if it was indeed due to L1D since prefetchT2 will > not place data into L1 while prefetchT0 will, but looks like it is not > the case here. > > It feels more like the buddy cacheline isn't in any level of the caches > without prefetch for some reason. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753069AbeCFH4B (ORCPT ); Tue, 6 Mar 2018 02:56:01 -0500 Received: from mx2.suse.de ([195.135.220.15]:35524 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750864AbeCFHz7 (ORCPT ); Tue, 6 Mar 2018 02:55:59 -0500 Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock To: Aaron Lu Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> From: Vlastimil Babka Message-ID: Date: Tue, 6 Mar 2018 08:55:57 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180305114159.GA32573@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/05/2018 12:41 PM, Aaron Lu wrote: > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: >> On 03/01/2018 03:00 PM, Michal Hocko wrote: >>> >>> I am really surprised that this has such a big impact. >> >> It's even stranger to me. Struct page is 64 bytes these days, exactly a >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache >> line (that forms an aligned 128 bytes block with the one we touch). >> Which is exactly a order-0 buddy struct page! Maybe that implicit >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > The Intel Architecture Optimization Manual section 7.3.2 says: > > prefetchT0 - fetch data into all cache levels > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 1st, 2nd and 3rd level cache. > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > prefetchT1) > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: 2nd and 3rd level cache. > > prefetchNTA - fetch data into non-temporal cache close to the processor, > minimizing cache pollution > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > microarchitectures: must fetch into 3rd level cache with fast replacement. > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > the same performance number as prefetchNTA. I had expected prefetchT0 to > deliver a better score if it was indeed due to L1D since prefetchT2 will > not place data into L1 while prefetchT0 will, but looks like it is not > the case here. > > It feels more like the buddy cacheline isn't in any level of the caches > without prefetch for some reason. So the adjacent line prefetch might be disabled? Could you check bios or the MSR mentioned in https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors >> remember. Would that make such a difference? It would be nice to do some >> perf tests with cache counters to see what is really going on... > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > no-prefetch change prefetchT2 metrics > \ \ > stddev stddev > ------------------------------------------------------------------------ > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > 3.52 -1.1% 3.48 perf-stat.cpi > 0.02 -0.0 0.01 ±3% perf-stat.dTLB-load-miss-rate% > 8.677e+08 -7.3% 8.048e+08 ±3% perf-stat.dTLB-load-misses > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > 6.126e+09 +10.1% 6.745e+09 ±3% perf-stat.iTLB-load-misses > 3464 -8.4% 3172 ±3% perf-stat.instructions-per-iTLB-miss > 0.28 +1.1% 0.29 perf-stat.ipc > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > 2182469 -4.2% 2090977 perf-stat.path-length > > Not sure if this is useful though... Looks like most stats increased in absolute values as the work done increased and this is a time-limited benchmark? Although number of instructions (calculated from itlb misses and insns-per-itlb-miss) shows less than 1% increase, so dunno. And the improvement comes from reduced dTLB-load-misses? That makes no sense for order-0 buddy struct pages which always share a page. And the memmap mapping should use huge pages. BTW what is path-length? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753689AbeCFM0k (ORCPT ); Tue, 6 Mar 2018 07:26:40 -0500 Received: from mga18.intel.com ([134.134.136.126]:13330 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753672AbeCFM0i (ORCPT ); Tue, 6 Mar 2018 07:26:38 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,431,1515484800"; d="scan'208";a="34998970" Date: Tue, 6 Mar 2018 20:27:33 +0800 From: Aaron Lu To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180306122733.GA9664@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > On 03/05/2018 12:41 PM, Aaron Lu wrote: > > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > >> On 03/01/2018 03:00 PM, Michal Hocko wrote: > >>> > >>> I am really surprised that this has such a big impact. > >> > >> It's even stranger to me. Struct page is 64 bytes these days, exactly a > >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > >> line (that forms an aligned 128 bytes block with the one we touch). > >> Which is exactly a order-0 buddy struct page! Maybe that implicit > >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > > > The Intel Architecture Optimization Manual section 7.3.2 says: > > > > prefetchT0 - fetch data into all cache levels > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 1st, 2nd and 3rd level cache. > > > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > > prefetchT1) > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 2nd and 3rd level cache. > > > > prefetchNTA - fetch data into non-temporal cache close to the processor, > > minimizing cache pollution > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: must fetch into 3rd level cache with fast replacement. > > > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > > the same performance number as prefetchNTA. I had expected prefetchT0 to > > deliver a better score if it was indeed due to L1D since prefetchT2 will > > not place data into L1 while prefetchT0 will, but looks like it is not > > the case here. > > > > It feels more like the buddy cacheline isn't in any level of the caches > > without prefetch for some reason. > > So the adjacent line prefetch might be disabled? Could you check bios or > the MSR mentioned in > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors root@lkp-bdw-ep2 ~# rdmsr 0x1a4 0 Looks like this feature isn't disabled(the doc you linked says value 1 means disable). > >> remember. Would that make such a difference? It would be nice to do some > >> perf tests with cache counters to see what is really going on... > > > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > > > no-prefetch change prefetchT2 metrics > > \ \ > > stddev stddev > > ------------------------------------------------------------------------ > > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > > 3.52 -1.1% 3.48 perf-stat.cpi > > 0.02 -0.0 0.01 ±3% perf-stat.dTLB-load-miss-rate% > > 8.677e+08 -7.3% 8.048e+08 ±3% perf-stat.dTLB-load-misses > > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > > 6.126e+09 +10.1% 6.745e+09 ±3% perf-stat.iTLB-load-misses > > 3464 -8.4% 3172 ±3% perf-stat.instructions-per-iTLB-miss > > 0.28 +1.1% 0.29 perf-stat.ipc > > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > > 2182469 -4.2% 2090977 perf-stat.path-length > > > > Not sure if this is useful though... > > Looks like most stats increased in absolute values as the work done > increased and this is a time-limited benchmark? Although number of Yes it is. > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > less than 1% increase, so dunno. And the improvement comes from reduced > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > which always share a page. And the memmap mapping should use huge pages. THP is disabled to stress order 0 pages(should have mentioned this in patch's description, sorry about this). > BTW what is path-length? It's the instruction path length: the number of machine code instructions required to execute a section of a computer program. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753601AbeCFMxI (ORCPT ); Tue, 6 Mar 2018 07:53:08 -0500 Received: from bombadil.infradead.org ([198.137.202.133]:34034 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753223AbeCFMxG (ORCPT ); Tue, 6 Mar 2018 07:53:06 -0500 Date: Tue, 6 Mar 2018 04:53:04 -0800 From: Matthew Wilcox To: Aaron Lu Cc: Vlastimil Babka , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Mel Gorman , David Rientjes Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Message-ID: <20180306125303.GA13722@bombadil.infradead.org> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-4-aaron.lu@intel.com> <20180301140044.GK15057@dhcp22.suse.cz> <20180305114159.GA32573@intel.com> <20180306122733.GA9664@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180306122733.GA9664@intel.com> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 06, 2018 at 08:27:33PM +0800, Aaron Lu wrote: > On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > > So the adjacent line prefetch might be disabled? Could you check bios or > > the MSR mentioned in > > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors > > root@lkp-bdw-ep2 ~# rdmsr 0x1a4 > 0 Technically 0x1a4 is per-core, so you should run rdmsr -a 0x1a4 in order to check all the cores. But I can't imagine they're being set differently on each core. > > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > > less than 1% increase, so dunno. And the improvement comes from reduced > > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > > which always share a page. And the memmap mapping should use huge pages. > > THP is disabled to stress order 0 pages(should have mentioned this in > patch's description, sorry about this). THP isn't related to memmap; the kernel uses huge pages (usually the 1G pages) in order to map its own memory.