From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Date: Fri, 18 Apr 2014 15:50:27 +0100 Message-ID: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org I was investigating a performance bug that looked like dd to tmpfs had regressed. The bulk of the problem turned out to be a difference in Kconfig but it got me looking at the unnecessary overhead in tmpfs, mark_page_accessed and parts of the allocator. This series is the result. The primary test workload was dd to a tmpfs file that was 1/10th the size of memory so that dirty balancing and reclaim should not be factors. loopdd Throughput 3.15.0-rc1 3.15.0-rc1 vanilla microopt-v1r11 Min 3993.6000 ( 0.00%) 4096.0000 ( 2.56%) Mean 4766.7200 ( 0.00%) 4896.4267 ( 2.72%) Stddev 164.5053 ( 0.00%) 167.7316 ( 1.96%) Max 4812.8000 ( 0.00%) 5120.0000 ( 6.38%) Respectable increase in throughput. The figures are misleading though because dd reports in GB/sec so there is a lot of noise. The actual time to completiono is easier to see loopdd Time 3.15.0-rc1 3.15.0-rc1 vanilla microopt-v1r11 Min time0.3521 ( 0.00%)0.3317 ( 5.80%) Mean time0.3570 ( 0.00%)0.3458 ( 3.14%) Stddev time0.0140 ( 0.00%)0.0112 ( 20.59%) Max time0.4230 ( 0.00%)0.4083 ( 3.49%) The time to dd the data is noticably reduced 3.15.0-rc1 3.15.0-rc1 vanillamicroopt-v1r11 User 10.86 10.78 System 70.21 67.12 Elapsed 92.43 89.42 And the system CPU overhead is lower. A series of tests against various filesystems as well as a general benchmark are still running but I thought I would send the series out as-is for comment. Documentation/sysctl/vm.txt | 17 ++-- arch/ia64/include/asm/topology.h | 3 +- arch/powerpc/include/asm/topology.h | 8 +- include/linux/cpuset.h | 29 +++++++ include/linux/mmzone.h | 14 ++- include/linux/page-flags.h | 2 + include/linux/pageblock-flags.h | 18 +++- include/linux/swap.h | 7 +- include/linux/topology.h | 3 +- kernel/cpuset.c | 8 +- mm/filemap.c | 58 ++++++++----- mm/page_alloc.c | 164 ++++++++++++++++++++---------------- mm/shmem.c | 8 +- mm/swap.c | 13 ++- 14 files changed, 226 insertions(+), 126 deletions(-) -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 01/16] mm: Disable zone_reclaim_mode by default Date: Fri, 18 Apr 2014 15:50:28 +0100 Message-ID: <1397832643-14275-2-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Zhang Yanfei --- Documentation/sysctl/vm.txt | 17 +++++++++-------- arch/ia64/include/asm/topology.h | 3 ++- arch/powerpc/include/asm/topology.h | 8 ++------ include/linux/topology.h | 3 ++- mm/page_alloc.c | 2 -- 5 files changed, 15 insertions(+), 18 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index dd9d0e3..5b6da0f 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -772,16 +772,17 @@ This is value ORed together of 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages -zone_reclaim_mode is set during bootup to 1 if it is determined that pages -from remote zones will cause a measurable performance reduction. The -page allocator will then reclaim easily reusable pages (those page -cache pages that are currently not used) before allocating off node pages. - -It may be beneficial to switch off zone reclaim if the system is -used for a file server and all of memory should be used for caching files -from disk. In that case the caching effect is more important than +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than data locality. +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 5cb55a1..3555fdd 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -21,7 +21,8 @@ #define PENALTY_FOR_NODE_WITH_CPUS 255 /* - * Distance above which we begin to use zone reclaim + * Nodes within this distance are eligible for reclaim by zone_reclaim() when + * zone_reclaim_mode is enabled. */ #define RECLAIM_DISTANCE 15 diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index c920215..6c8a8c5 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -9,12 +9,8 @@ struct device_node; #ifdef CONFIG_NUMA /* - * Before going off node we want the VM to try and reclaim from the local - * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. - * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of - * 20, we never reclaim and go off node straight away. - * - * To fix this we choose a smaller value of RECLAIM_DISTANCE. + * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that + * all zones on all nodes will be eligible for zone_reclaim(). */ #define RECLAIM_DISTANCE 10 diff --git a/include/linux/topology.h b/include/linux/topology.h index 7062330..53261e2 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -58,7 +58,8 @@ int arch_update_cpu_topology(void); /* * If the distance between nodes in a system is larger than RECLAIM_DISTANCE * (in whatever arch specific measurement units returned by node_distance()) - * then switch on zone reclaim on boot. + * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim() + * on nodes within this distance. */ #define RECLAIM_DISTANCE 30 #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dba293..628f1e7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1860,8 +1860,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) for_each_node_state(i, N_MEMORY) if (node_distance(nid, i) <= RECLAIM_DISTANCE) node_set(i, NODE_DATA(nid)->reclaim_nodes); - else - zone_reclaim_mode = 1; } #else /* CONFIG_NUMA */ -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 02/16] mm: page_alloc: Do not cache reclaim distances Date: Fri, 18 Apr 2014 15:50:29 +0100 Message-ID: <1397832643-14275-3-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52085 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751532AbaDROup (ORCPT ); Fri, 18 Apr 2014 10:50:45 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Zhang Yanfei --- include/linux/mmzone.h | 1 - mm/page_alloc.c | 18 ++---------------- 2 files changed, 2 insertions(+), 17 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fac5509..c1dbe0b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -763,7 +763,6 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 628f1e7..3c8200c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1850,16 +1850,8 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); -} - -static void __paginginit init_zone_allows_reclaim(int nid) -{ - int i; - - for_each_node_state(i, N_MEMORY) - if (node_distance(nid, i) <= RECLAIM_DISTANCE) - node_set(i, NODE_DATA(nid)->reclaim_nodes); + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < + RECLAIM_DISTANCE; } #else /* CONFIG_NUMA */ @@ -1892,10 +1884,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { return true; } - -static inline void init_zone_allows_reclaim(int nid) -{ -} #endif /* CONFIG_NUMA */ /* @@ -4919,8 +4907,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; - if (node_state(nid, N_MEMORY)) - init_zone_allows_reclaim(nid); #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #endif -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active Date: Fri, 18 Apr 2014 15:50:30 +0100 Message-ID: <1397832643-14275-4-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52088 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751691AbaDROup (ORCPT ); Fri, 18 Apr 2014 10:50:45 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: The zlc is used on NUMA machines to quickly skip over zones that are full. However it is always updated, even for the first zone scanned when the zlc might not even be active. As it's a write to a bitmap that potentially bounces cache line it's deceptively expensive and most machines will not care. Only update the zlc if it was active. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c8200c5..d8c9c4a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2030,7 +2030,7 @@ try_this_zone: if (page) break; this_zone_full: - if (IS_ENABLED(CONFIG_NUMA)) + if (IS_ENABLED(CONFIG_NUMA) && zlc_active) zlc_mark_zone_full(zonelist, z); } -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Date: Fri, 18 Apr 2014 15:50:31 +0100 Message-ID: <1397832643-14275-5-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52091 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753121AbaDROuq (ORCPT ); Fri, 18 Apr 2014 10:50:46 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: If a zone cannot be used for a dirty page then it gets marked "full" which is cached in the zlc and later potentially skipped by allocation requests that have nothing to do with dirty zones. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d8c9c4a..ad702e9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1962,7 +1962,7 @@ zonelist_scan: */ if ((alloc_flags & ALLOC_WMARK_LOW) && (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) - goto this_zone_full; + continue; mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (!zone_watermark_ok(zone, order, mark, -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Date: Fri, 18 Apr 2014 15:50:36 +0100 Message-ID: <1397832643-14275-10-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52102 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753531AbaDROus (ORCPT ); Fri, 18 Apr 2014 10:50:48 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely branch in the fast path. This patch moves the check out of the fast path and after it has been determined that the watermarks have not been met. This helps the common fast path at the cost of making the slow path slower and hitting kswapd with a performance cost. It's a reasonable tradeoff. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 770735a..737577c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1930,9 +1930,6 @@ zonelist_scan: (alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask)) continue; - BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); - if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) - goto try_this_zone; /* * Distribute pages in proportion to the individual * zone size to ensure fair page aging. The zone a @@ -1979,6 +1976,11 @@ zonelist_scan: classzone_idx, alloc_flags)) { int ret; + /* Checked here to keep the fast path fast */ + BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); + if (alloc_flags & ALLOC_NO_WATERMARKS) + goto try_this_zone; + if (IS_ENABLED(CONFIG_NUMA) && !did_zlc_setup && nr_online_nodes > 1) { /* -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Date: Fri, 18 Apr 2014 15:50:39 +0100 Message-ID: <1397832643-14275-13-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52091 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753716AbaDROut (ORCPT ); Fri, 18 Apr 2014 10:50:49 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: Mel Gorman --- include/linux/page-flags.h | 1 + mm/shmem.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d1fe1a7..4d4b39a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ PAGEFLAG(SavePinned, savepinned); /* Xen */ PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) + __SETPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) diff --git a/mm/shmem.c b/mm/shmem.c index 9f70e02..f47fb38 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1132,7 +1132,7 @@ repeat: goto decused; } - SetPageSwapBacked(page); + __SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_charge_file(page, current->mm, gfp & GFP_RECLAIM_MASK); -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies Date: Fri, 18 Apr 2014 15:50:34 +0100 Message-ID: <1397832643-14275-8-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52098 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753418AbaDROur (ORCPT ); Fri, 18 Apr 2014 10:50:47 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: A node/zone index is used to check if pages are compatible for merging but this happens unconditionally even if the buddy page is not free. Defer the calculation as long as possible. Ideally we would check the zone boundary but nodes can overlap. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 88a6dac..c5933a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, if (!pfn_valid_within(page_to_pfn(buddy))) return 0; - if (page_zone_id(page) != page_zone_id(buddy)) - return 0; - if (page_is_guard(buddy) && page_order(buddy) == order) { VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy); + + if (page_zone_id(page) != page_zone_id(buddy)) + return 0; + return 1; } if (PageBuddy(buddy) && page_order(buddy) == order) { VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy); + + /* + * zone check is done late to avoid uselessly + * calculating zone/node ids for pages that could + * never merge. + */ + if (page_zone_id(page) != page_zone_id(buddy)) + return 0; + return 1; } return 0; -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 13/16] mm: Do not use atomic operations when releasing pages Date: Fri, 18 Apr 2014 15:50:40 +0100 Message-ID: <1397832643-14275-14-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52093 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753770AbaDROuu (ORCPT ); Fri, 18 Apr 2014 10:50:50 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: There should be no references to it any more and a parallel mark should not be reordered against us. Use non-locked varient to clear page active. Signed-off-by: Mel Gorman --- mm/swap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swap.c b/mm/swap.c index 9ce43ba..fed4caf 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, int cold) } /* Clear Active bit in case of parallel mark_page_accessed */ - ClearPageActive(page); + __ClearPageActive(page); list_add(&page->lru, &pages_to_free); } -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Date: Fri, 18 Apr 2014 15:50:37 +0100 Message-ID: <1397832643-14275-11-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52106 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753666AbaDROut (ORCPT ); Fri, 18 Apr 2014 10:50:49 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: The test_bit operations in get/set pageblock flags are expensive. This patch reads the bitmap on a word basis and use shifts and masks to isolate the bits of interest. Similarly masks are used to set a local copy of the bitmap and then use cmpxchg to update the bitmap if there have been no other changes made in parallel. In a test running dd onto tmpfs the overhead of the pageblock-related functions went from 1.27% in profiles to 0.5%. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 6 +++++- include/linux/pageblock-flags.h | 21 ++++++++++++++++---- mm/page_alloc.c | 43 +++++++++++++++++++++++++---------------- 3 files changed, 48 insertions(+), 22 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c1dbe0b..c97b4bc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -75,9 +75,13 @@ enum { extern int page_group_by_mobility_disabled; +#define NR_MIGRATETYPE_BITS 3 +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) + static inline int get_pageblock_migratetype(struct page *page) { - return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); + BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2); + return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); } struct free_area { diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 2ee8cd2..c89ac75 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -30,9 +30,12 @@ enum pageblock_bits { PB_migrate, PB_migrate_end = PB_migrate + 3 - 1, /* 3 bits required for migrate types */ -#ifdef CONFIG_COMPACTION PB_migrate_skip,/* If set the block is skipped by compaction */ -#endif /* CONFIG_COMPACTION */ + + /* + * Assume the bits will always align on a word. If this assumption + * changes then get/set pageblock needs updating. + */ NR_PAGEBLOCK_BITS }; @@ -62,9 +65,19 @@ extern int pageblock_order; /* Forward declaration */ struct page; +unsigned long get_pageblock_flags_mask(struct page *page, + unsigned long nr_flag_bits, + unsigned long mask); + /* Declarations for getting and setting flags. See mm/page_alloc.c */ -unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx); +static inline unsigned long get_pageblock_flags_group(struct page *page, + int start_bitidx, int end_bitidx) +{ + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; + unsigned long mask = (1 << nr_flag_bits) - 1; + + return get_pageblock_flags_mask(page, nr_flag_bits, mask); +} void set_pageblock_flags_group(struct page *page, unsigned long flags, int start_bitidx, int end_bitidx); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 737577c..6047866 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) * @end_bitidx: The last bit of interest * returns pageblock_bits flags */ -unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx) +unsigned long get_pageblock_flags_mask(struct page *page, + unsigned long nr_flag_bits, + unsigned long mask) { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx; - unsigned long flags = 0; - unsigned long value = 1; + unsigned long pfn, bitidx, word_bitidx; + unsigned long word; zone = page_zone(page); pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); + word_bitidx = bitidx / BITS_PER_LONG; + bitidx &= (BITS_PER_LONG-1); - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) - if (test_bit(bitidx + start_bitidx, bitmap)) - flags |= value; - - return flags; + word = bitmap[word_bitidx]; + return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask; } /** @@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags, { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx; - unsigned long value = 1; + unsigned long pfn, bitidx, word_bitidx; + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; + unsigned long mask = (1 << nr_flag_bits) - 1; + unsigned long old_word, new_word; + + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); zone = page_zone(page); pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); + word_bitidx = bitidx / BITS_PER_LONG; + bitidx &= (BITS_PER_LONG-1); + VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page); - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) - if (flags & value) - __set_bit(bitidx + start_bitidx, bitmap); - else - __clear_bit(bitidx + start_bitidx, bitmap); + end_bitidx = bitidx + (end_bitidx - start_bitidx); + mask <<= (BITS_PER_LONG - end_bitidx - 1); + flags <<= (BITS_PER_LONG - end_bitidx - 1); + + do { + old_word = ACCESS_ONCE(bitmap[word_bitidx]); + new_word = (old_word & ~mask) | flags; + } while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word); } /* -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 11/16] mm: page_alloc: Reduce number of times page_to_pfn is called Date: Fri, 18 Apr 2014 15:50:38 +0100 Message-ID: <1397832643-14275-12-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52088 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751992AbaDROut (ORCPT ); Fri, 18 Apr 2014 10:50:49 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 9 +++++++-- include/linux/pageblock-flags.h | 15 ++++++--------- mm/page_alloc.c | 26 +++++++++++++++----------- 3 files changed, 28 insertions(+), 22 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c97b4bc..14ed8d1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled; #define NR_MIGRATETYPE_BITS 3 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) -static inline int get_pageblock_migratetype(struct page *page) +#define get_pageblock_migratetype(page) \ + get_pfnblock_flags_mask(page, page_to_pfn(page), \ + NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK) + +static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn) { BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2); - return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); + return get_pfnblock_flags_mask(page, pfn, + NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); } struct free_area { diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index c89ac75..6a9dd5b 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -65,19 +65,16 @@ extern int pageblock_order; /* Forward declaration */ struct page; -unsigned long get_pageblock_flags_mask(struct page *page, +unsigned long get_pfnblock_flags_mask(struct page *page, + unsigned long pfn, unsigned long nr_flag_bits, unsigned long mask); /* Declarations for getting and setting flags. See mm/page_alloc.c */ -static inline unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx) -{ - unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; - unsigned long mask = (1 << nr_flag_bits) - 1; - - return get_pageblock_flags_mask(page, nr_flag_bits, mask); -} +#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \ + get_pfnblock_flags_mask(page, page_to_pfn(page), \ + end_bitidx - start_bitidx + 1, \ + (1 << (end_bitidx - start_bitidx + 1)) - 1) void set_pageblock_flags_group(struct page *page, unsigned long flags, int start_bitidx, int end_bitidx); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6047866..377e58a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, */ static inline void __free_one_page(struct page *page, + unsigned long pfn, struct zone *zone, unsigned int order, int migratetype) { @@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON(migratetype == -1); - page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); + page_idx = pfn & ((1 << MAX_ORDER) - 1); VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); @@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, list_del(&page->lru); mt = get_freepage_migratetype(page); /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ - __free_one_page(page, zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt); trace_mm_page_pcpu_drain(page, 0, mt); if (likely(!is_migrate_isolate_page(page))) { __mod_zone_page_state(zone, NR_FREE_PAGES, 1); @@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock(&zone->lock); } -static void free_one_page(struct zone *zone, struct page *page, int order, +static void free_one_page(struct zone *zone, + struct page *page, unsigned long pfn, + int order, int migratetype) { spin_lock(&zone->lock); zone->pages_scanned = 0; - __free_one_page(page, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype); if (unlikely(!is_migrate_isolate(migratetype))) __mod_zone_freepage_state(zone, 1 << order, migratetype); spin_unlock(&zone->lock); @@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; int migratetype; + unsigned long pfn = page_to_pfn(page); if (!free_pages_prepare(page, order)) return; local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); - migratetype = get_pageblock_migratetype(page); + migratetype = get_pfnblock_migratetype(page, pfn); set_freepage_migratetype(page, migratetype); - free_one_page(page_zone(page), page, order, migratetype); + free_one_page(page_zone(page), page, pfn, order, migratetype); local_irq_restore(flags); } @@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold) struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; unsigned long flags; + unsigned long pfn = page_to_pfn(page); int migratetype; if (!free_pages_prepare(page, 0)) return; - migratetype = get_pageblock_migratetype(page); + migratetype = get_pfnblock_migratetype(page, pfn); set_freepage_migratetype(page, migratetype); local_irq_save(flags); __count_vm_event(PGFREE); @@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(zone, page, 0, migratetype); + free_one_page(zone, page, pfn, 0, migratetype); goto out; } migratetype = MIGRATE_MOVABLE; @@ -6012,17 +6017,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) * @end_bitidx: The last bit of interest * returns pageblock_bits flags */ -unsigned long get_pageblock_flags_mask(struct page *page, +unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long nr_flag_bits, unsigned long mask) { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx, word_bitidx; + unsigned long bitidx, word_bitidx; unsigned long word; zone = page_zone(page); - pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); word_bitidx = bitidx / BITS_PER_LONG; -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Date: Fri, 18 Apr 2014 15:50:33 +0100 Message-ID: <1397832643-14275-7-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52095 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753172AbaDROur (ORCPT ); Fri, 18 Apr 2014 10:50:47 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: There is no need to calculate zone_idx(preferred_zone) multiple times or use the pgdat to figure it out. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f2a9dd..88a6dac 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1893,17 +1893,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) static struct page * get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags, - struct zone *preferred_zone, int migratetype) + struct zone *preferred_zone, int classzone_idx, int migratetype) { struct zoneref *z; struct page *page = NULL; - int classzone_idx; struct zone *zone; nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ - classzone_idx = zone_idx(preferred_zone); zonelist_scan: /* * Scan zonelist, looking for a zone with enough free. @@ -2160,7 +2158,7 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { struct page *page; @@ -2178,7 +2176,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) goto out; @@ -2213,7 +2211,7 @@ static struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, - int migratetype, bool sync_migration, + int classzone_idx, int migratetype, bool sync_migration, bool *contended_compaction, bool *deferred_compaction, unsigned long *did_some_progress) { @@ -2241,7 +2239,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) { preferred_zone->compact_blockskip_flush = false; compaction_defer_reset(preferred_zone, order, true); @@ -2314,7 +2312,7 @@ static inline struct page * __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, - int migratetype, unsigned long *did_some_progress) + int classzone_idx, int migratetype, unsigned long *did_some_progress) { struct page *page = NULL; bool drained = false; @@ -2332,7 +2330,8 @@ retry: page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, + migratetype); /* * If an allocation failed after direct reclaim, it could be because @@ -2355,14 +2354,14 @@ static inline struct page * __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { struct page *page; do { page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (!page && gfp_mask & __GFP_NOFAIL) wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); @@ -2463,7 +2462,7 @@ static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; @@ -2520,7 +2519,7 @@ rebalance: /* This is the last chance, in general, before the goto nopage. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) goto got_pg; @@ -2535,7 +2534,7 @@ rebalance: page = __alloc_pages_high_priority(gfp_mask, order, zonelist, high_zoneidx, nodemask, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) { goto got_pg; } @@ -2568,6 +2567,7 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, + classzone_idx, migratetype, sync_migration, &contended_compaction, &deferred_compaction, @@ -2591,7 +2591,8 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, - migratetype, &did_some_progress); + classzone_idx, migratetype, + &did_some_progress); if (page) goto got_pg; @@ -2610,7 +2611,7 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - migratetype); + classzone_idx, migratetype); if (page) goto got_pg; @@ -2653,6 +2654,7 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, + classzone_idx, migratetype, sync_migration, &contended_compaction, &deferred_compaction, @@ -2680,11 +2682,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, { enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; + struct zoneref *preferred_zoneref; struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; struct mem_cgroup *memcg = NULL; + int classzone_idx; gfp_mask &= gfp_allowed_mask; @@ -2714,11 +2718,12 @@ retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); /* The preferred zone is used for statistics later */ - first_zones_zonelist(zonelist, high_zoneidx, + preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx, nodemask ? : &cpuset_current_mems_allowed, &preferred_zone); if (!preferred_zone) goto out; + classzone_idx = zonelist_zone_idx(preferred_zoneref); #ifdef CONFIG_CMA if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) @@ -2728,7 +2733,7 @@ retry: /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, alloc_flags, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (unlikely(!page)) { /* * The first pass makes sure allocations are spread @@ -2754,7 +2759,7 @@ retry: gfp_mask = memalloc_noio_flags(gfp_mask); page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); } trace_mm_page_alloc(page, order, gfp_mask, migratetype); -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 05/16] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Date: Fri, 18 Apr 2014 15:50:32 +0100 Message-ID: <1397832643-14275-6-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org If cpusets are not in use then we still check a global variable on every page allocation. Use jump labels to avoid the overhead. Signed-off-by: Mel Gorman --- include/linux/cpuset.h | 29 +++++++++++++++++++++++++++++ kernel/cpuset.c | 8 ++++++-- mm/page_alloc.c | 3 ++- 3 files changed, 37 insertions(+), 3 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index b19d3dc..9c840e3 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -17,6 +17,35 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */ +#ifdef HAVE_JUMP_LABEL +extern struct static_key cpusets_enabled_key; +static inline bool cpusets_enabled(void) +{ + return static_key_false(&cpusets_enabled_key); +} +#else +static inline bool cpusets_enabled(void) +{ + return number_of_cpusets > 1; +} +#endif + +static inline void cpuset_inc(void) +{ + number_of_cpusets++; +#ifdef HAVE_JUMP_LABEL + static_key_slow_inc(&cpusets_enabled_key); +#endif +} + +static inline void cpuset_dec(void) +{ + number_of_cpusets--; +#ifdef HAVE_JUMP_LABEL + static_key_slow_dec(&cpusets_enabled_key); +#endif +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_update_active_cpus(bool cpu_online); diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3d54c41..34ada52 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -68,6 +68,10 @@ */ int number_of_cpusets __read_mostly; +#ifdef HAVE_JUMP_LABEL +struct static_key cpusets_enabled_key = STATIC_KEY_INIT_FALSE; +#endif + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1888,7 +1892,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) if (is_spread_slab(parent)) set_bit(CS_SPREAD_SLAB, &cs->flags); - number_of_cpusets++; + cpuset_inc(); if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags)) goto out_unlock; @@ -1939,7 +1943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) if (is_sched_load_balance(cs)) update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); - number_of_cpusets--; + cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags); mutex_unlock(&cpuset_mutex); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ad702e9..3f2a9dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1916,7 +1916,8 @@ zonelist_scan: if (IS_ENABLED(CONFIG_NUMA) && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes)) continue; - if ((alloc_flags & ALLOC_CPUSET) && + if (cpusets_enabled() && + (alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask)) continue; BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate Date: Fri, 18 Apr 2014 15:50:43 +0100 Message-ID: <1397832643-14275-17-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52100 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753153AbaDROuv (ORCPT ); Fri, 18 Apr 2014 10:50:51 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: The write_end handler is likely to call SetPageUptodate which is an atomic operation so prefetch the line. Signed-off-by: Mel Gorman --- mm/filemap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index c28f69c..40713da 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2551,6 +2551,9 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); flush_dcache_page(page); + if (!PageUptodate(page)) + prefetchw(&page->flags); + status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); if (unlikely(status < 0)) -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 15/16] mm: Non-atomically mark page accessed in write_begin where possible Date: Fri, 18 Apr 2014 15:50:42 +0100 Message-ID: <1397832643-14275-16-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52098 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751532AbaDROuv (ORCPT ); Fri, 18 Apr 2014 10:50:51 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: aops->write_begin may allocate a new page and make it visible just to have mark_page_accessed called almost immediately after. Once it's visible atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. In this patch, new allocations in grab_cache_page_write_begin() or find_or_create_page() use init_page_accessed() and existing pages use mark_page_accessed(). This places a burden because filesystems need to ensure they either use these helpers or update the helpers they do use to call init_page_accessed() or mark_page_accessed() as appropriate. There is also a snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. In a profiled run measuring dd to tmpfs the overhead of mark_page_accessed was 25142 0.7055 vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla shmem_write_end 107830 3.0256 vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla mark_page_accessed 3.73% overall. With the patch applied, it becomes 118185 3.1712 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 shmem_write_end 2395 0.0643 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 init_page_accessed 159 0.0043 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 mark_page_accessed 3.23% overall. shmem_write_end increases in apparent cost because the SetPageUptodate is now to a cache line that mark_page_accessed had not dirtied for it. Even with that taken into account, it's still fewer atomic operations overall. Signed-off-by: Mel Gorman --- include/linux/page-flags.h | 1 + include/linux/swap.h | 1 + mm/filemap.c | 55 +++++++++++++++++++++++++++------------------- mm/shmem.c | 6 ++++- mm/swap.c | 11 ++++++++++ 5 files changed, 51 insertions(+), 23 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4d4b39a..2093eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -198,6 +198,7 @@ struct page; /* forward declaration */ TESTPAGEFLAG(Locked, locked) PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) + __SETPAGEFLAG(Referenced, referenced) PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4a9ac85..e54312d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); +extern void init_page_accessed(struct page *page); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); diff --git a/mm/filemap.c b/mm/filemap.c index a82fbe4..c28f69c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1059,24 +1059,31 @@ struct page *find_or_create_page(struct address_space *mapping, int err; repeat: page = find_lock_page(mapping, index); - if (!page) { - page = __page_cache_alloc(gfp_mask); - if (!page) - return NULL; - /* - * We want a regular kernel memory (not highmem or DMA etc) - * allocation for the radix tree nodes, but we need to honour - * the context-specific requirements the caller has asked for. - * GFP_RECLAIM_MASK collects those requirements. - */ - err = add_to_page_cache_lru(page, mapping, index, - (gfp_mask & GFP_RECLAIM_MASK)); - if (unlikely(err)) { - page_cache_release(page); - page = NULL; - if (err == -EEXIST) - goto repeat; - } + if (page) { + mark_page_accessed(page); + return page; + } + + page = __page_cache_alloc(gfp_mask); + if (!page) + return NULL; + + /* Init accessed so avoit atomic mark_page_accessed later */ + init_page_accessed(page); + + /* + * We want a regular kernel memory (not highmem or DMA etc) + * allocation for the radix tree nodes, but we need to honour + * the context-specific requirements the caller has asked for. + * GFP_RECLAIM_MASK collects those requirements. + */ + err = add_to_page_cache_lru(page, mapping, index, + (gfp_mask & GFP_RECLAIM_MASK)); + if (unlikely(err)) { + page_cache_release(page); + page = NULL; + if (err == -EEXIST) + goto repeat; } return page; } @@ -2372,7 +2379,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping, { const struct address_space_operations *aops = mapping->a_ops; - mark_page_accessed(page); return aops->write_end(file, mapping, pos, len, copied, page, fsdata); } EXPORT_SYMBOL(pagecache_write_end); @@ -2466,12 +2472,18 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, gfp_notmask = __GFP_FS; repeat: page = find_lock_page(mapping, index); - if (page) + if (page) { + mark_page_accessed(page); goto found; + } page = __page_cache_alloc(gfp_mask & ~gfp_notmask); if (!page) return NULL; + + /* Init accessed so avoit atomic mark_page_accessed later */ + init_page_accessed(page); + status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL & ~gfp_notmask); if (unlikely(status)) { @@ -2530,7 +2542,7 @@ again: status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata); - if (unlikely(status)) + if (unlikely(status < 0)) break; if (mapping_writably_mapped(mapping)) @@ -2539,7 +2551,6 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); flush_dcache_page(page); - mark_page_accessed(page); status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); if (unlikely(status < 0)) diff --git a/mm/shmem.c b/mm/shmem.c index f47fb38..700a4ad 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { + int ret; struct inode *inode = mapping->host; pgoff_t index = pos >> PAGE_CACHE_SHIFT; - return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); + ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); + if (*pagep) + init_page_accessed(*pagep); + return ret; } static int diff --git a/mm/swap.c b/mm/swap.c index fed4caf..2490dfe 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page) EXPORT_SYMBOL(mark_page_accessed); /* + * Used to mark_page_accessed(page) that is not visible yet and when it is + * still safe to use non-atomic ops + */ +void init_page_accessed(struct page *page) +{ + if (!PageReferenced(page)) + __SetPageReferenced(page); +} +EXPORT_SYMBOL(init_page_accessed); + +/* * Queue the page for addition to the LRU via pagevec. The decision on whether * to add the page to the [in]active [file|anon] list is deferred until the * pagevec is drained. This gives a chance for the caller of __lru_cache_add() -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Date: Fri, 18 Apr 2014 15:50:35 +0100 Message-ID: <1397832643-14275-9-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52100 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753449AbaDROus (ORCPT ); Fri, 18 Apr 2014 10:50:48 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Currently it's calculated once per zone in the zonelist. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c5933a5..770735a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1911,6 +1911,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ + bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_WRITE); zonelist_scan: /* @@ -1969,8 +1971,7 @@ zonelist_scan: * will require awareness of zones in the * dirty-throttling and the flusher threads. */ - if ((alloc_flags & ALLOC_WMARK_LOW) && - (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) + if (consider_zone_dirty && !zone_dirty_ok(zone)) continue; mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 14/16] mm: Do not use unnecessary atomic operations when adding pages to the LRU Date: Fri, 18 Apr 2014 15:50:41 +0100 Message-ID: <1397832643-14275-15-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Cc: Linux-FSDevel To: Linux-MM Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52095 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753916AbaDROuu (ORCPT ); Fri, 18 Apr 2014 10:50:50 -0400 In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: When adding pages to the LRU we clear the active bit unconditionally. As the page could be reachable from other paths we cannot use unlocked operations without risk of corruption such as a parallel mark_page_accessed. This patch test if is necessary to clear the atomic flag before using an atomic operation. In the unlikely even this races with mark_page_accesssed the consequences are simply that the page may be promoted to the active list that might have been left on the inactive list before the patch. This is a marginal consequence. Signed-off-by: Mel Gorman --- include/linux/swap.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 3507115..4a9ac85 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page); */ static inline void lru_cache_add_anon(struct page *page) { - ClearPageActive(page); + if (PageActive(page)) + ClearPageActive(page); __lru_cache_add(page); } static inline void lru_cache_add_file(struct page *page) { - ClearPageActive(page); + if (PageActive(page)) + ClearPageActive(page); __lru_cache_add(page); } -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default Date: Fri, 18 Apr 2014 10:26:28 -0700 Message-ID: <87tx9q35x7.fsf@tassilo.jf.intel.com> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-2-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: In-Reply-To: <1397832643-14275-2-git-send-email-mgorman@suse.de> (Mel Gorman's message of "Fri, 18 Apr 2014 15:50:28 +0100") Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Mel Gorman writes: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Non local memory also often destroys performance. > Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. While I'm not totally against this change, it will destroy many carefully tuned configurations as the default NUMA behavior may be completely different now. So it seems like a big hammer, and it's not even clear what problem you're exactly solving here. -Andi -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active Date: Fri, 18 Apr 2014 13:52:16 -0400 Message-ID: <20140418175216.GA29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-4-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-4-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:30PM +0100, Mel Gorman wrote: > The zlc is used on NUMA machines to quickly skip over zones that are full. > However it is always updated, even for the first zone scanned when the > zlc might not even be active. As it's a write to a bitmap that potentially > bounces cache line it's deceptively expensive and most machines will not > care. Only update the zlc if it was active. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Date: Fri, 18 Apr 2014 13:52:56 -0400 Message-ID: <20140418175256.GB29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-5-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-5-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:31PM +0100, Mel Gorman wrote: > If a zone cannot be used for a dirty page then it gets marked "full" > which is cached in the zlc and later potentially skipped by allocation > requests that have nothing to do with dirty zones. Urgh. Thanks for the fix. > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Date: Fri, 18 Apr 2014 14:03:09 -0400 Message-ID: <20140418180309.GC29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-7-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-7-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:33PM +0100, Mel Gorman wrote: > @@ -2463,7 +2462,7 @@ static inline struct page * > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > struct zonelist *zonelist, enum zone_type high_zoneidx, > nodemask_t *nodemask, struct zone *preferred_zone, > - int migratetype) > + int classzone_idx, int migratetype) > { > const gfp_t wait = gfp_mask & __GFP_WAIT; > struct page *page = NULL; There is another potential update of preferred_zone in this function after which the classzone_idx should probably be refreshed: /* * Find the true preferred zone if the allocation is unconstrained by * cpusets. */ if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) first_zones_zonelist(zonelist, high_zoneidx, NULL, &preferred_zone); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies Date: Fri, 18 Apr 2014 14:05:12 -0400 Message-ID: <20140418180512.GD29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-8-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-8-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:34PM +0100, Mel Gorman wrote: > A node/zone index is used to check if pages are compatible for merging > but this happens unconditionally even if the buddy page is not free. Defer > the calculation as long as possible. Ideally we would check the zone boundary > but nodes can overlap. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Date: Fri, 18 Apr 2014 14:08:36 -0400 Message-ID: <20140418180836.GE29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-9-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-9-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:35PM +0100, Mel Gorman wrote: > Currently it's calculated once per zone in the zonelist. > > Signed-off-by: Mel Gorman I would have assumed the compiler can detect such a loop invariant... Alas, Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Date: Fri, 18 Apr 2014 14:10:19 -0400 Message-ID: <20140418181019.GF29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-10-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-10-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:36PM +0100, Mel Gorman wrote: > ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for > __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases > are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely > branch in the fast path. This patch moves the check out of the fast path > and after it has been determined that the watermarks have not been met. This > helps the common fast path at the cost of making the slow path slower and > hitting kswapd with a performance cost. It's a reasonable tradeoff. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Date: Fri, 18 Apr 2014 14:13:00 -0400 Message-ID: <20140418181300.GG29210@cmpxchg.org> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-13-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <1397832643-14275-13-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 03:50:39PM +0100, Mel Gorman wrote: > shmem_getpage_gfp uses an atomic operation to set the SwapBacked field > before it's even added to the LRU or visible. This is unnecessary as what > could it possible race against? Use an unlocked variant. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlastimil Babka Subject: Re: [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Date: Fri, 18 Apr 2014 19:16:45 +0200 Message-ID: <53515DFD.4090009@suse.cz> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-11-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Linux-FSDevel To: Mel Gorman , Linux-MM layout Return-path: In-Reply-To: <1397832643-14275-11-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 04/18/2014 04:50 PM, Mel Gorman wrote: > The test_bit operations in get/set pageblock flags are expensive. This patch > reads the bitmap on a word basis and use shifts and masks to isolate the bits > of interest. Similarly masks are used to set a local copy of the bitmap and then > use cmpxchg to update the bitmap if there have been no other changes made in > parallel. > > In a test running dd onto tmpfs the overhead of the pageblock-related > functions went from 1.27% in profiles to 0.5%. > > Signed-off-by: Mel Gorman > --- > include/linux/mmzone.h | 6 +++++- > include/linux/pageblock-flags.h | 21 ++++++++++++++++---- > mm/page_alloc.c | 43 +++++++++++++++++++++++++---------------- > 3 files changed, 48 insertions(+), 22 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index c1dbe0b..c97b4bc 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -75,9 +75,13 @@ enum { > > extern int page_group_by_mobility_disabled; > > +#define NR_MIGRATETYPE_BITS 3 > +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) > + > static inline int get_pageblock_migratetype(struct page *page) > { > - return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); > + BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2); > + return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); > } > > struct free_area { > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h > index 2ee8cd2..c89ac75 100644 > --- a/include/linux/pageblock-flags.h > +++ b/include/linux/pageblock-flags.h > @@ -30,9 +30,12 @@ enum pageblock_bits { > PB_migrate, > PB_migrate_end = PB_migrate + 3 - 1, > /* 3 bits required for migrate types */ > -#ifdef CONFIG_COMPACTION > PB_migrate_skip,/* If set the block is skipped by compaction */ > -#endif /* CONFIG_COMPACTION */ > + > + /* > + * Assume the bits will always align on a word. If this assumption > + * changes then get/set pageblock needs updating. > + */ > NR_PAGEBLOCK_BITS > }; > > @@ -62,9 +65,19 @@ extern int pageblock_order; > /* Forward declaration */ > struct page; > > +unsigned long get_pageblock_flags_mask(struct page *page, > + unsigned long nr_flag_bits, > + unsigned long mask); > + > /* Declarations for getting and setting flags. See mm/page_alloc.c */ > -unsigned long get_pageblock_flags_group(struct page *page, > - int start_bitidx, int end_bitidx); > +static inline unsigned long get_pageblock_flags_group(struct page *page, > + int start_bitidx, int end_bitidx) > +{ > + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; > + unsigned long mask = (1 << nr_flag_bits) - 1; > + > + return get_pageblock_flags_mask(page, nr_flag_bits, mask); > +} > void set_pageblock_flags_group(struct page *page, unsigned long flags, > int start_bitidx, int end_bitidx); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 737577c..6047866 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) > * @end_bitidx: The last bit of interest > * returns pageblock_bits flags > */ > -unsigned long get_pageblock_flags_group(struct page *page, > - int start_bitidx, int end_bitidx) > +unsigned long get_pageblock_flags_mask(struct page *page, > + unsigned long nr_flag_bits, > + unsigned long mask) I don't think this can work with just nr_flag_bits and mask, without taking start_bitidx into account. This probably only works when start_bitidx == 0, which is true for PB_migrate, but not PB_migrate_skip. > { > struct zone *zone; > unsigned long *bitmap; > - unsigned long pfn, bitidx; > - unsigned long flags = 0; > - unsigned long value = 1; > + unsigned long pfn, bitidx, word_bitidx; > + unsigned long word; > > zone = page_zone(page); > pfn = page_to_pfn(page); > bitmap = get_pageblock_bitmap(zone, pfn); > bitidx = pfn_to_bitidx(zone, pfn); > + word_bitidx = bitidx / BITS_PER_LONG; > + bitidx &= (BITS_PER_LONG-1); > > - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) > - if (test_bit(bitidx + start_bitidx, bitmap)) > - flags |= value; > - > - return flags; > + word = bitmap[word_bitidx]; > + return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask; Ugh, so for bitidx == 0, this shifts by 61 bits, so bits 61-63 is read. Now consider this being called by get_pageblock_skip(). That will have nr_flags_bit == 1, so shift by 63 -> bit 63 is read, but you probably wanted bit 60? Or 60-62 for migratetype and 63 for the skip bit. I'm not sure anymore which one matches the old bitmap layout and how endianness plays a role here :) Friday evening... But, changing the order of bits, and 4-bits within words doesn't matter I guess, except making sure that the bitmap is now being allocated aligned to whole words so that we don't read/write past the end of it. > } > > /** > @@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags, > { > struct zone *zone; > unsigned long *bitmap; > - unsigned long pfn, bitidx; > - unsigned long value = 1; > + unsigned long pfn, bitidx, word_bitidx; > + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; > + unsigned long mask = (1 << nr_flag_bits) - 1; > + unsigned long old_word, new_word; > + > + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); > > zone = page_zone(page); > pfn = page_to_pfn(page); > bitmap = get_pageblock_bitmap(zone, pfn ); > bitidx = pfn_to_bitidx(zone, pfn); > + word_bitidx = bitidx / BITS_PER_LONG; > + bitidx &= (BITS_PER_LONG-1); > + > VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page); > > - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) > - if (flags & value) > - __set_bit(bitidx + start_bitidx, bitmap); > - else > - __clear_bit(bitidx + start_bitidx, bitmap); > + end_bitidx = bitidx + (end_bitidx - start_bitidx); > + mask <<= (BITS_PER_LONG - end_bitidx - 1); > + flags <<= (BITS_PER_LONG - end_bitidx - 1); Again, for bitidx == 0 and migratetype this will shift by 61, for skip bit it will shift by 63 and overlap. Again, start_bitidx is not considered except when subtracted from end_bitidx. It would be also better if the code did not differ so much from the get_ version, which makes it harder to decide they operate on the same bits. > + do { > + old_word = ACCESS_ONCE(bitmap[word_bitidx]); > + new_word = (old_word & ~mask) | flags; > + } while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word); It seems that cmpxchg is not available for SMP that's not x86 :( > } > > /* > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate Date: Fri, 18 Apr 2014 12:16:23 -0700 (PDT) Message-ID: References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-17-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Linux-MM , Linux-FSDevel To: Mel Gorman Return-path: In-Reply-To: <1397832643-14275-17-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, 18 Apr 2014, Mel Gorman wrote: > The write_end handler is likely to call SetPageUptodate which is an atomic > operation so prefetch the line. > > Signed-off-by: Mel Gorman This one seems a little odd to me: it feels as if you're compensating for your mark_page_accessed() movement, but in too shmem-specific a way. I see write_ends do SetPageUptodate more often than I was expecting (with __block_commit_write() doing so even when PageUptodate already), but even so... Given that the write_end is likely to want to SetPageDirty, and sure to want to clear_bit_unlock(PG_locked, &page->flags), wouldn't it be better and less mysterious just to prefetchw(&page->flags) here unconditionally? (But I'm also afraid that this sets a precedent for an avalanche of dubious prefetchw patches all over.) Hugh > --- > mm/filemap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/filemap.c b/mm/filemap.c > index c28f69c..40713da 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2551,6 +2551,9 @@ again: > copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); > flush_dcache_page(page); > > + if (!PageUptodate(page)) > + prefetchw(&page->flags); > + > status = a_ops->write_end(file, mapping, pos, bytes, copied, > page, fsdata); > if (unlikely(status < 0)) > -- > 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default Date: Fri, 18 Apr 2014 14:15:31 -0700 Message-ID: <535195F3.8040009@intel.com> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-2-git-send-email-mgorman@suse.de> <87tx9q35x7.fsf@tassilo.jf.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Linux-MM , Linux-FSDevel To: Andi Kleen , Mel Gorman Return-path: Received: from mga11.intel.com ([192.55.52.93]:29621 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754640AbaDRVPc (ORCPT ); Fri, 18 Apr 2014 17:15:32 -0400 In-Reply-To: <87tx9q35x7.fsf@tassilo.jf.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 04/18/2014 10:26 AM, Andi Kleen wrote: > Mel Gorman writes: >> Favour the common case and disable it by default. Users that are >> sophisticated enough to know they need zone_reclaim_mode will detect it. > > While I'm not totally against this change, it will destroy many > carefully tuned configurations as the default NUMA behavior may be completely > different now. So it seems like a big hammer, and it's not even clear > what problem you're exactly solving here. I'm not 100% sure what the common case _is_. Folks who want good NUMA affinity are happy now and are happy by default. Folks who want to fill memory with page cache are mad and mad by default, and they're the ones complaining. It's hard to count the happy ones. :) But, on the other hand, the current situation is easy to debug. Someone complains that they have too much free memory, and it ends up being pretty easy to solve just looking at statistics, and things go horribly wrong quickly. If we apply this patch, it's much less obvious when things are going wrong, and we have no statistics to help. We'll need to get folks running more things like numatop: https://01.org/numatop That said, as a recipient of angry calls from customers who don't like zone_reclaim_mode, I _do_ think this is the path we should take at the moment. Maybe we'll be reverting it in a few years once all of our customers are angry about lack of NUMA locality. Acked-by: Dave Hansen From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Date: Sat, 19 Apr 2014 12:18:05 +0100 Message-ID: <20140419111805.GB4225@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-7-git-send-email-mgorman@suse.de> <20140418180309.GC29210@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20140418180309.GC29210@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 02:03:09PM -0400, Johannes Weiner wrote: > On Fri, Apr 18, 2014 at 03:50:33PM +0100, Mel Gorman wrote: > > @@ -2463,7 +2462,7 @@ static inline struct page * > > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > struct zonelist *zonelist, enum zone_type high_zoneidx, > > nodemask_t *nodemask, struct zone *preferred_zone, > > - int migratetype) > > + int classzone_idx, int migratetype) > > { > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > struct page *page = NULL; > > There is another potential update of preferred_zone in this function > after which the classzone_idx should probably be refreshed: > > /* > * Find the true preferred zone if the allocation is unconstrained by > * cpusets. > */ > if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) > first_zones_zonelist(zonelist, high_zoneidx, NULL, > &preferred_zone); Thanks, I'll fix it up for v2. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Date: Sat, 19 Apr 2014 12:19:34 +0100 Message-ID: <20140419111934.GC4225@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-9-git-send-email-mgorman@suse.de> <20140418180836.GE29210@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20140418180836.GE29210@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Apr 18, 2014 at 02:08:36PM -0400, Johannes Weiner wrote: > On Fri, Apr 18, 2014 at 03:50:35PM +0100, Mel Gorman wrote: > > Currently it's calculated once per zone in the zonelist. > > > > Signed-off-by: Mel Gorman > > I would have assumed the compiler can detect such a loop invariant... > Alas, > Surprisingly it didn't in my case but the benefit of the patch is marginal at best. I can drop it if it makes the code more obscure to peoples eyes. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate Date: Sat, 19 Apr 2014 12:23:48 +0100 Message-ID: <20140419112347.GD4225@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-17-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel To: Hugh Dickins Return-path: Received: from cantor2.suse.de ([195.135.220.15]:33641 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751081AbaDSLYl (ORCPT ); Sat, 19 Apr 2014 07:24:41 -0400 Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Apr 18, 2014 at 12:16:23PM -0700, Hugh Dickins wrote: > On Fri, 18 Apr 2014, Mel Gorman wrote: > > > The write_end handler is likely to call SetPageUptodate which is an atomic > > operation so prefetch the line. > > > > Signed-off-by: Mel Gorman > > This one seems a little odd to me: it feels as if you're compensating > for your mark_page_accessed() movement, Not as such. We take the penalty anyway, it's just a case of when. As the penalty was semi-obviously in one place it seemed like a reasonable thing to do. > but in too shmem-specific a way. > > I see write_ends do SetPageUptodate more often than I was expecting > (with __block_commit_write() doing so even when PageUptodate already), > but even so... > Good point. I'll search for those and clean them up. > Given that the write_end is likely to want to SetPageDirty, and sure > to want to clear_bit_unlock(PG_locked, &page->flags), wouldn't it be > better and less mysterious just to prefetchw(&page->flags) here > unconditionally? > Again, good point. I'm travelling at the moment but will audit the write_end handlers when I get back and see if filesystems generally benefit or if I was aiming at shmem too much. > (But I'm also afraid that this sets a precedent for an avalanche of > dubious prefetchw patches all over.) > I'll include figures the next time to see if it's justified. However, even in that case I recognise that not all CPUs treat prefetchw the same and we might still want to drop this patch as a result regardless of what result I see on one test machine. Thanks Hugh -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by kanga.kvack.org (Postfix) with ESMTP id DB6A36B0035 for ; Fri, 18 Apr 2014 10:50:46 -0400 (EDT) Received: by mail-ee0-f49.google.com with SMTP id c41so1687513eek.36 for ; Fri, 18 Apr 2014 07:50:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z42si40611009eel.62.2014.04.18.07.50.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:45 -0700 (PDT) From: Mel Gorman Subject: [PATCH 02/16] mm: page_alloc: Do not cache reclaim distances Date: Fri, 18 Apr 2014 15:50:29 +0100 Message-Id: <1397832643-14275-3-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Zhang Yanfei --- include/linux/mmzone.h | 1 - mm/page_alloc.c | 18 ++---------------- 2 files changed, 2 insertions(+), 17 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fac5509..c1dbe0b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -763,7 +763,6 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 628f1e7..3c8200c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1850,16 +1850,8 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); -} - -static void __paginginit init_zone_allows_reclaim(int nid) -{ - int i; - - for_each_node_state(i, N_MEMORY) - if (node_distance(nid, i) <= RECLAIM_DISTANCE) - node_set(i, NODE_DATA(nid)->reclaim_nodes); + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < + RECLAIM_DISTANCE; } #else /* CONFIG_NUMA */ @@ -1892,10 +1884,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { return true; } - -static inline void init_zone_allows_reclaim(int nid) -{ -} #endif /* CONFIG_NUMA */ /* @@ -4919,8 +4907,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; - if (node_state(nid, N_MEMORY)) - init_zone_allows_reclaim(nid); #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #endif -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com [74.125.83.53]) by kanga.kvack.org (Postfix) with ESMTP id 376A16B0039 for ; Fri, 18 Apr 2014 10:50:47 -0400 (EDT) Received: by mail-ee0-f53.google.com with SMTP id b57so1695535eek.12 for ; Fri, 18 Apr 2014 07:50:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id w48si40610352eel.26.2014.04.18.07.50.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:45 -0700 (PDT) From: Mel Gorman Subject: [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active Date: Fri, 18 Apr 2014 15:50:30 +0100 Message-Id: <1397832643-14275-4-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel The zlc is used on NUMA machines to quickly skip over zones that are full. However it is always updated, even for the first zone scanned when the zlc might not even be active. As it's a write to a bitmap that potentially bounces cache line it's deceptively expensive and most machines will not care. Only update the zlc if it was active. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c8200c5..d8c9c4a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2030,7 +2030,7 @@ try_this_zone: if (page) break; this_zone_full: - if (IS_ENABLED(CONFIG_NUMA)) + if (IS_ENABLED(CONFIG_NUMA) && zlc_active) zlc_mark_zone_full(zonelist, z); } -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id 5A1506B003A for ; Fri, 18 Apr 2014 10:50:47 -0400 (EDT) Received: by mail-ee0-f47.google.com with SMTP id b15so1682320eek.6 for ; Fri, 18 Apr 2014 07:50:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id c48si23130614eeb.7.2014.04.18.07.50.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:46 -0700 (PDT) From: Mel Gorman Subject: [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Date: Fri, 18 Apr 2014 15:50:31 +0100 Message-Id: <1397832643-14275-5-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel If a zone cannot be used for a dirty page then it gets marked "full" which is cached in the zlc and later potentially skipped by allocation requests that have nothing to do with dirty zones. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d8c9c4a..ad702e9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1962,7 +1962,7 @@ zonelist_scan: */ if ((alloc_flags & ALLOC_WMARK_LOW) && (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) - goto this_zone_full; + continue; mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (!zone_watermark_ok(zone, order, mark, -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id E889C6B003B for ; Fri, 18 Apr 2014 10:50:48 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id t10so1662670eei.5 for ; Fri, 18 Apr 2014 07:50:48 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 43si40568740eer.147.2014.04.18.07.50.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:47 -0700 (PDT) From: Mel Gorman Subject: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Date: Fri, 18 Apr 2014 15:50:33 +0100 Message-Id: <1397832643-14275-7-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel There is no need to calculate zone_idx(preferred_zone) multiple times or use the pgdat to figure it out. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f2a9dd..88a6dac 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1893,17 +1893,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) static struct page * get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags, - struct zone *preferred_zone, int migratetype) + struct zone *preferred_zone, int classzone_idx, int migratetype) { struct zoneref *z; struct page *page = NULL; - int classzone_idx; struct zone *zone; nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ - classzone_idx = zone_idx(preferred_zone); zonelist_scan: /* * Scan zonelist, looking for a zone with enough free. @@ -2160,7 +2158,7 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { struct page *page; @@ -2178,7 +2176,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) goto out; @@ -2213,7 +2211,7 @@ static struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, - int migratetype, bool sync_migration, + int classzone_idx, int migratetype, bool sync_migration, bool *contended_compaction, bool *deferred_compaction, unsigned long *did_some_progress) { @@ -2241,7 +2239,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) { preferred_zone->compact_blockskip_flush = false; compaction_defer_reset(preferred_zone, order, true); @@ -2314,7 +2312,7 @@ static inline struct page * __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, - int migratetype, unsigned long *did_some_progress) + int classzone_idx, int migratetype, unsigned long *did_some_progress) { struct page *page = NULL; bool drained = false; @@ -2332,7 +2330,8 @@ retry: page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, + migratetype); /* * If an allocation failed after direct reclaim, it could be because @@ -2355,14 +2354,14 @@ static inline struct page * __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { struct page *page; do { page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (!page && gfp_mask & __GFP_NOFAIL) wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); @@ -2463,7 +2462,7 @@ static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int migratetype) + int classzone_idx, int migratetype) { const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; @@ -2520,7 +2519,7 @@ rebalance: /* This is the last chance, in general, before the goto nopage. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) goto got_pg; @@ -2535,7 +2534,7 @@ rebalance: page = __alloc_pages_high_priority(gfp_mask, order, zonelist, high_zoneidx, nodemask, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (page) { goto got_pg; } @@ -2568,6 +2567,7 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, + classzone_idx, migratetype, sync_migration, &contended_compaction, &deferred_compaction, @@ -2591,7 +2591,8 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, - migratetype, &did_some_progress); + classzone_idx, migratetype, + &did_some_progress); if (page) goto got_pg; @@ -2610,7 +2611,7 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - migratetype); + classzone_idx, migratetype); if (page) goto got_pg; @@ -2653,6 +2654,7 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, + classzone_idx, migratetype, sync_migration, &contended_compaction, &deferred_compaction, @@ -2680,11 +2682,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, { enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; + struct zoneref *preferred_zoneref; struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; struct mem_cgroup *memcg = NULL; + int classzone_idx; gfp_mask &= gfp_allowed_mask; @@ -2714,11 +2718,12 @@ retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); /* The preferred zone is used for statistics later */ - first_zones_zonelist(zonelist, high_zoneidx, + preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx, nodemask ? : &cpuset_current_mems_allowed, &preferred_zone); if (!preferred_zone) goto out; + classzone_idx = zonelist_zone_idx(preferred_zoneref); #ifdef CONFIG_CMA if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) @@ -2728,7 +2733,7 @@ retry: /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, alloc_flags, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); if (unlikely(!page)) { /* * The first pass makes sure allocations are spread @@ -2754,7 +2759,7 @@ retry: gfp_mask = memalloc_noio_flags(gfp_mask); page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, - preferred_zone, migratetype); + preferred_zone, classzone_idx, migratetype); } trace_mm_page_alloc(page, order, gfp_mask, migratetype); -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f50.google.com (mail-ee0-f50.google.com [74.125.83.50]) by kanga.kvack.org (Postfix) with ESMTP id 402756B003A for ; Fri, 18 Apr 2014 10:50:49 -0400 (EDT) Received: by mail-ee0-f50.google.com with SMTP id c13so1724918eek.9 for ; Fri, 18 Apr 2014 07:50:48 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id t3si40600065eeg.91.2014.04.18.07.50.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:47 -0700 (PDT) From: Mel Gorman Subject: [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies Date: Fri, 18 Apr 2014 15:50:34 +0100 Message-Id: <1397832643-14275-8-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel A node/zone index is used to check if pages are compatible for merging but this happens unconditionally even if the buddy page is not free. Defer the calculation as long as possible. Ideally we would check the zone boundary but nodes can overlap. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 88a6dac..c5933a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, if (!pfn_valid_within(page_to_pfn(buddy))) return 0; - if (page_zone_id(page) != page_zone_id(buddy)) - return 0; - if (page_is_guard(buddy) && page_order(buddy) == order) { VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy); + + if (page_zone_id(page) != page_zone_id(buddy)) + return 0; + return 1; } if (PageBuddy(buddy) && page_order(buddy) == order) { VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy); + + /* + * zone check is done late to avoid uselessly + * calculating zone/node ids for pages that could + * never merge. + */ + if (page_zone_id(page) != page_zone_id(buddy)) + return 0; + return 1; } return 0; -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f43.google.com (mail-ee0-f43.google.com [74.125.83.43]) by kanga.kvack.org (Postfix) with ESMTP id F3F886B0044 for ; Fri, 18 Apr 2014 10:50:49 -0400 (EDT) Received: by mail-ee0-f43.google.com with SMTP id e53so1702797eek.30 for ; Fri, 18 Apr 2014 07:50:49 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r9si40622686eew.78.2014.04.18.07.50.48 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:48 -0700 (PDT) From: Mel Gorman Subject: [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Date: Fri, 18 Apr 2014 15:50:36 +0100 Message-Id: <1397832643-14275-10-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely branch in the fast path. This patch moves the check out of the fast path and after it has been determined that the watermarks have not been met. This helps the common fast path at the cost of making the slow path slower and hitting kswapd with a performance cost. It's a reasonable tradeoff. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 770735a..737577c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1930,9 +1930,6 @@ zonelist_scan: (alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask)) continue; - BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); - if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) - goto try_this_zone; /* * Distribute pages in proportion to the individual * zone size to ensure fair page aging. The zone a @@ -1979,6 +1976,11 @@ zonelist_scan: classzone_idx, alloc_flags)) { int ret; + /* Checked here to keep the fast path fast */ + BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); + if (alloc_flags & ALLOC_NO_WATERMARKS) + goto try_this_zone; + if (IS_ENABLED(CONFIG_NUMA) && !did_zlc_setup && nr_online_nodes > 1) { /* -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id 8377B6B004D for ; Fri, 18 Apr 2014 10:50:50 -0400 (EDT) Received: by mail-ee0-f51.google.com with SMTP id c13so1700787eek.38 for ; Fri, 18 Apr 2014 07:50:50 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z2si40615447eeo.34.2014.04.18.07.50.48 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:49 -0700 (PDT) From: Mel Gorman Subject: [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Date: Fri, 18 Apr 2014 15:50:37 +0100 Message-Id: <1397832643-14275-11-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel The test_bit operations in get/set pageblock flags are expensive. This patch reads the bitmap on a word basis and use shifts and masks to isolate the bits of interest. Similarly masks are used to set a local copy of the bitmap and then use cmpxchg to update the bitmap if there have been no other changes made in parallel. In a test running dd onto tmpfs the overhead of the pageblock-related functions went from 1.27% in profiles to 0.5%. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 6 +++++- include/linux/pageblock-flags.h | 21 ++++++++++++++++---- mm/page_alloc.c | 43 +++++++++++++++++++++++++---------------- 3 files changed, 48 insertions(+), 22 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c1dbe0b..c97b4bc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -75,9 +75,13 @@ enum { extern int page_group_by_mobility_disabled; +#define NR_MIGRATETYPE_BITS 3 +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) + static inline int get_pageblock_migratetype(struct page *page) { - return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); + BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2); + return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); } struct free_area { diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 2ee8cd2..c89ac75 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -30,9 +30,12 @@ enum pageblock_bits { PB_migrate, PB_migrate_end = PB_migrate + 3 - 1, /* 3 bits required for migrate types */ -#ifdef CONFIG_COMPACTION PB_migrate_skip,/* If set the block is skipped by compaction */ -#endif /* CONFIG_COMPACTION */ + + /* + * Assume the bits will always align on a word. If this assumption + * changes then get/set pageblock needs updating. + */ NR_PAGEBLOCK_BITS }; @@ -62,9 +65,19 @@ extern int pageblock_order; /* Forward declaration */ struct page; +unsigned long get_pageblock_flags_mask(struct page *page, + unsigned long nr_flag_bits, + unsigned long mask); + /* Declarations for getting and setting flags. See mm/page_alloc.c */ -unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx); +static inline unsigned long get_pageblock_flags_group(struct page *page, + int start_bitidx, int end_bitidx) +{ + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; + unsigned long mask = (1 << nr_flag_bits) - 1; + + return get_pageblock_flags_mask(page, nr_flag_bits, mask); +} void set_pageblock_flags_group(struct page *page, unsigned long flags, int start_bitidx, int end_bitidx); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 737577c..6047866 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) * @end_bitidx: The last bit of interest * returns pageblock_bits flags */ -unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx) +unsigned long get_pageblock_flags_mask(struct page *page, + unsigned long nr_flag_bits, + unsigned long mask) { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx; - unsigned long flags = 0; - unsigned long value = 1; + unsigned long pfn, bitidx, word_bitidx; + unsigned long word; zone = page_zone(page); pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); + word_bitidx = bitidx / BITS_PER_LONG; + bitidx &= (BITS_PER_LONG-1); - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) - if (test_bit(bitidx + start_bitidx, bitmap)) - flags |= value; - - return flags; + word = bitmap[word_bitidx]; + return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask; } /** @@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags, { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx; - unsigned long value = 1; + unsigned long pfn, bitidx, word_bitidx; + unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; + unsigned long mask = (1 << nr_flag_bits) - 1; + unsigned long old_word, new_word; + + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); zone = page_zone(page); pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); + word_bitidx = bitidx / BITS_PER_LONG; + bitidx &= (BITS_PER_LONG-1); + VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page); - for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1) - if (flags & value) - __set_bit(bitidx + start_bitidx, bitmap); - else - __clear_bit(bitidx + start_bitidx, bitmap); + end_bitidx = bitidx + (end_bitidx - start_bitidx); + mask <<= (BITS_PER_LONG - end_bitidx - 1); + flags <<= (BITS_PER_LONG - end_bitidx - 1); + + do { + old_word = ACCESS_ONCE(bitmap[word_bitidx]); + new_word = (old_word & ~mask) | flags; + } while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word); } /* -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f48.google.com (mail-ee0-f48.google.com [74.125.83.48]) by kanga.kvack.org (Postfix) with ESMTP id 267636B0055 for ; Fri, 18 Apr 2014 10:50:51 -0400 (EDT) Received: by mail-ee0-f48.google.com with SMTP id b57so1676010eek.7 for ; Fri, 18 Apr 2014 07:50:50 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z2si40594441eeo.124.2014.04.18.07.50.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:49 -0700 (PDT) From: Mel Gorman Subject: [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Date: Fri, 18 Apr 2014 15:50:39 +0100 Message-Id: <1397832643-14275-13-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: Mel Gorman --- include/linux/page-flags.h | 1 + mm/shmem.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d1fe1a7..4d4b39a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ PAGEFLAG(SavePinned, savepinned); /* Xen */ PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) + __SETPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) diff --git a/mm/shmem.c b/mm/shmem.c index 9f70e02..f47fb38 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1132,7 +1132,7 @@ repeat: goto decused; } - SetPageSwapBacked(page); + __SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_charge_file(page, current->mm, gfp & GFP_RECLAIM_MASK); -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f43.google.com (mail-ee0-f43.google.com [74.125.83.43]) by kanga.kvack.org (Postfix) with ESMTP id 897DD6B005A for ; Fri, 18 Apr 2014 10:50:51 -0400 (EDT) Received: by mail-ee0-f43.google.com with SMTP id e53so1681813eek.16 for ; Fri, 18 Apr 2014 07:50:51 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d5si40543694eei.268.2014.04.18.07.50.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:50 -0700 (PDT) From: Mel Gorman Subject: [PATCH 13/16] mm: Do not use atomic operations when releasing pages Date: Fri, 18 Apr 2014 15:50:40 +0100 Message-Id: <1397832643-14275-14-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel There should be no references to it any more and a parallel mark should not be reordered against us. Use non-locked varient to clear page active. Signed-off-by: Mel Gorman --- mm/swap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swap.c b/mm/swap.c index 9ce43ba..fed4caf 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, int cold) } /* Clear Active bit in case of parallel mark_page_accessed */ - ClearPageActive(page); + __ClearPageActive(page); list_add(&page->lru, &pages_to_free); } -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 0F1CD6B005C for ; Fri, 18 Apr 2014 10:50:51 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id t10so1698085eei.33 for ; Fri, 18 Apr 2014 07:50:51 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q5si40538969eem.291.2014.04.18.07.50.50 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:50 -0700 (PDT) From: Mel Gorman Subject: [PATCH 14/16] mm: Do not use unnecessary atomic operations when adding pages to the LRU Date: Fri, 18 Apr 2014 15:50:41 +0100 Message-Id: <1397832643-14275-15-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel When adding pages to the LRU we clear the active bit unconditionally. As the page could be reachable from other paths we cannot use unlocked operations without risk of corruption such as a parallel mark_page_accessed. This patch test if is necessary to clear the atomic flag before using an atomic operation. In the unlikely even this races with mark_page_accesssed the consequences are simply that the page may be promoted to the active list that might have been left on the inactive list before the patch. This is a marginal consequence. Signed-off-by: Mel Gorman --- include/linux/swap.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 3507115..4a9ac85 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page); */ static inline void lru_cache_add_anon(struct page *page) { - ClearPageActive(page); + if (PageActive(page)) + ClearPageActive(page); __lru_cache_add(page); } static inline void lru_cache_add_file(struct page *page) { - ClearPageActive(page); + if (PageActive(page)) + ClearPageActive(page); __lru_cache_add(page); } -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f44.google.com (mail-ee0-f44.google.com [74.125.83.44]) by kanga.kvack.org (Postfix) with ESMTP id 8CE766B0062 for ; Fri, 18 Apr 2014 10:50:52 -0400 (EDT) Received: by mail-ee0-f44.google.com with SMTP id e49so1718920eek.3 for ; Fri, 18 Apr 2014 07:50:52 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p8si40591802eew.156.2014.04.18.07.50.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:51 -0700 (PDT) From: Mel Gorman Subject: [PATCH 15/16] mm: Non-atomically mark page accessed in write_begin where possible Date: Fri, 18 Apr 2014 15:50:42 +0100 Message-Id: <1397832643-14275-16-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel aops->write_begin may allocate a new page and make it visible just to have mark_page_accessed called almost immediately after. Once it's visible atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. In this patch, new allocations in grab_cache_page_write_begin() or find_or_create_page() use init_page_accessed() and existing pages use mark_page_accessed(). This places a burden because filesystems need to ensure they either use these helpers or update the helpers they do use to call init_page_accessed() or mark_page_accessed() as appropriate. There is also a snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. In a profiled run measuring dd to tmpfs the overhead of mark_page_accessed was 25142 0.7055 vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla shmem_write_end 107830 3.0256 vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla mark_page_accessed 3.73% overall. With the patch applied, it becomes 118185 3.1712 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 shmem_write_end 2395 0.0643 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 init_page_accessed 159 0.0043 vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 mark_page_accessed 3.23% overall. shmem_write_end increases in apparent cost because the SetPageUptodate is now to a cache line that mark_page_accessed had not dirtied for it. Even with that taken into account, it's still fewer atomic operations overall. Signed-off-by: Mel Gorman --- include/linux/page-flags.h | 1 + include/linux/swap.h | 1 + mm/filemap.c | 55 +++++++++++++++++++++++++++------------------- mm/shmem.c | 6 ++++- mm/swap.c | 11 ++++++++++ 5 files changed, 51 insertions(+), 23 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4d4b39a..2093eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -198,6 +198,7 @@ struct page; /* forward declaration */ TESTPAGEFLAG(Locked, locked) PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) + __SETPAGEFLAG(Referenced, referenced) PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4a9ac85..e54312d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); +extern void init_page_accessed(struct page *page); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); diff --git a/mm/filemap.c b/mm/filemap.c index a82fbe4..c28f69c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1059,24 +1059,31 @@ struct page *find_or_create_page(struct address_space *mapping, int err; repeat: page = find_lock_page(mapping, index); - if (!page) { - page = __page_cache_alloc(gfp_mask); - if (!page) - return NULL; - /* - * We want a regular kernel memory (not highmem or DMA etc) - * allocation for the radix tree nodes, but we need to honour - * the context-specific requirements the caller has asked for. - * GFP_RECLAIM_MASK collects those requirements. - */ - err = add_to_page_cache_lru(page, mapping, index, - (gfp_mask & GFP_RECLAIM_MASK)); - if (unlikely(err)) { - page_cache_release(page); - page = NULL; - if (err == -EEXIST) - goto repeat; - } + if (page) { + mark_page_accessed(page); + return page; + } + + page = __page_cache_alloc(gfp_mask); + if (!page) + return NULL; + + /* Init accessed so avoit atomic mark_page_accessed later */ + init_page_accessed(page); + + /* + * We want a regular kernel memory (not highmem or DMA etc) + * allocation for the radix tree nodes, but we need to honour + * the context-specific requirements the caller has asked for. + * GFP_RECLAIM_MASK collects those requirements. + */ + err = add_to_page_cache_lru(page, mapping, index, + (gfp_mask & GFP_RECLAIM_MASK)); + if (unlikely(err)) { + page_cache_release(page); + page = NULL; + if (err == -EEXIST) + goto repeat; } return page; } @@ -2372,7 +2379,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping, { const struct address_space_operations *aops = mapping->a_ops; - mark_page_accessed(page); return aops->write_end(file, mapping, pos, len, copied, page, fsdata); } EXPORT_SYMBOL(pagecache_write_end); @@ -2466,12 +2472,18 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, gfp_notmask = __GFP_FS; repeat: page = find_lock_page(mapping, index); - if (page) + if (page) { + mark_page_accessed(page); goto found; + } page = __page_cache_alloc(gfp_mask & ~gfp_notmask); if (!page) return NULL; + + /* Init accessed so avoit atomic mark_page_accessed later */ + init_page_accessed(page); + status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL & ~gfp_notmask); if (unlikely(status)) { @@ -2530,7 +2542,7 @@ again: status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata); - if (unlikely(status)) + if (unlikely(status < 0)) break; if (mapping_writably_mapped(mapping)) @@ -2539,7 +2551,6 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); flush_dcache_page(page); - mark_page_accessed(page); status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); if (unlikely(status < 0)) diff --git a/mm/shmem.c b/mm/shmem.c index f47fb38..700a4ad 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { + int ret; struct inode *inode = mapping->host; pgoff_t index = pos >> PAGE_CACHE_SHIFT; - return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); + ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); + if (*pagep) + init_page_accessed(*pagep); + return ret; } static int diff --git a/mm/swap.c b/mm/swap.c index fed4caf..2490dfe 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page) EXPORT_SYMBOL(mark_page_accessed); /* + * Used to mark_page_accessed(page) that is not visible yet and when it is + * still safe to use non-atomic ops + */ +void init_page_accessed(struct page *page) +{ + if (!PageReferenced(page)) + __SetPageReferenced(page); +} +EXPORT_SYMBOL(init_page_accessed); + +/* * Queue the page for addition to the LRU via pagevec. The decision on whether * to add the page to the [in]active [file|anon] list is deferred until the * pagevec is drained. This gives a chance for the caller of __lru_cache_add() -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id D06F46B005A for ; Fri, 18 Apr 2014 10:50:52 -0400 (EDT) Received: by mail-ee0-f47.google.com with SMTP id b15so1682403eek.6 for ; Fri, 18 Apr 2014 07:50:52 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id o46si40610989eem.69.2014.04.18.07.50.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:51 -0700 (PDT) From: Mel Gorman Subject: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate Date: Fri, 18 Apr 2014 15:50:43 +0100 Message-Id: <1397832643-14275-17-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel The write_end handler is likely to call SetPageUptodate which is an atomic operation so prefetch the line. Signed-off-by: Mel Gorman --- mm/filemap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index c28f69c..40713da 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2551,6 +2551,9 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); flush_dcache_page(page); + if (!PageUptodate(page)) + prefetchw(&page->flags); + status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); if (unlikely(status < 0)) -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f43.google.com (mail-ee0-f43.google.com [74.125.83.43]) by kanga.kvack.org (Postfix) with ESMTP id EB0B56B0068 for ; Fri, 18 Apr 2014 10:50:55 -0400 (EDT) Received: by mail-ee0-f43.google.com with SMTP id e53so1681809eek.16 for ; Fri, 18 Apr 2014 07:50:50 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z42si40598713eel.92.2014.04.18.07.50.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:49 -0700 (PDT) From: Mel Gorman Subject: [PATCH 11/16] mm: page_alloc: Reduce number of times page_to_pfn is called Date: Fri, 18 Apr 2014 15:50:38 +0100 Message-Id: <1397832643-14275-12-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 9 +++++++-- include/linux/pageblock-flags.h | 15 ++++++--------- mm/page_alloc.c | 26 +++++++++++++++----------- 3 files changed, 28 insertions(+), 22 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c97b4bc..14ed8d1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled; #define NR_MIGRATETYPE_BITS 3 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) -static inline int get_pageblock_migratetype(struct page *page) +#define get_pageblock_migratetype(page) \ + get_pfnblock_flags_mask(page, page_to_pfn(page), \ + NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK) + +static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn) { BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2); - return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); + return get_pfnblock_flags_mask(page, pfn, + NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK); } struct free_area { diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index c89ac75..6a9dd5b 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -65,19 +65,16 @@ extern int pageblock_order; /* Forward declaration */ struct page; -unsigned long get_pageblock_flags_mask(struct page *page, +unsigned long get_pfnblock_flags_mask(struct page *page, + unsigned long pfn, unsigned long nr_flag_bits, unsigned long mask); /* Declarations for getting and setting flags. See mm/page_alloc.c */ -static inline unsigned long get_pageblock_flags_group(struct page *page, - int start_bitidx, int end_bitidx) -{ - unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1; - unsigned long mask = (1 << nr_flag_bits) - 1; - - return get_pageblock_flags_mask(page, nr_flag_bits, mask); -} +#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \ + get_pfnblock_flags_mask(page, page_to_pfn(page), \ + end_bitidx - start_bitidx + 1, \ + (1 << (end_bitidx - start_bitidx + 1)) - 1) void set_pageblock_flags_group(struct page *page, unsigned long flags, int start_bitidx, int end_bitidx); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6047866..377e58a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, */ static inline void __free_one_page(struct page *page, + unsigned long pfn, struct zone *zone, unsigned int order, int migratetype) { @@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON(migratetype == -1); - page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); + page_idx = pfn & ((1 << MAX_ORDER) - 1); VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); @@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, list_del(&page->lru); mt = get_freepage_migratetype(page); /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ - __free_one_page(page, zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt); trace_mm_page_pcpu_drain(page, 0, mt); if (likely(!is_migrate_isolate_page(page))) { __mod_zone_page_state(zone, NR_FREE_PAGES, 1); @@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock(&zone->lock); } -static void free_one_page(struct zone *zone, struct page *page, int order, +static void free_one_page(struct zone *zone, + struct page *page, unsigned long pfn, + int order, int migratetype) { spin_lock(&zone->lock); zone->pages_scanned = 0; - __free_one_page(page, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype); if (unlikely(!is_migrate_isolate(migratetype))) __mod_zone_freepage_state(zone, 1 << order, migratetype); spin_unlock(&zone->lock); @@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; int migratetype; + unsigned long pfn = page_to_pfn(page); if (!free_pages_prepare(page, order)) return; local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); - migratetype = get_pageblock_migratetype(page); + migratetype = get_pfnblock_migratetype(page, pfn); set_freepage_migratetype(page, migratetype); - free_one_page(page_zone(page), page, order, migratetype); + free_one_page(page_zone(page), page, pfn, order, migratetype); local_irq_restore(flags); } @@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold) struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; unsigned long flags; + unsigned long pfn = page_to_pfn(page); int migratetype; if (!free_pages_prepare(page, 0)) return; - migratetype = get_pageblock_migratetype(page); + migratetype = get_pfnblock_migratetype(page, pfn); set_freepage_migratetype(page, migratetype); local_irq_save(flags); __count_vm_event(PGFREE); @@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(zone, page, 0, migratetype); + free_one_page(zone, page, pfn, 0, migratetype); goto out; } migratetype = MIGRATE_MOVABLE; @@ -6012,17 +6017,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn) * @end_bitidx: The last bit of interest * returns pageblock_bits flags */ -unsigned long get_pageblock_flags_mask(struct page *page, +unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long nr_flag_bits, unsigned long mask) { struct zone *zone; unsigned long *bitmap; - unsigned long pfn, bitidx, word_bitidx; + unsigned long bitidx, word_bitidx; unsigned long word; zone = page_zone(page); - pfn = page_to_pfn(page); bitmap = get_pageblock_bitmap(zone, pfn); bitidx = pfn_to_bitidx(zone, pfn); word_bitidx = bitidx / BITS_PER_LONG; -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (unknown [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 893FF6B0068 for ; Fri, 18 Apr 2014 10:50:59 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id t10so1698049eei.33 for ; Fri, 18 Apr 2014 07:50:48 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m49si40636408eeo.11.2014.04.18.07.50.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 07:50:48 -0700 (PDT) From: Mel Gorman Subject: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Date: Fri, 18 Apr 2014 15:50:35 +0100 Message-Id: <1397832643-14275-9-git-send-email-mgorman@suse.de> In-Reply-To: <1397832643-14275-1-git-send-email-mgorman@suse.de> References: <1397832643-14275-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Linux-FSDevel Currently it's calculated once per zone in the zonelist. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c5933a5..770735a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1911,6 +1911,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ + bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_WRITE); zonelist_scan: /* @@ -1969,8 +1971,7 @@ zonelist_scan: * will require awareness of zones in the * dirty-throttling and the flusher threads. */ - if ((alloc_flags & ALLOC_WMARK_LOW) && - (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) + if (consider_zone_dirty && !zone_dirty_ok(zone)) continue; mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f42.google.com (mail-pb0-f42.google.com [209.85.160.42]) by kanga.kvack.org (Postfix) with ESMTP id BC3F26B0035 for ; Fri, 18 Apr 2014 17:15:33 -0400 (EDT) Received: by mail-pb0-f42.google.com with SMTP id rr13so1816701pbb.15 for ; Fri, 18 Apr 2014 14:15:33 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id yd10si16940557pab.412.2014.04.18.14.15.32 for ; Fri, 18 Apr 2014 14:15:32 -0700 (PDT) Message-ID: <535195F3.8040009@intel.com> Date: Fri, 18 Apr 2014 14:15:31 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default References: <1397832643-14275-1-git-send-email-mgorman@suse.de> <1397832643-14275-2-git-send-email-mgorman@suse.de> <87tx9q35x7.fsf@tassilo.jf.intel.com> In-Reply-To: <87tx9q35x7.fsf@tassilo.jf.intel.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andi Kleen , Mel Gorman Cc: Linux-MM , Linux-FSDevel On 04/18/2014 10:26 AM, Andi Kleen wrote: > Mel Gorman writes: >> Favour the common case and disable it by default. Users that are >> sophisticated enough to know they need zone_reclaim_mode will detect it. > > While I'm not totally against this change, it will destroy many > carefully tuned configurations as the default NUMA behavior may be completely > different now. So it seems like a big hammer, and it's not even clear > what problem you're exactly solving here. I'm not 100% sure what the common case _is_. Folks who want good NUMA affinity are happy now and are happy by default. Folks who want to fill memory with page cache are mad and mad by default, and they're the ones complaining. It's hard to count the happy ones. :) But, on the other hand, the current situation is easy to debug. Someone complains that they have too much free memory, and it ends up being pretty easy to solve just looking at statistics, and things go horribly wrong quickly. If we apply this patch, it's much less obvious when things are going wrong, and we have no statistics to help. We'll need to get folks running more things like numatop: https://01.org/numatop That said, as a recipient of angry calls from customers who don't like zone_reclaim_mode, I _do_ think this is the path we should take at the moment. Maybe we'll be reverting it in a few years once all of our customers are angry about lack of NUMA locality. Acked-by: Dave Hansen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org