From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com [209.85.215.170]) by kanga.kvack.org (Postfix) with ESMTP id DBBDB6B003A for ; Tue, 17 Dec 2013 11:48:26 -0500 (EST) Received: by mail-ea0-f170.google.com with SMTP id k10so3042046eaj.29 for ; Tue, 17 Dec 2013 08:48:26 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id i1si19252922eev.131.2013.12.17.08.48.25 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:25 -0800 (PST) From: Mel Gorman Subject: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Date: Tue, 17 Dec 2013 16:48:18 +0000 Message-Id: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman This series is currently untested and is being posted to sync up discussions on the treatment of page cache pages, particularly the sysv part. I have not thought it through in detail but postings patches is the easiest way to highlight where I think a problem might be. Changelog since v2 o Drop an accounting patch, behaviour is deliberate o Special case tmpfs and shmem pages for discussion Changelog since v1 o Fix lot of brain damage in the configurable policy patch o Yoink a page cache annotation patch o Only account batch pages against allocations eligible for the fair policy o Add patch that default distributes file pages on remote nodes Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a bug whereby new pages could be reclaimed before old pages because of how the page allocator and kswapd interacted on the per-zone LRU lists. Unfortunately a side-effect missed during review was that it's now very easy to allocate remote memory on NUMA machines. The problem is that it is not a simple case of just restoring local allocation policies as there are genuine reasons why global page aging may be prefereable. It's still a major change to default behaviour so this patch makes the policy configurable and sets what I think is a sensible default. The patches are on top of some NUMA balancing patches currently in -mm. It's untested and posted to discuss patches 4 and 6. Documentation/sysctl/vm.txt | 29 ++++++++++ include/linux/gfp.h | 4 +- include/linux/mmzone.h | 2 + include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 + kernel/sysctl.c | 8 +++ mm/filemap.c | 2 + mm/page_alloc.c | 136 +++++++++++++++++++++++++++++++++++++------- mm/shmem.c | 14 +++++ 9 files changed, 176 insertions(+), 23 deletions(-) -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id 732FD6B003C for ; Tue, 17 Dec 2013 11:48:27 -0500 (EST) Received: by mail-ee0-f42.google.com with SMTP id e53so3011223eek.29 for ; Tue, 17 Dec 2013 08:48:26 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id l2si1379448een.83.2013.12.17.08.48.26 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:26 -0800 (PST) From: Mel Gorman Subject: [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Date: Tue, 17 Dec 2013 16:48:19 +0000 Message-Id: <1387298904-8824-2-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman From: Johannes Weiner Dave Hansen noted a regression in a microbenchmark that loops around open() and close() on an 8-node NUMA machine and bisected it down to 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That change forces the slab allocations of the file descriptor to spread out to all 8 nodes, causing remote references in the page allocator and slab. The round-robin policy is only there to provide fairness among memory allocations that are reclaimed involuntarily based on pressure in each zone. It does not make sense to apply it to unreclaimable kernel allocations that are freed manually, in this case instantly after the allocation, and incur the remote reference costs twice for no reason. Only round-robin allocations that are usually freed through page reclaim or slab shrinking. Cc: Bisected-by: Dave Hansen Signed-off-by: Johannes Weiner Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f0..f861d02 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1920,7 +1920,8 @@ zonelist_scan: * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & GFP_MOVABLE_MASK)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; if (zone_reclaim_mode && -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f180.google.com (mail-ea0-f180.google.com [209.85.215.180]) by kanga.kvack.org (Postfix) with ESMTP id 33DE86B003C for ; Tue, 17 Dec 2013 11:48:28 -0500 (EST) Received: by mail-ea0-f180.google.com with SMTP id f15so3048281eak.39 for ; Tue, 17 Dec 2013 08:48:27 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s42si168942eew.182.2013.12.17.08.48.27 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:27 -0800 (PST) From: Mel Gorman Subject: [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Date: Tue, 17 Dec 2013 16:48:20 +0000 Message-Id: <1387298904-8824-3-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman This patch moves the decision on whether to round-robin allocations between zones and nodes into its own helper functions. It'll make some later patches easier to understand and it will be automatically inlined. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Acked-by: Johannes Weiner --- mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f861d02..64020eb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid) #endif /* CONFIG_NUMA */ /* + * Distribute pages in proportion to the individual zone size to ensure fair + * page aging. The zone a page was allocated in should have no effect on the + * time the page has in memory before being reclaimed. + * + * Returns true if this zone should be skipped to spread the page ages to + * other zones. + */ +static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone, + struct zone *zone, int alloc_flags) +{ + /* Only round robin in the allocator fast path */ + if (!(alloc_flags & ALLOC_WMARK_LOW)) + return false; + + /* Only round robin pages likely to be LRU or reclaimable slab */ + if (!(gfp_mask & GFP_MOVABLE_MASK)) + return false; + + /* Distribute to the next zone if this zone has exhausted its batch */ + if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) + return true; + + /* + * When zone_reclaim_mode is enabled, try to stay in local zones in the + * fastpath. If that fails, the slowpath is entered, which will do + * another pass starting with the local zones, but ultimately fall back + * back to remote zones that do not partake in the fairness round-robin + * cycle of this zonelist. + */ + if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + return true; + + return false; +} + +/* * get_page_from_freelist goes through the zonelist trying to allocate * a page. */ @@ -1907,27 +1943,12 @@ zonelist_scan: BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; - /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. - * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the - * slowpath is entered, which will do another pass - * starting with the local zones, but ultimately fall - * back to remote zones that do not partake in the - * fairness round-robin cycle of this zonelist. - */ - if ((alloc_flags & ALLOC_WMARK_LOW) && - (gfp_mask & GFP_MOVABLE_MASK)) { - if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) - continue; - if (zone_reclaim_mode && - !zone_local(preferred_zone, zone)) - continue; - } + + /* Distribute pages to ensure fair page aging */ + if (zone_distribute_age(gfp_mask, preferred_zone, zone, + alloc_flags)) + continue; + /* * When allocating a page cache page for writing, we * want to get it from a zone that is within its dirty -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f175.google.com (mail-ea0-f175.google.com [209.85.215.175]) by kanga.kvack.org (Postfix) with ESMTP id E6AE16B003D for ; Tue, 17 Dec 2013 11:48:28 -0500 (EST) Received: by mail-ea0-f175.google.com with SMTP id z10so3014340ead.20 for ; Tue, 17 Dec 2013 08:48:28 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id f8si5687558eep.225.2013.12.17.08.48.28 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:28 -0800 (PST) From: Mel Gorman Subject: [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Date: Tue, 17 Dec 2013 16:48:21 +0000 Message-Id: <1387298904-8824-4-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman zone_local is using node_distance which is a more expensive call than necessary. On x86, it's another function call in the allocator fast path and increases cache footprint. This patch makes the assumption zones on a local node will share the same node ID. The necessary information should already be cache hot. Signed-off-by: Mel Gorman Acked-by: Rik van Riel --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64020eb..fd9677e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return zone_to_nid(zone) == numa_node_id(); } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com [209.85.215.170]) by kanga.kvack.org (Postfix) with ESMTP id B44BE6B0044 for ; Tue, 17 Dec 2013 11:48:29 -0500 (EST) Received: by mail-ea0-f170.google.com with SMTP id k10so3045015eaj.1 for ; Tue, 17 Dec 2013 08:48:29 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u49si5977975eep.85.2013.12.17.08.48.28 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:29 -0800 (PST) From: Mel Gorman Subject: [PATCH 4/6] mm: Annotate page cache allocations Date: Tue, 17 Dec 2013 16:48:22 +0000 Message-Id: <1387298904-8824-5-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman The fair zone allocation policy needs to distinguish between anonymous, slab and file-backed pages. This patch annotates many of the page cache allocations by adjusting __page_cache_alloc. This does not guarantee that all page cache allocations are being properly annotated. One case for special consideration is shmem. sysv shared memory and MAP_SHARED anonymous pages are backed by this and they should be treated as anon by the fair allocation policy. It is also used by tmpfs which arguably should be treated as file by the fair allocation policy. The primary top-level shmem allocation function is shmem_getpage_gfp which ultimately uses alloc_pages_vma() and not __page_cache_alloc. This is correct for sysv and MAP_SHARED but tmpfs is still treated as anonymous. This patch special cases shmem to annotate tmpfs allocations as files for the fair zone allocation policy. NOTE: At time of writing it has not been double checked that it annotates the different shmem request types. Furthermore, this patch was originally base on a patch from Johannes and does not have his signed-off-by. Without his signed-off, I cannot sign it off Cannot-sign-off-without-Johannes --- include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- mm/filemap.c | 2 ++ mm/shmem.c | 14 ++++++++++++++ 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd49..f69e4cb 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..bda4845 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/mm/filemap.c b/mm/filemap.c index b7749a9..5bb9225 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/shmem.c b/mm/shmem.c index 8297623..02d7a9c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, return page; } +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) +{ + /* If no internal shm_mount then it must be tmpfs */ + if (IS_ERR(shm_mnt)) + return true; + + /* Consider it to be tmpfs if the superblock is not the internal mount */ + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; +} + static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, pvma.vm_ops = NULL; pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); + if (shmem_inode_on_tmpfs(info)) + gfp |= __GFP_PAGECACHE; + page = alloc_page_vma(gfp, &pvma, 0); /* Drop reference taken by mpol_shared_policy_lookup() */ -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f51.google.com (mail-ee0-f51.google.com [74.125.83.51]) by kanga.kvack.org (Postfix) with ESMTP id A0CF56B004D for ; Tue, 17 Dec 2013 11:48:30 -0500 (EST) Received: by mail-ee0-f51.google.com with SMTP id b15so3024242eek.10 for ; Tue, 17 Dec 2013 08:48:30 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 5si6507588eei.249.2013.12.17.08.48.29 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:29 -0800 (PST) From: Mel Gorman Subject: [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Date: Tue, 17 Dec 2013 16:48:23 +0000 Message-Id: <1387298904-8824-6-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a bug whereby new pages could be reclaimed before old pages because of how the page allocator and kswapd interacted on the per-zone LRU lists. Unfortunately it was missed during review that a consequence is that we also round-robin between NUMA nodes. This is bad for two reasons 1. It alters the semantics of MPOL_LOCAL without telling anyone 2. It incurs an immediate remote memory performance hit in exchange for a potential performance gain when memory needs to be reclaimed later No cookies for the reviewers on this one. This patch makes the behaviour of the fair zone allocator policy configurable. By default it will only distribute pages that are going to exist on the LRU between zones local to the allocating process. This preserves the historical semantics of MPOL_LOCAL. By default, slab pages are not distributed between zones after this patch is applied. It can be argued that they should get similar treatment but they have different lifecycles to LRU pages, the shrinkers are not zone-aware and the interaction between the page allocator and kswapd is different for slabs. If it turns out to be an almost universal win, we can change the default. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 32 ++++++++++++++ include/linux/mmzone.h | 2 + include/linux/swap.h | 2 + kernel/sysctl.c | 8 ++++ mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------ 5 files changed, 134 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb..8eaa562 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure +- zone_distribute_mode - zone_reclaim_mode ============================================================== @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes. ============================================================== +zone_distribute_mode + +Pages allocation and reclaim are managed on a per-zone basis. When the +system needs to reclaim memory, candidate pages are selected from these +per-zone lists. Historically, a potential consequence was that recently +allocated pages were considered reclaim candidates. From a zone-local +perspective, page aging was preserved but from a system-wide perspective +there was an age inversion problem. + +A similar problem occurs on a node level where young pages may be reclaimed +from the local node instead of allocating remote memory. Unforuntately, the +cost of accessing remote nodes is higher so the system must choose by default +between favouring page aging or node locality. zone_distribute_mode controls +how the system will distribute page ages between zones. + +0 = Never round-robin based on age + +Otherwise the values are ORed together + +1 = Distribute anon pages between zones local to the allocating node +2 = Distribute file pages between zones local to the allocating node +4 = Distribute slab pages between zones local to the allocating node + +The following three flags effectively alter MPOL_DEFAULT, be careful. + +8 = Distribute anon pages between zones remote to the allocating node +16 = Distribute file pages between zones remote to the allocating node +32 = Distribute slab pages between zones remote to the allocating node + +============================================================== + zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b835d3f..20a75e3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int sysctl_zone_distribute_mode_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); extern int numa_zonelist_order_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6..44329b0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,6 +318,8 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; +extern unsigned __bitwise__ zone_distribute_mode; + #ifdef CONFIG_NUMA extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a6047..b75c08f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, #endif + { + .procname = "zone_distribute_mode", + .data = &zone_distribute_mode, + .maxlen = sizeof(zone_distribute_mode), + .mode = 0644, + .proc_handler = sysctl_zone_distribute_mode_handler, + .extra1 = &zero, + }, #ifdef CONFIG_NUMA { .procname = "zone_reclaim_mode", diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd9677e..c2a2229 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid) } #endif /* CONFIG_NUMA */ +/* Controls how page ages are distributed across zones automatically */ +unsigned __bitwise__ zone_distribute_mode __read_mostly; + +/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */ +#define DISTRIBUTE_DISABLE (0) +#define DISTRIBUTE_LOCAL_ANON (1UL << 0) +#define DISTRIBUTE_LOCAL_FILE (1UL << 1) +#define DISTRIBUTE_LOCAL_SLAB (1UL << 2) +#define DISTRIBUTE_REMOTE_ANON (1UL << 3) +#define DISTRIBUTE_REMOTE_FILE (1UL << 4) +#define DISTRIBUTE_REMOTE_SLAB (1UL << 5) + +#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON) +#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE) +#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB) +#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB) + +/* Only these GFP flags are affected by the fair zone allocation policy */ +#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE)) + +int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + /* If you are an admin reading this comment, what were you thinking? */ + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) == + DISTRIBUTE_STUPID_ANON)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON; + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) == + DISTRIBUTE_STUPID_FILE)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE; + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) == + DISTRIBUTE_STUPID_SLAB)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB; + + return 0; +} + /* * Distribute pages in proportion to the individual zone size to ensure fair * page aging. The zone a page was allocated in should have no effect on the @@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid) static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone, struct zone *zone, int alloc_flags) { + bool zone_is_local; + bool is_file, is_slab, is_anon; + /* Only round robin in the allocator fast path */ if (!(alloc_flags & ALLOC_WMARK_LOW)) return false; - /* Only round robin pages likely to be LRU or reclaimable slab */ - if (!(gfp_mask & GFP_MOVABLE_MASK)) + /* Only a subset of GFP flags are considered for fair zone policy */ + if (!(gfp_mask & DISTRIBUTE_GFP_MASK)) return false; - /* Distribute to the next zone if this zone has exhausted its batch */ - if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) - return true; - /* - * When zone_reclaim_mode is enabled, try to stay in local zones in the - * fastpath. If that fails, the slowpath is entered, which will do - * another pass starting with the local zones, but ultimately fall back - * back to remote zones that do not partake in the fairness round-robin - * cycle of this zonelist. + * Classify the type of allocation. From this point on, the fair zone + * allocation policy is being applied. If the allocation does not meet + * the criteria the zone must be skipped. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + is_file = gfp_mask & __GFP_PAGECACHE; + is_slab = gfp_mask & __GFP_RECLAIMABLE; + is_anon = (!is_file && !is_slab); + WARN_ON_ONCE(is_slab && is_file); + + zone_is_local = zone_local(preferred_zone, zone); + if (zone_local(preferred_zone, zone)) { + /* Distribute between zones local to the node if requested */ + if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE)) + goto check_batch; + } else { + /* + * When zone_reclaim_mode is enabled, stick to local zones. If + * that fails, the slowpath is entered, which will do another + * pass starting with the local zones, but ultimately fall back + * back to remote zones that do not partake in the fairness + * round-robin cycle of this zonelist. + */ + if (zone_reclaim_mode) + return false; + + if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE)) + goto check_batch; + } + + return true; + +check_batch: + /* Distribute to the next zone if this zone has exhausted its batch */ + if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) return true; return false; @@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) __build_all_zonelists(NULL); mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); + zone_distribute_mode = DISTRIBUTE_DEFAULT; } else { #ifdef CONFIG_MEMORY_HOTPLUG if (zone) -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f48.google.com (mail-ee0-f48.google.com [74.125.83.48]) by kanga.kvack.org (Postfix) with ESMTP id B8C9A6B0055 for ; Tue, 17 Dec 2013 11:48:31 -0500 (EST) Received: by mail-ee0-f48.google.com with SMTP id e49so3005142eek.7 for ; Tue, 17 Dec 2013 08:48:31 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s42si5349338eew.245.2013.12.17.08.48.30 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 08:48:31 -0800 (PST) From: Mel Gorman Subject: [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Date: Tue, 17 Dec 2013 16:48:24 +0000 Message-Id: <1387298904-8824-7-git-send-email-mgorman@suse.de> In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman This patch introduces a vm.pagecache_interleave sysctl that allows the administrator to alter the default memory allocation policy for file-backed pages. It removes a more configurable interface that is expected to be too complex to expose to users and give an unnecessarily level of control. By default it is disabled but there is strong evidence that users on NUMA machines will want to enable this. The default is expected to change once the documention is in sync. Ideally it would also be possible to control on a per-process basis by allowing processes to select either an MPOL_LOCAL or MPOL_INTERLEAVE_PAGECACHE memory policy as memory policies are the traditional way for controlling allocation behaviour. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 61 +++++++++++++++++++++------------------------ include/linux/mmzone.h | 2 +- include/linux/swap.h | 2 +- kernel/sysctl.c | 8 +++--- mm/page_alloc.c | 18 +++++-------- 5 files changed, 41 insertions(+), 50 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 8eaa562..655ed0a 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -49,6 +49,7 @@ Currently, these files are in /proc/sys/vm: - oom_kill_allocating_task - overcommit_memory - overcommit_ratio +- pagecache_interleave - page-cluster - panic_on_oom - percpu_pagelist_fraction @@ -56,7 +57,6 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure -- zone_distribute_mode - zone_reclaim_mode ============================================================== @@ -608,6 +608,34 @@ of physical RAM. See above. ============================================================== +pagecache_interleave: + +This setting is only relevant to NUMA machines. + +Historically, the default behaviour of the system is to allocate memory +local to the process. The behaviour is usually modified through the use +of memory policies while zone_reclaim_mode controls how strict the local +memory allocation policy is. + +Issues arise when the allocating process is frequently running on the same +node. The kernels memory reclaim daemon runs one instance per NUMA node. +A consequence is that relatively new memory may be reclaimed by kswapd when +the allocating process is running on a specific node. The user-visible +impact is that the system appears to do more IO than necessary when a +workload is accessing files that are larger than a given NUMA node. + +One way of addressing this is to use the interleave memory policy but that +is not always possible. + +Another option is to enable this setting. When enabled, the default +memory allocation changes from MPOL_LOCAL to interleaving file-backed +pages by default. The downside is that some file accesses will now be +to remote memory even though the local node had available resources. +The upside is that workloads working on files larger than a NUMA node +will not reclaim active pages prematurely. + +============================================================== + page-cluster page-cluster controls the number of pages up to which consecutive pages @@ -725,37 +753,6 @@ causes the kernel to prefer to reclaim dentries and inodes. ============================================================== -zone_distribute_mode - -Pages allocation and reclaim are managed on a per-zone basis. When the -system needs to reclaim memory, candidate pages are selected from these -per-zone lists. Historically, a potential consequence was that recently -allocated pages were considered reclaim candidates. From a zone-local -perspective, page aging was preserved but from a system-wide perspective -there was an age inversion problem. - -A similar problem occurs on a node level where young pages may be reclaimed -from the local node instead of allocating remote memory. Unforuntately, the -cost of accessing remote nodes is higher so the system must choose by default -between favouring page aging or node locality. zone_distribute_mode controls -how the system will distribute page ages between zones. - -0 = Never round-robin based on age - -Otherwise the values are ORed together - -1 = Distribute anon pages between zones local to the allocating node -2 = Distribute file pages between zones local to the allocating node -4 = Distribute slab pages between zones local to the allocating node - -The following three flags effectively alter MPOL_DEFAULT, be careful. - -8 = Distribute anon pages between zones remote to the allocating node -16 = Distribute file pages between zones remote to the allocating node -32 = Distribute slab pages between zones remote to the allocating node - -============================================================== - zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 20a75e3..2fb9e2d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -897,7 +897,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); -int sysctl_zone_distribute_mode_handler(struct ctl_table *, int, +int sysctl_zone_pagecache_interleave_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int numa_zonelist_order_handler(struct ctl_table *, int, diff --git a/include/linux/swap.h b/include/linux/swap.h index 44329b0..2b522cf 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,7 +318,7 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; -extern unsigned __bitwise__ zone_distribute_mode; +extern unsigned int zone_pagecache_interleave; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b75c08f..385d7cb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1350,11 +1350,11 @@ static struct ctl_table vm_table[] = { }, #endif { - .procname = "zone_distribute_mode", - .data = &zone_distribute_mode, - .maxlen = sizeof(zone_distribute_mode), + .procname = "pagecache_interleave", + .data = &zone_pagecache_interleave, + .maxlen = sizeof(zone_pagecache_interleave), .mode = 0644, - .proc_handler = sysctl_zone_distribute_mode_handler, + .proc_handler = sysctl_zone_pagecache_interleave_handler, .extra1 = &zero, }, #ifdef CONFIG_NUMA diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c2a2229..b6c8e63 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1872,7 +1872,8 @@ static inline void init_zone_allows_reclaim(int nid) #endif /* CONFIG_NUMA */ /* Controls how page ages are distributed across zones automatically */ -unsigned __bitwise__ zone_distribute_mode __read_mostly; +static unsigned __bitwise__ zone_distribute_mode __read_mostly; +unsigned int zone_pagecache_interleave; /* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */ #define DISTRIBUTE_DISABLE (0) @@ -1891,7 +1892,7 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly; /* Only these GFP flags are affected by the fair zone allocation policy */ #define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE)) -int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, +int sysctl_zone_pagecache_interleave_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { int rc; @@ -1900,16 +1901,9 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, if (rc) return rc; - /* If you are an admin reading this comment, what were you thinking? */ - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) == - DISTRIBUTE_STUPID_ANON)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON; - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) == - DISTRIBUTE_STUPID_FILE)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE; - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) == - DISTRIBUTE_STUPID_SLAB)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB; + zone_distribute_mode = DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB; + if (zone_pagecache_interleave) + zone_distribute_mode |= DISTRIBUTE_REMOTE_FILE; return 0; } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by kanga.kvack.org (Postfix) with ESMTP id 1E1866B0035 for ; Tue, 17 Dec 2013 15:02:21 -0500 (EST) Received: by mail-bk0-f54.google.com with SMTP id v16so2996114bkz.41 for ; Tue, 17 Dec 2013 12:02:20 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id xy9si5667728bkb.46.2013.12.17.12.02.19 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 12:02:19 -0800 (PST) Date: Tue, 17 Dec 2013 15:02:10 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131217200210.GG21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Hi Mel, On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > This series is currently untested and is being posted to sync up discussions > on the treatment of page cache pages, particularly the sysv part. I have > not thought it through in detail but postings patches is the easiest way > to highlight where I think a problem might be. > > Changelog since v2 > o Drop an accounting patch, behaviour is deliberate > o Special case tmpfs and shmem pages for discussion > > Changelog since v1 > o Fix lot of brain damage in the configurable policy patch > o Yoink a page cache annotation patch > o Only account batch pages against allocations eligible for the fair policy > o Add patch that default distributes file pages on remote nodes > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > bug whereby new pages could be reclaimed before old pages because of how > the page allocator and kswapd interacted on the per-zone LRU lists. Not just that, it was about ensuring predictable cache replacement and maximizing the cache's effectiveness. This implicitely fixed the kswapd interaction bug, but that was not the sole reason (I realize that the original changelog is incomplete and I apologize for that). I have had offline discussions with Andrea back then and his first suggestion was too to make this a zone fairness placement that is exclusive to the local node, but eventually he agreed that the problem applies just as much on the global level and that we should apply fairness throughout the system as long as we honor zone_reclaim_mode and hard bindings. During our discussions now, it turned out that zone_reclaim_mode is a terrible predictor for preferred locality, but we also more or less agreed that the locality issues in the first place are not really applicable to cache loads dominated by IO cost. So I think the main discrepancy between the original patch and what we truly want is that aging fairness is really only relevant for actual cache backed by secondary storage, because cache replacement is an ongoing operation that involves IO. As opposed to memory types that involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all (slab, kernel allocations), in which case we prefer NUMA locality. > Unfortunately a side-effect missed during review was that it's now very > easy to allocate remote memory on NUMA machines. The problem is that > it is not a simple case of just restoring local allocation policies as > there are genuine reasons why global page aging may be prefereable. It's > still a major change to default behaviour so this patch makes the policy > configurable and sets what I think is a sensible default. > > The patches are on top of some NUMA balancing patches currently in -mm. > It's untested and posted to discuss patches 4 and 6. It might be easier in dealing with -stable if we start with the critical fix(es) to restore sane functionality as much and as compact as possible and then place the cleanups on top? In my local tree, I have the following as the first patch: --- From: Johannes Weiner Subject: [patch] mm: page_alloc: restrict fair allocator policy to page cache 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged in order to ensure predictable page cache replacement and to maximize the cache's effectiveness of reducing IO regardless of zone or node topology. However, it was overzealous in round-robin placing every type of allocation over all allowable nodes, instead of preferring locality, which resulted in severe regressions on certain NUMA workloads that have nothing to do with page cache. This patch drastically reduces the impact of the original change by having the round-robin placement policy only apply to page cache backed by secondary storage, and no longer to anonymous memory, shmem, tmpfs, slab allocations. This still changes the long-standing behavior of page cache adhering to the configured memory policy and preferring local allocations per default, so make it configurable in case somebody relies on it. However, we also expect the majority of users to prefer maximium cache effectiveness and a predictable replacement behavior over memory locality, so reflect this in the default setting of the sysctl. --- Documentation/sysctl/vm.txt | 21 +++++++++++++++++ Documentation/vm/numa_memory_policy.txt | 8 +++++++ include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 ++ kernel/sysctl.c | 8 +++++++ mm/filemap.c | 2 ++ mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- 8 files changed, 76 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb7b64a..50d250f7470f 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes +- pagecache_mempolicy_mode - min_slab_ratio - min_unmapped_ratio - mmap_min_addr @@ -404,6 +405,26 @@ Setting this too high will OOM your machine instantly. ============================================================= +pagecache_mempolicy_mode: + +This is available only on NUMA kernels. + +Per default, the configured memory policy is applicable to anonymous +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an +interleaving fashion over all allowed nodes (hardbindings and +zone_reclaim_mode excluded). + +The assumption is that, when it comes to pagecache, users generally +prefer predictable replacement behavior regardless of NUMA topology +and maximizing the cache's effectiveness in reducing IO over memory +locality. + +This behavior can be changed by enabling pagecache_mempolicy_mode, in +which case page cache allocations will be placed according to the +configured memory policy (Documentation/vm/numa_memory_policy.txt). + +============================================================= + min_slab_ratio: This is available only on NUMA kernels. diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 4e7da6543424..64d48b6378db 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -16,6 +16,14 @@ programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +Note that, per default, the memory policies as described below apply to process +memory and shmem/tmpfs/ramfs only. Pagecache backed by secondary storage will +be interleaved fairly over all allowable nodes (respecting hardbindings and +zone_reclaim_mode) in order to maximize the cache's effectiveness in reducing IO +and to ensure predictable cache replacement. Special setups that require +pagecache to adhere to the configured memory policy can change this behavior by +enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). + MEMORY POLICY CONCEPTS Scope of Memory Policies diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd491f7e8..f69e4cb78ccf 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75a078b..bda48453af8e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..3458994b0881 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; +extern int pagecache_mempolicy_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 +#define pagecache_mempolicy_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a604726d0b..a8c56c1dc98e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, { + .procname = "pagecache_mempolicy_mode", + .data = &pagecache_mempolicy_mode, + .maxlen = sizeof(pagecache_mempolicy_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &zero, + }, + { .procname = "min_unmapped_ratio", .data = &sysctl_min_unmapped_ratio, .maxlen = sizeof(sysctl_min_unmapped_ratio), diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..5bb922506906 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f075ed0..b28370932950 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1547,7 +1547,15 @@ again: get_pageblock_migratetype(page)); } + /* + * All allocations eat into the round-robin batch, even + * allocations that are not subject to round-robin placement + * themselves. This makes sure that allocations that ARE + * subject to round-robin placement compensate for the + * allocations that aren't, to have equal placement overall. + */ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); + __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone, gfp_flags); local_irq_restore(flags); @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, #ifdef CONFIG_NUMA /* + * pagecache_mempolicy_mode - whether page cache should honor the + * configured memory policy and allocate from the zonelist in order of + * preference, or whether it should be interleaved fairly over all + * allowed zones in the given zonelist to maximize cache effects and + * ensure predictable cache replacement. + */ +int pagecache_mempolicy_mode __read_mostly; + +/* * zlc_setup - Setup for "zonelist cache". Uses cached zone data to * skip over zones that are not allowed by the cpuset, or that have * been recently (in last second) found to be nearly full. See further @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1908,22 +1925,25 @@ zonelist_scan: if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. + * Distribute page cache pages in proportion to the + * individual zone size to ensure fair page aging. + * The zone a page was allocated in should have no + * effect on the time the page has in memory before + * being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the + * When pagecache_mempolicy_mode or zone_reclaim_mode + * is enabled, try to allocate from zones within the + * preferred node in the fastpath. If that fails, the * slowpath is entered, which will do another pass * starting with the local zones, but ultimately fall * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_PAGECACHE)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && !zone_local(preferred_zone, zone)) continue; } @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && + !zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f53.google.com (mail-bk0-f53.google.com [209.85.214.53]) by kanga.kvack.org (Postfix) with ESMTP id E64A26B0035 for ; Wed, 18 Dec 2013 01:18:00 -0500 (EST) Received: by mail-bk0-f53.google.com with SMTP id na10so41699bkb.40 for ; Tue, 17 Dec 2013 22:18:00 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id dg6si6048601bkc.154.2013.12.17.22.17.59 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 17 Dec 2013 22:17:59 -0800 (PST) Date: Wed, 18 Dec 2013 01:17:50 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218061750.GK21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131217200210.GG21724@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > Hi Mel, > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > This series is currently untested and is being posted to sync up discussions > > on the treatment of page cache pages, particularly the sysv part. I have > > not thought it through in detail but postings patches is the easiest way > > to highlight where I think a problem might be. > > > > Changelog since v2 > > o Drop an accounting patch, behaviour is deliberate > > o Special case tmpfs and shmem pages for discussion > > > > Changelog since v1 > > o Fix lot of brain damage in the configurable policy patch > > o Yoink a page cache annotation patch > > o Only account batch pages against allocations eligible for the fair policy > > o Add patch that default distributes file pages on remote nodes > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > bug whereby new pages could be reclaimed before old pages because of how > > the page allocator and kswapd interacted on the per-zone LRU lists. > > Not just that, it was about ensuring predictable cache replacement and > maximizing the cache's effectiveness. This implicitely fixed the > kswapd interaction bug, but that was not the sole reason (I realize > that the original changelog is incomplete and I apologize for that). > > I have had offline discussions with Andrea back then and his first > suggestion was too to make this a zone fairness placement that is > exclusive to the local node, but eventually he agreed that the problem > applies just as much on the global level and that we should apply > fairness throughout the system as long as we honor zone_reclaim_mode > and hard bindings. During our discussions now, it turned out that > zone_reclaim_mode is a terrible predictor for preferred locality, but > we also more or less agreed that the locality issues in the first > place are not really applicable to cache loads dominated by IO cost. > > So I think the main discrepancy between the original patch and what we > truly want is that aging fairness is really only relevant for actual > cache backed by secondary storage, because cache replacement is an > ongoing operation that involves IO. As opposed to memory types that > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > (slab, kernel allocations), in which case we prefer NUMA locality. > > > Unfortunately a side-effect missed during review was that it's now very > > easy to allocate remote memory on NUMA machines. The problem is that > > it is not a simple case of just restoring local allocation policies as > > there are genuine reasons why global page aging may be prefereable. It's > > still a major change to default behaviour so this patch makes the policy > > configurable and sets what I think is a sensible default. > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > It's untested and posted to discuss patches 4 and 6. > > It might be easier in dealing with -stable if we start with the > critical fix(es) to restore sane functionality as much and as compact > as possible and then place the cleanups on top? > > In my local tree, I have the following as the first patch: Updated version with your tmpfs __GFP_PAGECACHE parts added and documentation, changelog updated as necessary. I remain unconvinced that tmpfs pages should be round-robined, but I agree with you that it is the conservative change to do for 3.12 and 3.12 and we can figure out the rest later. I sure hope that this doesn't drive most people on NUMA to disable pagecache interleaving right away as I expect most tmpfs workloads to see little to no reclaim and prefer locality... :/ --- From: Johannes Weiner Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged in order to ensure predictable pagecache replacement and to maximize the cache's effectiveness of reducing IO regardless of zone or node topology. However, it was overzealous in round-robin placing every type of allocation over all allowable nodes, instead of preferring locality, which resulted in severe regressions on certain NUMA workloads that have nothing to do with pagecache. This patch drastically reduces the impact of the original change by having the round-robin placement policy only apply to pagecache allocations and no longer to anonymous memory, shmem, slab and other types of kernel allocations. This still changes the long-standing behavior of pagecache adhering to the configured memory policy and preferring local allocations per default, so make it configurable in case somebody relies on it. However, we also expect the majority of users to prefer maximium cache effectiveness and a predictable replacement behavior over memory locality, so reflect this in the default setting of the sysctl. No-signoff-without-Mel's Cc: # 3.12 --- Documentation/sysctl/vm.txt | 20 ++++++++++++++++ Documentation/vm/numa_memory_policy.txt | 7 ++++++ include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 ++ kernel/sysctl.c | 8 +++++++ mm/filemap.c | 2 ++ mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- mm/shmem.c | 14 +++++++++++ 9 files changed, 88 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb7b64a..308c342f62ad 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes +- pagecache_mempolicy_mode - min_slab_ratio - min_unmapped_ratio - mmap_min_addr @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly. ============================================================= +pagecache_mempolicy_mode: + +This is available only on NUMA kernels. + +Per default, pagecache is allocated in an interleaving fashion over +all allowed nodes (hardbindings and zone_reclaim_mode excluded), +regardless of the selected memory policy. + +The assumption is that, when it comes to pagecache, users generally +prefer predictable replacement behavior regardless of NUMA topology +and maximizing the cache's effectiveness in reducing IO over memory +locality. + +This behavior can be changed by enabling pagecache_mempolicy_mode, in +which case page cache allocations will be placed according to the +configured memory policy (Documentation/vm/numa_memory_policy.txt). + +============================================================= + min_slab_ratio: This is available only on NUMA kernels. diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 4e7da6543424..72247e565908 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +Note that, per default, the memory policies do not apply to pagecache. Instead +it will be interleaved fairly over all allowable nodes (respecting hardbindings +and zone_reclaim_mode) in order to maximize the cache's effectiveness in +reducing IO and to ensure predictable cache replacement. Special setups that +require pagecache to adhere to the configured memory policy can change this +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). + MEMORY POLICY CONCEPTS Scope of Memory Policies diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd491f7e8..f69e4cb78ccf 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75a078b..bda48453af8e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..3458994b0881 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; +extern int pagecache_mempolicy_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 +#define pagecache_mempolicy_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a604726d0b..a8c56c1dc98e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, { + .procname = "pagecache_mempolicy_mode", + .data = &pagecache_mempolicy_mode, + .maxlen = sizeof(pagecache_mempolicy_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &zero, + }, + { .procname = "min_unmapped_ratio", .data = &sysctl_min_unmapped_ratio, .maxlen = sizeof(sysctl_min_unmapped_ratio), diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..5bb922506906 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f075ed0..f7c0ecb5bb8b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1547,7 +1547,15 @@ again: get_pageblock_migratetype(page)); } + /* + * All allocations eat into the round-robin batch, even + * allocations that are not subject to round-robin placement + * themselves. This makes sure that allocations that ARE + * subject to round-robin placement compensate for the + * allocations that aren't, to have equal placement overall. + */ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); + __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone, gfp_flags); local_irq_restore(flags); @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, #ifdef CONFIG_NUMA /* + * pagecache_mempolicy_mode - whether pagecache allocations should + * honor the configured memory policy and allocate from the zonelist + * in order of preference, or whether they should interleave fairly + * over all allowed zones in the given zonelist to maximize cache + * effects and ensure predictable cache replacement. + */ +int pagecache_mempolicy_mode __read_mostly; + +/* * zlc_setup - Setup for "zonelist cache". Uses cached zone data to * skip over zones that are not allowed by the cpuset, or that have * been recently (in last second) found to be nearly full. See further @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1908,22 +1925,25 @@ zonelist_scan: if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. + * Distribute pagecache pages in proportion to the + * individual zone size to ensure fair page aging. + * The zone a page was allocated in should have no + * effect on the time the page has in memory before + * being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the + * When pagecache_mempolicy_mode or zone_reclaim_mode + * is enabled, try to allocate from zones within the + * preferred node in the fastpath. If that fails, the * slowpath is entered, which will do another pass * starting with the local zones, but ultimately fall * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_PAGECACHE)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && !zone_local(preferred_zone, zone)) continue; } @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && + !zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - diff --git a/mm/shmem.c b/mm/shmem.c index 8297623fcaed..02d7a9c03463 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, return page; } +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) +{ + /* If no internal shm_mount then it must be tmpfs */ + if (IS_ERR(shm_mnt)) + return true; + + /* Consider it to be tmpfs if the superblock is not the internal mount */ + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; +} + static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, pvma.vm_ops = NULL; pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); + if (shmem_inode_on_tmpfs(info)) + gfp |= __GFP_PAGECACHE; + page = alloc_page_vma(gfp, &pvma, 0); /* Drop reference taken by mpol_shared_policy_lookup() */ -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id C6BA36B0035 for ; Wed, 18 Dec 2013 08:47:54 -0500 (EST) Received: by mail-ee0-f42.google.com with SMTP id e53so3511974eek.15 for ; Wed, 18 Dec 2013 05:47:54 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id m44si13653eeo.247.2013.12.18.05.47.52 for ; Wed, 18 Dec 2013 05:47:53 -0800 (PST) Message-ID: <52B1A781.50002@redhat.com> Date: Wed, 18 Dec 2013 08:47:45 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> In-Reply-To: <20131218061750.GK21724@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Linux-MM , LKML On 12/18/2013 01:17 AM, Johannes Weiner wrote: > Updated version with your tmpfs __GFP_PAGECACHE parts added and > documentation, changelog updated as necessary. I remain unconvinced > that tmpfs pages should be round-robined, but I agree with you that it > is the conservative change to do for 3.12 and 3.12 and we can figure > out the rest later. I sure hope that this doesn't drive most people > on NUMA to disable pagecache interleaving right away as I expect most > tmpfs workloads to see little to no reclaim and prefer locality... :/ Actually, I suspect most tmpfs heavy workloads will be things like databases with shared memory segments. Those tend to benefit from having all of the system's memory bandwidth available. The worker threads/processes tend to live all over the system, too... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f42.google.com (mail-bk0-f42.google.com [209.85.214.42]) by kanga.kvack.org (Postfix) with ESMTP id 187446B0035 for ; Wed, 18 Dec 2013 09:18:07 -0500 (EST) Received: by mail-bk0-f42.google.com with SMTP id w11so221823bkz.15 for ; Wed, 18 Dec 2013 06:18:07 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id tq3si235796bkb.139.2013.12.18.06.18.06 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 06:18:06 -0800 (PST) Date: Wed, 18 Dec 2013 09:17:58 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218141758.GL21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <52B1A781.50002@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52B1A781.50002@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Mel Gorman , Andrew Morton , Dave Hansen , Linux-MM , LKML On Wed, Dec 18, 2013 at 08:47:45AM -0500, Rik van Riel wrote: > On 12/18/2013 01:17 AM, Johannes Weiner wrote: > > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > > documentation, changelog updated as necessary. I remain unconvinced > > that tmpfs pages should be round-robined, but I agree with you that it > > is the conservative change to do for 3.12 and 3.12 and we can figure > > out the rest later. I sure hope that this doesn't drive most people > > on NUMA to disable pagecache interleaving right away as I expect most > > tmpfs workloads to see little to no reclaim and prefer locality... :/ > > Actually, I suspect most tmpfs heavy workloads will be things like > databases with shared memory segments. Those tend to benefit from > having all of the system's memory bandwidth available. The worker > threads/processes tend to live all over the system, too... Shared memory segments are explicitely excluded from the interleaving, though. The distinction is between the internal tmpfs mount that sysv shmem uses (mempolicy) and tmpfs mounts that use the actual filesystem interface (pagecache interleave). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f47.google.com (mail-ee0-f47.google.com [74.125.83.47]) by kanga.kvack.org (Postfix) with ESMTP id 2E5D46B0035 for ; Wed, 18 Dec 2013 09:51:14 -0500 (EST) Received: by mail-ee0-f47.google.com with SMTP id e51so3019192eek.20 for ; Wed, 18 Dec 2013 06:51:13 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id a9si296567eew.159.2013.12.18.06.51.12 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 06:51:12 -0800 (PST) Date: Wed, 18 Dec 2013 15:51:11 +0100 From: Michal Hocko Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218145111.GA27510@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131217200210.GG21724@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Tue 17-12-13 15:02:10, Johannes Weiner wrote: [...] > +pagecache_mempolicy_mode: > + > +This is available only on NUMA kernels. > + > +Per default, the configured memory policy is applicable to anonymous > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > +interleaving fashion over all allowed nodes (hardbindings and > +zone_reclaim_mode excluded). > + > +The assumption is that, when it comes to pagecache, users generally > +prefer predictable replacement behavior regardless of NUMA topology > +and maximizing the cache's effectiveness in reducing IO over memory > +locality. Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing semantically? The setting is per-cpuset rather than global which makes it harder to use but essentially it tries to distribute page cache pages across all the nodes. This is really getting confusing. We have zone_reclaim_mode to keep memory local in general, pagecache_mempolicy_mode to keep page cache local and PF_SPREAD_PAGE to spread the page cache around nodes. > + > +This behavior can be changed by enabling pagecache_mempolicy_mode, in > +which case page cache allocations will be placed according to the > +configured memory policy (Documentation/vm/numa_memory_policy.txt). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f178.google.com (mail-ea0-f178.google.com [209.85.215.178]) by kanga.kvack.org (Postfix) with ESMTP id 077B36B0035 for ; Wed, 18 Dec 2013 10:00:42 -0500 (EST) Received: by mail-ea0-f178.google.com with SMTP id d10so3694510eaj.9 for ; Wed, 18 Dec 2013 07:00:42 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 5si321928eei.186.2013.12.18.07.00.42 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 07:00:42 -0800 (PST) Date: Wed, 18 Dec 2013 15:00:38 +0000 From: Mel Gorman Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218150038.GP11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218061750.GK21724@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote: > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > > Hi Mel, > > > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > > This series is currently untested and is being posted to sync up discussions > > > on the treatment of page cache pages, particularly the sysv part. I have > > > not thought it through in detail but postings patches is the easiest way > > > to highlight where I think a problem might be. > > > > > > Changelog since v2 > > > o Drop an accounting patch, behaviour is deliberate > > > o Special case tmpfs and shmem pages for discussion > > > > > > Changelog since v1 > > > o Fix lot of brain damage in the configurable policy patch > > > o Yoink a page cache annotation patch > > > o Only account batch pages against allocations eligible for the fair policy > > > o Add patch that default distributes file pages on remote nodes > > > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > > bug whereby new pages could be reclaimed before old pages because of how > > > the page allocator and kswapd interacted on the per-zone LRU lists. > > > > Not just that, it was about ensuring predictable cache replacement and > > maximizing the cache's effectiveness. This implicitely fixed the > > kswapd interaction bug, but that was not the sole reason (I realize > > that the original changelog is incomplete and I apologize for that). > > > > I have had offline discussions with Andrea back then and his first > > suggestion was too to make this a zone fairness placement that is > > exclusive to the local node, but eventually he agreed that the problem > > applies just as much on the global level and that we should apply > > fairness throughout the system as long as we honor zone_reclaim_mode > > and hard bindings. During our discussions now, it turned out that > > zone_reclaim_mode is a terrible predictor for preferred locality, but > > we also more or less agreed that the locality issues in the first > > place are not really applicable to cache loads dominated by IO cost. > > > > So I think the main discrepancy between the original patch and what we > > truly want is that aging fairness is really only relevant for actual > > cache backed by secondary storage, because cache replacement is an > > ongoing operation that involves IO. As opposed to memory types that > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > > (slab, kernel allocations), in which case we prefer NUMA locality. > > > > > Unfortunately a side-effect missed during review was that it's now very > > > easy to allocate remote memory on NUMA machines. The problem is that > > > it is not a simple case of just restoring local allocation policies as > > > there are genuine reasons why global page aging may be prefereable. It's > > > still a major change to default behaviour so this patch makes the policy > > > configurable and sets what I think is a sensible default. > > > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > > It's untested and posted to discuss patches 4 and 6. > > > > It might be easier in dealing with -stable if we start with the > > critical fix(es) to restore sane functionality as much and as compact > > as possible and then place the cleanups on top? > > > > In my local tree, I have the following as the first patch: > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > documentation, changelog updated as necessary. I remain unconvinced > that tmpfs pages should be round-robined, but I agree with you that it > is the conservative change to do for 3.12 and 3.12 and we can figure > out the rest later. Assume you with 3.12 and 3.13 here. > I sure hope that this doesn't drive most people > on NUMA to disable pagecache interleaving right away as I expect most > tmpfs workloads to see little to no reclaim and prefer locality... :/ > I hope you're right but I expect the experience will be like zone_reclaim_mode. We're going to be looking out for bug reports that are "fixed" by disabling pagecache locality and pushing back on them by fixing the real problem. This was the experience with zone_reclaim_mode when it started going wrong. It was also the experience with THP for a very long time. Disabling THP was a workaround for all sorts of problems and it was very important to fix them and push back on anyone documenting disabling THP as a standard workaround. > --- > From: Johannes Weiner > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache > Monolithic patch with multiple changes but meh. I'm not pushed because I know what the breakout looks like. FWIW, I had intended the entire of my broken-out series for 3.12 and 3.13 once it got ironed out. I find the series easier to understand but of course I would. > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged > in order to ensure predictable pagecache replacement and to maximize > the cache's effectiveness of reducing IO regardless of zone or node > topology. > > However, it was overzealous in round-robin placing every type of > allocation over all allowable nodes, instead of preferring locality, > which resulted in severe regressions on certain NUMA workloads that > have nothing to do with pagecache. > > This patch drastically reduces the impact of the original change by > having the round-robin placement policy only apply to pagecache > allocations and no longer to anonymous memory, shmem, slab and other > types of kernel allocations. > > This still changes the long-standing behavior of pagecache adhering to > the configured memory policy and preferring local allocations per > default, so make it configurable in case somebody relies on it. > However, we also expect the majority of users to prefer maximium cache > effectiveness and a predictable replacement behavior over memory > locality, so reflect this in the default setting of the sysctl. > > No-signoff-without-Mel's > Cc: # 3.12 > --- > Documentation/sysctl/vm.txt | 20 ++++++++++++++++ > Documentation/vm/numa_memory_policy.txt | 7 ++++++ > include/linux/gfp.h | 4 +++- > include/linux/pagemap.h | 2 +- > include/linux/swap.h | 2 ++ > kernel/sysctl.c | 8 +++++++ > mm/filemap.c | 2 ++ > mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- > mm/shmem.c | 14 +++++++++++ > 9 files changed, 88 insertions(+), 12 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 1fbd4eb7b64a..308c342f62ad 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > +- pagecache_mempolicy_mode > - min_slab_ratio > - min_unmapped_ratio > - mmap_min_addr Sure about the name? This is a boolean and "mode" implies it might be a bitmask. That said, I recognise that my own naming also sucked because complaining about yours I can see that mine also sucks. > @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly. > > ============================================================= > > +pagecache_mempolicy_mode: > + > +This is available only on NUMA kernels. > + > +Per default, pagecache is allocated in an interleaving fashion over > +all allowed nodes (hardbindings and zone_reclaim_mode excluded), > +regardless of the selected memory policy. > + > +The assumption is that, when it comes to pagecache, users generally > +prefer predictable replacement behavior regardless of NUMA topology > +and maximizing the cache's effectiveness in reducing IO over memory > +locality. > + > +This behavior can be changed by enabling pagecache_mempolicy_mode, in > +which case page cache allocations will be placed according to the > +configured memory policy (Documentation/vm/numa_memory_policy.txt). > + Ok this indicates that pagecache will still be interleaved on zones local to the node the process is allocating on. Good because that preserves a very important aspect of your original patch. The current description feels a little backwards though -- "Enable this to *not* interleave pagecache". This documented behaviour says to me that pagecache_obey_mempolicy might be a better name if enabling it uses the system default memory policy. However, even that might put us in a corner. Ultimately we want this to be controllable on a per-process basis using memory policies. Merging what I have in v3, unreleased v4 and this thing I ended up with this. The observation about cpusets was raised by Michal Hocko on IRC. ---8<--- mpol_interleave_files This is available only on NUMA kernels. Historically, the default behaviour of the system is to allocate memory local to the process. The behaviour was usually modified through the use of memory policies while zone_reclaim_mode controls how strict the local memory allocation policy is. Issues arise when the allocating process is frequently running on the same node. The kernels memory reclaim daemon runs one instance per NUMA node. A consequence is that relatively new memory may be reclaimed by kswapd when the allocating process is running on a specific node. The user-visible impact is that the system appears to do more IO than necessary when a workload is accessing files that are larger than a given NUMA node. To address this problem, the default system memory policy is modified by this tunable. When this tunable is enabled, the system default memory policy will interleave batches of file-backed pages over all allowed zones and nodes. The assumption is that, when it comes to file pages that users generally prefer predictable replacement behavior regardless of NUMA topology and maximizing the page cache's effectiveness in reducing IO over memory locality. The tunable zone_reclaim_mode overrides this and enabling zone_reclaim_mode functionally disables mpol_interleave_pagecache. A process running within a memory cpuset will obey the cpuset policy and ignore mpol_interleave_files. At the time of writing, this parameter cannot be overridden by a process using set_mempolicy to set the task memory policy. Similarly, numactl setting the task memory policy will not override this setting. This may change in the future. The tunable is default enabled and has two recognised parameters; 0: Use the MPOL_LOCAL policy as the system-wide default 1: Batch interleave file-backed allocations over all allowed nodes One enabled, the downside is that some file accesses will now be to remote memory even though the local node had available resources. This will hurt workloads with small or short lived files that fit easily within one node. The upside is that workloads working on files larger than a NUMA node will not reclaim active pages prematurely. ---8<--- > +============================================================= > + > min_slab_ratio: > > This is available only on NUMA kernels. > diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt > index 4e7da6543424..72247e565908 100644 > --- a/Documentation/vm/numa_memory_policy.txt > +++ b/Documentation/vm/numa_memory_policy.txt > @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When > both cpusets and policies are applied to a task, the restrictions of the cpuset > takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. > > +Note that, per default, the memory policies do not apply to pagecache. Instead > +it will be interleaved fairly over all allowable nodes (respecting hardbindings > +and zone_reclaim_mode) in order to maximize the cache's effectiveness in > +reducing IO and to ensure predictable cache replacement. Special setups that > +require pagecache to adhere to the configured memory policy can change this > +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). > + Manual pages should also be updated. > MEMORY POLICY CONCEPTS > > Scope of Memory Policies > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 9b4dd491f7e8..f69e4cb78ccf 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -35,6 +35,7 @@ struct vm_area_struct; > #define ___GFP_NO_KSWAPD 0x400000u > #define ___GFP_OTHER_NODE 0x800000u > #define ___GFP_WRITE 0x1000000u > +#define ___GFP_PAGECACHE 0x2000000u > /* If the above are modified, __GFP_BITS_SHIFT may need updating */ > > /* > @@ -92,6 +93,7 @@ struct vm_area_struct; > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ > #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ > +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ > > /* > * This may seem redundant, but it's a way of annotating false positives vs. > @@ -99,7 +101,7 @@ struct vm_area_struct; > */ > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > > -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ > +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* This equals 0, but use constants in case they ever change */ > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index e3dea75a078b..bda48453af8e 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); > #else > static inline struct page *__page_cache_alloc(gfp_t gfp) > { > - return alloc_pages(gfp, 0); > + return alloc_pages(gfp | __GFP_PAGECACHE, 0); > } > #endif > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 46ba0c6c219f..3458994b0881 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; > > #ifdef CONFIG_NUMA > extern int zone_reclaim_mode; > +extern int pagecache_mempolicy_mode; > extern int sysctl_min_unmapped_ratio; > extern int sysctl_min_slab_ratio; > extern int zone_reclaim(struct zone *, gfp_t, unsigned int); > #else > #define zone_reclaim_mode 0 > +#define pagecache_mempolicy_mode 0 > static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) > { > return 0; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 34a604726d0b..a8c56c1dc98e 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { > .extra1 = &zero, > }, > { > + .procname = "pagecache_mempolicy_mode", > + .data = &pagecache_mempolicy_mode, > + .maxlen = sizeof(pagecache_mempolicy_mode), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + .extra1 = &zero, > + }, > + { > .procname = "min_unmapped_ratio", > .data = &sysctl_min_unmapped_ratio, > .maxlen = sizeof(sysctl_min_unmapped_ratio), > diff --git a/mm/filemap.c b/mm/filemap.c > index b7749a92021c..5bb922506906 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) > int n; > struct page *page; > > + gfp |= __GFP_PAGECACHE; > + > if (cpuset_do_page_mem_spread()) { > unsigned int cpuset_mems_cookie; > do { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 580a5f075ed0..f7c0ecb5bb8b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1547,7 +1547,15 @@ again: > get_pageblock_migratetype(page)); > } > > + /* > + * All allocations eat into the round-robin batch, even > + * allocations that are not subject to round-robin placement > + * themselves. This makes sure that allocations that ARE > + * subject to round-robin placement compensate for the > + * allocations that aren't, to have equal placement overall. > + */ > __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); > + > __count_zone_vm_events(PGALLOC, zone, 1 << order); > zone_statistics(preferred_zone, zone, gfp_flags); > local_irq_restore(flags); Thanks. > @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > > #ifdef CONFIG_NUMA > /* > + * pagecache_mempolicy_mode - whether pagecache allocations should > + * honor the configured memory policy and allocate from the zonelist > + * in order of preference, or whether they should interleave fairly > + * over all allowed zones in the given zonelist to maximize cache > + * effects and ensure predictable cache replacement. > + */ > +int pagecache_mempolicy_mode __read_mostly; > + > +/* > * zlc_setup - Setup for "zonelist cache". Uses cached zone data to > * skip over zones that are not allowed by the cpuset, or that have > * been recently (in last second) found to be nearly full. See further > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > static bool zone_local(struct zone *local_zone, struct zone *zone) > { > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > + return local_zone->node == zone->node; > } Does that not break on !CONFIG_NUMA? It's why I used zone_to_nid > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > @@ -1908,22 +1925,25 @@ zonelist_scan: > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > goto try_this_zone; > /* > - * Distribute pages in proportion to the individual > - * zone size to ensure fair page aging. The zone a > - * page was allocated in should have no effect on the > - * time the page has in memory before being reclaimed. > + * Distribute pagecache pages in proportion to the > + * individual zone size to ensure fair page aging. > + * The zone a page was allocated in should have no > + * effect on the time the page has in memory before > + * being reclaimed. > * > - * When zone_reclaim_mode is enabled, try to stay in > - * local zones in the fastpath. If that fails, the > + * When pagecache_mempolicy_mode or zone_reclaim_mode > + * is enabled, try to allocate from zones within the > + * preferred node in the fastpath. If that fails, the > * slowpath is entered, which will do another pass > * starting with the local zones, but ultimately fall > * back to remote zones that do not partake in the > * fairness round-robin cycle of this zonelist. > */ > - if (alloc_flags & ALLOC_WMARK_LOW) { > + if ((alloc_flags & ALLOC_WMARK_LOW) && > + (gfp_mask & __GFP_PAGECACHE)) { > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > continue; NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in some cases. Looks a bit fishy even though I can't quite put my finger on it. I also got details wrong here in the v3 of the series. In an unreleased v4 of the series I had corrected the treatment of slab pages in line with your wishes and reused the broken out helper in prepare_slowpath to keep the decision in sync. It's still in development but even if it gets rejected it'll act as a comparison point to yours. > - if (zone_reclaim_mode && > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > !zone_local(preferred_zone, zone)) > continue; > } Documention says "enabling pagecache_mempolicy_mode, in which case page cache allocations will be placed according to the configured memory policy". Should that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. Breaking this out would be more comprehensible. On a semi-related note, we might encounter a problem later where the interleaving causes us to skip over usable zones and zones with available batches are !zone_dirty_ok. We'd fall back to the slowpatch resetting the batches so it will not be particularly visible but there might be some interactions there. > @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, > * thrash fairness information for zones that are not > * actually part of this zonelist's round-robin cycle. > */ > - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > + !zone_local(preferred_zone, zone)) > continue; > mod_zone_page_state(zone, NR_ALLOC_BATCH, > high_wmark_pages(zone) - > diff --git a/mm/shmem.c b/mm/shmem.c > index 8297623fcaed..02d7a9c03463 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, > return page; > } > > +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ > +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) > +{ > + /* If no internal shm_mount then it must be tmpfs */ > + if (IS_ERR(shm_mnt)) > + return true; > + > + /* Consider it to be tmpfs if the superblock is not the internal mount */ > + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; > +} > + > static struct page *shmem_alloc_page(gfp_t gfp, > struct shmem_inode_info *info, pgoff_t index) > { > @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, > pvma.vm_ops = NULL; > pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); > > + if (shmem_inode_on_tmpfs(info)) > + gfp |= __GFP_PAGECACHE; > + > page = alloc_page_vma(gfp, &pvma, 0); > > /* Drop reference taken by mpol_shared_policy_lookup() */ For what it's worth, this is what I've currently kicked off testes for git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12 -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f47.google.com (mail-bk0-f47.google.com [209.85.214.47]) by kanga.kvack.org (Postfix) with ESMTP id 01AC16B0035 for ; Wed, 18 Dec 2013 10:18:56 -0500 (EST) Received: by mail-bk0-f47.google.com with SMTP id mx12so249852bkb.6 for ; Wed, 18 Dec 2013 07:18:56 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id lv5si293290bkb.202.2013.12.18.07.18.55 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 07:18:56 -0800 (PST) Date: Wed, 18 Dec 2013 10:18:46 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218151846.GM21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218145111.GA27510@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > [...] > > +pagecache_mempolicy_mode: > > + > > +This is available only on NUMA kernels. > > + > > +Per default, the configured memory policy is applicable to anonymous > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > +interleaving fashion over all allowed nodes (hardbindings and > > +zone_reclaim_mode excluded). > > + > > +The assumption is that, when it comes to pagecache, users generally > > +prefer predictable replacement behavior regardless of NUMA topology > > +and maximizing the cache's effectiveness in reducing IO over memory > > +locality. > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > semantically? The setting is per-cpuset rather than global which makes > it harder to use but essentially it tries to distribute page cache pages > across all the nodes. > > This is really getting confusing. We have zone_reclaim_mode to keep > memory local in general, pagecache_mempolicy_mode to keep page cache > local and PF_SPREAD_PAGE to spread the page cache around nodes. zone_reclaim_mode is a global setting to go through great lengths to stay on local nodes, intended to be used depending on the hardware, not the workload. Mempolicy on the other hand is to optimize placement for maximum locality depending on access patterns of a workload or even just the subset of a workload. I'm trying to change whether this applies to page cache (due to different locality / cache effectiveness tradeoff) and we want to provide pagecache_mempolicy_mode to revert in the field in case this is a mistake. PF_SPREAD_PAGE becomes implied per default and should eventually be removed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 7A1346B0035 for ; Wed, 18 Dec 2013 11:09:40 -0500 (EST) Received: by mail-ee0-f46.google.com with SMTP id d49so3635659eek.19 for ; Wed, 18 Dec 2013 08:09:39 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id w6si618496eeg.153.2013.12.18.08.09.39 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 08:09:39 -0800 (PST) Date: Wed, 18 Dec 2013 16:09:36 +0000 From: Mel Gorman Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218160936.GX11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218150038.GP11295@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote: > > For what it's worth, this is what I've currently kicked off testes for > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12 > Pushed a dirty tree by accident. Now mm-pgalloc-interleave-zones-v4r13 -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ea0-f175.google.com (mail-ea0-f175.google.com [209.85.215.175]) by kanga.kvack.org (Postfix) with ESMTP id E8C7E6B0035 for ; Wed, 18 Dec 2013 11:20:53 -0500 (EST) Received: by mail-ea0-f175.google.com with SMTP id z10so3639458ead.6 for ; Wed, 18 Dec 2013 08:20:53 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v6si642656eel.196.2013.12.18.08.20.52 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 08:20:52 -0800 (PST) Date: Wed, 18 Dec 2013 17:20:50 +0100 From: Michal Hocko Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218162050.GB27510@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218151846.GM21724@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed 18-12-13 10:18:46, Johannes Weiner wrote: > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > > [...] > > > +pagecache_mempolicy_mode: > > > + > > > +This is available only on NUMA kernels. > > > + > > > +Per default, the configured memory policy is applicable to anonymous > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > > +interleaving fashion over all allowed nodes (hardbindings and > > > +zone_reclaim_mode excluded). > > > + > > > +The assumption is that, when it comes to pagecache, users generally > > > +prefer predictable replacement behavior regardless of NUMA topology > > > +and maximizing the cache's effectiveness in reducing IO over memory > > > +locality. > > > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > > semantically? The setting is per-cpuset rather than global which makes > > it harder to use but essentially it tries to distribute page cache pages > > across all the nodes. > > > > This is really getting confusing. We have zone_reclaim_mode to keep > > memory local in general, pagecache_mempolicy_mode to keep page cache > > local and PF_SPREAD_PAGE to spread the page cache around nodes. > > zone_reclaim_mode is a global setting to go through great lengths to > stay on local nodes, intended to be used depending on the hardware, > not the workload. > > Mempolicy on the other hand is to optimize placement for maximum > locality depending on access patterns of a workload or even just the > subset of a workload. I'm trying to change whether this applies to > page cache (due to different locality / cache effectiveness tradeoff) > and we want to provide pagecache_mempolicy_mode to revert in the field > in case this is a mistake. > > PF_SPREAD_PAGE becomes implied per default and should eventually be > removed. I guess many loads do not care about page cache locality and the default spreading would be OK for them but what about those that do care? Currently we have a per-process (cpuset in fact) flag but this will change it to all or nothing. Is this really a good step? Btw. I do not mind having PF_SPREAD_PAGE enabled by default. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f42.google.com (mail-bk0-f42.google.com [209.85.214.42]) by kanga.kvack.org (Postfix) with ESMTP id 99F4D6B0035 for ; Wed, 18 Dec 2013 14:23:54 -0500 (EST) Received: by mail-bk0-f42.google.com with SMTP id w11so371173bkz.1 for ; Wed, 18 Dec 2013 11:23:53 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id lu3si527706bkb.214.2013.12.18.11.23.53 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 11:23:53 -0800 (PST) Date: Wed, 18 Dec 2013 14:20:15 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218192015.GA20038@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> <20131218162050.GB27510@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218162050.GB27510@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote: > On Wed 18-12-13 10:18:46, Johannes Weiner wrote: > > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > > > [...] > > > > +pagecache_mempolicy_mode: > > > > + > > > > +This is available only on NUMA kernels. > > > > + > > > > +Per default, the configured memory policy is applicable to anonymous > > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > > > +interleaving fashion over all allowed nodes (hardbindings and > > > > +zone_reclaim_mode excluded). > > > > + > > > > +The assumption is that, when it comes to pagecache, users generally > > > > +prefer predictable replacement behavior regardless of NUMA topology > > > > +and maximizing the cache's effectiveness in reducing IO over memory > > > > +locality. > > > > > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > > > semantically? The setting is per-cpuset rather than global which makes > > > it harder to use but essentially it tries to distribute page cache pages > > > across all the nodes. > > > > > > This is really getting confusing. We have zone_reclaim_mode to keep > > > memory local in general, pagecache_mempolicy_mode to keep page cache > > > local and PF_SPREAD_PAGE to spread the page cache around nodes. You are right that the user interface we are exposing is kind of cruddy and I'm less and less convinced that this is the right direction. > > zone_reclaim_mode is a global setting to go through great lengths to > > stay on local nodes, intended to be used depending on the hardware, > > not the workload. > > > > Mempolicy on the other hand is to optimize placement for maximum > > locality depending on access patterns of a workload or even just the > > subset of a workload. I'm trying to change whether this applies to > > page cache (due to different locality / cache effectiveness tradeoff) > > and we want to provide pagecache_mempolicy_mode to revert in the field > > in case this is a mistake. > > > > PF_SPREAD_PAGE becomes implied per default and should eventually be > > removed. > > I guess many loads do not care about page cache locality and the default > spreading would be OK for them but what about those that do care? Mel suggested that the page cache spreading be implemented as just another memory policy and I rejected it on the grounds that we have can have strange aging artifacts if it's not the default. But you are right that there might be usecases that really have high cache locality and don't incur any reclaim. The aging artifacts are non-existent to them but they would care about the NUMA locality. And basically, the same aging artifacts apply to anon e.g., just that the trade-off balance is different, as reclaim is much less common. And we do offer interleaving for anon as well. So the situation is not all that different that I had myself convinced it would be... So the more I'm thinking about it, the more I'm leaning towards making it a mempolicy after all, provided that we can set a sane default. Maybe we can make the new default a hybrid policy that keeps anon, shmem, slab, kernel, etc. local but interleaves pagecache. This should make sense to most usecases while providing the ability for custom placement policies per-process or per-VMA without having to make the decision on a global level or through an unusual interface. > Currently we have a per-process (cpuset in fact) flag but this will > change it to all or nothing. Is this really a good step? > Btw. I do not mind having PF_SPREAD_PAGE enabled by default. I don't want to muck around with cpusets too much, tbh... but I agree that the behavior of PF_SPREAD_PAGE should be the default. Except it should honor zone_reclaim_mode and round-robin nodes that are within RECLAIM_DISTANCE of the local one. I will have spotty access to internet starting tomorrow night until New Year's. Is there a chance we can maybe revert the NUMA aspects of the original patch for now and leave it as a node-local zone fairness thing? The NUMA behavior was so broken on 3.12 that I doubt that people have come to rely on the cache fairness on such machines in that one release. So we should be able to release 3.12-stable and 3.13 with node-local zone fairness without regressing anybody, and then give the NUMA aspect of it another try in 3.14. Something like the following should restore NUMA behavior while still fixing the kswapd vs. page allocator interaction bug of thrashing on the highest zone. PS: zone_local() is in a CONFIG_NUMA block, which is why accessing zone->node is safe :-) --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd886fac451a..317ea747d2cd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, * page was allocated in should have no effect on the * time the page has in memory before being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the - * slowpath is entered, which will do another pass - * starting with the local zones, but ultimately fall - * back to remote zones that do not partake in the - * fairness round-robin cycle of this zonelist. + * Try to stay in local zones in the fastpath. If + * that fails, the slowpath is entered, which will do + * another pass starting with the local zones, but + * ultimately fall back to remote zones that do not + * partake in the fairness round-robin cycle of this + * zonelist. */ if (alloc_flags & ALLOC_WMARK_LOW) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && - !zone_local(preferred_zone, zone)) + if (!zone_local(preferred_zone, zone)) continue; } /* @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if (!zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by kanga.kvack.org (Postfix) with ESMTP id 0A05C6B0036 for ; Wed, 18 Dec 2013 14:51:48 -0500 (EST) Received: by mail-bk0-f54.google.com with SMTP id v16so379014bkz.13 for ; Wed, 18 Dec 2013 11:51:48 -0800 (PST) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id ul10si554696bkb.173.2013.12.18.11.51.47 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 18 Dec 2013 11:51:47 -0800 (PST) Date: Wed, 18 Dec 2013 14:48:13 -0500 From: Johannes Weiner Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218194813.GB20038@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218150038.GP11295@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote: > On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote: > > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > > > Hi Mel, > > > > > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > > > This series is currently untested and is being posted to sync up discussions > > > > on the treatment of page cache pages, particularly the sysv part. I have > > > > not thought it through in detail but postings patches is the easiest way > > > > to highlight where I think a problem might be. > > > > > > > > Changelog since v2 > > > > o Drop an accounting patch, behaviour is deliberate > > > > o Special case tmpfs and shmem pages for discussion > > > > > > > > Changelog since v1 > > > > o Fix lot of brain damage in the configurable policy patch > > > > o Yoink a page cache annotation patch > > > > o Only account batch pages against allocations eligible for the fair policy > > > > o Add patch that default distributes file pages on remote nodes > > > > > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > > > bug whereby new pages could be reclaimed before old pages because of how > > > > the page allocator and kswapd interacted on the per-zone LRU lists. > > > > > > Not just that, it was about ensuring predictable cache replacement and > > > maximizing the cache's effectiveness. This implicitely fixed the > > > kswapd interaction bug, but that was not the sole reason (I realize > > > that the original changelog is incomplete and I apologize for that). > > > > > > I have had offline discussions with Andrea back then and his first > > > suggestion was too to make this a zone fairness placement that is > > > exclusive to the local node, but eventually he agreed that the problem > > > applies just as much on the global level and that we should apply > > > fairness throughout the system as long as we honor zone_reclaim_mode > > > and hard bindings. During our discussions now, it turned out that > > > zone_reclaim_mode is a terrible predictor for preferred locality, but > > > we also more or less agreed that the locality issues in the first > > > place are not really applicable to cache loads dominated by IO cost. > > > > > > So I think the main discrepancy between the original patch and what we > > > truly want is that aging fairness is really only relevant for actual > > > cache backed by secondary storage, because cache replacement is an > > > ongoing operation that involves IO. As opposed to memory types that > > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > > > (slab, kernel allocations), in which case we prefer NUMA locality. > > > > > > > Unfortunately a side-effect missed during review was that it's now very > > > > easy to allocate remote memory on NUMA machines. The problem is that > > > > it is not a simple case of just restoring local allocation policies as > > > > there are genuine reasons why global page aging may be prefereable. It's > > > > still a major change to default behaviour so this patch makes the policy > > > > configurable and sets what I think is a sensible default. > > > > > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > > > It's untested and posted to discuss patches 4 and 6. > > > > > > It might be easier in dealing with -stable if we start with the > > > critical fix(es) to restore sane functionality as much and as compact > > > as possible and then place the cleanups on top? > > > > > > In my local tree, I have the following as the first patch: > > > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > > documentation, changelog updated as necessary. I remain unconvinced > > that tmpfs pages should be round-robined, but I agree with you that it > > is the conservative change to do for 3.12 and 3.12 and we can figure > > out the rest later. > > Assume you with 3.12 and 3.13 here. Yes :) > > --- > > From: Johannes Weiner > > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache > > > > Monolithic patch with multiple changes but meh. I'm not pushed because I > know what the breakout looks like. FWIW, I had intended the entire of my > broken-out series for 3.12 and 3.13 once it got ironed out. I find the > series easier to understand but of course I would. And of course I can live without the cleanups to make code I wrote more readable ;-) I'm happy to defer on this, let's keep logical changes separated. > > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged > > in order to ensure predictable pagecache replacement and to maximize > > the cache's effectiveness of reducing IO regardless of zone or node > > topology. > > > > However, it was overzealous in round-robin placing every type of > > allocation over all allowable nodes, instead of preferring locality, > > which resulted in severe regressions on certain NUMA workloads that > > have nothing to do with pagecache. > > > > This patch drastically reduces the impact of the original change by > > having the round-robin placement policy only apply to pagecache > > allocations and no longer to anonymous memory, shmem, slab and other > > types of kernel allocations. > > > > This still changes the long-standing behavior of pagecache adhering to > > the configured memory policy and preferring local allocations per > > default, so make it configurable in case somebody relies on it. > > However, we also expect the majority of users to prefer maximium cache > > effectiveness and a predictable replacement behavior over memory > > locality, so reflect this in the default setting of the sysctl. > > > > No-signoff-without-Mel's > > Cc: # 3.12 > > --- > > Documentation/sysctl/vm.txt | 20 ++++++++++++++++ > > Documentation/vm/numa_memory_policy.txt | 7 ++++++ > > include/linux/gfp.h | 4 +++- > > include/linux/pagemap.h | 2 +- > > include/linux/swap.h | 2 ++ > > kernel/sysctl.c | 8 +++++++ > > mm/filemap.c | 2 ++ > > mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- > > mm/shmem.c | 14 +++++++++++ > > 9 files changed, 88 insertions(+), 12 deletions(-) > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > > index 1fbd4eb7b64a..308c342f62ad 100644 > > --- a/Documentation/sysctl/vm.txt > > +++ b/Documentation/sysctl/vm.txt > > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: > > - memory_failure_early_kill > > - memory_failure_recovery > > - min_free_kbytes > > +- pagecache_mempolicy_mode > > - min_slab_ratio > > - min_unmapped_ratio > > - mmap_min_addr > > Sure about the name? > > This is a boolean and "mode" implies it might be a bitmask. That said, I > recognise that my own naming also sucked because complaining about yours > I can see that mine also sucks. Is it because of how we use zone_reclaim_mode? I don't see anything wrong with a "mode" toggle that switches between only two modes of operation instead of three or more. But English being a second language and all... > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > > > static bool zone_local(struct zone *local_zone, struct zone *zone) > > { > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > > + return local_zone->node == zone->node; > > } > > Does that not break on !CONFIG_NUMA? > > It's why I used zone_to_nid There is a separate definition for !CONFIG_NUMA, it fit nicely next to the zlc stuff. > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > > @@ -1908,22 +1925,25 @@ zonelist_scan: > > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > > goto try_this_zone; > > /* > > - * Distribute pages in proportion to the individual > > - * zone size to ensure fair page aging. The zone a > > - * page was allocated in should have no effect on the > > - * time the page has in memory before being reclaimed. > > + * Distribute pagecache pages in proportion to the > > + * individual zone size to ensure fair page aging. > > + * The zone a page was allocated in should have no > > + * effect on the time the page has in memory before > > + * being reclaimed. > > * > > - * When zone_reclaim_mode is enabled, try to stay in > > - * local zones in the fastpath. If that fails, the > > + * When pagecache_mempolicy_mode or zone_reclaim_mode > > + * is enabled, try to allocate from zones within the > > + * preferred node in the fastpath. If that fails, the > > * slowpath is entered, which will do another pass > > * starting with the local zones, but ultimately fall > > * back to remote zones that do not partake in the > > * fairness round-robin cycle of this zonelist. > > */ > > - if (alloc_flags & ALLOC_WMARK_LOW) { > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > + (gfp_mask & __GFP_PAGECACHE)) { > > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > > continue; > > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in > some cases. Looks a bit fishy even though I can't quite put my finger on it. > > I also got details wrong here in the v3 of the series. In an unreleased > v4 of the series I had corrected the treatment of slab pages in line > with your wishes and reused the broken out helper in prepare_slowpath to > keep the decision in sync. > > It's still in development but even if it gets rejected it'll act as a > comparison point to yours. > > > - if (zone_reclaim_mode && > > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > > !zone_local(preferred_zone, zone)) > > continue; > > } > > Documention says "enabling pagecache_mempolicy_mode, in which case page cache > allocations will be placed according to the configured memory policy". Should > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. Yes, it's a bit weird. We want to consider the round-robin batches for local zones but at the same time avoid exhausted batches from pushing the allocation off-node when either of those modes are enabled. So in the fastpath we filter for both and in the slowpath, once kswapd has been woken at the same time that the batches have been reset to launch the new aging cycle, we try in order of zonelist preference. However, to answer your question above, if the slowpath still has to fall back to a remote zone, we don't want to reset its batch because we didn't verify it was actually exhausted in the fastpath and we could risk cutting short the aging cycle for that particular zone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f50.google.com (mail-ee0-f50.google.com [74.125.83.50]) by kanga.kvack.org (Postfix) with ESMTP id AD9646B0031 for ; Thu, 19 Dec 2013 06:20:56 -0500 (EST) Received: by mail-ee0-f50.google.com with SMTP id c41so406038eek.9 for ; Thu, 19 Dec 2013 03:20:56 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u49si3853119eep.127.2013.12.19.03.20.55 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 19 Dec 2013 03:20:55 -0800 (PST) Date: Thu, 19 Dec 2013 11:20:51 +0000 From: Mel Gorman Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131219112051.GH11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> <20131218194813.GB20038@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218194813.GB20038@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed, Dec 18, 2013 at 02:48:13PM -0500, Johannes Weiner wrote: > > > > > > Sure about the name? > > > > This is a boolean and "mode" implies it might be a bitmask. That said, I > > recognise that my own naming also sucked because complaining about yours > > I can see that mine also sucks. > > Is it because of how we use zone_reclaim_mode? I don't see anything > wrong with a "mode" toggle that switches between only two modes of > operation instead of three or more. But English being a second > language and all... > It's not just zone_reclaim_mode. Most references to mode in the VM (but not all because who needs consistentcy) refer to either a mask or multiple potential values. isolate_mode_t, gfp masks referred to as mode, memory policies described as mode, migration modes etc. Intuitively, I expect "mode" to not be a binary value. > > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone) > > > { > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > > > + return local_zone->node == zone->node; > > > } > > > > Does that not break on !CONFIG_NUMA? > > > > It's why I used zone_to_nid > > There is a separate definition for !CONFIG_NUMA, it fit nicely next to > the zlc stuff. > Ah, fair enough. > > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > > > @@ -1908,22 +1925,25 @@ zonelist_scan: > > > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > > > goto try_this_zone; > > > /* > > > - * Distribute pages in proportion to the individual > > > - * zone size to ensure fair page aging. The zone a > > > - * page was allocated in should have no effect on the > > > - * time the page has in memory before being reclaimed. > > > + * Distribute pagecache pages in proportion to the > > > + * individual zone size to ensure fair page aging. > > > + * The zone a page was allocated in should have no > > > + * effect on the time the page has in memory before > > > + * being reclaimed. > > > * > > > - * When zone_reclaim_mode is enabled, try to stay in > > > - * local zones in the fastpath. If that fails, the > > > + * When pagecache_mempolicy_mode or zone_reclaim_mode > > > + * is enabled, try to allocate from zones within the > > > + * preferred node in the fastpath. If that fails, the > > > * slowpath is entered, which will do another pass > > > * starting with the local zones, but ultimately fall > > > * back to remote zones that do not partake in the > > > * fairness round-robin cycle of this zonelist. > > > */ > > > - if (alloc_flags & ALLOC_WMARK_LOW) { > > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > > + (gfp_mask & __GFP_PAGECACHE)) { > > > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > > > continue; > > > > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or > > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in > > some cases. Looks a bit fishy even though I can't quite put my finger on it. > > > > I also got details wrong here in the v3 of the series. In an unreleased > > v4 of the series I had corrected the treatment of slab pages in line > > with your wishes and reused the broken out helper in prepare_slowpath to > > keep the decision in sync. > > > > It's still in development but even if it gets rejected it'll act as a > > comparison point to yours. > > > > > - if (zone_reclaim_mode && > > > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > > > !zone_local(preferred_zone, zone)) > > > continue; > > > } > > > > Documention says "enabling pagecache_mempolicy_mode, in which case page cache > > allocations will be placed according to the configured memory policy". Should > > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. > > Yes, it's a bit weird. > > We want to consider the round-robin batches for local zones but at the > same time avoid exhausted batches from pushing the allocation off-node > when either of those modes are enabled. So in the fastpath we filter > for both and in the slowpath, once kswapd has been woken at the same > time that the batches have been reset to launch the new aging cycle, > we try in order of zonelist preference. > > However, to answer your question above, if the slowpath still has to > fall back to a remote zone, we don't want to reset its batch because > we didn't verify it was actually exhausted in the fastpath and we > could risk cutting short the aging cycle for that particular zone. Understood, thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f45.google.com (mail-ee0-f45.google.com [74.125.83.45]) by kanga.kvack.org (Postfix) with ESMTP id 37BD66B0037 for ; Thu, 19 Dec 2013 07:59:24 -0500 (EST) Received: by mail-ee0-f45.google.com with SMTP id d49so461314eek.4 for ; Thu, 19 Dec 2013 04:59:23 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v6si4240350eel.91.2013.12.19.04.59.22 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 19 Dec 2013 04:59:22 -0800 (PST) Date: Thu, 19 Dec 2013 13:59:21 +0100 From: Michal Hocko Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131219125921.GF10855@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> <20131218162050.GB27510@dhcp22.suse.cz> <20131218192015.GA20038@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218192015.GA20038@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML On Wed 18-12-13 14:20:15, Johannes Weiner wrote: > On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote: [...] > > Currently we have a per-process (cpuset in fact) flag but this will > > change it to all or nothing. Is this really a good step? > > Btw. I do not mind having PF_SPREAD_PAGE enabled by default. > > I don't want to muck around with cpusets too much, tbh... but I agree > that the behavior of PF_SPREAD_PAGE should be the default. Except it > should honor zone_reclaim_mode and round-robin nodes that are within > RECLAIM_DISTANCE of the local one. Agreed. > I will have spotty access to internet starting tomorrow night until > New Year's. Is there a chance we can maybe revert the NUMA aspects of > the original patch for now and leave it as a node-local zone fairness > thing? Yes, that sounds perfectly reasonable to me. > The NUMA behavior was so broken on 3.12 that I doubt that > people have come to rely on the cache fairness on such machines in > that one release. So we should be able to release 3.12-stable and > 3.13 with node-local zone fairness without regressing anybody, and > then give the NUMA aspect of it another try in 3.14. > > Something like the following should restore NUMA behavior while still > fixing the kswapd vs. page allocator interaction bug of thrashing on > the highest zone. Yes, it looks good to me. I guess zone_local could have stayed as it was because it shouldn't be a big deal to fall-back to a different node if the distance is LOCAL, but taking a conservative approach is not harmfull. > PS: zone_local() is in a CONFIG_NUMA block, which > is why accessing zone->node is safe :-) > > --- > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dd886fac451a..317ea747d2cd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > static bool zone_local(struct zone *local_zone, struct zone *zone) > { > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > + return local_zone->node == zone->node; > } > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, > * page was allocated in should have no effect on the > * time the page has in memory before being reclaimed. > * > - * When zone_reclaim_mode is enabled, try to stay in > - * local zones in the fastpath. If that fails, the > - * slowpath is entered, which will do another pass > - * starting with the local zones, but ultimately fall > - * back to remote zones that do not partake in the > - * fairness round-robin cycle of this zonelist. > + * Try to stay in local zones in the fastpath. If > + * that fails, the slowpath is entered, which will do > + * another pass starting with the local zones, but > + * ultimately fall back to remote zones that do not > + * partake in the fairness round-robin cycle of this > + * zonelist. > */ > if (alloc_flags & ALLOC_WMARK_LOW) { > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > continue; > - if (zone_reclaim_mode && > - !zone_local(preferred_zone, zone)) > + if (!zone_local(preferred_zone, zone)) > continue; > } > /* > @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, > * thrash fairness information for zones that are not > * actually part of this zonelist's round-robin cycle. > */ > - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) > + if (!zone_local(preferred_zone, zone)) > continue; > mod_zone_page_state(zone, NR_ALLOC_BATCH, > high_wmark_pages(zone) - > > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754601Ab3LQQs1 (ORCPT ); Tue, 17 Dec 2013 11:48:27 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60486 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753185Ab3LQQs0 (ORCPT ); Tue, 17 Dec 2013 11:48:26 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Date: Tue, 17 Dec 2013 16:48:18 +0000 Message-Id: <1387298904-8824-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series is currently untested and is being posted to sync up discussions on the treatment of page cache pages, particularly the sysv part. I have not thought it through in detail but postings patches is the easiest way to highlight where I think a problem might be. Changelog since v2 o Drop an accounting patch, behaviour is deliberate o Special case tmpfs and shmem pages for discussion Changelog since v1 o Fix lot of brain damage in the configurable policy patch o Yoink a page cache annotation patch o Only account batch pages against allocations eligible for the fair policy o Add patch that default distributes file pages on remote nodes Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a bug whereby new pages could be reclaimed before old pages because of how the page allocator and kswapd interacted on the per-zone LRU lists. Unfortunately a side-effect missed during review was that it's now very easy to allocate remote memory on NUMA machines. The problem is that it is not a simple case of just restoring local allocation policies as there are genuine reasons why global page aging may be prefereable. It's still a major change to default behaviour so this patch makes the policy configurable and sets what I think is a sensible default. The patches are on top of some NUMA balancing patches currently in -mm. It's untested and posted to discuss patches 4 and 6. Documentation/sysctl/vm.txt | 29 ++++++++++ include/linux/gfp.h | 4 +- include/linux/mmzone.h | 2 + include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 + kernel/sysctl.c | 8 +++ mm/filemap.c | 2 + mm/page_alloc.c | 136 +++++++++++++++++++++++++++++++++++++------- mm/shmem.c | 14 +++++ 9 files changed, 176 insertions(+), 23 deletions(-) -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754880Ab3LQQse (ORCPT ); Tue, 17 Dec 2013 11:48:34 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60500 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753185Ab3LQQs2 (ORCPT ); Tue, 17 Dec 2013 11:48:28 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Date: Tue, 17 Dec 2013 16:48:20 +0000 Message-Id: <1387298904-8824-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch moves the decision on whether to round-robin allocations between zones and nodes into its own helper functions. It'll make some later patches easier to understand and it will be automatically inlined. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Acked-by: Johannes Weiner --- mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f861d02..64020eb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid) #endif /* CONFIG_NUMA */ /* + * Distribute pages in proportion to the individual zone size to ensure fair + * page aging. The zone a page was allocated in should have no effect on the + * time the page has in memory before being reclaimed. + * + * Returns true if this zone should be skipped to spread the page ages to + * other zones. + */ +static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone, + struct zone *zone, int alloc_flags) +{ + /* Only round robin in the allocator fast path */ + if (!(alloc_flags & ALLOC_WMARK_LOW)) + return false; + + /* Only round robin pages likely to be LRU or reclaimable slab */ + if (!(gfp_mask & GFP_MOVABLE_MASK)) + return false; + + /* Distribute to the next zone if this zone has exhausted its batch */ + if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) + return true; + + /* + * When zone_reclaim_mode is enabled, try to stay in local zones in the + * fastpath. If that fails, the slowpath is entered, which will do + * another pass starting with the local zones, but ultimately fall back + * back to remote zones that do not partake in the fairness round-robin + * cycle of this zonelist. + */ + if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + return true; + + return false; +} + +/* * get_page_from_freelist goes through the zonelist trying to allocate * a page. */ @@ -1907,27 +1943,12 @@ zonelist_scan: BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; - /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. - * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the - * slowpath is entered, which will do another pass - * starting with the local zones, but ultimately fall - * back to remote zones that do not partake in the - * fairness round-robin cycle of this zonelist. - */ - if ((alloc_flags & ALLOC_WMARK_LOW) && - (gfp_mask & GFP_MOVABLE_MASK)) { - if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) - continue; - if (zone_reclaim_mode && - !zone_local(preferred_zone, zone)) - continue; - } + + /* Distribute pages to ensure fair page aging */ + if (zone_distribute_age(gfp_mask, preferred_zone, zone, + alloc_flags)) + continue; + /* * When allocating a page cache page for writing, we * want to get it from a zone that is within its dirty -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754825Ab3LQQsd (ORCPT ); Tue, 17 Dec 2013 11:48:33 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60500 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754616Ab3LQQs3 (ORCPT ); Tue, 17 Dec 2013 11:48:29 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 4/6] mm: Annotate page cache allocations Date: Tue, 17 Dec 2013 16:48:22 +0000 Message-Id: <1387298904-8824-5-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The fair zone allocation policy needs to distinguish between anonymous, slab and file-backed pages. This patch annotates many of the page cache allocations by adjusting __page_cache_alloc. This does not guarantee that all page cache allocations are being properly annotated. One case for special consideration is shmem. sysv shared memory and MAP_SHARED anonymous pages are backed by this and they should be treated as anon by the fair allocation policy. It is also used by tmpfs which arguably should be treated as file by the fair allocation policy. The primary top-level shmem allocation function is shmem_getpage_gfp which ultimately uses alloc_pages_vma() and not __page_cache_alloc. This is correct for sysv and MAP_SHARED but tmpfs is still treated as anonymous. This patch special cases shmem to annotate tmpfs allocations as files for the fair zone allocation policy. NOTE: At time of writing it has not been double checked that it annotates the different shmem request types. Furthermore, this patch was originally base on a patch from Johannes and does not have his signed-off-by. Without his signed-off, I cannot sign it off Cannot-sign-off-without-Johannes --- include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- mm/filemap.c | 2 ++ mm/shmem.c | 14 ++++++++++++++ 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd49..f69e4cb 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75..bda4845 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/mm/filemap.c b/mm/filemap.c index b7749a9..5bb9225 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/shmem.c b/mm/shmem.c index 8297623..02d7a9c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, return page; } +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) +{ + /* If no internal shm_mount then it must be tmpfs */ + if (IS_ERR(shm_mnt)) + return true; + + /* Consider it to be tmpfs if the superblock is not the internal mount */ + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; +} + static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, pvma.vm_ops = NULL; pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); + if (shmem_inode_on_tmpfs(info)) + gfp |= __GFP_PAGECACHE; + page = alloc_page_vma(gfp, &pvma, 0); /* Drop reference taken by mpol_shared_policy_lookup() */ -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754956Ab3LQQtA (ORCPT ); Tue, 17 Dec 2013 11:49:00 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60521 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754742Ab3LQQsb (ORCPT ); Tue, 17 Dec 2013 11:48:31 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Date: Tue, 17 Dec 2013 16:48:24 +0000 Message-Id: <1387298904-8824-7-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch introduces a vm.pagecache_interleave sysctl that allows the administrator to alter the default memory allocation policy for file-backed pages. It removes a more configurable interface that is expected to be too complex to expose to users and give an unnecessarily level of control. By default it is disabled but there is strong evidence that users on NUMA machines will want to enable this. The default is expected to change once the documention is in sync. Ideally it would also be possible to control on a per-process basis by allowing processes to select either an MPOL_LOCAL or MPOL_INTERLEAVE_PAGECACHE memory policy as memory policies are the traditional way for controlling allocation behaviour. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 61 +++++++++++++++++++++------------------------ include/linux/mmzone.h | 2 +- include/linux/swap.h | 2 +- kernel/sysctl.c | 8 +++--- mm/page_alloc.c | 18 +++++-------- 5 files changed, 41 insertions(+), 50 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 8eaa562..655ed0a 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -49,6 +49,7 @@ Currently, these files are in /proc/sys/vm: - oom_kill_allocating_task - overcommit_memory - overcommit_ratio +- pagecache_interleave - page-cluster - panic_on_oom - percpu_pagelist_fraction @@ -56,7 +57,6 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure -- zone_distribute_mode - zone_reclaim_mode ============================================================== @@ -608,6 +608,34 @@ of physical RAM. See above. ============================================================== +pagecache_interleave: + +This setting is only relevant to NUMA machines. + +Historically, the default behaviour of the system is to allocate memory +local to the process. The behaviour is usually modified through the use +of memory policies while zone_reclaim_mode controls how strict the local +memory allocation policy is. + +Issues arise when the allocating process is frequently running on the same +node. The kernels memory reclaim daemon runs one instance per NUMA node. +A consequence is that relatively new memory may be reclaimed by kswapd when +the allocating process is running on a specific node. The user-visible +impact is that the system appears to do more IO than necessary when a +workload is accessing files that are larger than a given NUMA node. + +One way of addressing this is to use the interleave memory policy but that +is not always possible. + +Another option is to enable this setting. When enabled, the default +memory allocation changes from MPOL_LOCAL to interleaving file-backed +pages by default. The downside is that some file accesses will now be +to remote memory even though the local node had available resources. +The upside is that workloads working on files larger than a NUMA node +will not reclaim active pages prematurely. + +============================================================== + page-cluster page-cluster controls the number of pages up to which consecutive pages @@ -725,37 +753,6 @@ causes the kernel to prefer to reclaim dentries and inodes. ============================================================== -zone_distribute_mode - -Pages allocation and reclaim are managed on a per-zone basis. When the -system needs to reclaim memory, candidate pages are selected from these -per-zone lists. Historically, a potential consequence was that recently -allocated pages were considered reclaim candidates. From a zone-local -perspective, page aging was preserved but from a system-wide perspective -there was an age inversion problem. - -A similar problem occurs on a node level where young pages may be reclaimed -from the local node instead of allocating remote memory. Unforuntately, the -cost of accessing remote nodes is higher so the system must choose by default -between favouring page aging or node locality. zone_distribute_mode controls -how the system will distribute page ages between zones. - -0 = Never round-robin based on age - -Otherwise the values are ORed together - -1 = Distribute anon pages between zones local to the allocating node -2 = Distribute file pages between zones local to the allocating node -4 = Distribute slab pages between zones local to the allocating node - -The following three flags effectively alter MPOL_DEFAULT, be careful. - -8 = Distribute anon pages between zones remote to the allocating node -16 = Distribute file pages between zones remote to the allocating node -32 = Distribute slab pages between zones remote to the allocating node - -============================================================== - zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 20a75e3..2fb9e2d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -897,7 +897,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); -int sysctl_zone_distribute_mode_handler(struct ctl_table *, int, +int sysctl_zone_pagecache_interleave_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int numa_zonelist_order_handler(struct ctl_table *, int, diff --git a/include/linux/swap.h b/include/linux/swap.h index 44329b0..2b522cf 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,7 +318,7 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; -extern unsigned __bitwise__ zone_distribute_mode; +extern unsigned int zone_pagecache_interleave; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b75c08f..385d7cb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1350,11 +1350,11 @@ static struct ctl_table vm_table[] = { }, #endif { - .procname = "zone_distribute_mode", - .data = &zone_distribute_mode, - .maxlen = sizeof(zone_distribute_mode), + .procname = "pagecache_interleave", + .data = &zone_pagecache_interleave, + .maxlen = sizeof(zone_pagecache_interleave), .mode = 0644, - .proc_handler = sysctl_zone_distribute_mode_handler, + .proc_handler = sysctl_zone_pagecache_interleave_handler, .extra1 = &zero, }, #ifdef CONFIG_NUMA diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c2a2229..b6c8e63 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1872,7 +1872,8 @@ static inline void init_zone_allows_reclaim(int nid) #endif /* CONFIG_NUMA */ /* Controls how page ages are distributed across zones automatically */ -unsigned __bitwise__ zone_distribute_mode __read_mostly; +static unsigned __bitwise__ zone_distribute_mode __read_mostly; +unsigned int zone_pagecache_interleave; /* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */ #define DISTRIBUTE_DISABLE (0) @@ -1891,7 +1892,7 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly; /* Only these GFP flags are affected by the fair zone allocation policy */ #define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE)) -int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, +int sysctl_zone_pagecache_interleave_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { int rc; @@ -1900,16 +1901,9 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, if (rc) return rc; - /* If you are an admin reading this comment, what were you thinking? */ - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) == - DISTRIBUTE_STUPID_ANON)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON; - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) == - DISTRIBUTE_STUPID_FILE)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE; - if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) == - DISTRIBUTE_STUPID_SLAB)) - zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB; + zone_distribute_mode = DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB; + if (zone_pagecache_interleave) + zone_distribute_mode |= DISTRIBUTE_REMOTE_FILE; return 0; } -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754736Ab3LQQsb (ORCPT ); Tue, 17 Dec 2013 11:48:31 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60506 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754612Ab3LQQs2 (ORCPT ); Tue, 17 Dec 2013 11:48:28 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Date: Tue, 17 Dec 2013 16:48:21 +0000 Message-Id: <1387298904-8824-4-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org zone_local is using node_distance which is a more expensive call than necessary. On x86, it's another function call in the allocator fast path and increases cache footprint. This patch makes the assumption zones on a local node will share the same node ID. The necessary information should already be cache hot. Signed-off-by: Mel Gorman Acked-by: Rik van Riel --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64020eb..fd9677e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return zone_to_nid(zone) == numa_node_id(); } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755019Ab3LQQte (ORCPT ); Tue, 17 Dec 2013 11:49:34 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60506 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752609Ab3LQQsa (ORCPT ); Tue, 17 Dec 2013 11:48:30 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Date: Tue, 17 Dec 2013 16:48:23 +0000 Message-Id: <1387298904-8824-6-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a bug whereby new pages could be reclaimed before old pages because of how the page allocator and kswapd interacted on the per-zone LRU lists. Unfortunately it was missed during review that a consequence is that we also round-robin between NUMA nodes. This is bad for two reasons 1. It alters the semantics of MPOL_LOCAL without telling anyone 2. It incurs an immediate remote memory performance hit in exchange for a potential performance gain when memory needs to be reclaimed later No cookies for the reviewers on this one. This patch makes the behaviour of the fair zone allocator policy configurable. By default it will only distribute pages that are going to exist on the LRU between zones local to the allocating process. This preserves the historical semantics of MPOL_LOCAL. By default, slab pages are not distributed between zones after this patch is applied. It can be argued that they should get similar treatment but they have different lifecycles to LRU pages, the shrinkers are not zone-aware and the interaction between the page allocator and kswapd is different for slabs. If it turns out to be an almost universal win, we can change the default. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 32 ++++++++++++++ include/linux/mmzone.h | 2 + include/linux/swap.h | 2 + kernel/sysctl.c | 8 ++++ mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------ 5 files changed, 134 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb..8eaa562 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure +- zone_distribute_mode - zone_reclaim_mode ============================================================== @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes. ============================================================== +zone_distribute_mode + +Pages allocation and reclaim are managed on a per-zone basis. When the +system needs to reclaim memory, candidate pages are selected from these +per-zone lists. Historically, a potential consequence was that recently +allocated pages were considered reclaim candidates. From a zone-local +perspective, page aging was preserved but from a system-wide perspective +there was an age inversion problem. + +A similar problem occurs on a node level where young pages may be reclaimed +from the local node instead of allocating remote memory. Unforuntately, the +cost of accessing remote nodes is higher so the system must choose by default +between favouring page aging or node locality. zone_distribute_mode controls +how the system will distribute page ages between zones. + +0 = Never round-robin based on age + +Otherwise the values are ORed together + +1 = Distribute anon pages between zones local to the allocating node +2 = Distribute file pages between zones local to the allocating node +4 = Distribute slab pages between zones local to the allocating node + +The following three flags effectively alter MPOL_DEFAULT, be careful. + +8 = Distribute anon pages between zones remote to the allocating node +16 = Distribute file pages between zones remote to the allocating node +32 = Distribute slab pages between zones remote to the allocating node + +============================================================== + zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b835d3f..20a75e3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int sysctl_zone_distribute_mode_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); extern int numa_zonelist_order_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6..44329b0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,6 +318,8 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; +extern unsigned __bitwise__ zone_distribute_mode; + #ifdef CONFIG_NUMA extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a6047..b75c08f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, #endif + { + .procname = "zone_distribute_mode", + .data = &zone_distribute_mode, + .maxlen = sizeof(zone_distribute_mode), + .mode = 0644, + .proc_handler = sysctl_zone_distribute_mode_handler, + .extra1 = &zero, + }, #ifdef CONFIG_NUMA { .procname = "zone_reclaim_mode", diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd9677e..c2a2229 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid) } #endif /* CONFIG_NUMA */ +/* Controls how page ages are distributed across zones automatically */ +unsigned __bitwise__ zone_distribute_mode __read_mostly; + +/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */ +#define DISTRIBUTE_DISABLE (0) +#define DISTRIBUTE_LOCAL_ANON (1UL << 0) +#define DISTRIBUTE_LOCAL_FILE (1UL << 1) +#define DISTRIBUTE_LOCAL_SLAB (1UL << 2) +#define DISTRIBUTE_REMOTE_ANON (1UL << 3) +#define DISTRIBUTE_REMOTE_FILE (1UL << 4) +#define DISTRIBUTE_REMOTE_SLAB (1UL << 5) + +#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON) +#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE) +#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB) +#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB) + +/* Only these GFP flags are affected by the fair zone allocation policy */ +#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE)) + +int sysctl_zone_distribute_mode_handler(ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + /* If you are an admin reading this comment, what were you thinking? */ + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) == + DISTRIBUTE_STUPID_ANON)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON; + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) == + DISTRIBUTE_STUPID_FILE)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE; + if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) == + DISTRIBUTE_STUPID_SLAB)) + zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB; + + return 0; +} + /* * Distribute pages in proportion to the individual zone size to ensure fair * page aging. The zone a page was allocated in should have no effect on the @@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid) static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone, struct zone *zone, int alloc_flags) { + bool zone_is_local; + bool is_file, is_slab, is_anon; + /* Only round robin in the allocator fast path */ if (!(alloc_flags & ALLOC_WMARK_LOW)) return false; - /* Only round robin pages likely to be LRU or reclaimable slab */ - if (!(gfp_mask & GFP_MOVABLE_MASK)) + /* Only a subset of GFP flags are considered for fair zone policy */ + if (!(gfp_mask & DISTRIBUTE_GFP_MASK)) return false; - /* Distribute to the next zone if this zone has exhausted its batch */ - if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) - return true; - /* - * When zone_reclaim_mode is enabled, try to stay in local zones in the - * fastpath. If that fails, the slowpath is entered, which will do - * another pass starting with the local zones, but ultimately fall back - * back to remote zones that do not partake in the fairness round-robin - * cycle of this zonelist. + * Classify the type of allocation. From this point on, the fair zone + * allocation policy is being applied. If the allocation does not meet + * the criteria the zone must be skipped. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + is_file = gfp_mask & __GFP_PAGECACHE; + is_slab = gfp_mask & __GFP_RECLAIMABLE; + is_anon = (!is_file && !is_slab); + WARN_ON_ONCE(is_slab && is_file); + + zone_is_local = zone_local(preferred_zone, zone); + if (zone_local(preferred_zone, zone)) { + /* Distribute between zones local to the node if requested */ + if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE)) + goto check_batch; + } else { + /* + * When zone_reclaim_mode is enabled, stick to local zones. If + * that fails, the slowpath is entered, which will do another + * pass starting with the local zones, but ultimately fall back + * back to remote zones that do not partake in the fairness + * round-robin cycle of this zonelist. + */ + if (zone_reclaim_mode) + return false; + + if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE)) + goto check_batch; + if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE)) + goto check_batch; + } + + return true; + +check_batch: + /* Distribute to the next zone if this zone has exhausted its batch */ + if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) return true; return false; @@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) __build_all_zonelists(NULL); mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); + zone_distribute_mode = DISTRIBUTE_DEFAULT; } else { #ifdef CONFIG_MEMORY_HOTPLUG if (zone) -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755115Ab3LQQtw (ORCPT ); Tue, 17 Dec 2013 11:49:52 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60493 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754194Ab3LQQs1 (ORCPT ); Tue, 17 Dec 2013 11:48:27 -0500 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML , Mel Gorman Subject: [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Date: Tue, 17 Dec 2013 16:48:19 +0000 Message-Id: <1387298904-8824-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Johannes Weiner Dave Hansen noted a regression in a microbenchmark that loops around open() and close() on an 8-node NUMA machine and bisected it down to 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That change forces the slab allocations of the file descriptor to spread out to all 8 nodes, causing remote references in the page allocator and slab. The round-robin policy is only there to provide fairness among memory allocations that are reclaimed involuntarily based on pressure in each zone. It does not make sense to apply it to unreclaimable kernel allocations that are freed manually, in this case instantly after the allocation, and incur the remote reference costs twice for no reason. Only round-robin allocations that are usually freed through page reclaim or slab shrinking. Cc: Bisected-by: Dave Hansen Signed-off-by: Johannes Weiner Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f0..f861d02 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1920,7 +1920,8 @@ zonelist_scan: * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & GFP_MOVABLE_MASK)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; if (zone_reclaim_mode && -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752848Ab3LQUCX (ORCPT ); Tue, 17 Dec 2013 15:02:23 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50279 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752451Ab3LQUCV (ORCPT ); Tue, 17 Dec 2013 15:02:21 -0500 Date: Tue, 17 Dec 2013 15:02:10 -0500 From: Johannes Weiner To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131217200210.GG21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1387298904-8824-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > This series is currently untested and is being posted to sync up discussions > on the treatment of page cache pages, particularly the sysv part. I have > not thought it through in detail but postings patches is the easiest way > to highlight where I think a problem might be. > > Changelog since v2 > o Drop an accounting patch, behaviour is deliberate > o Special case tmpfs and shmem pages for discussion > > Changelog since v1 > o Fix lot of brain damage in the configurable policy patch > o Yoink a page cache annotation patch > o Only account batch pages against allocations eligible for the fair policy > o Add patch that default distributes file pages on remote nodes > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > bug whereby new pages could be reclaimed before old pages because of how > the page allocator and kswapd interacted on the per-zone LRU lists. Not just that, it was about ensuring predictable cache replacement and maximizing the cache's effectiveness. This implicitely fixed the kswapd interaction bug, but that was not the sole reason (I realize that the original changelog is incomplete and I apologize for that). I have had offline discussions with Andrea back then and his first suggestion was too to make this a zone fairness placement that is exclusive to the local node, but eventually he agreed that the problem applies just as much on the global level and that we should apply fairness throughout the system as long as we honor zone_reclaim_mode and hard bindings. During our discussions now, it turned out that zone_reclaim_mode is a terrible predictor for preferred locality, but we also more or less agreed that the locality issues in the first place are not really applicable to cache loads dominated by IO cost. So I think the main discrepancy between the original patch and what we truly want is that aging fairness is really only relevant for actual cache backed by secondary storage, because cache replacement is an ongoing operation that involves IO. As opposed to memory types that involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all (slab, kernel allocations), in which case we prefer NUMA locality. > Unfortunately a side-effect missed during review was that it's now very > easy to allocate remote memory on NUMA machines. The problem is that > it is not a simple case of just restoring local allocation policies as > there are genuine reasons why global page aging may be prefereable. It's > still a major change to default behaviour so this patch makes the policy > configurable and sets what I think is a sensible default. > > The patches are on top of some NUMA balancing patches currently in -mm. > It's untested and posted to discuss patches 4 and 6. It might be easier in dealing with -stable if we start with the critical fix(es) to restore sane functionality as much and as compact as possible and then place the cleanups on top? In my local tree, I have the following as the first patch: --- From: Johannes Weiner Subject: [patch] mm: page_alloc: restrict fair allocator policy to page cache 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged in order to ensure predictable page cache replacement and to maximize the cache's effectiveness of reducing IO regardless of zone or node topology. However, it was overzealous in round-robin placing every type of allocation over all allowable nodes, instead of preferring locality, which resulted in severe regressions on certain NUMA workloads that have nothing to do with page cache. This patch drastically reduces the impact of the original change by having the round-robin placement policy only apply to page cache backed by secondary storage, and no longer to anonymous memory, shmem, tmpfs, slab allocations. This still changes the long-standing behavior of page cache adhering to the configured memory policy and preferring local allocations per default, so make it configurable in case somebody relies on it. However, we also expect the majority of users to prefer maximium cache effectiveness and a predictable replacement behavior over memory locality, so reflect this in the default setting of the sysctl. --- Documentation/sysctl/vm.txt | 21 +++++++++++++++++ Documentation/vm/numa_memory_policy.txt | 8 +++++++ include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 ++ kernel/sysctl.c | 8 +++++++ mm/filemap.c | 2 ++ mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- 8 files changed, 76 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb7b64a..50d250f7470f 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes +- pagecache_mempolicy_mode - min_slab_ratio - min_unmapped_ratio - mmap_min_addr @@ -404,6 +405,26 @@ Setting this too high will OOM your machine instantly. ============================================================= +pagecache_mempolicy_mode: + +This is available only on NUMA kernels. + +Per default, the configured memory policy is applicable to anonymous +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an +interleaving fashion over all allowed nodes (hardbindings and +zone_reclaim_mode excluded). + +The assumption is that, when it comes to pagecache, users generally +prefer predictable replacement behavior regardless of NUMA topology +and maximizing the cache's effectiveness in reducing IO over memory +locality. + +This behavior can be changed by enabling pagecache_mempolicy_mode, in +which case page cache allocations will be placed according to the +configured memory policy (Documentation/vm/numa_memory_policy.txt). + +============================================================= + min_slab_ratio: This is available only on NUMA kernels. diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 4e7da6543424..64d48b6378db 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -16,6 +16,14 @@ programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +Note that, per default, the memory policies as described below apply to process +memory and shmem/tmpfs/ramfs only. Pagecache backed by secondary storage will +be interleaved fairly over all allowable nodes (respecting hardbindings and +zone_reclaim_mode) in order to maximize the cache's effectiveness in reducing IO +and to ensure predictable cache replacement. Special setups that require +pagecache to adhere to the configured memory policy can change this behavior by +enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). + MEMORY POLICY CONCEPTS Scope of Memory Policies diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd491f7e8..f69e4cb78ccf 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75a078b..bda48453af8e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..3458994b0881 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; +extern int pagecache_mempolicy_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 +#define pagecache_mempolicy_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a604726d0b..a8c56c1dc98e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, { + .procname = "pagecache_mempolicy_mode", + .data = &pagecache_mempolicy_mode, + .maxlen = sizeof(pagecache_mempolicy_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &zero, + }, + { .procname = "min_unmapped_ratio", .data = &sysctl_min_unmapped_ratio, .maxlen = sizeof(sysctl_min_unmapped_ratio), diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..5bb922506906 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f075ed0..b28370932950 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1547,7 +1547,15 @@ again: get_pageblock_migratetype(page)); } + /* + * All allocations eat into the round-robin batch, even + * allocations that are not subject to round-robin placement + * themselves. This makes sure that allocations that ARE + * subject to round-robin placement compensate for the + * allocations that aren't, to have equal placement overall. + */ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); + __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone, gfp_flags); local_irq_restore(flags); @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, #ifdef CONFIG_NUMA /* + * pagecache_mempolicy_mode - whether page cache should honor the + * configured memory policy and allocate from the zonelist in order of + * preference, or whether it should be interleaved fairly over all + * allowed zones in the given zonelist to maximize cache effects and + * ensure predictable cache replacement. + */ +int pagecache_mempolicy_mode __read_mostly; + +/* * zlc_setup - Setup for "zonelist cache". Uses cached zone data to * skip over zones that are not allowed by the cpuset, or that have * been recently (in last second) found to be nearly full. See further @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1908,22 +1925,25 @@ zonelist_scan: if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. + * Distribute page cache pages in proportion to the + * individual zone size to ensure fair page aging. + * The zone a page was allocated in should have no + * effect on the time the page has in memory before + * being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the + * When pagecache_mempolicy_mode or zone_reclaim_mode + * is enabled, try to allocate from zones within the + * preferred node in the fastpath. If that fails, the * slowpath is entered, which will do another pass * starting with the local zones, but ultimately fall * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_PAGECACHE)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && !zone_local(preferred_zone, zone)) continue; } @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && + !zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752790Ab3LRGSM (ORCPT ); Wed, 18 Dec 2013 01:18:12 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50324 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752296Ab3LRGSB (ORCPT ); Wed, 18 Dec 2013 01:18:01 -0500 Date: Wed, 18 Dec 2013 01:17:50 -0500 From: Johannes Weiner To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218061750.GK21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131217200210.GG21724@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > Hi Mel, > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > This series is currently untested and is being posted to sync up discussions > > on the treatment of page cache pages, particularly the sysv part. I have > > not thought it through in detail but postings patches is the easiest way > > to highlight where I think a problem might be. > > > > Changelog since v2 > > o Drop an accounting patch, behaviour is deliberate > > o Special case tmpfs and shmem pages for discussion > > > > Changelog since v1 > > o Fix lot of brain damage in the configurable policy patch > > o Yoink a page cache annotation patch > > o Only account batch pages against allocations eligible for the fair policy > > o Add patch that default distributes file pages on remote nodes > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > bug whereby new pages could be reclaimed before old pages because of how > > the page allocator and kswapd interacted on the per-zone LRU lists. > > Not just that, it was about ensuring predictable cache replacement and > maximizing the cache's effectiveness. This implicitely fixed the > kswapd interaction bug, but that was not the sole reason (I realize > that the original changelog is incomplete and I apologize for that). > > I have had offline discussions with Andrea back then and his first > suggestion was too to make this a zone fairness placement that is > exclusive to the local node, but eventually he agreed that the problem > applies just as much on the global level and that we should apply > fairness throughout the system as long as we honor zone_reclaim_mode > and hard bindings. During our discussions now, it turned out that > zone_reclaim_mode is a terrible predictor for preferred locality, but > we also more or less agreed that the locality issues in the first > place are not really applicable to cache loads dominated by IO cost. > > So I think the main discrepancy between the original patch and what we > truly want is that aging fairness is really only relevant for actual > cache backed by secondary storage, because cache replacement is an > ongoing operation that involves IO. As opposed to memory types that > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > (slab, kernel allocations), in which case we prefer NUMA locality. > > > Unfortunately a side-effect missed during review was that it's now very > > easy to allocate remote memory on NUMA machines. The problem is that > > it is not a simple case of just restoring local allocation policies as > > there are genuine reasons why global page aging may be prefereable. It's > > still a major change to default behaviour so this patch makes the policy > > configurable and sets what I think is a sensible default. > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > It's untested and posted to discuss patches 4 and 6. > > It might be easier in dealing with -stable if we start with the > critical fix(es) to restore sane functionality as much and as compact > as possible and then place the cleanups on top? > > In my local tree, I have the following as the first patch: Updated version with your tmpfs __GFP_PAGECACHE parts added and documentation, changelog updated as necessary. I remain unconvinced that tmpfs pages should be round-robined, but I agree with you that it is the conservative change to do for 3.12 and 3.12 and we can figure out the rest later. I sure hope that this doesn't drive most people on NUMA to disable pagecache interleaving right away as I expect most tmpfs workloads to see little to no reclaim and prefer locality... :/ --- From: Johannes Weiner Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged in order to ensure predictable pagecache replacement and to maximize the cache's effectiveness of reducing IO regardless of zone or node topology. However, it was overzealous in round-robin placing every type of allocation over all allowable nodes, instead of preferring locality, which resulted in severe regressions on certain NUMA workloads that have nothing to do with pagecache. This patch drastically reduces the impact of the original change by having the round-robin placement policy only apply to pagecache allocations and no longer to anonymous memory, shmem, slab and other types of kernel allocations. This still changes the long-standing behavior of pagecache adhering to the configured memory policy and preferring local allocations per default, so make it configurable in case somebody relies on it. However, we also expect the majority of users to prefer maximium cache effectiveness and a predictable replacement behavior over memory locality, so reflect this in the default setting of the sysctl. No-signoff-without-Mel's Cc: # 3.12 --- Documentation/sysctl/vm.txt | 20 ++++++++++++++++ Documentation/vm/numa_memory_policy.txt | 7 ++++++ include/linux/gfp.h | 4 +++- include/linux/pagemap.h | 2 +- include/linux/swap.h | 2 ++ kernel/sysctl.c | 8 +++++++ mm/filemap.c | 2 ++ mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- mm/shmem.c | 14 +++++++++++ 9 files changed, 88 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb7b64a..308c342f62ad 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes +- pagecache_mempolicy_mode - min_slab_ratio - min_unmapped_ratio - mmap_min_addr @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly. ============================================================= +pagecache_mempolicy_mode: + +This is available only on NUMA kernels. + +Per default, pagecache is allocated in an interleaving fashion over +all allowed nodes (hardbindings and zone_reclaim_mode excluded), +regardless of the selected memory policy. + +The assumption is that, when it comes to pagecache, users generally +prefer predictable replacement behavior regardless of NUMA topology +and maximizing the cache's effectiveness in reducing IO over memory +locality. + +This behavior can be changed by enabling pagecache_mempolicy_mode, in +which case page cache allocations will be placed according to the +configured memory policy (Documentation/vm/numa_memory_policy.txt). + +============================================================= + min_slab_ratio: This is available only on NUMA kernels. diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 4e7da6543424..72247e565908 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +Note that, per default, the memory policies do not apply to pagecache. Instead +it will be interleaved fairly over all allowable nodes (respecting hardbindings +and zone_reclaim_mode) in order to maximize the cache's effectiveness in +reducing IO and to ensure predictable cache replacement. Special setups that +require pagecache to adhere to the configured memory policy can change this +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). + MEMORY POLICY CONCEPTS Scope of Memory Policies diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd491f7e8..f69e4cb78ccf 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_PAGECACHE 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -92,6 +93,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ /* * This may seem redundant, but it's a way of annotating false positives vs. @@ -99,7 +101,7 @@ struct vm_area_struct; */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e3dea75a078b..bda48453af8e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); #else static inline struct page *__page_cache_alloc(gfp_t gfp) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp | __GFP_PAGECACHE, 0); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..3458994b0881 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; +extern int pagecache_mempolicy_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 +#define pagecache_mempolicy_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a604726d0b..a8c56c1dc98e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, { + .procname = "pagecache_mempolicy_mode", + .data = &pagecache_mempolicy_mode, + .maxlen = sizeof(pagecache_mempolicy_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &zero, + }, + { .procname = "min_unmapped_ratio", .data = &sysctl_min_unmapped_ratio, .maxlen = sizeof(sysctl_min_unmapped_ratio), diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..5bb922506906 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) int n; struct page *page; + gfp |= __GFP_PAGECACHE; + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 580a5f075ed0..f7c0ecb5bb8b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1547,7 +1547,15 @@ again: get_pageblock_migratetype(page)); } + /* + * All allocations eat into the round-robin batch, even + * allocations that are not subject to round-robin placement + * themselves. This makes sure that allocations that ARE + * subject to round-robin placement compensate for the + * allocations that aren't, to have equal placement overall. + */ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); + __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone, gfp_flags); local_irq_restore(flags); @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, #ifdef CONFIG_NUMA /* + * pagecache_mempolicy_mode - whether pagecache allocations should + * honor the configured memory policy and allocate from the zonelist + * in order of preference, or whether they should interleave fairly + * over all allowed zones in the given zonelist to maximize cache + * effects and ensure predictable cache replacement. + */ +int pagecache_mempolicy_mode __read_mostly; + +/* * zlc_setup - Setup for "zonelist cache". Uses cached zone data to * skip over zones that are not allowed by the cpuset, or that have * been recently (in last second) found to be nearly full. See further @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1908,22 +1925,25 @@ zonelist_scan: if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) goto try_this_zone; /* - * Distribute pages in proportion to the individual - * zone size to ensure fair page aging. The zone a - * page was allocated in should have no effect on the - * time the page has in memory before being reclaimed. + * Distribute pagecache pages in proportion to the + * individual zone size to ensure fair page aging. + * The zone a page was allocated in should have no + * effect on the time the page has in memory before + * being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the + * When pagecache_mempolicy_mode or zone_reclaim_mode + * is enabled, try to allocate from zones within the + * preferred node in the fastpath. If that fails, the * slowpath is entered, which will do another pass * starting with the local zones, but ultimately fall * back to remote zones that do not partake in the * fairness round-robin cycle of this zonelist. */ - if (alloc_flags & ALLOC_WMARK_LOW) { + if ((alloc_flags & ALLOC_WMARK_LOW) && + (gfp_mask & __GFP_PAGECACHE)) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && !zone_local(preferred_zone, zone)) continue; } @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && + !zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - diff --git a/mm/shmem.c b/mm/shmem.c index 8297623fcaed..02d7a9c03463 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, return page; } +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) +{ + /* If no internal shm_mount then it must be tmpfs */ + if (IS_ERR(shm_mnt)) + return true; + + /* Consider it to be tmpfs if the superblock is not the internal mount */ + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; +} + static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, pvma.vm_ops = NULL; pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); + if (shmem_inode_on_tmpfs(info)) + gfp |= __GFP_PAGECACHE; + page = alloc_page_vma(gfp, &pvma, 0); /* Drop reference taken by mpol_shared_policy_lookup() */ -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754637Ab3LRNr5 (ORCPT ); Wed, 18 Dec 2013 08:47:57 -0500 Received: from mx1.redhat.com ([209.132.183.28]:42261 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753195Ab3LRNry (ORCPT ); Wed, 18 Dec 2013 08:47:54 -0500 Message-ID: <52B1A781.50002@redhat.com> Date: Wed, 18 Dec 2013 08:47:45 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Johannes Weiner CC: Mel Gorman , Andrew Morton , Dave Hansen , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> In-Reply-To: <20131218061750.GK21724@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/18/2013 01:17 AM, Johannes Weiner wrote: > Updated version with your tmpfs __GFP_PAGECACHE parts added and > documentation, changelog updated as necessary. I remain unconvinced > that tmpfs pages should be round-robined, but I agree with you that it > is the conservative change to do for 3.12 and 3.12 and we can figure > out the rest later. I sure hope that this doesn't drive most people > on NUMA to disable pagecache interleaving right away as I expect most > tmpfs workloads to see little to no reclaim and prefer locality... :/ Actually, I suspect most tmpfs heavy workloads will be things like databases with shared memory segments. Those tend to benefit from having all of the system's memory bandwidth available. The worker threads/processes tend to live all over the system, too... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754651Ab3LROSJ (ORCPT ); Wed, 18 Dec 2013 09:18:09 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50346 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753705Ab3LROSI (ORCPT ); Wed, 18 Dec 2013 09:18:08 -0500 Date: Wed, 18 Dec 2013 09:17:58 -0500 From: Johannes Weiner To: Rik van Riel Cc: Mel Gorman , Andrew Morton , Dave Hansen , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218141758.GL21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <52B1A781.50002@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52B1A781.50002@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 08:47:45AM -0500, Rik van Riel wrote: > On 12/18/2013 01:17 AM, Johannes Weiner wrote: > > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > > documentation, changelog updated as necessary. I remain unconvinced > > that tmpfs pages should be round-robined, but I agree with you that it > > is the conservative change to do for 3.12 and 3.12 and we can figure > > out the rest later. I sure hope that this doesn't drive most people > > on NUMA to disable pagecache interleaving right away as I expect most > > tmpfs workloads to see little to no reclaim and prefer locality... :/ > > Actually, I suspect most tmpfs heavy workloads will be things like > databases with shared memory segments. Those tend to benefit from > having all of the system's memory bandwidth available. The worker > threads/processes tend to live all over the system, too... Shared memory segments are explicitely excluded from the interleaving, though. The distinction is between the internal tmpfs mount that sysv shmem uses (mempolicy) and tmpfs mounts that use the actual filesystem interface (pagecache interleave). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755078Ab3LROvO (ORCPT ); Wed, 18 Dec 2013 09:51:14 -0500 Received: from cantor2.suse.de ([195.135.220.15]:56060 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754820Ab3LROvN (ORCPT ); Wed, 18 Dec 2013 09:51:13 -0500 Date: Wed, 18 Dec 2013 15:51:11 +0100 From: Michal Hocko To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218145111.GA27510@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131217200210.GG21724@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 17-12-13 15:02:10, Johannes Weiner wrote: [...] > +pagecache_mempolicy_mode: > + > +This is available only on NUMA kernels. > + > +Per default, the configured memory policy is applicable to anonymous > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > +interleaving fashion over all allowed nodes (hardbindings and > +zone_reclaim_mode excluded). > + > +The assumption is that, when it comes to pagecache, users generally > +prefer predictable replacement behavior regardless of NUMA topology > +and maximizing the cache's effectiveness in reducing IO over memory > +locality. Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing semantically? The setting is per-cpuset rather than global which makes it harder to use but essentially it tries to distribute page cache pages across all the nodes. This is really getting confusing. We have zone_reclaim_mode to keep memory local in general, pagecache_mempolicy_mode to keep page cache local and PF_SPREAD_PAGE to spread the page cache around nodes. > + > +This behavior can be changed by enabling pagecache_mempolicy_mode, in > +which case page cache allocations will be placed according to the > +configured memory policy (Documentation/vm/numa_memory_policy.txt). -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755318Ab3LRPAo (ORCPT ); Wed, 18 Dec 2013 10:00:44 -0500 Received: from cantor2.suse.de ([195.135.220.15]:56332 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754607Ab3LRPAm (ORCPT ); Wed, 18 Dec 2013 10:00:42 -0500 Date: Wed, 18 Dec 2013 15:00:38 +0000 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218150038.GP11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218061750.GK21724@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote: > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > > Hi Mel, > > > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > > This series is currently untested and is being posted to sync up discussions > > > on the treatment of page cache pages, particularly the sysv part. I have > > > not thought it through in detail but postings patches is the easiest way > > > to highlight where I think a problem might be. > > > > > > Changelog since v2 > > > o Drop an accounting patch, behaviour is deliberate > > > o Special case tmpfs and shmem pages for discussion > > > > > > Changelog since v1 > > > o Fix lot of brain damage in the configurable policy patch > > > o Yoink a page cache annotation patch > > > o Only account batch pages against allocations eligible for the fair policy > > > o Add patch that default distributes file pages on remote nodes > > > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > > bug whereby new pages could be reclaimed before old pages because of how > > > the page allocator and kswapd interacted on the per-zone LRU lists. > > > > Not just that, it was about ensuring predictable cache replacement and > > maximizing the cache's effectiveness. This implicitely fixed the > > kswapd interaction bug, but that was not the sole reason (I realize > > that the original changelog is incomplete and I apologize for that). > > > > I have had offline discussions with Andrea back then and his first > > suggestion was too to make this a zone fairness placement that is > > exclusive to the local node, but eventually he agreed that the problem > > applies just as much on the global level and that we should apply > > fairness throughout the system as long as we honor zone_reclaim_mode > > and hard bindings. During our discussions now, it turned out that > > zone_reclaim_mode is a terrible predictor for preferred locality, but > > we also more or less agreed that the locality issues in the first > > place are not really applicable to cache loads dominated by IO cost. > > > > So I think the main discrepancy between the original patch and what we > > truly want is that aging fairness is really only relevant for actual > > cache backed by secondary storage, because cache replacement is an > > ongoing operation that involves IO. As opposed to memory types that > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > > (slab, kernel allocations), in which case we prefer NUMA locality. > > > > > Unfortunately a side-effect missed during review was that it's now very > > > easy to allocate remote memory on NUMA machines. The problem is that > > > it is not a simple case of just restoring local allocation policies as > > > there are genuine reasons why global page aging may be prefereable. It's > > > still a major change to default behaviour so this patch makes the policy > > > configurable and sets what I think is a sensible default. > > > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > > It's untested and posted to discuss patches 4 and 6. > > > > It might be easier in dealing with -stable if we start with the > > critical fix(es) to restore sane functionality as much and as compact > > as possible and then place the cleanups on top? > > > > In my local tree, I have the following as the first patch: > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > documentation, changelog updated as necessary. I remain unconvinced > that tmpfs pages should be round-robined, but I agree with you that it > is the conservative change to do for 3.12 and 3.12 and we can figure > out the rest later. Assume you with 3.12 and 3.13 here. > I sure hope that this doesn't drive most people > on NUMA to disable pagecache interleaving right away as I expect most > tmpfs workloads to see little to no reclaim and prefer locality... :/ > I hope you're right but I expect the experience will be like zone_reclaim_mode. We're going to be looking out for bug reports that are "fixed" by disabling pagecache locality and pushing back on them by fixing the real problem. This was the experience with zone_reclaim_mode when it started going wrong. It was also the experience with THP for a very long time. Disabling THP was a workaround for all sorts of problems and it was very important to fix them and push back on anyone documenting disabling THP as a standard workaround. > --- > From: Johannes Weiner > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache > Monolithic patch with multiple changes but meh. I'm not pushed because I know what the breakout looks like. FWIW, I had intended the entire of my broken-out series for 3.12 and 3.13 once it got ironed out. I find the series easier to understand but of course I would. > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged > in order to ensure predictable pagecache replacement and to maximize > the cache's effectiveness of reducing IO regardless of zone or node > topology. > > However, it was overzealous in round-robin placing every type of > allocation over all allowable nodes, instead of preferring locality, > which resulted in severe regressions on certain NUMA workloads that > have nothing to do with pagecache. > > This patch drastically reduces the impact of the original change by > having the round-robin placement policy only apply to pagecache > allocations and no longer to anonymous memory, shmem, slab and other > types of kernel allocations. > > This still changes the long-standing behavior of pagecache adhering to > the configured memory policy and preferring local allocations per > default, so make it configurable in case somebody relies on it. > However, we also expect the majority of users to prefer maximium cache > effectiveness and a predictable replacement behavior over memory > locality, so reflect this in the default setting of the sysctl. > > No-signoff-without-Mel's > Cc: # 3.12 > --- > Documentation/sysctl/vm.txt | 20 ++++++++++++++++ > Documentation/vm/numa_memory_policy.txt | 7 ++++++ > include/linux/gfp.h | 4 +++- > include/linux/pagemap.h | 2 +- > include/linux/swap.h | 2 ++ > kernel/sysctl.c | 8 +++++++ > mm/filemap.c | 2 ++ > mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- > mm/shmem.c | 14 +++++++++++ > 9 files changed, 88 insertions(+), 12 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 1fbd4eb7b64a..308c342f62ad 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > +- pagecache_mempolicy_mode > - min_slab_ratio > - min_unmapped_ratio > - mmap_min_addr Sure about the name? This is a boolean and "mode" implies it might be a bitmask. That said, I recognise that my own naming also sucked because complaining about yours I can see that mine also sucks. > @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly. > > ============================================================= > > +pagecache_mempolicy_mode: > + > +This is available only on NUMA kernels. > + > +Per default, pagecache is allocated in an interleaving fashion over > +all allowed nodes (hardbindings and zone_reclaim_mode excluded), > +regardless of the selected memory policy. > + > +The assumption is that, when it comes to pagecache, users generally > +prefer predictable replacement behavior regardless of NUMA topology > +and maximizing the cache's effectiveness in reducing IO over memory > +locality. > + > +This behavior can be changed by enabling pagecache_mempolicy_mode, in > +which case page cache allocations will be placed according to the > +configured memory policy (Documentation/vm/numa_memory_policy.txt). > + Ok this indicates that pagecache will still be interleaved on zones local to the node the process is allocating on. Good because that preserves a very important aspect of your original patch. The current description feels a little backwards though -- "Enable this to *not* interleave pagecache". This documented behaviour says to me that pagecache_obey_mempolicy might be a better name if enabling it uses the system default memory policy. However, even that might put us in a corner. Ultimately we want this to be controllable on a per-process basis using memory policies. Merging what I have in v3, unreleased v4 and this thing I ended up with this. The observation about cpusets was raised by Michal Hocko on IRC. ---8<--- mpol_interleave_files This is available only on NUMA kernels. Historically, the default behaviour of the system is to allocate memory local to the process. The behaviour was usually modified through the use of memory policies while zone_reclaim_mode controls how strict the local memory allocation policy is. Issues arise when the allocating process is frequently running on the same node. The kernels memory reclaim daemon runs one instance per NUMA node. A consequence is that relatively new memory may be reclaimed by kswapd when the allocating process is running on a specific node. The user-visible impact is that the system appears to do more IO than necessary when a workload is accessing files that are larger than a given NUMA node. To address this problem, the default system memory policy is modified by this tunable. When this tunable is enabled, the system default memory policy will interleave batches of file-backed pages over all allowed zones and nodes. The assumption is that, when it comes to file pages that users generally prefer predictable replacement behavior regardless of NUMA topology and maximizing the page cache's effectiveness in reducing IO over memory locality. The tunable zone_reclaim_mode overrides this and enabling zone_reclaim_mode functionally disables mpol_interleave_pagecache. A process running within a memory cpuset will obey the cpuset policy and ignore mpol_interleave_files. At the time of writing, this parameter cannot be overridden by a process using set_mempolicy to set the task memory policy. Similarly, numactl setting the task memory policy will not override this setting. This may change in the future. The tunable is default enabled and has two recognised parameters; 0: Use the MPOL_LOCAL policy as the system-wide default 1: Batch interleave file-backed allocations over all allowed nodes One enabled, the downside is that some file accesses will now be to remote memory even though the local node had available resources. This will hurt workloads with small or short lived files that fit easily within one node. The upside is that workloads working on files larger than a NUMA node will not reclaim active pages prematurely. ---8<--- > +============================================================= > + > min_slab_ratio: > > This is available only on NUMA kernels. > diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt > index 4e7da6543424..72247e565908 100644 > --- a/Documentation/vm/numa_memory_policy.txt > +++ b/Documentation/vm/numa_memory_policy.txt > @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When > both cpusets and policies are applied to a task, the restrictions of the cpuset > takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. > > +Note that, per default, the memory policies do not apply to pagecache. Instead > +it will be interleaved fairly over all allowable nodes (respecting hardbindings > +and zone_reclaim_mode) in order to maximize the cache's effectiveness in > +reducing IO and to ensure predictable cache replacement. Special setups that > +require pagecache to adhere to the configured memory policy can change this > +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt). > + Manual pages should also be updated. > MEMORY POLICY CONCEPTS > > Scope of Memory Policies > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 9b4dd491f7e8..f69e4cb78ccf 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -35,6 +35,7 @@ struct vm_area_struct; > #define ___GFP_NO_KSWAPD 0x400000u > #define ___GFP_OTHER_NODE 0x800000u > #define ___GFP_WRITE 0x1000000u > +#define ___GFP_PAGECACHE 0x2000000u > /* If the above are modified, __GFP_BITS_SHIFT may need updating */ > > /* > @@ -92,6 +93,7 @@ struct vm_area_struct; > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ > #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ > +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */ > > /* > * This may seem redundant, but it's a way of annotating false positives vs. > @@ -99,7 +101,7 @@ struct vm_area_struct; > */ > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > > -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ > +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */ > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* This equals 0, but use constants in case they ever change */ > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index e3dea75a078b..bda48453af8e 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp); > #else > static inline struct page *__page_cache_alloc(gfp_t gfp) > { > - return alloc_pages(gfp, 0); > + return alloc_pages(gfp | __GFP_PAGECACHE, 0); > } > #endif > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 46ba0c6c219f..3458994b0881 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages; > > #ifdef CONFIG_NUMA > extern int zone_reclaim_mode; > +extern int pagecache_mempolicy_mode; > extern int sysctl_min_unmapped_ratio; > extern int sysctl_min_slab_ratio; > extern int zone_reclaim(struct zone *, gfp_t, unsigned int); > #else > #define zone_reclaim_mode 0 > +#define pagecache_mempolicy_mode 0 > static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) > { > return 0; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 34a604726d0b..a8c56c1dc98e 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = { > .extra1 = &zero, > }, > { > + .procname = "pagecache_mempolicy_mode", > + .data = &pagecache_mempolicy_mode, > + .maxlen = sizeof(pagecache_mempolicy_mode), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + .extra1 = &zero, > + }, > + { > .procname = "min_unmapped_ratio", > .data = &sysctl_min_unmapped_ratio, > .maxlen = sizeof(sysctl_min_unmapped_ratio), > diff --git a/mm/filemap.c b/mm/filemap.c > index b7749a92021c..5bb922506906 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp) > int n; > struct page *page; > > + gfp |= __GFP_PAGECACHE; > + > if (cpuset_do_page_mem_spread()) { > unsigned int cpuset_mems_cookie; > do { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 580a5f075ed0..f7c0ecb5bb8b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1547,7 +1547,15 @@ again: > get_pageblock_migratetype(page)); > } > > + /* > + * All allocations eat into the round-robin batch, even > + * allocations that are not subject to round-robin placement > + * themselves. This makes sure that allocations that ARE > + * subject to round-robin placement compensate for the > + * allocations that aren't, to have equal placement overall. > + */ > __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); > + > __count_zone_vm_events(PGALLOC, zone, 1 << order); > zone_statistics(preferred_zone, zone, gfp_flags); > local_irq_restore(flags); Thanks. > @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > > #ifdef CONFIG_NUMA > /* > + * pagecache_mempolicy_mode - whether pagecache allocations should > + * honor the configured memory policy and allocate from the zonelist > + * in order of preference, or whether they should interleave fairly > + * over all allowed zones in the given zonelist to maximize cache > + * effects and ensure predictable cache replacement. > + */ > +int pagecache_mempolicy_mode __read_mostly; > + > +/* > * zlc_setup - Setup for "zonelist cache". Uses cached zone data to > * skip over zones that are not allowed by the cpuset, or that have > * been recently (in last second) found to be nearly full. See further > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > static bool zone_local(struct zone *local_zone, struct zone *zone) > { > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > + return local_zone->node == zone->node; > } Does that not break on !CONFIG_NUMA? It's why I used zone_to_nid > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > @@ -1908,22 +1925,25 @@ zonelist_scan: > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > goto try_this_zone; > /* > - * Distribute pages in proportion to the individual > - * zone size to ensure fair page aging. The zone a > - * page was allocated in should have no effect on the > - * time the page has in memory before being reclaimed. > + * Distribute pagecache pages in proportion to the > + * individual zone size to ensure fair page aging. > + * The zone a page was allocated in should have no > + * effect on the time the page has in memory before > + * being reclaimed. > * > - * When zone_reclaim_mode is enabled, try to stay in > - * local zones in the fastpath. If that fails, the > + * When pagecache_mempolicy_mode or zone_reclaim_mode > + * is enabled, try to allocate from zones within the > + * preferred node in the fastpath. If that fails, the > * slowpath is entered, which will do another pass > * starting with the local zones, but ultimately fall > * back to remote zones that do not partake in the > * fairness round-robin cycle of this zonelist. > */ > - if (alloc_flags & ALLOC_WMARK_LOW) { > + if ((alloc_flags & ALLOC_WMARK_LOW) && > + (gfp_mask & __GFP_PAGECACHE)) { > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > continue; NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in some cases. Looks a bit fishy even though I can't quite put my finger on it. I also got details wrong here in the v3 of the series. In an unreleased v4 of the series I had corrected the treatment of slab pages in line with your wishes and reused the broken out helper in prepare_slowpath to keep the decision in sync. It's still in development but even if it gets rejected it'll act as a comparison point to yours. > - if (zone_reclaim_mode && > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > !zone_local(preferred_zone, zone)) > continue; > } Documention says "enabling pagecache_mempolicy_mode, in which case page cache allocations will be placed according to the configured memory policy". Should that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. Breaking this out would be more comprehensible. On a semi-related note, we might encounter a problem later where the interleaving causes us to skip over usable zones and zones with available batches are !zone_dirty_ok. We'd fall back to the slowpatch resetting the batches so it will not be particularly visible but there might be some interactions there. > @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, > * thrash fairness information for zones that are not > * actually part of this zonelist's round-robin cycle. > */ > - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > + !zone_local(preferred_zone, zone)) > continue; > mod_zone_page_state(zone, NR_ALLOC_BATCH, > high_wmark_pages(zone) - > diff --git a/mm/shmem.c b/mm/shmem.c > index 8297623fcaed..02d7a9c03463 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, > return page; > } > > +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */ > +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info) > +{ > + /* If no internal shm_mount then it must be tmpfs */ > + if (IS_ERR(shm_mnt)) > + return true; > + > + /* Consider it to be tmpfs if the superblock is not the internal mount */ > + return info->vfs_inode.i_sb != shm_mnt->mnt_sb; > +} > + > static struct page *shmem_alloc_page(gfp_t gfp, > struct shmem_inode_info *info, pgoff_t index) > { > @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp, > pvma.vm_ops = NULL; > pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); > > + if (shmem_inode_on_tmpfs(info)) > + gfp |= __GFP_PAGECACHE; > + > page = alloc_page_vma(gfp, &pvma, 0); > > /* Drop reference taken by mpol_shared_policy_lookup() */ For what it's worth, this is what I've currently kicked off testes for git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12 -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755186Ab3LRPS6 (ORCPT ); Wed, 18 Dec 2013 10:18:58 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50355 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754878Ab3LRPS5 (ORCPT ); Wed, 18 Dec 2013 10:18:57 -0500 Date: Wed, 18 Dec 2013 10:18:46 -0500 From: Johannes Weiner To: Michal Hocko Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218151846.GM21724@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218145111.GA27510@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > [...] > > +pagecache_mempolicy_mode: > > + > > +This is available only on NUMA kernels. > > + > > +Per default, the configured memory policy is applicable to anonymous > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > +interleaving fashion over all allowed nodes (hardbindings and > > +zone_reclaim_mode excluded). > > + > > +The assumption is that, when it comes to pagecache, users generally > > +prefer predictable replacement behavior regardless of NUMA topology > > +and maximizing the cache's effectiveness in reducing IO over memory > > +locality. > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > semantically? The setting is per-cpuset rather than global which makes > it harder to use but essentially it tries to distribute page cache pages > across all the nodes. > > This is really getting confusing. We have zone_reclaim_mode to keep > memory local in general, pagecache_mempolicy_mode to keep page cache > local and PF_SPREAD_PAGE to spread the page cache around nodes. zone_reclaim_mode is a global setting to go through great lengths to stay on local nodes, intended to be used depending on the hardware, not the workload. Mempolicy on the other hand is to optimize placement for maximum locality depending on access patterns of a workload or even just the subset of a workload. I'm trying to change whether this applies to page cache (due to different locality / cache effectiveness tradeoff) and we want to provide pagecache_mempolicy_mode to revert in the field in case this is a mistake. PF_SPREAD_PAGE becomes implied per default and should eventually be removed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754967Ab3LRQJl (ORCPT ); Wed, 18 Dec 2013 11:09:41 -0500 Received: from cantor2.suse.de ([195.135.220.15]:58205 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754862Ab3LRQJj (ORCPT ); Wed, 18 Dec 2013 11:09:39 -0500 Date: Wed, 18 Dec 2013 16:09:36 +0000 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218160936.GX11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218150038.GP11295@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote: > > For what it's worth, this is what I've currently kicked off testes for > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12 > Pushed a dirty tree by accident. Now mm-pgalloc-interleave-zones-v4r13 -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754102Ab3LRQUy (ORCPT ); Wed, 18 Dec 2013 11:20:54 -0500 Received: from cantor2.suse.de ([195.135.220.15]:58696 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751443Ab3LRQUw (ORCPT ); Wed, 18 Dec 2013 11:20:52 -0500 Date: Wed, 18 Dec 2013 17:20:50 +0100 From: Michal Hocko To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218162050.GB27510@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218151846.GM21724@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-12-13 10:18:46, Johannes Weiner wrote: > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > > [...] > > > +pagecache_mempolicy_mode: > > > + > > > +This is available only on NUMA kernels. > > > + > > > +Per default, the configured memory policy is applicable to anonymous > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > > +interleaving fashion over all allowed nodes (hardbindings and > > > +zone_reclaim_mode excluded). > > > + > > > +The assumption is that, when it comes to pagecache, users generally > > > +prefer predictable replacement behavior regardless of NUMA topology > > > +and maximizing the cache's effectiveness in reducing IO over memory > > > +locality. > > > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > > semantically? The setting is per-cpuset rather than global which makes > > it harder to use but essentially it tries to distribute page cache pages > > across all the nodes. > > > > This is really getting confusing. We have zone_reclaim_mode to keep > > memory local in general, pagecache_mempolicy_mode to keep page cache > > local and PF_SPREAD_PAGE to spread the page cache around nodes. > > zone_reclaim_mode is a global setting to go through great lengths to > stay on local nodes, intended to be used depending on the hardware, > not the workload. > > Mempolicy on the other hand is to optimize placement for maximum > locality depending on access patterns of a workload or even just the > subset of a workload. I'm trying to change whether this applies to > page cache (due to different locality / cache effectiveness tradeoff) > and we want to provide pagecache_mempolicy_mode to revert in the field > in case this is a mistake. > > PF_SPREAD_PAGE becomes implied per default and should eventually be > removed. I guess many loads do not care about page cache locality and the default spreading would be OK for them but what about those that do care? Currently we have a per-process (cpuset in fact) flag but this will change it to all or nothing. Is this really a good step? Btw. I do not mind having PF_SPREAD_PAGE enabled by default. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751954Ab3LRTX4 (ORCPT ); Wed, 18 Dec 2013 14:23:56 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50369 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750872Ab3LRTXz (ORCPT ); Wed, 18 Dec 2013 14:23:55 -0500 Date: Wed, 18 Dec 2013 14:20:15 -0500 From: Johannes Weiner To: Michal Hocko Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218192015.GA20038@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> <20131218162050.GB27510@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218162050.GB27510@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote: > On Wed 18-12-13 10:18:46, Johannes Weiner wrote: > > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote: > > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote: > > > [...] > > > > +pagecache_mempolicy_mode: > > > > + > > > > +This is available only on NUMA kernels. > > > > + > > > > +Per default, the configured memory policy is applicable to anonymous > > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an > > > > +interleaving fashion over all allowed nodes (hardbindings and > > > > +zone_reclaim_mode excluded). > > > > + > > > > +The assumption is that, when it comes to pagecache, users generally > > > > +prefer predictable replacement behavior regardless of NUMA topology > > > > +and maximizing the cache's effectiveness in reducing IO over memory > > > > +locality. > > > > > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing > > > semantically? The setting is per-cpuset rather than global which makes > > > it harder to use but essentially it tries to distribute page cache pages > > > across all the nodes. > > > > > > This is really getting confusing. We have zone_reclaim_mode to keep > > > memory local in general, pagecache_mempolicy_mode to keep page cache > > > local and PF_SPREAD_PAGE to spread the page cache around nodes. You are right that the user interface we are exposing is kind of cruddy and I'm less and less convinced that this is the right direction. > > zone_reclaim_mode is a global setting to go through great lengths to > > stay on local nodes, intended to be used depending on the hardware, > > not the workload. > > > > Mempolicy on the other hand is to optimize placement for maximum > > locality depending on access patterns of a workload or even just the > > subset of a workload. I'm trying to change whether this applies to > > page cache (due to different locality / cache effectiveness tradeoff) > > and we want to provide pagecache_mempolicy_mode to revert in the field > > in case this is a mistake. > > > > PF_SPREAD_PAGE becomes implied per default and should eventually be > > removed. > > I guess many loads do not care about page cache locality and the default > spreading would be OK for them but what about those that do care? Mel suggested that the page cache spreading be implemented as just another memory policy and I rejected it on the grounds that we have can have strange aging artifacts if it's not the default. But you are right that there might be usecases that really have high cache locality and don't incur any reclaim. The aging artifacts are non-existent to them but they would care about the NUMA locality. And basically, the same aging artifacts apply to anon e.g., just that the trade-off balance is different, as reclaim is much less common. And we do offer interleaving for anon as well. So the situation is not all that different that I had myself convinced it would be... So the more I'm thinking about it, the more I'm leaning towards making it a mempolicy after all, provided that we can set a sane default. Maybe we can make the new default a hybrid policy that keeps anon, shmem, slab, kernel, etc. local but interleaves pagecache. This should make sense to most usecases while providing the ability for custom placement policies per-process or per-VMA without having to make the decision on a global level or through an unusual interface. > Currently we have a per-process (cpuset in fact) flag but this will > change it to all or nothing. Is this really a good step? > Btw. I do not mind having PF_SPREAD_PAGE enabled by default. I don't want to muck around with cpusets too much, tbh... but I agree that the behavior of PF_SPREAD_PAGE should be the default. Except it should honor zone_reclaim_mode and round-robin nodes that are within RECLAIM_DISTANCE of the local one. I will have spotty access to internet starting tomorrow night until New Year's. Is there a chance we can maybe revert the NUMA aspects of the original patch for now and leave it as a node-local zone fairness thing? The NUMA behavior was so broken on 3.12 that I doubt that people have come to rely on the cache fairness on such machines in that one release. So we should be able to release 3.12-stable and 3.13 with node-local zone fairness without regressing anybody, and then give the NUMA aspect of it another try in 3.14. Something like the following should restore NUMA behavior while still fixing the kswapd vs. page allocator interaction bug of thrashing on the highest zone. PS: zone_local() is in a CONFIG_NUMA block, which is why accessing zone->node is safe :-) --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd886fac451a..317ea747d2cd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) static bool zone_local(struct zone *local_zone, struct zone *zone) { - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; + return local_zone->node == zone->node; } static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, * page was allocated in should have no effect on the * time the page has in memory before being reclaimed. * - * When zone_reclaim_mode is enabled, try to stay in - * local zones in the fastpath. If that fails, the - * slowpath is entered, which will do another pass - * starting with the local zones, but ultimately fall - * back to remote zones that do not partake in the - * fairness round-robin cycle of this zonelist. + * Try to stay in local zones in the fastpath. If + * that fails, the slowpath is entered, which will do + * another pass starting with the local zones, but + * ultimately fall back to remote zones that do not + * partake in the fairness round-robin cycle of this + * zonelist. */ if (alloc_flags & ALLOC_WMARK_LOW) { if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) continue; - if (zone_reclaim_mode && - !zone_local(preferred_zone, zone)) + if (!zone_local(preferred_zone, zone)) continue; } /* @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, * thrash fairness information for zones that are not * actually part of this zonelist's round-robin cycle. */ - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) + if (!zone_local(preferred_zone, zone)) continue; mod_zone_page_state(zone, NR_ALLOC_BATCH, high_wmark_pages(zone) - From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752214Ab3LRTvu (ORCPT ); Wed, 18 Dec 2013 14:51:50 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:50375 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751140Ab3LRTvt (ORCPT ); Wed, 18 Dec 2013 14:51:49 -0500 Date: Wed, 18 Dec 2013 14:48:13 -0500 From: Johannes Weiner To: Mel Gorman Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131218194813.GB20038@cmpxchg.org> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218150038.GP11295@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote: > On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote: > > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote: > > > Hi Mel, > > > > > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote: > > > > This series is currently untested and is being posted to sync up discussions > > > > on the treatment of page cache pages, particularly the sysv part. I have > > > > not thought it through in detail but postings patches is the easiest way > > > > to highlight where I think a problem might be. > > > > > > > > Changelog since v2 > > > > o Drop an accounting patch, behaviour is deliberate > > > > o Special case tmpfs and shmem pages for discussion > > > > > > > > Changelog since v1 > > > > o Fix lot of brain damage in the configurable policy patch > > > > o Yoink a page cache annotation patch > > > > o Only account batch pages against allocations eligible for the fair policy > > > > o Add patch that default distributes file pages on remote nodes > > > > > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a > > > > bug whereby new pages could be reclaimed before old pages because of how > > > > the page allocator and kswapd interacted on the per-zone LRU lists. > > > > > > Not just that, it was about ensuring predictable cache replacement and > > > maximizing the cache's effectiveness. This implicitely fixed the > > > kswapd interaction bug, but that was not the sole reason (I realize > > > that the original changelog is incomplete and I apologize for that). > > > > > > I have had offline discussions with Andrea back then and his first > > > suggestion was too to make this a zone fairness placement that is > > > exclusive to the local node, but eventually he agreed that the problem > > > applies just as much on the global level and that we should apply > > > fairness throughout the system as long as we honor zone_reclaim_mode > > > and hard bindings. During our discussions now, it turned out that > > > zone_reclaim_mode is a terrible predictor for preferred locality, but > > > we also more or less agreed that the locality issues in the first > > > place are not really applicable to cache loads dominated by IO cost. > > > > > > So I think the main discrepancy between the original patch and what we > > > truly want is that aging fairness is really only relevant for actual > > > cache backed by secondary storage, because cache replacement is an > > > ongoing operation that involves IO. As opposed to memory types that > > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all > > > (slab, kernel allocations), in which case we prefer NUMA locality. > > > > > > > Unfortunately a side-effect missed during review was that it's now very > > > > easy to allocate remote memory on NUMA machines. The problem is that > > > > it is not a simple case of just restoring local allocation policies as > > > > there are genuine reasons why global page aging may be prefereable. It's > > > > still a major change to default behaviour so this patch makes the policy > > > > configurable and sets what I think is a sensible default. > > > > > > > > The patches are on top of some NUMA balancing patches currently in -mm. > > > > It's untested and posted to discuss patches 4 and 6. > > > > > > It might be easier in dealing with -stable if we start with the > > > critical fix(es) to restore sane functionality as much and as compact > > > as possible and then place the cleanups on top? > > > > > > In my local tree, I have the following as the first patch: > > > > Updated version with your tmpfs __GFP_PAGECACHE parts added and > > documentation, changelog updated as necessary. I remain unconvinced > > that tmpfs pages should be round-robined, but I agree with you that it > > is the conservative change to do for 3.12 and 3.12 and we can figure > > out the rest later. > > Assume you with 3.12 and 3.13 here. Yes :) > > --- > > From: Johannes Weiner > > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache > > > > Monolithic patch with multiple changes but meh. I'm not pushed because I > know what the breakout looks like. FWIW, I had intended the entire of my > broken-out series for 3.12 and 3.13 once it got ironed out. I find the > series easier to understand but of course I would. And of course I can live without the cleanups to make code I wrote more readable ;-) I'm happy to defer on this, let's keep logical changes separated. > > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged > > in order to ensure predictable pagecache replacement and to maximize > > the cache's effectiveness of reducing IO regardless of zone or node > > topology. > > > > However, it was overzealous in round-robin placing every type of > > allocation over all allowable nodes, instead of preferring locality, > > which resulted in severe regressions on certain NUMA workloads that > > have nothing to do with pagecache. > > > > This patch drastically reduces the impact of the original change by > > having the round-robin placement policy only apply to pagecache > > allocations and no longer to anonymous memory, shmem, slab and other > > types of kernel allocations. > > > > This still changes the long-standing behavior of pagecache adhering to > > the configured memory policy and preferring local allocations per > > default, so make it configurable in case somebody relies on it. > > However, we also expect the majority of users to prefer maximium cache > > effectiveness and a predictable replacement behavior over memory > > locality, so reflect this in the default setting of the sysctl. > > > > No-signoff-without-Mel's > > Cc: # 3.12 > > --- > > Documentation/sysctl/vm.txt | 20 ++++++++++++++++ > > Documentation/vm/numa_memory_policy.txt | 7 ++++++ > > include/linux/gfp.h | 4 +++- > > include/linux/pagemap.h | 2 +- > > include/linux/swap.h | 2 ++ > > kernel/sysctl.c | 8 +++++++ > > mm/filemap.c | 2 ++ > > mm/page_alloc.c | 41 +++++++++++++++++++++++++-------- > > mm/shmem.c | 14 +++++++++++ > > 9 files changed, 88 insertions(+), 12 deletions(-) > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > > index 1fbd4eb7b64a..308c342f62ad 100644 > > --- a/Documentation/sysctl/vm.txt > > +++ b/Documentation/sysctl/vm.txt > > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: > > - memory_failure_early_kill > > - memory_failure_recovery > > - min_free_kbytes > > +- pagecache_mempolicy_mode > > - min_slab_ratio > > - min_unmapped_ratio > > - mmap_min_addr > > Sure about the name? > > This is a boolean and "mode" implies it might be a bitmask. That said, I > recognise that my own naming also sucked because complaining about yours > I can see that mine also sucks. Is it because of how we use zone_reclaim_mode? I don't see anything wrong with a "mode" toggle that switches between only two modes of operation instead of three or more. But English being a second language and all... > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > > > static bool zone_local(struct zone *local_zone, struct zone *zone) > > { > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > > + return local_zone->node == zone->node; > > } > > Does that not break on !CONFIG_NUMA? > > It's why I used zone_to_nid There is a separate definition for !CONFIG_NUMA, it fit nicely next to the zlc stuff. > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > > @@ -1908,22 +1925,25 @@ zonelist_scan: > > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > > goto try_this_zone; > > /* > > - * Distribute pages in proportion to the individual > > - * zone size to ensure fair page aging. The zone a > > - * page was allocated in should have no effect on the > > - * time the page has in memory before being reclaimed. > > + * Distribute pagecache pages in proportion to the > > + * individual zone size to ensure fair page aging. > > + * The zone a page was allocated in should have no > > + * effect on the time the page has in memory before > > + * being reclaimed. > > * > > - * When zone_reclaim_mode is enabled, try to stay in > > - * local zones in the fastpath. If that fails, the > > + * When pagecache_mempolicy_mode or zone_reclaim_mode > > + * is enabled, try to allocate from zones within the > > + * preferred node in the fastpath. If that fails, the > > * slowpath is entered, which will do another pass > > * starting with the local zones, but ultimately fall > > * back to remote zones that do not partake in the > > * fairness round-robin cycle of this zonelist. > > */ > > - if (alloc_flags & ALLOC_WMARK_LOW) { > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > + (gfp_mask & __GFP_PAGECACHE)) { > > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > > continue; > > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in > some cases. Looks a bit fishy even though I can't quite put my finger on it. > > I also got details wrong here in the v3 of the series. In an unreleased > v4 of the series I had corrected the treatment of slab pages in line > with your wishes and reused the broken out helper in prepare_slowpath to > keep the decision in sync. > > It's still in development but even if it gets rejected it'll act as a > comparison point to yours. > > > - if (zone_reclaim_mode && > > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > > !zone_local(preferred_zone, zone)) > > continue; > > } > > Documention says "enabling pagecache_mempolicy_mode, in which case page cache > allocations will be placed according to the configured memory policy". Should > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. Yes, it's a bit weird. We want to consider the round-robin batches for local zones but at the same time avoid exhausted batches from pushing the allocation off-node when either of those modes are enabled. So in the fastpath we filter for both and in the slowpath, once kswapd has been woken at the same time that the batches have been reset to launch the new aging cycle, we try in order of zonelist preference. However, to answer your question above, if the slowpath still has to fall back to a remote zone, we don't want to reset its batch because we didn't verify it was actually exhausted in the fastpath and we could risk cutting short the aging cycle for that particular zone. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753332Ab3LSLU6 (ORCPT ); Thu, 19 Dec 2013 06:20:58 -0500 Received: from cantor2.suse.de ([195.135.220.15]:48841 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751525Ab3LSLU4 (ORCPT ); Thu, 19 Dec 2013 06:20:56 -0500 Date: Thu, 19 Dec 2013 11:20:51 +0000 From: Mel Gorman To: Johannes Weiner Cc: Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131219112051.GH11295@suse.de> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218061750.GK21724@cmpxchg.org> <20131218150038.GP11295@suse.de> <20131218194813.GB20038@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131218194813.GB20038@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 18, 2013 at 02:48:13PM -0500, Johannes Weiner wrote: > > > > > > Sure about the name? > > > > This is a boolean and "mode" implies it might be a bitmask. That said, I > > recognise that my own naming also sucked because complaining about yours > > I can see that mine also sucks. > > Is it because of how we use zone_reclaim_mode? I don't see anything > wrong with a "mode" toggle that switches between only two modes of > operation instead of three or more. But English being a second > language and all... > It's not just zone_reclaim_mode. Most references to mode in the VM (but not all because who needs consistentcy) refer to either a mask or multiple potential values. isolate_mode_t, gfp masks referred to as mode, memory policies described as mode, migration modes etc. Intuitively, I expect "mode" to not be a binary value. > > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone) > > > { > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > > > + return local_zone->node == zone->node; > > > } > > > > Does that not break on !CONFIG_NUMA? > > > > It's why I used zone_to_nid > > There is a separate definition for !CONFIG_NUMA, it fit nicely next to > the zlc stuff. > Ah, fair enough. > > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > > > @@ -1908,22 +1925,25 @@ zonelist_scan: > > > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS)) > > > goto try_this_zone; > > > /* > > > - * Distribute pages in proportion to the individual > > > - * zone size to ensure fair page aging. The zone a > > > - * page was allocated in should have no effect on the > > > - * time the page has in memory before being reclaimed. > > > + * Distribute pagecache pages in proportion to the > > > + * individual zone size to ensure fair page aging. > > > + * The zone a page was allocated in should have no > > > + * effect on the time the page has in memory before > > > + * being reclaimed. > > > * > > > - * When zone_reclaim_mode is enabled, try to stay in > > > - * local zones in the fastpath. If that fails, the > > > + * When pagecache_mempolicy_mode or zone_reclaim_mode > > > + * is enabled, try to allocate from zones within the > > > + * preferred node in the fastpath. If that fails, the > > > * slowpath is entered, which will do another pass > > > * starting with the local zones, but ultimately fall > > > * back to remote zones that do not partake in the > > > * fairness round-robin cycle of this zonelist. > > > */ > > > - if (alloc_flags & ALLOC_WMARK_LOW) { > > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > > + (gfp_mask & __GFP_PAGECACHE)) { > > > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > > > continue; > > > > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or > > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in > > some cases. Looks a bit fishy even though I can't quite put my finger on it. > > > > I also got details wrong here in the v3 of the series. In an unreleased > > v4 of the series I had corrected the treatment of slab pages in line > > with your wishes and reused the broken out helper in prepare_slowpath to > > keep the decision in sync. > > > > It's still in development but even if it gets rejected it'll act as a > > comparison point to yours. > > > > > - if (zone_reclaim_mode && > > > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) && > > > !zone_local(preferred_zone, zone)) > > > continue; > > > } > > > > Documention says "enabling pagecache_mempolicy_mode, in which case page cache > > allocations will be placed according to the configured memory policy". Should > > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots. > > Yes, it's a bit weird. > > We want to consider the round-robin batches for local zones but at the > same time avoid exhausted batches from pushing the allocation off-node > when either of those modes are enabled. So in the fastpath we filter > for both and in the slowpath, once kswapd has been woken at the same > time that the batches have been reset to launch the new aging cycle, > we try in order of zonelist preference. > > However, to answer your question above, if the slowpath still has to > fall back to a remote zone, we don't want to reset its batch because > we didn't verify it was actually exhausted in the fastpath and we > could risk cutting short the aging cycle for that particular zone. Understood, thanks. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756876Ab3LSM71 (ORCPT ); Thu, 19 Dec 2013 07:59:27 -0500 Received: from cantor2.suse.de ([195.135.220.15]:54824 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755717Ab3LSM7X (ORCPT ); Thu, 19 Dec 2013 07:59:23 -0500 Date: Thu, 19 Dec 2013 13:59:21 +0100 From: Michal Hocko To: Johannes Weiner Cc: Mel Gorman , Andrew Morton , Dave Hansen , Rik van Riel , Linux-MM , LKML Subject: Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Message-ID: <20131219125921.GF10855@dhcp22.suse.cz> References: <1387298904-8824-1-git-send-email-mgorman@suse.de> <20131217200210.GG21724@cmpxchg.org> <20131218145111.GA27510@dhcp22.suse.cz> <20131218151846.GM21724@cmpxchg.org> <20131218162050.GB27510@dhcp22.suse.cz> <20131218192015.GA20038@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131218192015.GA20038@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-12-13 14:20:15, Johannes Weiner wrote: > On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote: [...] > > Currently we have a per-process (cpuset in fact) flag but this will > > change it to all or nothing. Is this really a good step? > > Btw. I do not mind having PF_SPREAD_PAGE enabled by default. > > I don't want to muck around with cpusets too much, tbh... but I agree > that the behavior of PF_SPREAD_PAGE should be the default. Except it > should honor zone_reclaim_mode and round-robin nodes that are within > RECLAIM_DISTANCE of the local one. Agreed. > I will have spotty access to internet starting tomorrow night until > New Year's. Is there a chance we can maybe revert the NUMA aspects of > the original patch for now and leave it as a node-local zone fairness > thing? Yes, that sounds perfectly reasonable to me. > The NUMA behavior was so broken on 3.12 that I doubt that > people have come to rely on the cache fairness on such machines in > that one release. So we should be able to release 3.12-stable and > 3.13 with node-local zone fairness without regressing anybody, and > then give the NUMA aspect of it another try in 3.14. > > Something like the following should restore NUMA behavior while still > fixing the kswapd vs. page allocator interaction bug of thrashing on > the highest zone. Yes, it looks good to me. I guess zone_local could have stayed as it was because it shouldn't be a big deal to fall-back to a different node if the distance is LOCAL, but taking a conservative approach is not harmfull. > PS: zone_local() is in a CONFIG_NUMA block, which > is why accessing zone->node is safe :-) > > --- > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dd886fac451a..317ea747d2cd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist) > > static bool zone_local(struct zone *local_zone, struct zone *zone) > { > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE; > + return local_zone->node == zone->node; > } > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, > * page was allocated in should have no effect on the > * time the page has in memory before being reclaimed. > * > - * When zone_reclaim_mode is enabled, try to stay in > - * local zones in the fastpath. If that fails, the > - * slowpath is entered, which will do another pass > - * starting with the local zones, but ultimately fall > - * back to remote zones that do not partake in the > - * fairness round-robin cycle of this zonelist. > + * Try to stay in local zones in the fastpath. If > + * that fails, the slowpath is entered, which will do > + * another pass starting with the local zones, but > + * ultimately fall back to remote zones that do not > + * partake in the fairness round-robin cycle of this > + * zonelist. > */ > if (alloc_flags & ALLOC_WMARK_LOW) { > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0) > continue; > - if (zone_reclaim_mode && > - !zone_local(preferred_zone, zone)) > + if (!zone_local(preferred_zone, zone)) > continue; > } > /* > @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order, > * thrash fairness information for zones that are not > * actually part of this zonelist's round-robin cycle. > */ > - if (zone_reclaim_mode && !zone_local(preferred_zone, zone)) > + if (!zone_local(preferred_zone, zone)) > continue; > mod_zone_page_state(zone, NR_ALLOC_BATCH, > high_wmark_pages(zone) - > > -- Michal Hocko SUSE Labs