From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id 794586B025E for ; Mon, 28 Mar 2016 01:27:13 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id td3so89738747pab.2 for ; Sun, 27 Mar 2016 22:27:13 -0700 (PDT) Received: from mail-pa0-x232.google.com (mail-pa0-x232.google.com. [2607:f8b0:400e:c03::232]) by mx.google.com with ESMTPS id 18si23915042pfr.205.2016.03.27.22.27.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:12 -0700 (PDT) Received: by mail-pa0-x232.google.com with SMTP id td3so89738569pab.2 for ; Sun, 27 Mar 2016 22:27:12 -0700 (PDT) From: js1304@gmail.com Subject: mm/slab: reduce lock contention in alloc path Date: Mon, 28 Mar 2016 14:26:50 +0900 Message-Id: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. Thanks. Joonsoo Kim (11): mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() mm/slab: remove BAD_ALIEN_MAGIC again mm/slab: drain the free slab as much as possible mm/slab: factor out kmem_cache_node initialization code mm/slab: clean-up kmem_cache_node setup mm/slab: don't keep free slabs if free_objects exceeds free_limit mm/slab: racy access/modify the slab color mm/slab: make cache_grow() handle the page allocated on arbitrary node mm/slab: separate cache_grow() to two parts mm/slab: refill cpu cache through a new slab without holding a node lock mm/slab: lockless decision to grow cache mm/slab.c | 495 ++++++++++++++++++++++++++++--------------------------- mm/slab_common.c | 4 + 2 files changed, 255 insertions(+), 244 deletions(-) -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f182.google.com (mail-pf0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 117996B025F for ; Mon, 28 Mar 2016 01:27:16 -0400 (EDT) Received: by mail-pf0-f182.google.com with SMTP id n5so128680584pfn.2 for ; Sun, 27 Mar 2016 22:27:16 -0700 (PDT) Received: from mail-pf0-x232.google.com (mail-pf0-x232.google.com. [2607:f8b0:400e:c00::232]) by mx.google.com with ESMTPS id by10si6171648pab.168.2016.03.27.22.27.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:15 -0700 (PDT) Received: by mail-pf0-x232.google.com with SMTP id n5so128680433pfn.2 for ; Sun, 27 Mar 2016 22:27:15 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Date: Mon, 28 Mar 2016 14:26:51 +0900 Message-Id: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Major kmem_cache metadata in slab subsystem is synchronized with the slab_mutex. In SLAB, if some of them is changed, node's shared array cache would be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. We can fix this small theoretical race condition by holding node lock in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() looks more appropriate solution because stable state would make things less error-prone and this is not performance critical path. In addtion, annotate on SLAB functions. Signed-off-by: Joonsoo Kim --- mm/slab.c | 2 ++ mm/slab_common.c | 4 ++++ 2 files changed, 6 insertions(+) diff --git a/mm/slab.c b/mm/slab.c index a53a0f6..043606a 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2218,6 +2218,7 @@ static void do_drain(void *arg) ac->avail = 0; } +/* Should be called with slab_mutex to prevent from freeing shared array */ static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_cache_node *n; @@ -3871,6 +3872,7 @@ skip_setup: * Drain an array if it contains any elements taking the node lock only if * necessary. Note that the node listlock also protects the array_cache * if drain_array() is used on the shared array. + * Should be called with slab_mutex to prevent from freeing shared array. */ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, struct array_cache *ac, int force, int node) diff --git a/mm/slab_common.c b/mm/slab_common.c index a65dad7..5bed565 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -755,7 +755,11 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); kasan_cache_shrink(cachep); + + mutex_lock(&slab_mutex); ret = __kmem_cache_shrink(cachep, false); + mutex_unlock(&slab_mutex); + put_online_mems(); put_online_cpus(); return ret; -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id 07E586B0260 for ; Mon, 28 Mar 2016 01:27:19 -0400 (EDT) Received: by mail-pf0-f175.google.com with SMTP id x3so128944894pfb.1 for ; Sun, 27 Mar 2016 22:27:19 -0700 (PDT) Received: from mail-pa0-x229.google.com (mail-pa0-x229.google.com. [2607:f8b0:400e:c03::229]) by mx.google.com with ESMTPS id n62si12756293pfi.139.2016.03.27.22.27.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:18 -0700 (PDT) Received: by mail-pa0-x229.google.com with SMTP id td3so89740101pab.2 for ; Sun, 27 Mar 2016 22:27:18 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Date: Mon, 28 Mar 2016 14:26:52 +0900 Message-Id: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it causes a problem on m68k which has many node but !CONFIG_NUMA. In this case, although alien cache isn't used at all but to cope with some initialization path, garbage value is used and that is BAD_ALIEN_MAGIC. Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 043606a..a5a205b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = { .name = "kmem_cache", }; -#define BAD_ALIEN_MAGIC 0x01020304ul - static DEFINE_PER_CPU(struct delayed_work, slab_reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) @@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to, static inline struct alien_cache **alloc_alien_cache(int node, int limit, gfp_t gfp) { - return (struct alien_cache **)BAD_ALIEN_MAGIC; + return NULL; } static inline void free_alien_cache(struct alien_cache **ac_ptr) @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) sizeof(struct rcu_head)); kmem_cache = &kmem_cache_boot; - if (num_possible_nodes() == 1) + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) use_alien_caches = 0; for (i = 0; i < NUM_INIT_LISTS; i++) -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 22AA86B0261 for ; Mon, 28 Mar 2016 01:27:22 -0400 (EDT) Received: by mail-pa0-f42.google.com with SMTP id td3so89740987pab.2 for ; Sun, 27 Mar 2016 22:27:22 -0700 (PDT) Received: from mail-pf0-x22e.google.com (mail-pf0-x22e.google.com. [2607:f8b0:400e:c00::22e]) by mx.google.com with ESMTPS id zo2si17526599pac.221.2016.03.27.22.27.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:21 -0700 (PDT) Received: by mail-pf0-x22e.google.com with SMTP id 4so129069542pfd.0 for ; Sun, 27 Mar 2016 22:27:21 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 03/11] mm/slab: drain the free slab as much as possible Date: Mon, 28 Mar 2016 14:26:53 +0900 Message-Id: <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim slabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Signed-off-by: Joonsoo Kim --- mm/slab.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a5a205b..ba2eacf 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -888,12 +888,6 @@ static int init_cache_node_node(int node) return 0; } -static inline int slabs_tofree(struct kmem_cache *cachep, - struct kmem_cache_node *n) -{ - return (n->free_objects + cachep->num - 1) / cachep->num; -} - static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -958,7 +952,7 @@ free_slab: n = get_node(cachep, node); if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); } } @@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node) if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); if (!list_empty(&n->slabs_full) || !list_empty(&n->slabs_partial)) { @@ -2280,7 +2274,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) check_irq_on(); for_each_kmem_cache_node(cachep, node, n) { - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); ret += !list_empty(&n->slabs_full) || !list_empty(&n->slabs_partial); -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id 29BB56B0262 for ; Mon, 28 Mar 2016 01:27:25 -0400 (EDT) Received: by mail-pa0-f46.google.com with SMTP id fe3so90065433pab.1 for ; Sun, 27 Mar 2016 22:27:25 -0700 (PDT) Received: from mail-pa0-x22c.google.com (mail-pa0-x22c.google.com. [2607:f8b0:400e:c03::22c]) by mx.google.com with ESMTPS id d82si10735402pfj.52.2016.03.27.22.27.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:24 -0700 (PDT) Received: by mail-pa0-x22c.google.com with SMTP id zm5so4455458pac.0 for ; Sun, 27 Mar 2016 22:27:24 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code Date: Mon, 28 Mar 2016 14:26:54 +0900 Message-Id: <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim It can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 39 insertions(+), 29 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index ba2eacf..569d7db 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags) } #endif +static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp) +{ + struct kmem_cache_node *n; + + /* + * Set up the kmem_cache_node for cpu before we can + * begin anything. Make sure some other cpu on this + * node has not already allocated this + */ + n = get_node(cachep, node); + if (n) + return 0; + + n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); + if (!n) + return -ENOMEM; + + kmem_cache_node_init(n); + n->next_reap = jiffies + REAPTIMEOUT_NODE + + ((unsigned long)cachep) % REAPTIMEOUT_NODE; + + n->free_limit = + (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; + + /* + * The kmem_cache_nodes don't come and go as CPUs + * come and go. slab_mutex is sufficient + * protection here. + */ + cachep->node[node] = n; + + return 0; +} + /* * Allocates and initializes node for a node on each slab cache, used for * either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node @@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags) */ static int init_cache_node_node(int node) { + int ret; struct kmem_cache *cachep; - struct kmem_cache_node *n; - const size_t memsize = sizeof(struct kmem_cache_node); list_for_each_entry(cachep, &slab_caches, list) { - /* - * Set up the kmem_cache_node for cpu before we can - * begin anything. Make sure some other cpu on this - * node has not already allocated this - */ - n = get_node(cachep, node); - if (!n) { - n = kmalloc_node(memsize, GFP_KERNEL, node); - if (!n) - return -ENOMEM; - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - - /* - * The kmem_cache_nodes don't come and go as CPUs - * come and go. slab_mutex is sufficient - * protection here. - */ - cachep->node[node] = n; - } - - spin_lock_irq(&n->list_lock); - n->free_limit = - (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(&n->list_lock); + ret = init_cache_node(cachep, node, GFP_KERNEL); + if (ret) + return ret; } + return 0; } -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f169.google.com (mail-pf0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id 1831C6B0264 for ; Mon, 28 Mar 2016 01:27:28 -0400 (EDT) Received: by mail-pf0-f169.google.com with SMTP id x3so128947436pfb.1 for ; Sun, 27 Mar 2016 22:27:28 -0700 (PDT) Received: from mail-pa0-x22e.google.com (mail-pa0-x22e.google.com. [2607:f8b0:400e:c03::22e]) by mx.google.com with ESMTPS id q195si12765221pfq.247.2016.03.27.22.27.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:27 -0700 (PDT) Received: by mail-pa0-x22e.google.com with SMTP id tt10so90304536pab.3 for ; Sun, 27 Mar 2016 22:27:27 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup Date: Mon, 28 Mar 2016 14:26:55 +0900 Message-Id: <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim There are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. Signed-off-by: Joonsoo Kim --- mm/slab.c | 167 +++++++++++++++++++++++++------------------------------------- 1 file changed, 67 insertions(+), 100 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 569d7db..b96f381 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -898,6 +898,62 @@ static int init_cache_node_node(int node) return 0; } +static int setup_kmem_cache_node(struct kmem_cache *cachep, + int node, gfp_t gfp, bool force_change) +{ + int ret = -ENOMEM; + struct kmem_cache_node *n; + struct array_cache *old_shared = NULL; + struct array_cache *new_shared = NULL; + struct alien_cache **new_alien = NULL; + LIST_HEAD(list); + + if (use_alien_caches) { + new_alien = alloc_alien_cache(node, cachep->limit, gfp); + if (!new_alien) + goto fail; + } + + if (cachep->shared) { + new_shared = alloc_arraycache(node, + cachep->shared * cachep->batchcount, 0xbaadf00d, gfp); + if (!new_shared) + goto fail; + } + + ret = init_cache_node(cachep, node, gfp); + if (ret) + goto fail; + + n = get_node(cachep, node); + spin_lock_irq(&n->list_lock); + if (n->shared) { + free_block(cachep, n->shared->entry, + n->shared->avail, node, &list); + } + + if (!n->shared || force_change) { + old_shared = n->shared; + n->shared = new_shared; + new_shared = NULL; + } + + if (!n->alien) { + n->alien = new_alien; + new_alien = NULL; + } + + spin_unlock_irq(&n->list_lock); + slabs_destroy(cachep, &list); + +fail: + kfree(old_shared); + kfree(new_shared); + free_alien_cache(new_alien); + + return ret; +} + static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -969,7 +1025,6 @@ free_slab: static int cpuup_prepare(long cpu) { struct kmem_cache *cachep; - struct kmem_cache_node *n = NULL; int node = cpu_to_mem(cpu); int err; @@ -988,44 +1043,9 @@ static int cpuup_prepare(long cpu) * array caches */ list_for_each_entry(cachep, &slab_caches, list) { - struct array_cache *shared = NULL; - struct alien_cache **alien = NULL; - - if (cachep->shared) { - shared = alloc_arraycache(node, - cachep->shared * cachep->batchcount, - 0xbaadf00d, GFP_KERNEL); - if (!shared) - goto bad; - } - if (use_alien_caches) { - alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL); - if (!alien) { - kfree(shared); - goto bad; - } - } - n = get_node(cachep, node); - BUG_ON(!n); - - spin_lock_irq(&n->list_lock); - if (!n->shared) { - /* - * We are serialised from CPU_DEAD or - * CPU_UP_CANCELLED by the cpucontrol lock - */ - n->shared = shared; - shared = NULL; - } -#ifdef CONFIG_NUMA - if (!n->alien) { - n->alien = alien; - alien = NULL; - } -#endif - spin_unlock_irq(&n->list_lock); - kfree(shared); - free_alien_cache(alien); + err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false); + if (err) + goto bad; } return 0; @@ -3652,72 +3672,19 @@ EXPORT_SYMBOL(kfree); /* * This initializes kmem_cache_node or resizes various caches for all nodes. */ -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) { + int ret; int node; struct kmem_cache_node *n; - struct array_cache *new_shared; - struct alien_cache **new_alien = NULL; for_each_online_node(node) { - - if (use_alien_caches) { - new_alien = alloc_alien_cache(node, cachep->limit, gfp); - if (!new_alien) - goto fail; - } - - new_shared = NULL; - if (cachep->shared) { - new_shared = alloc_arraycache(node, - cachep->shared*cachep->batchcount, - 0xbaadf00d, gfp); - if (!new_shared) { - free_alien_cache(new_alien); - goto fail; - } - } - - n = get_node(cachep, node); - if (n) { - struct array_cache *shared = n->shared; - LIST_HEAD(list); - - spin_lock_irq(&n->list_lock); - - if (shared) - free_block(cachep, shared->entry, - shared->avail, node, &list); - - n->shared = new_shared; - if (!n->alien) { - n->alien = new_alien; - new_alien = NULL; - } - n->free_limit = (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(&n->list_lock); - slabs_destroy(cachep, &list); - kfree(shared); - free_alien_cache(new_alien); - continue; - } - n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); - if (!n) { - free_alien_cache(new_alien); - kfree(new_shared); + ret = setup_kmem_cache_node(cachep, node, gfp, true); + if (ret) goto fail; - } - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - n->shared = new_shared; - n->alien = new_alien; - n->free_limit = (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - cachep->node[node] = n; } + return 0; fail: @@ -3759,7 +3726,7 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, cachep->shared = shared; if (!prev) - goto alloc_node; + goto setup_node; for_each_online_cpu(cpu) { LIST_HEAD(list); @@ -3776,8 +3743,8 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, } free_percpu(prev); -alloc_node: - return alloc_kmem_cache_node(cachep, gfp); +setup_node: + return setup_kmem_cache_node_node(cachep, gfp); } static int do_tune_cpucache(struct kmem_cache *cachep, int limit, -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id 4F6C56B0265 for ; Mon, 28 Mar 2016 01:27:31 -0400 (EDT) Received: by mail-pa0-f48.google.com with SMTP id fe3so90066997pab.1 for ; Sun, 27 Mar 2016 22:27:31 -0700 (PDT) Received: from mail-pf0-x22d.google.com (mail-pf0-x22d.google.com. [2607:f8b0:400e:c00::22d]) by mx.google.com with ESMTPS id oq6si157194pab.84.2016.03.27.22.27.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:30 -0700 (PDT) Received: by mail-pf0-x22d.google.com with SMTP id u190so129232592pfb.3 for ; Sun, 27 Mar 2016 22:27:30 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit Date: Mon, 28 Mar 2016 14:26:56 +0900 Message-Id: <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Currently, determination to free a slab is done whenever free object is put into the slab. This has a problem that free slabs are not freed even if we have free slabs and have more free_objects than free_limit when processed slab isn't a free slab. This would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free it so we keep free slab as minimal as possible. Signed-off-by: Joonsoo Kim --- mm/slab.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b96f381..df11757 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3258,6 +3258,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp, { int i; struct kmem_cache_node *n = get_node(cachep, node); + struct page *page; + + n->free_objects += nr_objects; for (i = 0; i < nr_objects; i++) { void *objp; @@ -3270,17 +3273,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp, check_spinlock_acquired_node(cachep, node); slab_put_obj(cachep, page, objp); STATS_DEC_ACTIVE(cachep); - n->free_objects++; /* fixup slab chains */ - if (page->active == 0) { - if (n->free_objects > n->free_limit) { - n->free_objects -= cachep->num; - list_add_tail(&page->lru, list); - } else { - list_add(&page->lru, &n->slabs_free); - } - } else { + if (page->active == 0) + list_add(&page->lru, &n->slabs_free); + else { /* Unconditionally move a slab to the end of the * partial list on free - maximum time for the * other objects to be freed, too. @@ -3288,6 +3285,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp, list_add_tail(&page->lru, &n->slabs_partial); } } + + while (n->free_objects > n->free_limit && !list_empty(&n->slabs_free)) { + n->free_objects -= cachep->num; + + page = list_last_entry(&n->slabs_free, struct page, lru); + list_del(&page->lru); + list_add(&page->lru, list); + } } static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f171.google.com (mail-pf0-f171.google.com [209.85.192.171]) by kanga.kvack.org (Postfix) with ESMTP id 921A96B0268 for ; Mon, 28 Mar 2016 01:27:34 -0400 (EDT) Received: by mail-pf0-f171.google.com with SMTP id x3so128949218pfb.1 for ; Sun, 27 Mar 2016 22:27:34 -0700 (PDT) Received: from mail-pa0-x22f.google.com (mail-pa0-x22f.google.com. [2607:f8b0:400e:c03::22f]) by mx.google.com with ESMTPS id z23si17846386pfi.42.2016.03.27.22.27.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:33 -0700 (PDT) Received: by mail-pa0-x22f.google.com with SMTP id td3so89743958pab.2 for ; Sun, 27 Mar 2016 22:27:33 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 07/11] mm/slab: racy access/modify the slab color Date: Mon, 28 Mar 2016 14:26:57 +0900 Message-Id: <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Slab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index df11757..52fc5e3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, } local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - /* Take the node list lock to change the colour_next on this node */ check_irq_off(); - n = get_node(cachep, nodeid); - spin_lock(&n->list_lock); - - /* Get colour for the slab, and cal the next value. */ - offset = n->colour_next; - n->colour_next++; - if (n->colour_next >= cachep->colour) - n->colour_next = 0; - spin_unlock(&n->list_lock); - - offset *= cachep->colour_off; - if (gfpflags_allow_blocking(local_flags)) local_irq_enable(); @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, if (!page) goto failed; + n = get_node(cachep, nodeid); + + /* Get colour for the slab, and cal the next value. */ + n->colour_next++; + if (n->colour_next >= cachep->colour) + n->colour_next = 0; + + offset = n->colour_next; + if (offset >= cachep->colour) + offset = 0; + + offset *= cachep->colour_off; + /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, local_flags & ~GFP_CONSTRAINT_MASK, nodeid); -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f173.google.com (mail-pf0-f173.google.com [209.85.192.173]) by kanga.kvack.org (Postfix) with ESMTP id BF14C6B026A for ; Mon, 28 Mar 2016 01:27:37 -0400 (EDT) Received: by mail-pf0-f173.google.com with SMTP id n5so128686211pfn.2 for ; Sun, 27 Mar 2016 22:27:37 -0700 (PDT) Received: from mail-pf0-x22f.google.com (mail-pf0-x22f.google.com. [2607:f8b0:400e:c00::22f]) by mx.google.com with ESMTPS id 73si12745595pfp.195.2016.03.27.22.27.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:37 -0700 (PDT) Received: by mail-pf0-x22f.google.com with SMTP id x3so128949844pfb.1 for ; Sun, 27 Mar 2016 22:27:36 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node Date: Mon, 28 Mar 2016 14:26:58 +0900 Message-Id: <1459142821-20303-9-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Currently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim --- mm/slab.c | 60 +++++++++++++++++++++--------------------------------------- 1 file changed, 21 insertions(+), 39 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 52fc5e3..ce8ed65 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2518,13 +2518,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, struct page *page) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) { void *freelist; size_t offset; gfp_t local_flags; + int page_node; struct kmem_cache_node *n; + struct page *page; /* * Be lazy and only check for valid flags here, keeping it out of the @@ -2552,12 +2553,12 @@ static int cache_grow(struct kmem_cache *cachep, * Get mem for the objs. Attempt to allocate a physical page from * 'nodeid'. */ - if (!page) - page = kmem_getpages(cachep, local_flags, nodeid); + page = kmem_getpages(cachep, local_flags, nodeid); if (!page) goto failed; - n = get_node(cachep, nodeid); + page_node = page_to_nid(page); + n = get_node(cachep, page_node); /* Get colour for the slab, and cal the next value. */ n->colour_next++; @@ -2572,7 +2573,7 @@ static int cache_grow(struct kmem_cache *cachep, /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, - local_flags & ~GFP_CONSTRAINT_MASK, nodeid); + local_flags & ~GFP_CONSTRAINT_MASK, page_node); if (OFF_SLAB(cachep) && !freelist) goto opps1; @@ -2591,13 +2592,13 @@ static int cache_grow(struct kmem_cache *cachep, STATS_INC_GROWN(cachep); n->free_objects += cachep->num; spin_unlock(&n->list_lock); - return 1; + return page_node; opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return 0; + return -1; } #if DEBUG @@ -2878,14 +2879,14 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (!x && ac->avail == 0) + if (x < 0 && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3014,7 +3015,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) { struct zonelist *zonelist; - gfp_t local_flags; struct zoneref *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); @@ -3025,8 +3025,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) if (flags & __GFP_THISNODE) return NULL; - local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); zonelist = node_zonelist(mempolicy_slab_node(), flags); @@ -3056,33 +3054,17 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - struct page *page; + nid = cache_grow(cache, flags, numa_mem_id()); + if (nid >= 0) { + obj = ____cache_alloc_node(cache, + gfp_exact_node(flags), nid); - if (gfpflags_allow_blocking(local_flags)) - local_irq_enable(); - kmem_flagcheck(cache, flags); - page = kmem_getpages(cache, local_flags, numa_mem_id()); - if (gfpflags_allow_blocking(local_flags)) - local_irq_disable(); - if (page) { /* - * Insert into the appropriate per node queues + * Another processor may allocate the objects in + * the slab since we are not holding any locks. */ - nid = page_to_nid(page); - if (cache_grow(cache, flags, nid, page)) { - obj = ____cache_alloc_node(cache, - gfp_exact_node(flags), nid); - if (!obj) - /* - * Another processor may allocate the - * objects in the slab since we are - * not holding any locks. - */ - goto retry; - } else { - /* cache_grow already freed obj */ - obj = NULL; - } + if (!obj) + goto retry; } } @@ -3133,8 +3115,8 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); - if (x) + x = cache_grow(cachep, gfp_exact_node(flags), nodeid); + if (x >= 0) goto retry; return fallback_alloc(cachep, flags); -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id E2312828DF for ; Mon, 28 Mar 2016 01:27:40 -0400 (EDT) Received: by mail-pa0-f49.google.com with SMTP id fe3so90069562pab.1 for ; Sun, 27 Mar 2016 22:27:40 -0700 (PDT) Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com. [2607:f8b0:400e:c03::236]) by mx.google.com with ESMTPS id hj1si25980199pac.235.2016.03.27.22.27.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:40 -0700 (PDT) Received: by mail-pa0-x236.google.com with SMTP id fe3so90069428pab.1 for ; Sun, 27 Mar 2016 22:27:40 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 09/11] mm/slab: separate cache_grow() to two parts Date: Mon, 28 Mar 2016 14:26:59 +0900 Message-Id: <1459142821-20303-10-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim This is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim --- mm/slab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 52 insertions(+), 22 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index ce8ed65..401e60c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list); static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp); static void cache_reap(struct work_struct *unused); +static inline void fixup_objfreelist_debug(struct kmem_cache *cachep, + void **list); +static inline void fixup_slab_list(struct kmem_cache *cachep, + struct kmem_cache_node *n, struct page *page, + void **list); static int slab_early_init = 1; #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) @@ -1796,7 +1801,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep, /* * Needed to avoid possible looping condition - * in cache_grow() + * in cache_grow_begin() */ if (OFF_SLAB(freelist_cache)) continue; @@ -2518,7 +2523,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) +static struct page *cache_grow_begin(struct kmem_cache *cachep, + gfp_t flags, int nodeid) { void *freelist; size_t offset; @@ -2584,21 +2590,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - check_irq_off(); - spin_lock(&n->list_lock); - /* Make slab active. */ - list_add_tail(&page->lru, &(n->slabs_free)); - STATS_INC_GROWN(cachep); - n->free_objects += cachep->num; - spin_unlock(&n->list_lock); - return page_node; + return page; + opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return -1; + return NULL; +} + +static void cache_grow_end(struct kmem_cache *cachep, struct page *page) +{ + struct kmem_cache_node *n; + void *list = NULL; + + check_irq_off(); + + if (!page) + return; + + INIT_LIST_HEAD(&page->lru); + n = get_node(cachep, page_to_nid(page)); + + spin_lock(&n->list_lock); + if (!page->active) + list_add_tail(&page->lru, &(n->slabs_free)); + else + fixup_slab_list(cachep, n, page, &list); + STATS_INC_GROWN(cachep); + n->free_objects += cachep->num - page->active; + spin_unlock(&n->list_lock); + + fixup_objfreelist_debug(cachep, &list); } #if DEBUG @@ -2809,6 +2834,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) struct array_cache *ac; int node; void *list = NULL; + struct page *page; check_irq_off(); node = numa_mem_id(); @@ -2836,7 +2862,6 @@ retry: } while (batchcount > 0) { - struct page *page; /* Get slab alloc is to come from. */ page = get_first_slab(n, false); if (!page) @@ -2869,8 +2894,6 @@ alloc_done: fixup_objfreelist_debug(cachep, &list); if (unlikely(!ac->avail)) { - int x; - /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { void *obj = cache_alloc_pfmemalloc(cachep, n, flags); @@ -2879,14 +2902,18 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node); + page = cache_grow_begin(cachep, gfp_exact_node(flags), node); + cache_grow_end(cachep, page); - /* cache_grow can reenable interrupts, then ac could change. */ + /* + * cache_grow_begin() can reenable interrupts, + * then ac could change. + */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (x < 0 && ac->avail == 0) + if (!page && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3019,6 +3046,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; + struct page *page; int nid; unsigned int cpuset_mems_cookie; @@ -3054,8 +3082,10 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - nid = cache_grow(cache, flags, numa_mem_id()); - if (nid >= 0) { + page = cache_grow_begin(cache, flags, numa_mem_id()); + cache_grow_end(cache, page); + if (page) { + nid = page_to_nid(page); obj = ____cache_alloc_node(cache, gfp_exact_node(flags), nid); @@ -3083,7 +3113,6 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, struct kmem_cache_node *n; void *obj; void *list = NULL; - int x; VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n = get_node(cachep, nodeid); @@ -3115,8 +3144,9 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, gfp_exact_node(flags), nodeid); - if (x >= 0) + page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid); + cache_grow_end(cachep, page); + if (page) goto retry; return fallback_alloc(cachep, flags); -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f180.google.com (mail-pf0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id AB039828DF for ; Mon, 28 Mar 2016 01:27:44 -0400 (EDT) Received: by mail-pf0-f180.google.com with SMTP id x3so128952048pfb.1 for ; Sun, 27 Mar 2016 22:27:44 -0700 (PDT) Received: from mail-pf0-x22d.google.com (mail-pf0-x22d.google.com. [2607:f8b0:400e:c00::22d]) by mx.google.com with ESMTPS id 10si12743000pfk.172.2016.03.27.22.27.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:43 -0700 (PDT) Received: by mail-pf0-x22d.google.com with SMTP id u190so129236393pfb.3 for ; Sun, 27 Mar 2016 22:27:43 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock Date: Mon, 28 Mar 2016 14:27:00 +0900 Message-Id: <1459142821-20303-11-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim Until now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 +++++++++++++++++++++++++++++++++------------------------------ 1 file changed, 36 insertions(+), 32 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 401e60c..029d6b3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2827,6 +2827,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, return obj; } +/* + * Slab list should be fixed up by fixup_slab_list() for existing slab + * or cache_grow_end() for new slab + */ +static __always_inline int alloc_block(struct kmem_cache *cachep, + struct array_cache *ac, struct page *page, int batchcount) +{ + /* + * There must be at least one object available for + * allocation. + */ + BUG_ON(page->active >= cachep->num); + + while (page->active < cachep->num && batchcount--) { + STATS_INC_ALLOCED(cachep); + STATS_INC_ACTIVE(cachep); + STATS_SET_HIGH(cachep); + + ac->entry[ac->avail++] = slab_get_obj(cachep, page); + } + + return batchcount; +} + static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; @@ -2839,7 +2863,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) check_irq_off(); node = numa_mem_id(); -retry: ac = cpu_cache_get(cachep); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { @@ -2869,21 +2892,7 @@ retry: check_spinlock_acquired(cachep); - /* - * The slab was either on partial or free list so - * there must be at least one object available for - * allocation. - */ - BUG_ON(page->active >= cachep->num); - - while (page->active < cachep->num && batchcount--) { - STATS_INC_ALLOCED(cachep); - STATS_INC_ACTIVE(cachep); - STATS_SET_HIGH(cachep); - - ac->entry[ac->avail++] = slab_get_obj(cachep, page); - } - + batchcount = alloc_block(cachep, ac, page, batchcount); fixup_slab_list(cachep, n, page, &list); } @@ -2903,21 +2912,18 @@ alloc_done: } page = cache_grow_begin(cachep, gfp_exact_node(flags), node); - cache_grow_end(cachep, page); /* * cache_grow_begin() can reenable interrupts, * then ac could change. */ ac = cpu_cache_get(cachep); - node = numa_mem_id(); + if (!ac->avail && page) + alloc_block(cachep, ac, page, batchcount); + cache_grow_end(cachep, page); - /* no objects in sight? abort */ - if (!page && ac->avail == 0) + if (!ac->avail) return NULL; - - if (!ac->avail) /* objects refilled by interrupt? */ - goto retry; } ac->touched = 1; @@ -3111,14 +3117,13 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, { struct page *page; struct kmem_cache_node *n; - void *obj; + void *obj = NULL; void *list = NULL; VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n = get_node(cachep, nodeid); BUG_ON(!n); -retry: check_irq_off(); spin_lock(&n->list_lock); page = get_first_slab(n, false); @@ -3140,19 +3145,18 @@ retry: spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); - goto done; + return obj; must_grow: spin_unlock(&n->list_lock); page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid); + if (page) { + /* This slab isn't counted yet so don't update free_objects */ + obj = slab_get_obj(cachep, page); + } cache_grow_end(cachep, page); - if (page) - goto retry; - return fallback_alloc(cachep, flags); - -done: - return obj; + return obj ? obj : fallback_alloc(cachep, flags); } static __always_inline void * -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id CFAA8828DF for ; Mon, 28 Mar 2016 01:27:47 -0400 (EDT) Received: by mail-pa0-f50.google.com with SMTP id fe3so90071167pab.1 for ; Sun, 27 Mar 2016 22:27:47 -0700 (PDT) Received: from mail-pa0-x233.google.com (mail-pa0-x233.google.com. [2607:f8b0:400e:c03::233]) by mx.google.com with ESMTPS id n79si18823565pfi.149.2016.03.27.22.27.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Mar 2016 22:27:47 -0700 (PDT) Received: by mail-pa0-x233.google.com with SMTP id tt10so90309766pab.3 for ; Sun, 27 Mar 2016 22:27:47 -0700 (PDT) From: js1304@gmail.com Subject: [PATCH 11/11] mm/slab: lockless decision to grow cache Date: Mon, 28 Mar 2016 14:27:01 +0900 Message-Id: <1459142821-20303-12-git-send-email-iamjoonsoo.kim@lge.com> In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Joonsoo Kim To check whther free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. Signed-off-by: Joonsoo Kim --- mm/slab.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 029d6b3..b70aabf 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -951,6 +951,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep, spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); + /* + * To protect lockless access to n->shared during irq disabled context. + * If n->shared isn't NULL in irq disabled context, accessing to it is + * guaranteed to be valid until irq is re-enabled, because it will be + * freed after kick_all_cpus_sync(). + */ + if (force_change) + kick_all_cpus_sync(); + fail: kfree(old_shared); kfree(new_shared); @@ -2855,7 +2864,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; struct kmem_cache_node *n; - struct array_cache *ac; + struct array_cache *ac, *shared; int node; void *list = NULL; struct page *page; @@ -2876,11 +2885,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) n = get_node(cachep, node); BUG_ON(ac->avail > 0 || !n); + shared = READ_ONCE(n->shared); + if (!n->free_objects && (!shared || !shared->avail)) + goto direct_grow; + spin_lock(&n->list_lock); + shared = READ_ONCE(n->shared); /* See if we can refill from the shared array */ - if (n->shared && transfer_objects(ac, n->shared, batchcount)) { - n->shared->touched = 1; + if (shared && transfer_objects(ac, shared, batchcount)) { + shared->touched = 1; goto alloc_done; } @@ -2902,6 +2916,7 @@ alloc_done: spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); +direct_grow: if (unlikely(!ac->avail)) { /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by kanga.kvack.org (Postfix) with ESMTP id A799C6B007E for ; Mon, 28 Mar 2016 04:58:39 -0400 (EDT) Received: by mail-io0-f171.google.com with SMTP id 124so168890865iov.3 for ; Mon, 28 Mar 2016 01:58:39 -0700 (PDT) Received: from mail-ig0-x241.google.com (mail-ig0-x241.google.com. [2607:f8b0:4001:c05::241]) by mx.google.com with ESMTPS id sa10si6895046igb.8.2016.03.28.01.58.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 01:58:38 -0700 (PDT) Received: by mail-ig0-x241.google.com with SMTP id ww10so8865330igb.2 for ; Mon, 28 Mar 2016 01:58:38 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> Date: Mon, 28 Mar 2016 10:58:38 +0200 Message-ID: Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again From: Geert Uytterhoeven Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: JoonSoo Kim Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , Linux MM , "linux-kernel@vger.kernel.org" , Joonsoo Kim Hi Jonsoo, On Mon, Mar 28, 2016 at 7:26 AM, wrote: > From: Joonsoo Kim > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > because it causes a problem on m68k which has many node > but !CONFIG_NUMA. In this case, although alien cache isn't used > at all but to cope with some initialization path, garbage value > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > Signed-off-by: Joonsoo Kim I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous version that was reverted, so Tested-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id 27A706B007E for ; Mon, 28 Mar 2016 17:19:13 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fe3so106718940pab.1 for ; Mon, 28 Mar 2016 14:19:13 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id q195si17932726pfq.247.2016.03.28.14.19.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 14:19:12 -0700 (PDT) Date: Mon, 28 Mar 2016 14:19:11 -0700 From: Andrew Morton Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Message-Id: <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> In-Reply-To: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote: > From: Joonsoo Kim > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > because it causes a problem on m68k which has many node > but !CONFIG_NUMA. Whaaa? How is that even possible? I'd have thought that everything would break at compile time (at least) with such a setup. > In this case, although alien cache isn't used > at all but to cope with some initialization path, garbage value > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > ... > > @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) > sizeof(struct rcu_head)); > kmem_cache = &kmem_cache_boot; > > - if (num_possible_nodes() == 1) > + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) > use_alien_caches = 0; > > for (i = 0; i < NUM_INIT_LISTS; i++) This does look screwy. How can num_possible_nodes() possibly return anything but "1" if CONFIG_NUMA=n. Can we please get a code comment in here to explain things to the poor old reader and to prevent people from trying to "fix" it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f170.google.com (mail-io0-f170.google.com [209.85.223.170]) by kanga.kvack.org (Postfix) with ESMTP id C65116B007E for ; Mon, 28 Mar 2016 20:50:38 -0400 (EDT) Received: by mail-io0-f170.google.com with SMTP id a129so4206716ioe.0 for ; Mon, 28 Mar 2016 17:50:38 -0700 (PDT) Received: from resqmta-ch2-12v.sys.comcast.net (resqmta-ch2-12v.sys.comcast.net. [2001:558:fe21:29:69:252:207:44]) by mx.google.com with ESMTPS id q64si25603379ioe.49.2016.03.28.17.50.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 17:50:38 -0700 (PDT) Date: Mon, 28 Mar 2016 19:50:36 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() In-Reply-To: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, js1304@gmail.com wrote: > Major kmem_cache metadata in slab subsystem is synchronized with > the slab_mutex. In SLAB, if some of them is changed, node's shared > array cache would be freed and re-populated. If __kmem_cache_shrink() > is called at the same time, it will call drain_array() with n->shared > without holding node lock so problem can happen. > > We can fix this small theoretical race condition by holding node lock > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > looks more appropriate solution because stable state would make things > less error-prone and this is not performance critical path. Ummm.. The mutex taking is added to common code. So this will also affect SLUB. The patch needs to consider this. Do we want to force all allocators to run shrinking only when holding the lock? SLUB does not need to hold the mutex. And frankly the mutex is for reconfiguration of metadata which is *not* occurring here. A shrink operation does not do that. Can we figure out a slab specific way of handling synchronization in the strange free/realloc cycle? It seems that taking the node lock is the appropriate level of synchrnonization since the concern is with the contents of a shared cache at that level. There is no change of metadata which would require the mutex. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by kanga.kvack.org (Postfix) with ESMTP id 2F0A76B007E for ; Mon, 28 Mar 2016 20:53:45 -0400 (EDT) Received: by mail-ig0-f176.google.com with SMTP id l20so2813304igf.0 for ; Mon, 28 Mar 2016 17:53:45 -0700 (PDT) Received: from resqmta-po-03v.sys.comcast.net (resqmta-po-03v.sys.comcast.net. [2001:558:fe16:19:96:114:154:162]) by mx.google.com with ESMTPS id l87si889131iod.62.2016.03.28.17.53.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 17:53:43 -0700 (PDT) Date: Mon, 28 Mar 2016 19:53:39 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again In-Reply-To: <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: js1304@gmail.com, Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, Andrew Morton wrote: > On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > > because it causes a problem on m68k which has many node > > but !CONFIG_NUMA. > > Whaaa? How is that even possible? I'd have thought that everything > would break at compile time (at least) with such a setup. Yes we have that and the support for this caused numerous issues. Can we stop supporting such a configuration? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by kanga.kvack.org (Postfix) with ESMTP id 479186B025E for ; Mon, 28 Mar 2016 20:54:35 -0400 (EDT) Received: by mail-ig0-f175.google.com with SMTP id m10so2135977igt.1 for ; Mon, 28 Mar 2016 17:54:35 -0700 (PDT) Received: from resqmta-ch2-10v.sys.comcast.net (resqmta-ch2-10v.sys.comcast.net. [2001:558:fe21:29:69:252:207:42]) by mx.google.com with ESMTPS id y82si25625103iod.45.2016.03.28.17.54.34 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 17:54:34 -0700 (PDT) Date: Mon, 28 Mar 2016 19:54:33 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 03/11] mm/slab: drain the free slab as much as possible In-Reply-To: <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Acked-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by kanga.kvack.org (Postfix) with ESMTP id 539496B025E for ; Mon, 28 Mar 2016 20:56:18 -0400 (EDT) Received: by mail-io0-f171.google.com with SMTP id a129so4321521ioe.0 for ; Mon, 28 Mar 2016 17:56:18 -0700 (PDT) Received: from resqmta-ch2-09v.sys.comcast.net (resqmta-ch2-09v.sys.comcast.net. [2001:558:fe21:29:69:252:207:41]) by mx.google.com with ESMTPS id w63si25613141iod.140.2016.03.28.17.56.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 17:56:17 -0700 (PDT) Date: Mon, 28 Mar 2016 19:56:15 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code In-Reply-To: <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > - spin_lock_irq(&n->list_lock); > - n->free_limit = > - (1 + nr_cpus_node(node)) * > - cachep->batchcount + cachep->num; > - spin_unlock_irq(&n->list_lock); > + ret = init_cache_node(cachep, node, GFP_KERNEL); > + if (ret) > + return ret; Drop ret and do a return init_cache_node(...); instead? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f178.google.com (mail-io0-f178.google.com [209.85.223.178]) by kanga.kvack.org (Postfix) with ESMTP id C664E6B025E for ; Mon, 28 Mar 2016 20:58:31 -0400 (EDT) Received: by mail-io0-f178.google.com with SMTP id a129so4367132ioe.0 for ; Mon, 28 Mar 2016 17:58:31 -0700 (PDT) Received: from resqmta-po-12v.sys.comcast.net (resqmta-po-12v.sys.comcast.net. [2001:558:fe16:19:96:114:154:171]) by mx.google.com with ESMTPS id 25si4717822ioj.47.2016.03.28.17.58.31 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 17:58:31 -0700 (PDT) Date: Mon, 28 Mar 2016 19:58:29 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup In-Reply-To: <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, js1304@gmail.com wrote: > * This initializes kmem_cache_node or resizes various caches for all nodes. > */ > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) ... _node_node? Isnt there a better name for it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f170.google.com (mail-io0-f170.google.com [209.85.223.170]) by kanga.kvack.org (Postfix) with ESMTP id 2353D6B025F for ; Mon, 28 Mar 2016 21:03:24 -0400 (EDT) Received: by mail-io0-f170.google.com with SMTP id q128so4325574iof.3 for ; Mon, 28 Mar 2016 18:03:24 -0700 (PDT) Received: from resqmta-po-02v.sys.comcast.net (resqmta-po-02v.sys.comcast.net. [2001:558:fe16:19:96:114:154:161]) by mx.google.com with ESMTPS id o36si25669548ioi.7.2016.03.28.18.03.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 18:03:18 -0700 (PDT) Date: Mon, 28 Mar 2016 20:03:16 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit In-Reply-To: <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > > Currently, determination to free a slab is done whenever free object is > put into the slab. This has a problem that free slabs are not freed > even if we have free slabs and have more free_objects than free_limit There needs to be a better explanation here since I do not get why there is an issue with checking after free if a slab is actually free. > when processed slab isn't a free slab. This would cause to keep > too much memory in the slab subsystem. This patch try to fix it > by checking number of free object after all free work is done. If there > is free slab at that time, we can free it so we keep free slab as minimal > as possible. Ok if we check after free work is done then the number of free slabs may be higher than the limit set and then we free the additional slabs to get down to the limit that was set? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f170.google.com (mail-ig0-f170.google.com [209.85.213.170]) by kanga.kvack.org (Postfix) with ESMTP id F26ED6B025E for ; Mon, 28 Mar 2016 21:05:48 -0400 (EDT) Received: by mail-ig0-f170.google.com with SMTP id m10so2298729igt.1 for ; Mon, 28 Mar 2016 18:05:48 -0700 (PDT) Received: from resqmta-po-10v.sys.comcast.net (resqmta-po-10v.sys.comcast.net. [2001:558:fe16:19:96:114:154:169]) by mx.google.com with ESMTPS id 26si8177007ioi.61.2016.03.28.18.05.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Mar 2016 18:05:43 -0700 (PDT) Date: Mon, 28 Mar 2016 20:05:41 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color In-Reply-To: <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > > Slab color isn't needed to be changed strictly. Because locking > for changing slab color could cause more lock contention so this patch > implements racy access/modify the slab color. This is a preparation step > to implement lockless allocation path when there is no free objects in > the kmem_cache. Acked-by: Christoph Lameter The rest of the description does not relate to this patch and does not actually reflect the improvement of applying this patch. Remove the rest? > Below is the result of concurrent allocation/free in slab allocation > benchmark made by Christoph a long time ago. I make the output simpler. > The number shows cycle count during alloc/free respectively so less > is better. > > * Before > Kmalloc N*alloc N*free(32): Average=365/806 > Kmalloc N*alloc N*free(64): Average=452/690 > Kmalloc N*alloc N*free(128): Average=736/886 > Kmalloc N*alloc N*free(256): Average=1167/985 > Kmalloc N*alloc N*free(512): Average=2088/1125 > Kmalloc N*alloc N*free(1024): Average=4115/1184 > Kmalloc N*alloc N*free(2048): Average=8451/1748 > Kmalloc N*alloc N*free(4096): Average=16024/2048 > > * After > Kmalloc N*alloc N*free(32): Average=355/750 > Kmalloc N*alloc N*free(64): Average=452/812 > Kmalloc N*alloc N*free(128): Average=559/1070 > Kmalloc N*alloc N*free(256): Average=1176/980 > Kmalloc N*alloc N*free(512): Average=1939/1189 > Kmalloc N*alloc N*free(1024): Average=3521/1278 > Kmalloc N*alloc N*free(2048): Average=7152/1838 > Kmalloc N*alloc N*free(4096): Average=13438/2013 > > It shows that contention is reduced for object size >= 1024 > and performance increases by roughly 15%. > > Signed-off-by: Joonsoo Kim > --- > mm/slab.c | 26 +++++++++++++------------- > 1 file changed, 13 insertions(+), 13 deletions(-) > > diff --git a/mm/slab.c b/mm/slab.c > index df11757..52fc5e3 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, > } > local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); > > - /* Take the node list lock to change the colour_next on this node */ > check_irq_off(); > - n = get_node(cachep, nodeid); > - spin_lock(&n->list_lock); > - > - /* Get colour for the slab, and cal the next value. */ > - offset = n->colour_next; > - n->colour_next++; > - if (n->colour_next >= cachep->colour) > - n->colour_next = 0; > - spin_unlock(&n->list_lock); > - > - offset *= cachep->colour_off; > - > if (gfpflags_allow_blocking(local_flags)) > local_irq_enable(); > > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, > if (!page) > goto failed; > > + n = get_node(cachep, nodeid); > + > + /* Get colour for the slab, and cal the next value. */ > + n->colour_next++; > + if (n->colour_next >= cachep->colour) > + n->colour_next = 0; > + > + offset = n->colour_next; > + if (offset >= cachep->colour) > + offset = 0; > + > + offset *= cachep->colour_off; > + > /* Get slab management. */ > freelist = alloc_slabmgmt(cachep, page, offset, > local_flags & ~GFP_CONSTRAINT_MASK, nodeid); > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f181.google.com (mail-ig0-f181.google.com [209.85.213.181]) by kanga.kvack.org (Postfix) with ESMTP id BAE9F6B0005 for ; Wed, 30 Mar 2016 04:09:19 -0400 (EDT) Received: by mail-ig0-f181.google.com with SMTP id cl4so95170796igb.0 for ; Wed, 30 Mar 2016 01:09:19 -0700 (PDT) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTP id 197si3691827ion.193.2016.03.30.01.09.18 for ; Wed, 30 Mar 2016 01:09:18 -0700 (PDT) Date: Wed, 30 Mar 2016 17:11:16 +0900 From: Joonsoo Kim Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Message-ID: <20160330081116.GA1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:50:36PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > Major kmem_cache metadata in slab subsystem is synchronized with > > the slab_mutex. In SLAB, if some of them is changed, node's shared > > array cache would be freed and re-populated. If __kmem_cache_shrink() > > is called at the same time, it will call drain_array() with n->shared > > without holding node lock so problem can happen. > > > > We can fix this small theoretical race condition by holding node lock > > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > > looks more appropriate solution because stable state would make things > > less error-prone and this is not performance critical path. > > Ummm.. The mutex taking is added to common code. So this will also affect > SLUB. The patch needs to consider this. Do we want to force all > allocators to run shrinking only when holding the lock? SLUB does not > need to hold the mutex. And frankly the mutex is for reconfiguration of > metadata which is *not* occurring here. A shrink operation does not do > that. Can we figure out a slab specific way of handling synchronization > in the strange free/realloc cycle? > > It seems that taking the node lock is the appropriate level of > synchrnonization since the concern is with the contents of a shared cache > at that level. There is no change of metadata which would require the > mutex. Okay. I will fix it. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by kanga.kvack.org (Postfix) with ESMTP id 73D976B007E for ; Wed, 30 Mar 2016 04:09:37 -0400 (EDT) Received: by mail-io0-f180.google.com with SMTP id e3so56231156ioa.1 for ; Wed, 30 Mar 2016 01:09:37 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id qf2si18177614igb.91.2016.03.30.01.09.36 for ; Wed, 30 Mar 2016 01:09:36 -0700 (PDT) Date: Wed, 30 Mar 2016 17:11:35 +0900 From: Joonsoo Kim Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Message-ID: <20160330081135.GB1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Geert Uytterhoeven Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , Linux MM , "linux-kernel@vger.kernel.org" On Mon, Mar 28, 2016 at 10:58:38AM +0200, Geert Uytterhoeven wrote: > Hi Jonsoo, > > On Mon, Mar 28, 2016 at 7:26 AM, wrote: > > From: Joonsoo Kim > > > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > > because it causes a problem on m68k which has many node > > but !CONFIG_NUMA. In this case, although alien cache isn't used > > at all but to cope with some initialization path, garbage value > > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > > > Signed-off-by: Joonsoo Kim > > I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous > version that was reverted, so > Tested-by: Geert Uytterhoeven Thanks for testing!!! Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id 9AC916B0005 for ; Wed, 30 Mar 2016 04:10:59 -0400 (EDT) Received: by mail-pa0-f44.google.com with SMTP id zm5so34588209pac.0 for ; Wed, 30 Mar 2016 01:10:59 -0700 (PDT) Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com. [156.147.23.52]) by mx.google.com with ESMTP id v71si4779149pfi.22.2016.03.30.01.10.58 for ; Wed, 30 Mar 2016 01:10:59 -0700 (PDT) Date: Wed, 30 Mar 2016 17:12:58 +0900 From: Joonsoo Kim Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code Message-ID: <20160330081258.GC1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:56:15PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > - spin_lock_irq(&n->list_lock); > > - n->free_limit = > > - (1 + nr_cpus_node(node)) * > > - cachep->batchcount + cachep->num; > > - spin_unlock_irq(&n->list_lock); > > + ret = init_cache_node(cachep, node, GFP_KERNEL); > > + if (ret) > > + return ret; > > Drop ret and do a > > return init_cache_node(...); > > instead? Will do it. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f177.google.com (mail-pf0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 474A06B0005 for ; Wed, 30 Mar 2016 04:13:29 -0400 (EDT) Received: by mail-pf0-f177.google.com with SMTP id x3so36143239pfb.1 for ; Wed, 30 Mar 2016 01:13:29 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id g4si4779059pat.65.2016.03.30.01.13.27 for ; Wed, 30 Mar 2016 01:13:28 -0700 (PDT) Date: Wed, 30 Mar 2016 17:15:27 +0900 From: Joonsoo Kim Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup Message-ID: <20160330081526.GD1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:58:29PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > * This initializes kmem_cache_node or resizes various caches for all nodes. > > */ > > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) > > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) > > ... _node_node? Isnt there a better name for it? I will think it more. Reason I use this naming is that there is other site that uses this naming convention. I'm just mimicking it. :) It's very appreaciate if you have a suggestion. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f175.google.com (mail-ig0-f175.google.com [209.85.213.175]) by kanga.kvack.org (Postfix) with ESMTP id 54FD76B0005 for ; Wed, 30 Mar 2016 04:23:17 -0400 (EDT) Received: by mail-ig0-f175.google.com with SMTP id l20so35160775igf.0 for ; Wed, 30 Mar 2016 01:23:17 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id t37si3769878ioi.205.2016.03.30.01.23.15 for ; Wed, 30 Mar 2016 01:23:16 -0700 (PDT) Date: Wed, 30 Mar 2016 17:25:14 +0900 From: Joonsoo Kim Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit Message-ID: <20160330082514.GE1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 08:03:16PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Currently, determination to free a slab is done whenever free object is > > put into the slab. This has a problem that free slabs are not freed > > even if we have free slabs and have more free_objects than free_limit > > There needs to be a better explanation here since I do not get why there > is an issue with checking after free if a slab is actually free. Okay. Consider following 3 objects free situation. free_limt = 10 nr_free = 9 free(free slab) free(free slab) free(not free slab) If we check it one by one, when nr_free > free_limit (at last free), we cannot free the slab because current slab isn't a free slab. But, if we check it lastly, we can free 1 free slab. I will add more explanation on the next version. > > > when processed slab isn't a free slab. This would cause to keep > > too much memory in the slab subsystem. This patch try to fix it > > by checking number of free object after all free work is done. If there > > is free slab at that time, we can free it so we keep free slab as minimal > > as possible. > > Ok if we check after free work is done then the number of free slabs may > be higher than the limit set and then we free the additional slabs to get > down to the limit that was set? Yes. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by kanga.kvack.org (Postfix) with ESMTP id F1CC46B007E for ; Wed, 30 Mar 2016 04:24:02 -0400 (EDT) Received: by mail-io0-f180.google.com with SMTP id e3so56650223ioa.1 for ; Wed, 30 Mar 2016 01:24:02 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id n1si18275515igp.22.2016.03.30.01.24.01 for ; Wed, 30 Mar 2016 01:24:02 -0700 (PDT) Date: Wed, 30 Mar 2016 17:25:57 +0900 From: Joonsoo Kim Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color Message-ID: <20160330082557.GF1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 08:05:41PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Slab color isn't needed to be changed strictly. Because locking > > for changing slab color could cause more lock contention so this patch > > implements racy access/modify the slab color. This is a preparation step > > to implement lockless allocation path when there is no free objects in > > the kmem_cache. > > Acked-by: Christoph Lameter > > The rest of the description does not relate to this patch and does not > actually reflect the improvement of applying this patch. Remove the rest? No, below improvement is for this individual patch. Thanks. > > > > Below is the result of concurrent allocation/free in slab allocation > > benchmark made by Christoph a long time ago. I make the output simpler. > > The number shows cycle count during alloc/free respectively so less > > is better. > > > > * Before > > Kmalloc N*alloc N*free(32): Average=365/806 > > Kmalloc N*alloc N*free(64): Average=452/690 > > Kmalloc N*alloc N*free(128): Average=736/886 > > Kmalloc N*alloc N*free(256): Average=1167/985 > > Kmalloc N*alloc N*free(512): Average=2088/1125 > > Kmalloc N*alloc N*free(1024): Average=4115/1184 > > Kmalloc N*alloc N*free(2048): Average=8451/1748 > > Kmalloc N*alloc N*free(4096): Average=16024/2048 > > > > * After > > Kmalloc N*alloc N*free(32): Average=355/750 > > Kmalloc N*alloc N*free(64): Average=452/812 > > Kmalloc N*alloc N*free(128): Average=559/1070 > > Kmalloc N*alloc N*free(256): Average=1176/980 > > Kmalloc N*alloc N*free(512): Average=1939/1189 > > Kmalloc N*alloc N*free(1024): Average=3521/1278 > > Kmalloc N*alloc N*free(2048): Average=7152/1838 > > Kmalloc N*alloc N*free(4096): Average=13438/2013 > > > > It shows that contention is reduced for object size >= 1024 > > and performance increases by roughly 15%. > > > > Signed-off-by: Joonsoo Kim > > --- > > mm/slab.c | 26 +++++++++++++------------- > > 1 file changed, 13 insertions(+), 13 deletions(-) > > > > diff --git a/mm/slab.c b/mm/slab.c > > index df11757..52fc5e3 100644 > > --- a/mm/slab.c > > +++ b/mm/slab.c > > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, > > } > > local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); > > > > - /* Take the node list lock to change the colour_next on this node */ > > check_irq_off(); > > - n = get_node(cachep, nodeid); > > - spin_lock(&n->list_lock); > > - > > - /* Get colour for the slab, and cal the next value. */ > > - offset = n->colour_next; > > - n->colour_next++; > > - if (n->colour_next >= cachep->colour) > > - n->colour_next = 0; > > - spin_unlock(&n->list_lock); > > - > > - offset *= cachep->colour_off; > > - > > if (gfpflags_allow_blocking(local_flags)) > > local_irq_enable(); > > > > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, > > if (!page) > > goto failed; > > > > + n = get_node(cachep, nodeid); > > + > > + /* Get colour for the slab, and cal the next value. */ > > + n->colour_next++; > > + if (n->colour_next >= cachep->colour) > > + n->colour_next = 0; > > + > > + offset = n->colour_next; > > + if (offset >= cachep->colour) > > + offset = 0; > > + > > + offset *= cachep->colour_off; > > + > > /* Get slab management. */ > > freelist = alloc_slabmgmt(cachep, page, offset, > > local_flags & ~GFP_CONSTRAINT_MASK, nodeid); > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id 11D496B007E for ; Thu, 31 Mar 2016 06:53:19 -0400 (EDT) Received: by mail-wm0-f42.google.com with SMTP id r72so135637201wmg.0 for ; Thu, 31 Mar 2016 03:53:19 -0700 (PDT) Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com. [74.125.82.45]) by mx.google.com with ESMTPS id wg3si11096267wjb.162.2016.03.31.03.53.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 Mar 2016 03:53:17 -0700 (PDT) Received: by mail-wm0-f45.google.com with SMTP id p65so109214469wmp.0 for ; Thu, 31 Mar 2016 03:53:17 -0700 (PDT) Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> From: Nikolay Borisov Message-ID: <56FD019A.10906@kyup.com> Date: Thu, 31 Mar 2016 13:53:14 +0300 MIME-Version: 1.0 In-Reply-To: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: js1304@gmail.com, Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim On 03/28/2016 08:26 AM, js1304@gmail.com wrote: > From: Joonsoo Kim > > Major kmem_cache metadata in slab subsystem is synchronized with > the slab_mutex. In SLAB, if some of them is changed, node's shared > array cache would be freed and re-populated. If __kmem_cache_shrink() > is called at the same time, it will call drain_array() with n->shared > without holding node lock so problem can happen. > > We can fix this small theoretical race condition by holding node lock > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > looks more appropriate solution because stable state would make things > less error-prone and this is not performance critical path. > > In addtion, annotate on SLAB functions. Just a nit but would it not be better instead of doing comment-style annotation to use lockdep_assert_held/_once. In both cases for someone to understand what locks have to be held will go and read the source. In my mind it's easier to miss a comment line, rather than the lockdep_assert. Furthermore in case lockdep is enabled a locking violation would spew useful info to dmesg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f170.google.com (mail-pf0-f170.google.com [209.85.192.170]) by kanga.kvack.org (Postfix) with ESMTP id 35C7C6B007E for ; Thu, 31 Mar 2016 22:16:03 -0400 (EDT) Received: by mail-pf0-f170.google.com with SMTP id e128so61632912pfe.3 for ; Thu, 31 Mar 2016 19:16:03 -0700 (PDT) Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com. [156.147.23.52]) by mx.google.com with ESMTP id n3si17787268pfb.123.2016.03.31.19.16.01 for ; Thu, 31 Mar 2016 19:16:02 -0700 (PDT) Date: Fri, 1 Apr 2016 11:18:07 +0900 From: Joonsoo Kim Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Message-ID: <20160401021806.GA13179@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> <56FD019A.10906@kyup.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56FD019A.10906@kyup.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nikolay Borisov Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Thu, Mar 31, 2016 at 01:53:14PM +0300, Nikolay Borisov wrote: > > > On 03/28/2016 08:26 AM, js1304@gmail.com wrote: > > From: Joonsoo Kim > > > > Major kmem_cache metadata in slab subsystem is synchronized with > > the slab_mutex. In SLAB, if some of them is changed, node's shared > > array cache would be freed and re-populated. If __kmem_cache_shrink() > > is called at the same time, it will call drain_array() with n->shared > > without holding node lock so problem can happen. > > > > We can fix this small theoretical race condition by holding node lock > > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > > looks more appropriate solution because stable state would make things > > less error-prone and this is not performance critical path. > > > > In addtion, annotate on SLAB functions. > > Just a nit but would it not be better instead of doing comment-style > annotation to use lockdep_assert_held/_once. In both cases for someone > to understand what locks have to be held will go and read the source. In > my mind it's easier to miss a comment line, rather than the > lockdep_assert. Furthermore in case lockdep is enabled a locking > violation would spew useful info to dmesg. Good idea. I'm not sure if lockdep_assert is best fit but I will add something to check it rather than just adding the comment. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754323AbcC1F1h (ORCPT ); Mon, 28 Mar 2016 01:27:37 -0400 Received: from mail-pf0-f177.google.com ([209.85.192.177]:34607 "EHLO mail-pf0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753984AbcC1F1W (ORCPT ); Mon, 28 Mar 2016 01:27:22 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 03/11] mm/slab: drain the free slab as much as possible Date: Mon, 28 Mar 2016 14:26:53 +0900 Message-Id: <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim slabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Signed-off-by: Joonsoo Kim --- mm/slab.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a5a205b..ba2eacf 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -888,12 +888,6 @@ static int init_cache_node_node(int node) return 0; } -static inline int slabs_tofree(struct kmem_cache *cachep, - struct kmem_cache_node *n) -{ - return (n->free_objects + cachep->num - 1) / cachep->num; -} - static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -958,7 +952,7 @@ free_slab: n = get_node(cachep, node); if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); } } @@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node) if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); if (!list_empty(&n->slabs_full) || !list_empty(&n->slabs_partial)) { @@ -2280,7 +2274,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) check_irq_on(); for_each_kmem_cache_node(cachep, node, n) { - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); ret += !list_empty(&n->slabs_full) || !list_empty(&n->slabs_partial); -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753986AbcC1F1X (ORCPT ); Mon, 28 Mar 2016 01:27:23 -0400 Received: from mail-pa0-f47.google.com ([209.85.220.47]:33108 "EHLO mail-pa0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753569AbcC1F1N (ORCPT ); Mon, 28 Mar 2016 01:27:13 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: mm/slab: reduce lock contention in alloc path Date: Mon, 28 Mar 2016 14:26:50 +0900 Message-Id: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. Thanks. Joonsoo Kim (11): mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() mm/slab: remove BAD_ALIEN_MAGIC again mm/slab: drain the free slab as much as possible mm/slab: factor out kmem_cache_node initialization code mm/slab: clean-up kmem_cache_node setup mm/slab: don't keep free slabs if free_objects exceeds free_limit mm/slab: racy access/modify the slab color mm/slab: make cache_grow() handle the page allocated on arbitrary node mm/slab: separate cache_grow() to two parts mm/slab: refill cpu cache through a new slab without holding a node lock mm/slab: lockless decision to grow cache mm/slab.c | 495 ++++++++++++++++++++++++++++--------------------------- mm/slab_common.c | 4 + 2 files changed, 255 insertions(+), 244 deletions(-) -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754492AbcC1F1s (ORCPT ); Mon, 28 Mar 2016 01:27:48 -0400 Received: from mail-pa0-f54.google.com ([209.85.220.54]:36530 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754024AbcC1F1Y (ORCPT ); Mon, 28 Mar 2016 01:27:24 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code Date: Mon, 28 Mar 2016 14:26:54 +0900 Message-Id: <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim It can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 39 insertions(+), 29 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index ba2eacf..569d7db 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags) } #endif +static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp) +{ + struct kmem_cache_node *n; + + /* + * Set up the kmem_cache_node for cpu before we can + * begin anything. Make sure some other cpu on this + * node has not already allocated this + */ + n = get_node(cachep, node); + if (n) + return 0; + + n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); + if (!n) + return -ENOMEM; + + kmem_cache_node_init(n); + n->next_reap = jiffies + REAPTIMEOUT_NODE + + ((unsigned long)cachep) % REAPTIMEOUT_NODE; + + n->free_limit = + (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; + + /* + * The kmem_cache_nodes don't come and go as CPUs + * come and go. slab_mutex is sufficient + * protection here. + */ + cachep->node[node] = n; + + return 0; +} + /* * Allocates and initializes node for a node on each slab cache, used for * either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node @@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags) */ static int init_cache_node_node(int node) { + int ret; struct kmem_cache *cachep; - struct kmem_cache_node *n; - const size_t memsize = sizeof(struct kmem_cache_node); list_for_each_entry(cachep, &slab_caches, list) { - /* - * Set up the kmem_cache_node for cpu before we can - * begin anything. Make sure some other cpu on this - * node has not already allocated this - */ - n = get_node(cachep, node); - if (!n) { - n = kmalloc_node(memsize, GFP_KERNEL, node); - if (!n) - return -ENOMEM; - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - - /* - * The kmem_cache_nodes don't come and go as CPUs - * come and go. slab_mutex is sufficient - * protection here. - */ - cachep->node[node] = n; - } - - spin_lock_irq(&n->list_lock); - n->free_limit = - (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(&n->list_lock); + ret = init_cache_node(cachep, node, GFP_KERNEL); + if (ret) + return ret; } + return 0; } -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754550AbcC1F15 (ORCPT ); Mon, 28 Mar 2016 01:27:57 -0400 Received: from mail-pa0-f54.google.com ([209.85.220.54]:34782 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753840AbcC1F1T (ORCPT ); Mon, 28 Mar 2016 01:27:19 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Date: Mon, 28 Mar 2016 14:26:52 +0900 Message-Id: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it causes a problem on m68k which has many node but !CONFIG_NUMA. In this case, although alien cache isn't used at all but to cope with some initialization path, garbage value is used and that is BAD_ALIEN_MAGIC. Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 043606a..a5a205b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = { .name = "kmem_cache", }; -#define BAD_ALIEN_MAGIC 0x01020304ul - static DEFINE_PER_CPU(struct delayed_work, slab_reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) @@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to, static inline struct alien_cache **alloc_alien_cache(int node, int limit, gfp_t gfp) { - return (struct alien_cache **)BAD_ALIEN_MAGIC; + return NULL; } static inline void free_alien_cache(struct alien_cache **ac_ptr) @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) sizeof(struct rcu_head)); kmem_cache = &kmem_cache_boot; - if (num_possible_nodes() == 1) + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) use_alien_caches = 0; for (i = 0; i < NUM_INIT_LISTS; i++) -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754593AbcC1F2F (ORCPT ); Mon, 28 Mar 2016 01:28:05 -0400 Received: from mail-pf0-f172.google.com ([209.85.192.172]:34530 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753755AbcC1F1Q (ORCPT ); Mon, 28 Mar 2016 01:27:16 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Date: Mon, 28 Mar 2016 14:26:51 +0900 Message-Id: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Major kmem_cache metadata in slab subsystem is synchronized with the slab_mutex. In SLAB, if some of them is changed, node's shared array cache would be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. We can fix this small theoretical race condition by holding node lock in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() looks more appropriate solution because stable state would make things less error-prone and this is not performance critical path. In addtion, annotate on SLAB functions. Signed-off-by: Joonsoo Kim --- mm/slab.c | 2 ++ mm/slab_common.c | 4 ++++ 2 files changed, 6 insertions(+) diff --git a/mm/slab.c b/mm/slab.c index a53a0f6..043606a 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2218,6 +2218,7 @@ static void do_drain(void *arg) ac->avail = 0; } +/* Should be called with slab_mutex to prevent from freeing shared array */ static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_cache_node *n; @@ -3871,6 +3872,7 @@ skip_setup: * Drain an array if it contains any elements taking the node lock only if * necessary. Note that the node listlock also protects the array_cache * if drain_array() is used on the shared array. + * Should be called with slab_mutex to prevent from freeing shared array. */ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, struct array_cache *ac, int force, int node) diff --git a/mm/slab_common.c b/mm/slab_common.c index a65dad7..5bed565 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -755,7 +755,11 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); kasan_cache_shrink(cachep); + + mutex_lock(&slab_mutex); ret = __kmem_cache_shrink(cachep, false); + mutex_unlock(&slab_mutex); + put_online_mems(); put_online_cpus(); return ret; -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754630AbcC1F2Z (ORCPT ); Mon, 28 Mar 2016 01:28:25 -0400 Received: from mail-pf0-f169.google.com ([209.85.192.169]:33870 "EHLO mail-pf0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754461AbcC1F1p (ORCPT ); Mon, 28 Mar 2016 01:27:45 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock Date: Mon, 28 Mar 2016 14:27:00 +0900 Message-Id: <1459142821-20303-11-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Until now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 +++++++++++++++++++++++++++++++++------------------------------ 1 file changed, 36 insertions(+), 32 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 401e60c..029d6b3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2827,6 +2827,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, return obj; } +/* + * Slab list should be fixed up by fixup_slab_list() for existing slab + * or cache_grow_end() for new slab + */ +static __always_inline int alloc_block(struct kmem_cache *cachep, + struct array_cache *ac, struct page *page, int batchcount) +{ + /* + * There must be at least one object available for + * allocation. + */ + BUG_ON(page->active >= cachep->num); + + while (page->active < cachep->num && batchcount--) { + STATS_INC_ALLOCED(cachep); + STATS_INC_ACTIVE(cachep); + STATS_SET_HIGH(cachep); + + ac->entry[ac->avail++] = slab_get_obj(cachep, page); + } + + return batchcount; +} + static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; @@ -2839,7 +2863,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) check_irq_off(); node = numa_mem_id(); -retry: ac = cpu_cache_get(cachep); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { @@ -2869,21 +2892,7 @@ retry: check_spinlock_acquired(cachep); - /* - * The slab was either on partial or free list so - * there must be at least one object available for - * allocation. - */ - BUG_ON(page->active >= cachep->num); - - while (page->active < cachep->num && batchcount--) { - STATS_INC_ALLOCED(cachep); - STATS_INC_ACTIVE(cachep); - STATS_SET_HIGH(cachep); - - ac->entry[ac->avail++] = slab_get_obj(cachep, page); - } - + batchcount = alloc_block(cachep, ac, page, batchcount); fixup_slab_list(cachep, n, page, &list); } @@ -2903,21 +2912,18 @@ alloc_done: } page = cache_grow_begin(cachep, gfp_exact_node(flags), node); - cache_grow_end(cachep, page); /* * cache_grow_begin() can reenable interrupts, * then ac could change. */ ac = cpu_cache_get(cachep); - node = numa_mem_id(); + if (!ac->avail && page) + alloc_block(cachep, ac, page, batchcount); + cache_grow_end(cachep, page); - /* no objects in sight? abort */ - if (!page && ac->avail == 0) + if (!ac->avail) return NULL; - - if (!ac->avail) /* objects refilled by interrupt? */ - goto retry; } ac->touched = 1; @@ -3111,14 +3117,13 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, { struct page *page; struct kmem_cache_node *n; - void *obj; + void *obj = NULL; void *list = NULL; VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n = get_node(cachep, nodeid); BUG_ON(!n); -retry: check_irq_off(); spin_lock(&n->list_lock); page = get_first_slab(n, false); @@ -3140,19 +3145,18 @@ retry: spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); - goto done; + return obj; must_grow: spin_unlock(&n->list_lock); page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid); + if (page) { + /* This slab isn't counted yet so don't update free_objects */ + obj = slab_get_obj(cachep, page); + } cache_grow_end(cachep, page); - if (page) - goto retry; - return fallback_alloc(cachep, flags); - -done: - return obj; + return obj ? obj : fallback_alloc(cachep, flags); } static __always_inline void * -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754667AbcC1F2d (ORCPT ); Mon, 28 Mar 2016 01:28:33 -0400 Received: from mail-pa0-f43.google.com ([209.85.220.43]:34084 "EHLO mail-pa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754488AbcC1F1r (ORCPT ); Mon, 28 Mar 2016 01:27:47 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 11/11] mm/slab: lockless decision to grow cache Date: Mon, 28 Mar 2016 14:27:01 +0900 Message-Id: <1459142821-20303-12-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim To check whther free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. Signed-off-by: Joonsoo Kim --- mm/slab.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 029d6b3..b70aabf 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -951,6 +951,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep, spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); + /* + * To protect lockless access to n->shared during irq disabled context. + * If n->shared isn't NULL in irq disabled context, accessing to it is + * guaranteed to be valid until irq is re-enabled, because it will be + * freed after kick_all_cpus_sync(). + */ + if (force_change) + kick_all_cpus_sync(); + fail: kfree(old_shared); kfree(new_shared); @@ -2855,7 +2864,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; struct kmem_cache_node *n; - struct array_cache *ac; + struct array_cache *ac, *shared; int node; void *list = NULL; struct page *page; @@ -2876,11 +2885,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) n = get_node(cachep, node); BUG_ON(ac->avail > 0 || !n); + shared = READ_ONCE(n->shared); + if (!n->free_objects && (!shared || !shared->avail)) + goto direct_grow; + spin_lock(&n->list_lock); + shared = READ_ONCE(n->shared); /* See if we can refill from the shared array */ - if (n->shared && transfer_objects(ac, n->shared, batchcount)) { - n->shared->touched = 1; + if (shared && transfer_objects(ac, shared, batchcount)) { + shared->touched = 1; goto alloc_done; } @@ -2902,6 +2916,7 @@ alloc_done: spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); +direct_grow: if (unlikely(!ac->avail)) { /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754771AbcC1F2w (ORCPT ); Mon, 28 Mar 2016 01:28:52 -0400 Received: from mail-pa0-f44.google.com ([209.85.220.44]:36639 "EHLO mail-pa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754299AbcC1F1f (ORCPT ); Mon, 28 Mar 2016 01:27:35 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 07/11] mm/slab: racy access/modify the slab color Date: Mon, 28 Mar 2016 14:26:57 +0900 Message-Id: <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Slab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index df11757..52fc5e3 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, } local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - /* Take the node list lock to change the colour_next on this node */ check_irq_off(); - n = get_node(cachep, nodeid); - spin_lock(&n->list_lock); - - /* Get colour for the slab, and cal the next value. */ - offset = n->colour_next; - n->colour_next++; - if (n->colour_next >= cachep->colour) - n->colour_next = 0; - spin_unlock(&n->list_lock); - - offset *= cachep->colour_off; - if (gfpflags_allow_blocking(local_flags)) local_irq_enable(); @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, if (!page) goto failed; + n = get_node(cachep, nodeid); + + /* Get colour for the slab, and cal the next value. */ + n->colour_next++; + if (n->colour_next >= cachep->colour) + n->colour_next = 0; + + offset = n->colour_next; + if (offset >= cachep->colour) + offset = 0; + + offset *= cachep->colour_off; + /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, local_flags & ~GFP_CONSTRAINT_MASK, nodeid); -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754715AbcC1F2h (ORCPT ); Mon, 28 Mar 2016 01:28:37 -0400 Received: from mail-pa0-f48.google.com ([209.85.220.48]:36713 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754351AbcC1F1l (ORCPT ); Mon, 28 Mar 2016 01:27:41 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 09/11] mm/slab: separate cache_grow() to two parts Date: Mon, 28 Mar 2016 14:26:59 +0900 Message-Id: <1459142821-20303-10-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim This is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim --- mm/slab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 52 insertions(+), 22 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index ce8ed65..401e60c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list); static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp); static void cache_reap(struct work_struct *unused); +static inline void fixup_objfreelist_debug(struct kmem_cache *cachep, + void **list); +static inline void fixup_slab_list(struct kmem_cache *cachep, + struct kmem_cache_node *n, struct page *page, + void **list); static int slab_early_init = 1; #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) @@ -1796,7 +1801,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep, /* * Needed to avoid possible looping condition - * in cache_grow() + * in cache_grow_begin() */ if (OFF_SLAB(freelist_cache)) continue; @@ -2518,7 +2523,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) +static struct page *cache_grow_begin(struct kmem_cache *cachep, + gfp_t flags, int nodeid) { void *freelist; size_t offset; @@ -2584,21 +2590,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - check_irq_off(); - spin_lock(&n->list_lock); - /* Make slab active. */ - list_add_tail(&page->lru, &(n->slabs_free)); - STATS_INC_GROWN(cachep); - n->free_objects += cachep->num; - spin_unlock(&n->list_lock); - return page_node; + return page; + opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return -1; + return NULL; +} + +static void cache_grow_end(struct kmem_cache *cachep, struct page *page) +{ + struct kmem_cache_node *n; + void *list = NULL; + + check_irq_off(); + + if (!page) + return; + + INIT_LIST_HEAD(&page->lru); + n = get_node(cachep, page_to_nid(page)); + + spin_lock(&n->list_lock); + if (!page->active) + list_add_tail(&page->lru, &(n->slabs_free)); + else + fixup_slab_list(cachep, n, page, &list); + STATS_INC_GROWN(cachep); + n->free_objects += cachep->num - page->active; + spin_unlock(&n->list_lock); + + fixup_objfreelist_debug(cachep, &list); } #if DEBUG @@ -2809,6 +2834,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) struct array_cache *ac; int node; void *list = NULL; + struct page *page; check_irq_off(); node = numa_mem_id(); @@ -2836,7 +2862,6 @@ retry: } while (batchcount > 0) { - struct page *page; /* Get slab alloc is to come from. */ page = get_first_slab(n, false); if (!page) @@ -2869,8 +2894,6 @@ alloc_done: fixup_objfreelist_debug(cachep, &list); if (unlikely(!ac->avail)) { - int x; - /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { void *obj = cache_alloc_pfmemalloc(cachep, n, flags); @@ -2879,14 +2902,18 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node); + page = cache_grow_begin(cachep, gfp_exact_node(flags), node); + cache_grow_end(cachep, page); - /* cache_grow can reenable interrupts, then ac could change. */ + /* + * cache_grow_begin() can reenable interrupts, + * then ac could change. + */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (x < 0 && ac->avail == 0) + if (!page && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3019,6 +3046,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; + struct page *page; int nid; unsigned int cpuset_mems_cookie; @@ -3054,8 +3082,10 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - nid = cache_grow(cache, flags, numa_mem_id()); - if (nid >= 0) { + page = cache_grow_begin(cache, flags, numa_mem_id()); + cache_grow_end(cache, page); + if (page) { + nid = page_to_nid(page); obj = ____cache_alloc_node(cache, gfp_exact_node(flags), nid); @@ -3083,7 +3113,6 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, struct kmem_cache_node *n; void *obj; void *list = NULL; - int x; VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n = get_node(cachep, nodeid); @@ -3115,8 +3144,9 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, gfp_exact_node(flags), nodeid); - if (x >= 0) + page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid); + cache_grow_end(cachep, page); + if (page) goto retry; return fallback_alloc(cachep, flags); -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754743AbcC1F2r (ORCPT ); Mon, 28 Mar 2016 01:28:47 -0400 Received: from mail-pf0-f173.google.com ([209.85.192.173]:33232 "EHLO mail-pf0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754321AbcC1F1h (ORCPT ); Mon, 28 Mar 2016 01:27:37 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node Date: Mon, 28 Mar 2016 14:26:58 +0900 Message-Id: <1459142821-20303-9-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Currently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim --- mm/slab.c | 60 +++++++++++++++++++++--------------------------------------- 1 file changed, 21 insertions(+), 39 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 52fc5e3..ce8ed65 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2518,13 +2518,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, struct page *page) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) { void *freelist; size_t offset; gfp_t local_flags; + int page_node; struct kmem_cache_node *n; + struct page *page; /* * Be lazy and only check for valid flags here, keeping it out of the @@ -2552,12 +2553,12 @@ static int cache_grow(struct kmem_cache *cachep, * Get mem for the objs. Attempt to allocate a physical page from * 'nodeid'. */ - if (!page) - page = kmem_getpages(cachep, local_flags, nodeid); + page = kmem_getpages(cachep, local_flags, nodeid); if (!page) goto failed; - n = get_node(cachep, nodeid); + page_node = page_to_nid(page); + n = get_node(cachep, page_node); /* Get colour for the slab, and cal the next value. */ n->colour_next++; @@ -2572,7 +2573,7 @@ static int cache_grow(struct kmem_cache *cachep, /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, - local_flags & ~GFP_CONSTRAINT_MASK, nodeid); + local_flags & ~GFP_CONSTRAINT_MASK, page_node); if (OFF_SLAB(cachep) && !freelist) goto opps1; @@ -2591,13 +2592,13 @@ static int cache_grow(struct kmem_cache *cachep, STATS_INC_GROWN(cachep); n->free_objects += cachep->num; spin_unlock(&n->list_lock); - return 1; + return page_node; opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return 0; + return -1; } #if DEBUG @@ -2878,14 +2879,14 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (!x && ac->avail == 0) + if (x < 0 && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3014,7 +3015,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) { struct zonelist *zonelist; - gfp_t local_flags; struct zoneref *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); @@ -3025,8 +3025,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) if (flags & __GFP_THISNODE) return NULL; - local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); zonelist = node_zonelist(mempolicy_slab_node(), flags); @@ -3056,33 +3054,17 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - struct page *page; + nid = cache_grow(cache, flags, numa_mem_id()); + if (nid >= 0) { + obj = ____cache_alloc_node(cache, + gfp_exact_node(flags), nid); - if (gfpflags_allow_blocking(local_flags)) - local_irq_enable(); - kmem_flagcheck(cache, flags); - page = kmem_getpages(cache, local_flags, numa_mem_id()); - if (gfpflags_allow_blocking(local_flags)) - local_irq_disable(); - if (page) { /* - * Insert into the appropriate per node queues + * Another processor may allocate the objects in + * the slab since we are not holding any locks. */ - nid = page_to_nid(page); - if (cache_grow(cache, flags, nid, page)) { - obj = ____cache_alloc_node(cache, - gfp_exact_node(flags), nid); - if (!obj) - /* - * Another processor may allocate the - * objects in the slab since we are - * not holding any locks. - */ - goto retry; - } else { - /* cache_grow already freed obj */ - obj = NULL; - } + if (!obj) + goto retry; } } @@ -3133,8 +3115,8 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); - if (x) + x = cache_grow(cachep, gfp_exact_node(flags), nodeid); + if (x >= 0) goto retry; return fallback_alloc(cachep, flags); -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754302AbcC1F3F (ORCPT ); Mon, 28 Mar 2016 01:29:05 -0400 Received: from mail-pf0-f179.google.com ([209.85.192.179]:35352 "EHLO mail-pf0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754139AbcC1F1b (ORCPT ); Mon, 28 Mar 2016 01:27:31 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit Date: Mon, 28 Mar 2016 14:26:56 +0900 Message-Id: <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Currently, determination to free a slab is done whenever free object is put into the slab. This has a problem that free slabs are not freed even if we have free slabs and have more free_objects than free_limit when processed slab isn't a free slab. This would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free it so we keep free slab as minimal as possible. Signed-off-by: Joonsoo Kim --- mm/slab.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b96f381..df11757 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3258,6 +3258,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp, { int i; struct kmem_cache_node *n = get_node(cachep, node); + struct page *page; + + n->free_objects += nr_objects; for (i = 0; i < nr_objects; i++) { void *objp; @@ -3270,17 +3273,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp, check_spinlock_acquired_node(cachep, node); slab_put_obj(cachep, page, objp); STATS_DEC_ACTIVE(cachep); - n->free_objects++; /* fixup slab chains */ - if (page->active == 0) { - if (n->free_objects > n->free_limit) { - n->free_objects -= cachep->num; - list_add_tail(&page->lru, list); - } else { - list_add(&page->lru, &n->slabs_free); - } - } else { + if (page->active == 0) + list_add(&page->lru, &n->slabs_free); + else { /* Unconditionally move a slab to the end of the * partial list on free - maximum time for the * other objects to be freed, too. @@ -3288,6 +3285,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp, list_add_tail(&page->lru, &n->slabs_partial); } } + + while (n->free_objects > n->free_limit && !list_empty(&n->slabs_free)) { + n->free_objects -= cachep->num; + + page = list_last_entry(&n->slabs_free, struct page, lru); + list_del(&page->lru); + list_add(&page->lru, list); + } } static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754370AbcC1F3v (ORCPT ); Mon, 28 Mar 2016 01:29:51 -0400 Received: from mail-pa0-f43.google.com ([209.85.220.43]:35505 "EHLO mail-pa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754079AbcC1F12 (ORCPT ); Mon, 28 Mar 2016 01:27:28 -0400 From: js1304@gmail.com X-Google-Original-From: iamjoonsoo.kim@lge.com To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup Date: Mon, 28 Mar 2016 14:26:55 +0900 Message-Id: <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim There are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. Signed-off-by: Joonsoo Kim --- mm/slab.c | 167 +++++++++++++++++++++++++------------------------------------- 1 file changed, 67 insertions(+), 100 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 569d7db..b96f381 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -898,6 +898,62 @@ static int init_cache_node_node(int node) return 0; } +static int setup_kmem_cache_node(struct kmem_cache *cachep, + int node, gfp_t gfp, bool force_change) +{ + int ret = -ENOMEM; + struct kmem_cache_node *n; + struct array_cache *old_shared = NULL; + struct array_cache *new_shared = NULL; + struct alien_cache **new_alien = NULL; + LIST_HEAD(list); + + if (use_alien_caches) { + new_alien = alloc_alien_cache(node, cachep->limit, gfp); + if (!new_alien) + goto fail; + } + + if (cachep->shared) { + new_shared = alloc_arraycache(node, + cachep->shared * cachep->batchcount, 0xbaadf00d, gfp); + if (!new_shared) + goto fail; + } + + ret = init_cache_node(cachep, node, gfp); + if (ret) + goto fail; + + n = get_node(cachep, node); + spin_lock_irq(&n->list_lock); + if (n->shared) { + free_block(cachep, n->shared->entry, + n->shared->avail, node, &list); + } + + if (!n->shared || force_change) { + old_shared = n->shared; + n->shared = new_shared; + new_shared = NULL; + } + + if (!n->alien) { + n->alien = new_alien; + new_alien = NULL; + } + + spin_unlock_irq(&n->list_lock); + slabs_destroy(cachep, &list); + +fail: + kfree(old_shared); + kfree(new_shared); + free_alien_cache(new_alien); + + return ret; +} + static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -969,7 +1025,6 @@ free_slab: static int cpuup_prepare(long cpu) { struct kmem_cache *cachep; - struct kmem_cache_node *n = NULL; int node = cpu_to_mem(cpu); int err; @@ -988,44 +1043,9 @@ static int cpuup_prepare(long cpu) * array caches */ list_for_each_entry(cachep, &slab_caches, list) { - struct array_cache *shared = NULL; - struct alien_cache **alien = NULL; - - if (cachep->shared) { - shared = alloc_arraycache(node, - cachep->shared * cachep->batchcount, - 0xbaadf00d, GFP_KERNEL); - if (!shared) - goto bad; - } - if (use_alien_caches) { - alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL); - if (!alien) { - kfree(shared); - goto bad; - } - } - n = get_node(cachep, node); - BUG_ON(!n); - - spin_lock_irq(&n->list_lock); - if (!n->shared) { - /* - * We are serialised from CPU_DEAD or - * CPU_UP_CANCELLED by the cpucontrol lock - */ - n->shared = shared; - shared = NULL; - } -#ifdef CONFIG_NUMA - if (!n->alien) { - n->alien = alien; - alien = NULL; - } -#endif - spin_unlock_irq(&n->list_lock); - kfree(shared); - free_alien_cache(alien); + err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false); + if (err) + goto bad; } return 0; @@ -3652,72 +3672,19 @@ EXPORT_SYMBOL(kfree); /* * This initializes kmem_cache_node or resizes various caches for all nodes. */ -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) { + int ret; int node; struct kmem_cache_node *n; - struct array_cache *new_shared; - struct alien_cache **new_alien = NULL; for_each_online_node(node) { - - if (use_alien_caches) { - new_alien = alloc_alien_cache(node, cachep->limit, gfp); - if (!new_alien) - goto fail; - } - - new_shared = NULL; - if (cachep->shared) { - new_shared = alloc_arraycache(node, - cachep->shared*cachep->batchcount, - 0xbaadf00d, gfp); - if (!new_shared) { - free_alien_cache(new_alien); - goto fail; - } - } - - n = get_node(cachep, node); - if (n) { - struct array_cache *shared = n->shared; - LIST_HEAD(list); - - spin_lock_irq(&n->list_lock); - - if (shared) - free_block(cachep, shared->entry, - shared->avail, node, &list); - - n->shared = new_shared; - if (!n->alien) { - n->alien = new_alien; - new_alien = NULL; - } - n->free_limit = (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(&n->list_lock); - slabs_destroy(cachep, &list); - kfree(shared); - free_alien_cache(new_alien); - continue; - } - n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); - if (!n) { - free_alien_cache(new_alien); - kfree(new_shared); + ret = setup_kmem_cache_node(cachep, node, gfp, true); + if (ret) goto fail; - } - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - n->shared = new_shared; - n->alien = new_alien; - n->free_limit = (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - cachep->node[node] = n; } + return 0; fail: @@ -3759,7 +3726,7 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, cachep->shared = shared; if (!prev) - goto alloc_node; + goto setup_node; for_each_online_cpu(cpu) { LIST_HEAD(list); @@ -3776,8 +3743,8 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, } free_percpu(prev); -alloc_node: - return alloc_kmem_cache_node(cachep, gfp); +setup_node: + return setup_kmem_cache_node_node(cachep, gfp); } static int do_tune_cpucache(struct kmem_cache *cachep, int limit, -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754519AbcC1I6k (ORCPT ); Mon, 28 Mar 2016 04:58:40 -0400 Received: from mail-ig0-f195.google.com ([209.85.213.195]:33588 "EHLO mail-ig0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751488AbcC1I6j (ORCPT ); Mon, 28 Mar 2016 04:58:39 -0400 MIME-Version: 1.0 In-Reply-To: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> Date: Mon, 28 Mar 2016 10:58:38 +0200 X-Google-Sender-Auth: 63Gl9_lHznv8q6X8_Z20Be6no_8 Message-ID: Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again From: Geert Uytterhoeven To: JoonSoo Kim Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , Linux MM , "linux-kernel@vger.kernel.org" , Joonsoo Kim Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jonsoo, On Mon, Mar 28, 2016 at 7:26 AM, wrote: > From: Joonsoo Kim > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > because it causes a problem on m68k which has many node > but !CONFIG_NUMA. In this case, although alien cache isn't used > at all but to cope with some initialization path, garbage value > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > Signed-off-by: Joonsoo Kim I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous version that was reverted, so Tested-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755410AbcC1VTO (ORCPT ); Mon, 28 Mar 2016 17:19:14 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:57263 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754065AbcC1VTM (ORCPT ); Mon, 28 Mar 2016 17:19:12 -0400 Date: Mon, 28 Mar 2016 14:19:11 -0700 From: Andrew Morton To: js1304@gmail.com Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Message-Id: <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> In-Reply-To: <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote: > From: Joonsoo Kim > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > because it causes a problem on m68k which has many node > but !CONFIG_NUMA. Whaaa? How is that even possible? I'd have thought that everything would break at compile time (at least) with such a setup. > In this case, although alien cache isn't used > at all but to cope with some initialization path, garbage value > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > ... > > @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) > sizeof(struct rcu_head)); > kmem_cache = &kmem_cache_boot; > > - if (num_possible_nodes() == 1) > + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) > use_alien_caches = 0; > > for (i = 0; i < NUM_INIT_LISTS; i++) This does look screwy. How can num_possible_nodes() possibly return anything but "1" if CONFIG_NUMA=n. Can we please get a code comment in here to explain things to the poor old reader and to prevent people from trying to "fix" it? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755610AbcC2Aul (ORCPT ); Mon, 28 Mar 2016 20:50:41 -0400 Received: from resqmta-ch2-10v.sys.comcast.net ([69.252.207.42]:49441 "EHLO resqmta-ch2-10v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752162AbcC2Aui (ORCPT ); Mon, 28 Mar 2016 20:50:38 -0400 Date: Mon, 28 Mar 2016 19:50:36 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() In-Reply-To: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, js1304@gmail.com wrote: > Major kmem_cache metadata in slab subsystem is synchronized with > the slab_mutex. In SLAB, if some of them is changed, node's shared > array cache would be freed and re-populated. If __kmem_cache_shrink() > is called at the same time, it will call drain_array() with n->shared > without holding node lock so problem can happen. > > We can fix this small theoretical race condition by holding node lock > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > looks more appropriate solution because stable state would make things > less error-prone and this is not performance critical path. Ummm.. The mutex taking is added to common code. So this will also affect SLUB. The patch needs to consider this. Do we want to force all allocators to run shrinking only when holding the lock? SLUB does not need to hold the mutex. And frankly the mutex is for reconfiguration of metadata which is *not* occurring here. A shrink operation does not do that. Can we figure out a slab specific way of handling synchronization in the strange free/realloc cycle? It seems that taking the node lock is the appropriate level of synchrnonization since the concern is with the contents of a shared cache at that level. There is no change of metadata which would require the mutex. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755622AbcC2Axp (ORCPT ); Mon, 28 Mar 2016 20:53:45 -0400 Received: from resqmta-po-02v.sys.comcast.net ([96.114.154.161]:43538 "EHLO resqmta-po-02v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755057AbcC2Axo (ORCPT ); Mon, 28 Mar 2016 20:53:44 -0400 Date: Mon, 28 Mar 2016 19:53:39 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: Andrew Morton cc: js1304@gmail.com, Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again In-Reply-To: <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> <20160328141911.3048ab8d406b86a6e5b9f910@linux-foundation.org> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, Andrew Morton wrote: > On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > > because it causes a problem on m68k which has many node > > but !CONFIG_NUMA. > > Whaaa? How is that even possible? I'd have thought that everything > would break at compile time (at least) with such a setup. Yes we have that and the support for this caused numerous issues. Can we stop supporting such a configuration? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755668AbcC2Ayh (ORCPT ); Mon, 28 Mar 2016 20:54:37 -0400 Received: from resqmta-ch2-09v.sys.comcast.net ([69.252.207.41]:38033 "EHLO resqmta-ch2-09v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752421AbcC2Ayf (ORCPT ); Mon, 28 Mar 2016 20:54:35 -0400 Date: Mon, 28 Mar 2016 19:54:33 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 03/11] mm/slab: drain the free slab as much as possible In-Reply-To: <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-4-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Acked-by: Christoph Lameter From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755690AbcC2A4X (ORCPT ); Mon, 28 Mar 2016 20:56:23 -0400 Received: from resqmta-ch2-09v.sys.comcast.net ([69.252.207.41]:38351 "EHLO resqmta-ch2-09v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755625AbcC2A4S (ORCPT ); Mon, 28 Mar 2016 20:56:18 -0400 Date: Mon, 28 Mar 2016 19:56:15 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code In-Reply-To: <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > - spin_lock_irq(&n->list_lock); > - n->free_limit = > - (1 + nr_cpus_node(node)) * > - cachep->batchcount + cachep->num; > - spin_unlock_irq(&n->list_lock); > + ret = init_cache_node(cachep, node, GFP_KERNEL); > + if (ret) > + return ret; Drop ret and do a return init_cache_node(...); instead? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755576AbcC2A6f (ORCPT ); Mon, 28 Mar 2016 20:58:35 -0400 Received: from resqmta-po-04v.sys.comcast.net ([96.114.154.163]:56137 "EHLO resqmta-po-04v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750749AbcC2A6c (ORCPT ); Mon, 28 Mar 2016 20:58:32 -0400 Date: Mon, 28 Mar 2016 19:58:29 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup In-Reply-To: <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, js1304@gmail.com wrote: > * This initializes kmem_cache_node or resizes various caches for all nodes. > */ > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) ... _node_node? Isnt there a better name for it? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755500AbcC2BDV (ORCPT ); Mon, 28 Mar 2016 21:03:21 -0400 Received: from resqmta-po-06v.sys.comcast.net ([96.114.154.165]:59741 "EHLO resqmta-po-06v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751448AbcC2BDT (ORCPT ); Mon, 28 Mar 2016 21:03:19 -0400 Date: Mon, 28 Mar 2016 20:03:16 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit In-Reply-To: <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > > Currently, determination to free a slab is done whenever free object is > put into the slab. This has a problem that free slabs are not freed > even if we have free slabs and have more free_objects than free_limit There needs to be a better explanation here since I do not get why there is an issue with checking after free if a slab is actually free. > when processed slab isn't a free slab. This would cause to keep > too much memory in the slab subsystem. This patch try to fix it > by checking number of free object after all free work is done. If there > is free slab at that time, we can free it so we keep free slab as minimal > as possible. Ok if we check after free work is done then the number of free slabs may be higher than the limit set and then we free the additional slabs to get down to the limit that was set? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752473AbcC2BFo (ORCPT ); Mon, 28 Mar 2016 21:05:44 -0400 Received: from resqmta-po-09v.sys.comcast.net ([96.114.154.168]:39467 "EHLO resqmta-po-09v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750972AbcC2BFo (ORCPT ); Mon, 28 Mar 2016 21:05:44 -0400 Date: Mon, 28 Mar 2016 20:05:41 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@east.gentwo.org To: js1304@gmail.com cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color In-Reply-To: <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> Message-ID: References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 28 Mar 2016, js1304@gmail.com wrote: > From: Joonsoo Kim > > Slab color isn't needed to be changed strictly. Because locking > for changing slab color could cause more lock contention so this patch > implements racy access/modify the slab color. This is a preparation step > to implement lockless allocation path when there is no free objects in > the kmem_cache. Acked-by: Christoph Lameter The rest of the description does not relate to this patch and does not actually reflect the improvement of applying this patch. Remove the rest? > Below is the result of concurrent allocation/free in slab allocation > benchmark made by Christoph a long time ago. I make the output simpler. > The number shows cycle count during alloc/free respectively so less > is better. > > * Before > Kmalloc N*alloc N*free(32): Average=365/806 > Kmalloc N*alloc N*free(64): Average=452/690 > Kmalloc N*alloc N*free(128): Average=736/886 > Kmalloc N*alloc N*free(256): Average=1167/985 > Kmalloc N*alloc N*free(512): Average=2088/1125 > Kmalloc N*alloc N*free(1024): Average=4115/1184 > Kmalloc N*alloc N*free(2048): Average=8451/1748 > Kmalloc N*alloc N*free(4096): Average=16024/2048 > > * After > Kmalloc N*alloc N*free(32): Average=355/750 > Kmalloc N*alloc N*free(64): Average=452/812 > Kmalloc N*alloc N*free(128): Average=559/1070 > Kmalloc N*alloc N*free(256): Average=1176/980 > Kmalloc N*alloc N*free(512): Average=1939/1189 > Kmalloc N*alloc N*free(1024): Average=3521/1278 > Kmalloc N*alloc N*free(2048): Average=7152/1838 > Kmalloc N*alloc N*free(4096): Average=13438/2013 > > It shows that contention is reduced for object size >= 1024 > and performance increases by roughly 15%. > > Signed-off-by: Joonsoo Kim > --- > mm/slab.c | 26 +++++++++++++------------- > 1 file changed, 13 insertions(+), 13 deletions(-) > > diff --git a/mm/slab.c b/mm/slab.c > index df11757..52fc5e3 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, > } > local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); > > - /* Take the node list lock to change the colour_next on this node */ > check_irq_off(); > - n = get_node(cachep, nodeid); > - spin_lock(&n->list_lock); > - > - /* Get colour for the slab, and cal the next value. */ > - offset = n->colour_next; > - n->colour_next++; > - if (n->colour_next >= cachep->colour) > - n->colour_next = 0; > - spin_unlock(&n->list_lock); > - > - offset *= cachep->colour_off; > - > if (gfpflags_allow_blocking(local_flags)) > local_irq_enable(); > > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, > if (!page) > goto failed; > > + n = get_node(cachep, nodeid); > + > + /* Get colour for the slab, and cal the next value. */ > + n->colour_next++; > + if (n->colour_next >= cachep->colour) > + n->colour_next = 0; > + > + offset = n->colour_next; > + if (offset >= cachep->colour) > + offset = 0; > + > + offset *= cachep->colour_off; > + > /* Get slab management. */ > freelist = alloc_slabmgmt(cachep, page, offset, > local_flags & ~GFP_CONSTRAINT_MASK, nodeid); > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932715AbcC3IJk (ORCPT ); Wed, 30 Mar 2016 04:09:40 -0400 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:41287 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932610AbcC3IJh (ORCPT ); Wed, 30 Mar 2016 04:09:37 -0400 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:11:35 +0900 From: Joonsoo Kim To: Geert Uytterhoeven Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , Linux MM , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again Message-ID: <20160330081135.GB1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-3-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 10:58:38AM +0200, Geert Uytterhoeven wrote: > Hi Jonsoo, > > On Mon, Mar 28, 2016 at 7:26 AM, wrote: > > From: Joonsoo Kim > > > > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by > > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' > > because it causes a problem on m68k which has many node > > but !CONFIG_NUMA. In this case, although alien cache isn't used > > at all but to cope with some initialization path, garbage value > > is used and that is BAD_ALIEN_MAGIC. Now, this patch set > > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization > > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. > > > > Signed-off-by: Joonsoo Kim > > I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous > version that was reverted, so > Tested-by: Geert Uytterhoeven Thanks for testing!!! Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932732AbcC3IJ7 (ORCPT ); Wed, 30 Mar 2016 04:09:59 -0400 Received: from LGEAMRELO13.lge.com ([156.147.23.53]:58002 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932622AbcC3IJT (ORCPT ); Wed, 30 Mar 2016 04:09:19 -0400 X-Original-SENDERIP: 156.147.1.121 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:11:16 +0900 From: Joonsoo Kim To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Message-ID: <20160330081116.GA1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:50:36PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > Major kmem_cache metadata in slab subsystem is synchronized with > > the slab_mutex. In SLAB, if some of them is changed, node's shared > > array cache would be freed and re-populated. If __kmem_cache_shrink() > > is called at the same time, it will call drain_array() with n->shared > > without holding node lock so problem can happen. > > > > We can fix this small theoretical race condition by holding node lock > > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > > looks more appropriate solution because stable state would make things > > less error-prone and this is not performance critical path. > > Ummm.. The mutex taking is added to common code. So this will also affect > SLUB. The patch needs to consider this. Do we want to force all > allocators to run shrinking only when holding the lock? SLUB does not > need to hold the mutex. And frankly the mutex is for reconfiguration of > metadata which is *not* occurring here. A shrink operation does not do > that. Can we figure out a slab specific way of handling synchronization > in the strange free/realloc cycle? > > It seems that taking the node lock is the appropriate level of > synchrnonization since the concern is with the contents of a shared cache > at that level. There is no change of metadata which would require the > mutex. Okay. I will fix it. Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758758AbcC3ILD (ORCPT ); Wed, 30 Mar 2016 04:11:03 -0400 Received: from LGEAMRELO12.lge.com ([156.147.23.52]:56064 "EHLO lgeamrelo12.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753336AbcC3IK7 (ORCPT ); Wed, 30 Mar 2016 04:10:59 -0400 X-Original-SENDERIP: 156.147.1.121 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:12:58 +0900 From: Joonsoo Kim To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code Message-ID: <20160330081258.GC1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-5-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:56:15PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > - spin_lock_irq(&n->list_lock); > > - n->free_limit = > > - (1 + nr_cpus_node(node)) * > > - cachep->batchcount + cachep->num; > > - spin_unlock_irq(&n->list_lock); > > + ret = init_cache_node(cachep, node, GFP_KERNEL); > > + if (ret) > > + return ret; > > Drop ret and do a > > return init_cache_node(...); > > instead? Will do it. Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932703AbcC3INb (ORCPT ); Wed, 30 Mar 2016 04:13:31 -0400 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:42254 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932643AbcC3IN3 (ORCPT ); Wed, 30 Mar 2016 04:13:29 -0400 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:15:27 +0900 From: Joonsoo Kim To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup Message-ID: <20160330081526.GD1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-6-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 07:58:29PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > * This initializes kmem_cache_node or resizes various caches for all nodes. > > */ > > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) > > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp) > > ... _node_node? Isnt there a better name for it? I will think it more. Reason I use this naming is that there is other site that uses this naming convention. I'm just mimicking it. :) It's very appreaciate if you have a suggestion. Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932765AbcC3IXl (ORCPT ); Wed, 30 Mar 2016 04:23:41 -0400 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:43911 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932551AbcC3IXS (ORCPT ); Wed, 30 Mar 2016 04:23:18 -0400 X-Original-SENDERIP: 156.147.1.126 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:25:14 +0900 From: Joonsoo Kim To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit Message-ID: <20160330082514.GE1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-7-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 08:03:16PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Currently, determination to free a slab is done whenever free object is > > put into the slab. This has a problem that free slabs are not freed > > even if we have free slabs and have more free_objects than free_limit > > There needs to be a better explanation here since I do not get why there > is an issue with checking after free if a slab is actually free. Okay. Consider following 3 objects free situation. free_limt = 10 nr_free = 9 free(free slab) free(free slab) free(not free slab) If we check it one by one, when nr_free > free_limit (at last free), we cannot free the slab because current slab isn't a free slab. But, if we check it lastly, we can free 1 free slab. I will add more explanation on the next version. > > > when processed slab isn't a free slab. This would cause to keep > > too much memory in the slab subsystem. This patch try to fix it > > by checking number of free object after all free work is done. If there > > is free slab at that time, we can free it so we keep free slab as minimal > > as possible. > > Ok if we check after free work is done then the number of free slabs may > be higher than the limit set and then we free the additional slabs to get > down to the limit that was set? Yes. Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932797AbcC3IYJ (ORCPT ); Wed, 30 Mar 2016 04:24:09 -0400 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:44043 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932594AbcC3IYC (ORCPT ); Wed, 30 Mar 2016 04:24:02 -0400 X-Original-SENDERIP: 156.147.1.126 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Wed, 30 Mar 2016 17:25:57 +0900 From: Joonsoo Kim To: Christoph Lameter Cc: Andrew Morton , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color Message-ID: <20160330082557.GF1678@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-8-git-send-email-iamjoonsoo.kim@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2016 at 08:05:41PM -0500, Christoph Lameter wrote: > On Mon, 28 Mar 2016, js1304@gmail.com wrote: > > > From: Joonsoo Kim > > > > Slab color isn't needed to be changed strictly. Because locking > > for changing slab color could cause more lock contention so this patch > > implements racy access/modify the slab color. This is a preparation step > > to implement lockless allocation path when there is no free objects in > > the kmem_cache. > > Acked-by: Christoph Lameter > > The rest of the description does not relate to this patch and does not > actually reflect the improvement of applying this patch. Remove the rest? No, below improvement is for this individual patch. Thanks. > > > > Below is the result of concurrent allocation/free in slab allocation > > benchmark made by Christoph a long time ago. I make the output simpler. > > The number shows cycle count during alloc/free respectively so less > > is better. > > > > * Before > > Kmalloc N*alloc N*free(32): Average=365/806 > > Kmalloc N*alloc N*free(64): Average=452/690 > > Kmalloc N*alloc N*free(128): Average=736/886 > > Kmalloc N*alloc N*free(256): Average=1167/985 > > Kmalloc N*alloc N*free(512): Average=2088/1125 > > Kmalloc N*alloc N*free(1024): Average=4115/1184 > > Kmalloc N*alloc N*free(2048): Average=8451/1748 > > Kmalloc N*alloc N*free(4096): Average=16024/2048 > > > > * After > > Kmalloc N*alloc N*free(32): Average=355/750 > > Kmalloc N*alloc N*free(64): Average=452/812 > > Kmalloc N*alloc N*free(128): Average=559/1070 > > Kmalloc N*alloc N*free(256): Average=1176/980 > > Kmalloc N*alloc N*free(512): Average=1939/1189 > > Kmalloc N*alloc N*free(1024): Average=3521/1278 > > Kmalloc N*alloc N*free(2048): Average=7152/1838 > > Kmalloc N*alloc N*free(4096): Average=13438/2013 > > > > It shows that contention is reduced for object size >= 1024 > > and performance increases by roughly 15%. > > > > Signed-off-by: Joonsoo Kim > > --- > > mm/slab.c | 26 +++++++++++++------------- > > 1 file changed, 13 insertions(+), 13 deletions(-) > > > > diff --git a/mm/slab.c b/mm/slab.c > > index df11757..52fc5e3 100644 > > --- a/mm/slab.c > > +++ b/mm/slab.c > > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep, > > } > > local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); > > > > - /* Take the node list lock to change the colour_next on this node */ > > check_irq_off(); > > - n = get_node(cachep, nodeid); > > - spin_lock(&n->list_lock); > > - > > - /* Get colour for the slab, and cal the next value. */ > > - offset = n->colour_next; > > - n->colour_next++; > > - if (n->colour_next >= cachep->colour) > > - n->colour_next = 0; > > - spin_unlock(&n->list_lock); > > - > > - offset *= cachep->colour_off; > > - > > if (gfpflags_allow_blocking(local_flags)) > > local_irq_enable(); > > > > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep, > > if (!page) > > goto failed; > > > > + n = get_node(cachep, nodeid); > > + > > + /* Get colour for the slab, and cal the next value. */ > > + n->colour_next++; > > + if (n->colour_next >= cachep->colour) > > + n->colour_next = 0; > > + > > + offset = n->colour_next; > > + if (offset >= cachep->colour) > > + offset = 0; > > + > > + offset *= cachep->colour_off; > > + > > /* Get slab management. */ > > freelist = alloc_slabmgmt(cachep, page, offset, > > local_flags & ~GFP_CONSTRAINT_MASK, nodeid); > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755889AbcCaKxU (ORCPT ); Thu, 31 Mar 2016 06:53:20 -0400 Received: from mail-wm0-f48.google.com ([74.125.82.48]:35536 "EHLO mail-wm0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752240AbcCaKxT (ORCPT ); Thu, 31 Mar 2016 06:53:19 -0400 Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() To: js1304@gmail.com, Andrew Morton References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joonsoo Kim From: Nikolay Borisov Message-ID: <56FD019A.10906@kyup.com> Date: Thu, 31 Mar 2016 13:53:14 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/28/2016 08:26 AM, js1304@gmail.com wrote: > From: Joonsoo Kim > > Major kmem_cache metadata in slab subsystem is synchronized with > the slab_mutex. In SLAB, if some of them is changed, node's shared > array cache would be freed and re-populated. If __kmem_cache_shrink() > is called at the same time, it will call drain_array() with n->shared > without holding node lock so problem can happen. > > We can fix this small theoretical race condition by holding node lock > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > looks more appropriate solution because stable state would make things > less error-prone and this is not performance critical path. > > In addtion, annotate on SLAB functions. Just a nit but would it not be better instead of doing comment-style annotation to use lockdep_assert_held/_once. In both cases for someone to understand what locks have to be held will go and read the source. In my mind it's easier to miss a comment line, rather than the lockdep_assert. Furthermore in case lockdep is enabled a locking violation would spew useful info to dmesg. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757647AbcDACQE (ORCPT ); Thu, 31 Mar 2016 22:16:04 -0400 Received: from LGEAMRELO12.lge.com ([156.147.23.52]:36128 "EHLO lgeamrelo12.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752210AbcDACQD (ORCPT ); Thu, 31 Mar 2016 22:16:03 -0400 X-Original-SENDERIP: 156.147.1.121 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Fri, 1 Apr 2016 11:18:07 +0900 From: Joonsoo Kim To: Nikolay Borisov Cc: Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Jesper Dangaard Brouer , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() Message-ID: <20160401021806.GA13179@js1304-P5Q-DELUXE> References: <1459142821-20303-1-git-send-email-iamjoonsoo.kim@lge.com> <1459142821-20303-2-git-send-email-iamjoonsoo.kim@lge.com> <56FD019A.10906@kyup.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56FD019A.10906@kyup.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 31, 2016 at 01:53:14PM +0300, Nikolay Borisov wrote: > > > On 03/28/2016 08:26 AM, js1304@gmail.com wrote: > > From: Joonsoo Kim > > > > Major kmem_cache metadata in slab subsystem is synchronized with > > the slab_mutex. In SLAB, if some of them is changed, node's shared > > array cache would be freed and re-populated. If __kmem_cache_shrink() > > is called at the same time, it will call drain_array() with n->shared > > without holding node lock so problem can happen. > > > > We can fix this small theoretical race condition by holding node lock > > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink() > > looks more appropriate solution because stable state would make things > > less error-prone and this is not performance critical path. > > > > In addtion, annotate on SLAB functions. > > Just a nit but would it not be better instead of doing comment-style > annotation to use lockdep_assert_held/_once. In both cases for someone > to understand what locks have to be held will go and read the source. In > my mind it's easier to miss a comment line, rather than the > lockdep_assert. Furthermore in case lockdep is enabled a locking > violation would spew useful info to dmesg. Good idea. I'm not sure if lockdep_assert is best fit but I will add something to check it rather than just adding the comment. Thanks.