From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id C670D8D0040 for ; Wed, 30 Mar 2011 16:24:17 -0400 (EDT) Message-Id: <20110330202342.669400887@linux.com> Date: Wed, 30 Mar 2011 15:23:42 -0500 From: Christoph Lameter Subject: [slubll1 00/19] SLUB: Implement mostly lockless slowpaths Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Well here is another result of my obsession with SLAB allocators. There must be some way to get an allocator done that is faster without queueing and I hope that we are now there (maybe only almost...). This patchset implement wider lockless operations in slub affecting most of the slowpaths. In particular the patch decreases the overhead in the performance critical section of __slab_free. One test that I ran was "hackbench 200 process 200" on 2.6.29-rc1 under KVM Run SLAB SLUB SLUB LL 1st 35.2 35.9 31.9 2nd 34.6 30.8 27.9 3rd 33.8 29.9 28.8 Note that the SLUB version in 2.6.29-rc1 already has an optimized allocation and free path using this_cpu_cmpxchg_double(). SLUB LL takes it to new heights by also using cmpxchg_double() in the slowpaths (especially in the kfree() case where we cannot queue). The patch uses a cmpxchg_double (also introduced here) to do an atomic change on the state of a slab page that includes the following pieces of information: 1. Freelist pointer 2. Number of objects inuse 3. Frozen state of a slab Disabling of interrupts (which is a significant latency in the allocator paths) is avoided in the __slab_free case. There are some concerns with this patch. The use of cmpxchg_double on fields of the page struct requires alignment of the fields to double word boundaries. That can only be accomplished by adding some padding to struct page which blows it up to 64 byte (on x86_64). Comments in the source describe these things in more detail. The cmpxchg_double() operation introduced here could also be used to update other doublewords in the page struct in a lockless fashone. One can envision page state changes that involved flags and mappings or do list operations locklessly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3A7A48D0040 for ; Wed, 30 Mar 2011 16:24:19 -0400 (EDT) Message-Id: <20110330202416.864916763@linux.com> Date: Wed, 30 Mar 2011 15:23:45 -0500 From: Christoph Lameter Subject: [slubll1 03/19] slub: Eliminate repeated use of c->page through a new page variable References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=avoid_c_page Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers __slab_alloc is full of "c->page" repeats. Lets just use one local variable named "page" for this. Also avoids the need to a have another variable called "new". Signed-off-by: Christoph Lameter --- mm/slub.c | 41 ++++++++++++++++++++++------------------- 1 file changed, 22 insertions(+), 19 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:30:24.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:30:51.000000000 -0500 @@ -1790,7 +1790,7 @@ static void *__slab_alloc(struct kmem_ca unsigned long addr, struct kmem_cache_cpu *c) { void **object; - struct page *new; + struct page *page; #ifdef CONFIG_CMPXCHG_LOCAL unsigned long flags; @@ -1808,28 +1808,30 @@ static void *__slab_alloc(struct kmem_ca /* We handle __GFP_ZERO in the caller */ gfpflags &= ~__GFP_ZERO; - if (!c->page) + page = c->page; + if (!page) goto new_slab; - slab_lock(c->page); + slab_lock(page); if (unlikely(!node_match(c, node))) goto another_slab; stat(s, ALLOC_REFILL); load_freelist: - object = c->page->freelist; + object = page->freelist; if (unlikely(!object)) goto another_slab; if (kmem_cache_debug(s)) goto debug; c->freelist = get_freepointer(s, object); - c->page->inuse = c->page->objects; - c->page->freelist = NULL; - c->node = page_to_nid(c->page); + page->inuse = page->objects; + page->freelist = NULL; + c->node = page_to_nid(page); + unlock_out: - slab_unlock(c->page); + slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL c->tid = next_tid(c->tid); local_irq_restore(flags); @@ -1841,9 +1843,9 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); - if (new) { - c->page = new; + page = get_partial(s, gfpflags, node); + if (page) { + c->page = page; stat(s, ALLOC_FROM_PARTIAL); goto load_freelist; } @@ -1852,19 +1854,20 @@ new_slab: if (gfpflags & __GFP_WAIT) local_irq_enable(); - new = new_slab(s, gfpflags, node); + page = new_slab(s, gfpflags, node); if (gfpflags & __GFP_WAIT) local_irq_disable(); - if (new) { + if (page) { c = __this_cpu_ptr(s->cpu_slab); stat(s, ALLOC_SLAB); if (c->page) flush_slab(s, c); - slab_lock(new); - __SetPageSlubFrozen(new); - c->page = new; + + slab_lock(page); + __SetPageSlubFrozen(page); + c->page = page; goto load_freelist; } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) @@ -1874,11 +1877,11 @@ new_slab: #endif return NULL; debug: - if (!alloc_debug_processing(s, c->page, object, addr)) + if (!alloc_debug_processing(s, page, object, addr)) goto another_slab; - c->page->inuse++; - c->page->freelist = get_freepointer(s, object); + page->inuse++; + page->freelist = get_freepointer(s, object); c->node = NUMA_NO_NODE; goto unlock_out; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 56C748D0048 for ; Wed, 30 Mar 2011 16:24:19 -0400 (EDT) Message-Id: <20110330202415.596399815@linux.com> Date: Wed, 30 Mar 2011 15:23:43 -0500 From: Christoph Lameter Subject: [slubll1 01/19] percpu: Fixup __this_cpu_xchg* operations References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=fixup_xchg Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Somehow we got into a situation where the __this_cpu_xchg() operations were not defined in the same way as this_cpu_xchg() and friends. I had some build failures under 32 bit compiles that were addressed by these fixes. Signed-off-by: Christoph Lameter --- arch/x86/include/asm/percpu.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6/arch/x86/include/asm/percpu.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-30 14:09:28.000000000 -0500 +++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-30 14:10:16.000000000 -0500 @@ -388,12 +388,9 @@ do { \ #define __this_cpu_xor_1(pcp, val) percpu_to_op("xor", (pcp), val) #define __this_cpu_xor_2(pcp, val) percpu_to_op("xor", (pcp), val) #define __this_cpu_xor_4(pcp, val) percpu_to_op("xor", (pcp), val) -/* - * Generic fallback operations for __this_cpu_xchg_[1-4] are okay and much - * faster than an xchg with forced lock semantics. - */ -#define __this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define __this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) +#define __this_cpu_xchg_1(pcp, val) percpu_xchg_op(pcp, val) +#define __this_cpu_xchg_2(pcp, val) percpu_xchg_op(pcp, val) +#define __this_cpu_xchg_4(pcp, val) percpu_xchg_op(pcp, val) #define this_cpu_read_1(pcp) percpu_from_op("mov", (pcp), "m"(pcp)) #define this_cpu_read_2(pcp) percpu_from_op("mov", (pcp), "m"(pcp)) @@ -471,6 +468,7 @@ do { \ #define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) #define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) #define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) + #endif /* CONFIG_X86_CMPXCHG64 */ /* @@ -485,6 +483,8 @@ do { \ #define __this_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) #define __this_cpu_xor_8(pcp, val) percpu_to_op("xor", (pcp), val) #define __this_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val) +#define __this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) +#define __this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) #define this_cpu_read_8(pcp) percpu_from_op("mov", (pcp), "m"(pcp)) #define this_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 25D238D0040 for ; Wed, 30 Mar 2011 16:24:20 -0400 (EDT) Message-Id: <20110330202416.210524653@linux.com> Date: Wed, 30 Mar 2011 15:23:44 -0500 From: Christoph Lameter Subject: [slubll1 02/19] slub: get_map() function to establish map of free objects in a slab References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=slub_slowpath_get_map Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers The bit map of free objects in a slab page is determined in various functions if debugging is enabled. Provide a common function for that purpose. Signed-off-by: Christoph Lameter --- mm/slub.c | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:09:27.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:30:24.000000000 -0500 @@ -271,10 +271,6 @@ static inline void set_freepointer(struc for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\ __p += (__s)->size) -/* Scan freelist */ -#define for_each_free_object(__p, __s, __free) \ - for (__p = (__free); __p; __p = get_freepointer((__s), __p)) - /* Determine object index from a given position */ static inline int slab_index(void *p, struct kmem_cache *s, void *addr) { @@ -330,6 +326,21 @@ static inline int oo_objects(struct kmem return x.x & OO_MASK; } +/* + * Determine a map of object in use on a page. + * + * Slab lock or node listlock must be held to guarantee that the page does + * not vanish from under us. + */ +static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map) +{ + void *p; + void *addr = page_address(page); + + for (p = page->freelist; p; p = get_freepointer(s, p)) + set_bit(slab_index(p, s, addr), map); +} + #ifdef CONFIG_SLUB_DEBUG /* * Debug settings: @@ -2673,9 +2684,8 @@ static void list_slab_objects(struct kme return; slab_err(s, page, "%s", text); slab_lock(page); - for_each_free_object(p, s, page->freelist) - set_bit(slab_index(p, s, addr), map); + get_map(s, page, map); for_each_object(p, s, addr, page->objects) { if (!test_bit(slab_index(p, s, addr), map)) { @@ -3610,10 +3620,11 @@ static int validate_slab(struct kmem_cac /* Now we know that a valid freelist exists */ bitmap_zero(map, page->objects); - for_each_free_object(p, s, page->freelist) { - set_bit(slab_index(p, s, addr), map); - if (!check_object(s, page, p, SLUB_RED_INACTIVE)) - return 0; + get_map(s, page, map); + for_each_object(p, s, addr, page->objects) { + if (test_bit(slab_index(p, s, addr), map)) + if (!check_object(s, page, p, SLUB_RED_INACTIVE)) + return 0; } for_each_object(p, s, addr, page->objects) @@ -3821,8 +3832,7 @@ static void process_slab(struct loc_trac void *p; bitmap_zero(map, page->objects); - for_each_free_object(p, s, page->freelist) - set_bit(slab_index(p, s, addr), map); + get_map(s, page, map); for_each_object(p, s, addr, page->objects) if (!test_bit(slab_index(p, s, addr), map)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 2DC868D004B for ; Wed, 30 Mar 2011 16:24:20 -0400 (EDT) Message-Id: <20110330202417.488647335@linux.com> Date: Wed, 30 Mar 2011 15:23:46 -0500 From: Christoph Lameter Subject: [slubll1 04/19] slub: Move node determination out of hotpath References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=move_slab_node Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers If the node does not change then there is no need to recalculate the node from the page struct. So move the node determination into the places where we acquire a new slab page. Signed-off-by: Christoph Lameter --- mm/slub.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-25 15:12:43.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-28 08:50:41.000000000 -0500 @@ -1822,7 +1822,6 @@ load_freelist: c->freelist = get_freepointer(s, object); page->inuse = page->objects; page->freelist = NULL; - c->node = page_to_nid(page); unlock_out: slab_unlock(page); @@ -1839,8 +1838,10 @@ another_slab: new_slab: page = get_partial(s, gfpflags, node); if (page) { - c->page = page; stat(s, ALLOC_FROM_PARTIAL); +load_from_page: + c->node = page_to_nid(page); + c->page = page; goto load_freelist; } @@ -1861,8 +1862,8 @@ new_slab: slab_lock(page); __SetPageSlubFrozen(page); - c->page = page; - goto load_freelist; + + goto load_from_page; } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) slab_out_of_memory(s, gfpflags, node); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D58FE8D0049 for ; Wed, 30 Mar 2011 16:24:20 -0400 (EDT) Message-Id: <20110330202418.145015838@linux.com> Date: Wed, 30 Mar 2011 15:23:47 -0500 From: Christoph Lameter Subject: [slubll1 05/19] slub: Move debug handlign in __slab_free References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=move_debug Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Its easier to read if its with the check for debugging flags. Signed-off-by: Christoph Lameter --- mm/slub.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-28 14:52:13.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-28 14:52:58.000000000 -0500 @@ -2051,10 +2051,9 @@ static void __slab_free(struct kmem_cach slab_lock(page); stat(s, FREE_SLOWPATH); - if (kmem_cache_debug(s)) - goto debug; + if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr)) + goto out_unlock; -checks_ok: prior = page->freelist; set_freepointer(s, object, prior); page->freelist = object; @@ -2098,12 +2097,6 @@ slab_empty: #endif stat(s, FREE_SLAB); discard_slab(s, page); - return; - -debug: - if (!free_debug_processing(s, page, x, addr)) - goto out_unlock; - goto checks_ok; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 9D9DA8D004D for ; Wed, 30 Mar 2011 16:24:21 -0400 (EDT) Message-Id: <20110330202418.795427129@linux.com> Date: Wed, 30 Mar 2011 15:23:48 -0500 From: Christoph Lameter Subject: [slubll1 06/19] slub: Do not use frozen page flag but a bit in the page counters References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=frozen_field Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Do not use a page flag for the frozen bit. It needs to be part of the state that is handled with cmpxchg_double(). So use a bit in the counter struct in the page struct for that purpose. Also all page start out as frozen pages so set the bit when the page is allocated. Signed-off-by: Christoph Lameter --- include/linux/mm_types.h | 5 +++-- include/linux/page-flags.h | 2 -- mm/slub.c | 12 ++++++------ 3 files changed, 9 insertions(+), 10 deletions(-) Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2011-03-30 14:29:23.000000000 -0500 +++ linux-2.6/include/linux/mm_types.h 2011-03-30 14:32:15.000000000 -0500 @@ -41,8 +41,9 @@ struct page { * & limit reverse map searches. */ struct { /* SLUB */ - u16 inuse; - u16 objects; + unsigned inuse:16; + unsigned objects:15; + unsigned frozen:1; }; }; union { Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2011-03-30 14:29:23.000000000 -0500 +++ linux-2.6/include/linux/page-flags.h 2011-03-30 14:32:15.000000000 -0500 @@ -212,8 +212,6 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEAR __PAGEFLAG(SlobFree, slob_free) -__PAGEFLAG(SlubFrozen, slub_frozen) - /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:31:27.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:32:15.000000000 -0500 @@ -166,7 +166,7 @@ static inline int kmem_cache_debug(struc #define OO_SHIFT 16 #define OO_MASK ((1 << OO_SHIFT) - 1) -#define MAX_OBJS_PER_PAGE 65535 /* since page.objects is u16 */ +#define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */ /* Internal SLUB flags */ #define __OBJECT_POISON 0x80000000UL /* Poison object */ @@ -1013,7 +1013,7 @@ static noinline int free_debug_processin } /* Special debug activities for freeing objects */ - if (!PageSlubFrozen(page) && !page->freelist) + if (!page->frozen && !page->freelist) remove_full(s, page); if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); @@ -1402,7 +1402,7 @@ static inline int lock_and_freeze_slab(s { if (slab_trylock(page)) { __remove_partial(n, page); - __SetPageSlubFrozen(page); + page->frozen = 1; return 1; } return 0; @@ -1516,7 +1516,7 @@ static void unfreeze_slab(struct kmem_ca { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - __ClearPageSlubFrozen(page); + page->frozen = 0; if (page->inuse) { if (page->freelist) { @@ -1867,7 +1867,7 @@ load_from_page: flush_slab(s, c); slab_lock(page); - __SetPageSlubFrozen(page); + page->frozen = 1; goto load_from_page; } @@ -2065,7 +2065,7 @@ static void __slab_free(struct kmem_cach page->freelist = object; page->inuse--; - if (unlikely(PageSlubFrozen(page))) { + if (unlikely(page->frozen)) { stat(s, FREE_FROZEN); goto out_unlock; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 58A3B8D004B for ; Wed, 30 Mar 2011 16:24:22 -0400 (EDT) Message-Id: <20110330202420.088460266@linux.com> Date: Wed, 30 Mar 2011 15:23:50 -0500 From: Christoph Lameter Subject: [slubll1 08/19] x86: Add support for cmpxchg_double References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=cmpxchg_double_x86 Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers A simple implementation that only supports the word size and does not have a fallback mode (would require a spinlock). And 32 and 64 bit support for cmpxchg_double. cmpxchg double uses the cmpxchg8b or cmpxchg16b instruction on x86 processors to compare and swap 2 machine words. This allows lockless algorithms to move more context information through critical sections. Set a flag CONFIG_CMPXCHG_DOUBLE to signal the support of that feature during kernel builds. Signed-off-by: Christoph Lameter --- arch/x86/Kconfig.cpu | 3 ++ arch/x86/include/asm/cmpxchg_32.h | 46 ++++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/cmpxchg_64.h | 43 +++++++++++++++++++++++++++++++++++ 3 files changed, 92 insertions(+) Index: linux-2.6/arch/x86/include/asm/cmpxchg_64.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/cmpxchg_64.h 2011-03-30 11:21:40.000000000 -0500 +++ linux-2.6/arch/x86/include/asm/cmpxchg_64.h 2011-03-30 12:31:35.000000000 -0500 @@ -151,4 +151,47 @@ extern void __cmpxchg_wrong_size(void); cmpxchg_local((ptr), (o), (n)); \ }) +#define cmpxchg16b(ptr, o1, o2, n1, n2) \ +({ \ + char __ret; \ + __typeof__(o2) __junk; \ + __typeof__(*(ptr)) __old1 = (o1); \ + __typeof__(o2) __old2 = (o2); \ + __typeof__(*(ptr)) __new1 = (n1); \ + __typeof__(o2) __new2 = (n2); \ + asm volatile(LOCK_PREFIX_HERE "lock; cmpxchg16b (%%rsi);setz %1" \ + : "=d"(__junk), "=a"(__ret) \ + : "S"(ptr), "b"(__new1), "c"(__new2), \ + "a"(__old1), "d"(__old2)); \ + __ret; }) + + +#define cmpxchg16b_local(ptr, o1, o2, n1, n2) \ +({ \ + char __ret; \ + __typeof__(o2) __junk; \ + __typeof__(*(ptr)) __old1 = (o1); \ + __typeof__(o2) __old2 = (o2); \ + __typeof__(*(ptr)) __new1 = (n1); \ + __typeof__(o2) __new2 = (n2); \ + asm volatile("cmpxchg16b (%%rsi)\n\t\tsetz %1\n\t" \ + : "=d"(__junk)_, "=a"(__ret) \ + : "S"((ptr)), "b"(__new1), "c"(__new2), \ + "a"(__old1), "d"(__old2)); \ + __ret; }) + +#define cmpxchg_double(ptr, o1, o2, n1, n2) \ +({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 8); \ + VM_BUG_ON((unsigned long)(ptr) % 16); \ + cmpxchg16b((ptr), (o1), (o2), (n1), (n2)); \ +}) + +#define cmpxchg_double_local(ptr, o1, o2, n1, n2) \ +({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 8); \ + VM_BUG_ON((unsigned long)(ptr) % 16); \ + cmpxchg16b_local((ptr), (o1), (o2), (n1), (n2)); \ +}) + #endif /* _ASM_X86_CMPXCHG_64_H */ Index: linux-2.6/arch/x86/include/asm/cmpxchg_32.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/cmpxchg_32.h 2011-03-30 11:21:40.000000000 -0500 +++ linux-2.6/arch/x86/include/asm/cmpxchg_32.h 2011-03-30 12:40:27.000000000 -0500 @@ -280,4 +280,50 @@ static inline unsigned long cmpxchg_386( #endif +#define cmpxchg8b(ptr, o1, o2, n1, n2) \ +({ \ + char __ret; \ + __typeof__(o2) __dummy; \ + __typeof__(*(ptr)) __old1 = (o1); \ + __typeof__(o2) __old2 = (o2); \ + __typeof__(*(ptr)) __new1 = (n1); \ + __typeof__(o2) __new2 = (n2); \ + asm volatile(LOCK_PREFIX_HERE "lock; cmpxchg8b (%%esi); setz %1"\ + : "d="(__dummy), "=a" (__ret) \ + : "S" ((ptr)), "a" (__old1), "d"(__old2), \ + "b" (__new1), "c" (__new2) \ + : "memory"); \ + __ret; }) + + +#define cmpxchg8b_local(ptr, o1, o2, n1, n2) \ +({ \ + char __ret; \ + __typeof__(o2) __dummy; \ + __typeof__(*(ptr)) __old1 = (o1); \ + __typeof__(o2) __old2 = (o2); \ + __typeof__(*(ptr)) __new1 = (n1); \ + __typeof__(o2) __new2 = (n2); \ + asm volatile("cmpxchg8b (%%esi); tsetz %1" \ + : "d="(__dummy), "=a"(__ret) \ + : "S" ((ptr)), "a" (__old), "d"(__old2), \ + "b" (__new1), "c" (__new2), \ + : "memory"); \ + __ret; }) + + +#define cmpxchg_double(ptr, o1, o2, n1, n2) \ +({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 4); \ + VM_BUG_ON((unsigned long)(ptr) % 8); \ + cmpxchg8b((ptr), (o1), (o2), (n1), (n2)); \ +}) + +#define cmpxchg_double_local(ptr, o1, o2, n1, n2) \ +({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 4); \ + VM_BUG_ON((unsigned long)(ptr) % 8); \ + cmpxchg16b_local((ptr), (o1), (o2), (n1), (n2)); \ +}) + #endif /* _ASM_X86_CMPXCHG_32_H */ Index: linux-2.6/arch/x86/Kconfig.cpu =================================================================== --- linux-2.6.orig/arch/x86/Kconfig.cpu 2011-03-30 11:21:40.000000000 -0500 +++ linux-2.6/arch/x86/Kconfig.cpu 2011-03-30 11:21:44.000000000 -0500 @@ -308,6 +308,9 @@ config X86_CMPXCHG config CMPXCHG_LOCAL def_bool X86_64 || (X86_32 && !M386) +config CMPXCHG_DOUBLE + def_bool X86_64 || (X86_32 && !M386) + config X86_L1_CACHE_SHIFT int default "7" if MPENTIUM4 || MPSC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id F180D8D004E for ; Wed, 30 Mar 2011 16:24:23 -0400 (EDT) Message-Id: <20110330202419.431478190@linux.com> Date: Wed, 30 Mar 2011 15:23:49 -0500 From: Christoph Lameter Subject: [slubll1 07/19] slub: Move page->frozen handling near where the page->freelist handling occurs References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=frozen_move Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers This is necessary because the frozen bit has to be handled in the same cmpxchg_double with the freelist and the counters. Signed-off-by: Christoph Lameter --- mm/slub.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:32:15.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:32:29.000000000 -0500 @@ -1264,6 +1264,7 @@ static struct page *new_slab(struct kmem page->freelist = start; page->inuse = 0; + page->frozen = 1; out: return page; } @@ -1402,7 +1403,6 @@ static inline int lock_and_freeze_slab(s { if (slab_trylock(page)) { __remove_partial(n, page); - page->frozen = 1; return 1; } return 0; @@ -1516,7 +1516,6 @@ static void unfreeze_slab(struct kmem_ca { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - page->frozen = 0; if (page->inuse) { if (page->freelist) { @@ -1657,6 +1656,7 @@ static void deactivate_slab(struct kmem_ #ifdef CONFIG_CMPXCHG_LOCAL c->tid = next_tid(c->tid); #endif + page->frozen = 0; unfreeze_slab(s, page, tail); } @@ -1819,6 +1819,8 @@ static void *__slab_alloc(struct kmem_ca stat(s, ALLOC_REFILL); load_freelist: + VM_BUG_ON(!page->frozen); + object = page->freelist; if (unlikely(!object)) goto another_slab; @@ -1845,6 +1847,7 @@ new_slab: page = get_partial(s, gfpflags, node); if (page) { stat(s, ALLOC_FROM_PARTIAL); + page->frozen = 1; load_from_page: c->node = page_to_nid(page); c->page = page; @@ -1867,7 +1870,6 @@ load_from_page: flush_slab(s, c); slab_lock(page); - page->frozen = 1; goto load_from_page; } @@ -2414,6 +2416,7 @@ static void early_kmem_cache_node_alloc( BUG_ON(!n); page->freelist = get_freepointer(kmem_cache_node, n); page->inuse++; + page->frozen = 0; kmem_cache_node->node[node] = n; #ifdef CONFIG_SLUB_DEBUG init_object(kmem_cache_node, n, SLUB_RED_ACTIVE); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 66BB38D004C for ; Wed, 30 Mar 2011 16:24:24 -0400 (EDT) Message-Id: <20110330202420.741455023@linux.com> Date: Wed, 30 Mar 2011 15:23:51 -0500 From: Christoph Lameter Subject: [slubll1 09/19] mm: Rearrange struct page References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=resort_struct_page Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers We need to be able to use cmpxchg_double on the freelist and object count field in struct page. Rearrange the fields in struct page according to doubleword entities so that the freelist pointer comes before the counters. Do the rearranging with a future in mind where we use more doubleword atomics to avoid locking of updates to flags/mapping or lru pointers. Create another union to allow access to counters in struct page as a single unsigned long value. The doublewords must be properly aligned for cmpxchg_double to work. Sadly this increases the size of page struct by one word but as a result page structs are now cacheline aligned on x86_64. Signed-off-by: Christoph Lameter --- include/linux/mm_types.h | 85 ++++++++++++++++++++++++++++++----------------- 1 file changed, 55 insertions(+), 30 deletions(-) Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2011-03-30 13:54:14.000000000 -0500 +++ linux-2.6/include/linux/mm_types.h 2011-03-30 13:58:07.000000000 -0500 @@ -30,52 +30,68 @@ struct address_space; * moment. Note that we have no way to track which tasks are using * a page, though if it is a pagecache page, rmap structures can tell us * who is mapping it. + * + * The objects in struct page are organized in double word blocks in + * order to allows us to use atomic double word operations on portions + * of struct page. That is currently only used by slub but the arrangement + * allows the use of atomic double word operations on the flags/mapping + * and lru list pointers also. */ struct page { + /* First double word block */ unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ - atomic_t _count; /* Usage count, see below. */ - union { - atomic_t _mapcount; /* Count of ptes mapped in mms, - * to show when page is mapped - * & limit reverse map searches. + struct address_space *mapping; /* If low bit clear, points to + * inode address_space, or NULL. + * If page mapped as anonymous + * memory, low bit is set, and + * it points to anon_vma object: + * see PAGE_MAPPING_ANON below. */ - struct { /* SLUB */ - unsigned inuse:16; - unsigned objects:15; - unsigned frozen:1; + /* Second double word block used by SLUB */ + union { + pgoff_t index; /* Our offset within mapping. */ + void *freelist; /* SLUB: freelist req. slab lock */ + }; + union { + unsigned long counters; + struct { + union { + struct { /* SLUB */ + unsigned inuse:16; + unsigned objects:15; + unsigned frozen:1; + }; + atomic_t _mapcount; /* Count of ptes mapped in mms, + * to show when page is mapped + * & limit reverse map searches. + */ + }; + atomic_t _count; /* Usage count, see below. */ }; }; + + /* Third double word block */ + struct list_head lru; /* Pageout list, eg. active_list + * protected by zone->lru_lock ! + */ + + /* Remainder is not double word aligned */ union { - struct { - unsigned long private; /* Mapping-private opaque data: + unsigned long private; /* Mapping-private opaque data: * usually used for buffer_heads * if PagePrivate set; used for * swp_entry_t if PageSwapCache; * indicates order in the buddy * system if PG_buddy is set. */ - struct address_space *mapping; /* If low bit clear, points to - * inode address_space, or NULL. - * If page mapped as anonymous - * memory, low bit is set, and - * it points to anon_vma object: - * see PAGE_MAPPING_ANON below. - */ - }; #if USE_SPLIT_PTLOCKS - spinlock_t ptl; + spinlock_t ptl; #endif - struct kmem_cache *slab; /* SLUB: Pointer to slab */ - struct page *first_page; /* Compound tail pages */ + struct kmem_cache *slab; /* SLUB: Pointer to slab */ + struct page *first_page; /* Compound tail pages */ }; - union { - pgoff_t index; /* Our offset within mapping. */ - void *freelist; /* SLUB: freelist req. slab lock */ - }; - struct list_head lru; /* Pageout list, eg. active_list - * protected by zone->lru_lock ! - */ + /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with @@ -101,7 +117,16 @@ struct page { */ void *shadow; #endif -}; +} +/* + * If another subsystem starts using the double word pairing for atomic + * operations on struct page then it must change the #if to ensure + * proper alignment of the page struct. + */ +#if defined(CONFIG_SLUB) && defined(CONFIG_CMPXCHG_LOCAL) + __attribute__((__aligned__(2*sizeof(unsigned long)))) +#endif +; /* * A region containing a mapping of a non-memory backed file under NOMMU -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D6E108D004E for ; Wed, 30 Mar 2011 16:24:24 -0400 (EDT) Message-Id: <20110330202421.349168617@linux.com> Date: Wed, 30 Mar 2011 15:23:52 -0500 From: Christoph Lameter Subject: [slubll1 10/19] slub: Add cmpxchg_double_slab() References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=cmpxchg_double_slab Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Add a function that operates on the second doubleword in the page struct and manipulates the object counters, the freelist and the frozen attribute. Signed-off-by: Christoph Lameter --- include/linux/slub_def.h | 1 + mm/slub.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:34:59.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:42:52.000000000 -0500 @@ -131,6 +131,9 @@ static inline int kmem_cache_debug(struc /* Enable to test recovery from slab corruption on boot */ #undef SLUB_RESILIENCY_TEST +/* Enable to log cmpxchg failures */ +#undef SLUB_DEBUG_CMPXCHG + /* * Mininum number of partial slabs. These will be left on the partial * lists even if they are empty. kmem_cache_shrink may reclaim them. @@ -326,6 +329,37 @@ static inline int oo_objects(struct kmem return x.x & OO_MASK; } +static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, + void *freelist_old, unsigned long counters_old, + void *freelist_new, unsigned long counters_new, + const char *n) +{ +#ifdef CONFIG_CMPXCHG_DOUBLE + if (!kmem_cache_debug(s)) { + if (cmpxchg_double(&page->freelist, + freelist_old, counters_old, + freelist_new, counters_new)) + return 1; + } else +#endif + { + if (page->freelist == freelist_old && page->counters == counters_old) { + page->freelist = freelist_new; + page->counters = counters_new; + return 1; + } + } + + cpu_relax(); + stat(s, CMPXCHG_DOUBLE_FAIL); + +#ifdef SLUB_DEBUG_CMPXCHG + printk(KERN_INFO "%s %s: cmpxchg double redo ", n, s->name); +#endif + + return 0; +} + /* * Determine a map of object in use on a page. * @@ -4535,6 +4569,8 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail); STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees); STAT_ATTR(ORDER_FALLBACK, order_fallback); +STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail); +STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail); #endif static struct attribute *slab_attrs[] = { @@ -4592,6 +4628,8 @@ static struct attribute *slab_attrs[] = &deactivate_to_tail_attr.attr, &deactivate_remote_frees_attr.attr, &order_fallback_attr.attr, + &cmpxchg_double_fail_attr.attr, + &cmpxchg_double_cpu_fail_attr.attr, #endif #ifdef CONFIG_FAILSLAB &failslab_attr.attr, Index: linux-2.6/include/linux/slub_def.h =================================================================== --- linux-2.6.orig/include/linux/slub_def.h 2011-03-30 14:34:59.000000000 -0500 +++ linux-2.6/include/linux/slub_def.h 2011-03-30 14:35:01.000000000 -0500 @@ -33,6 +33,7 @@ enum stat_item { DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */ ORDER_FALLBACK, /* Number of times fallback was necessary */ CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */ + CMPXCHG_DOUBLE_FAIL, /* Number of times that cmpxchg double did not match */ NR_SLUB_STAT_ITEMS }; struct kmem_cache_cpu { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id E6B2D8D0051 for ; Wed, 30 Mar 2011 16:24:24 -0400 (EDT) Message-Id: <20110330202421.954263610@linux.com> Date: Wed, 30 Mar 2011 15:23:53 -0500 From: Christoph Lameter Subject: [slubll1 11/19] slub: explicit list_lock taking References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=unlock_list_ops Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers The allocator fastpath rework does change the usage of the list_lock. Remove the list_lock processing from the functions that hide them from the critical sections and move them into those critical sections. This is turn simplifies the support functions (no __ variant needed anymore) and simplifies the lock handling on bootstrap. Signed-off-by: Christoph Lameter --- mm/slub.c | 74 ++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 36 insertions(+), 38 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:42:52.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:42:55.000000000 -0500 @@ -904,25 +904,21 @@ static inline void slab_free_hook(struct /* * Tracking of fully allocated slabs for debugging purposes. */ -static void add_full(struct kmem_cache_node *n, struct page *page) +static void add_full(struct kmem_cache *s, + struct kmem_cache_node *n, struct page *page) { - spin_lock(&n->list_lock); + if (!(s->flags & SLAB_STORE_USER)) + return; + list_add(&page->lru, &n->full); - spin_unlock(&n->list_lock); } static void remove_full(struct kmem_cache *s, struct page *page) { - struct kmem_cache_node *n; - if (!(s->flags & SLAB_STORE_USER)) return; - n = get_node(s, page_to_nid(page)); - - spin_lock(&n->list_lock); list_del(&page->lru); - spin_unlock(&n->list_lock); } /* Tracking of the number of slabs for debugging purposes */ @@ -1047,8 +1043,13 @@ static noinline int free_debug_processin } /* Special debug activities for freeing objects */ - if (!page->frozen && !page->freelist) + if (!page->frozen && !page->freelist) { + struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + + spin_lock(&n->list_lock); remove_full(s, page); + spin_unlock(&n->list_lock); + } if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); @@ -1399,36 +1400,26 @@ static __always_inline int slab_trylock( /* * Management of partially allocated slabs */ -static void add_partial(struct kmem_cache_node *n, +static inline void add_partial(struct kmem_cache_node *n, struct page *page, int tail) { - spin_lock(&n->list_lock); n->nr_partial++; if (tail) list_add_tail(&page->lru, &n->partial); else list_add(&page->lru, &n->partial); - spin_unlock(&n->list_lock); } -static inline void __remove_partial(struct kmem_cache_node *n, +static inline void remove_partial(struct kmem_cache_node *n, struct page *page) { list_del(&page->lru); n->nr_partial--; } -static void remove_partial(struct kmem_cache *s, struct page *page) -{ - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - - spin_lock(&n->list_lock); - __remove_partial(n, page); - spin_unlock(&n->list_lock); -} - /* - * Lock slab and remove from the partial list. + * Lock slab, remove from the partial list and put the object into the + * per cpu freelist. * * Must hold list_lock. */ @@ -1436,7 +1427,7 @@ static inline int lock_and_freeze_slab(s struct page *page) { if (slab_trylock(page)) { - __remove_partial(n, page); + remove_partial(n, page); return 1; } return 0; @@ -1553,12 +1544,17 @@ static void unfreeze_slab(struct kmem_ca if (page->inuse) { if (page->freelist) { + spin_lock(&n->list_lock); add_partial(n, page, tail); + spin_unlock(&n->list_lock); stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD); } else { stat(s, DEACTIVATE_FULL); - if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER)) - add_full(n, page); + if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER)) { + spin_lock(&n->list_lock); + add_full(s, n, page); + spin_unlock(&n->list_lock); + } } slab_unlock(page); } else { @@ -1574,7 +1570,9 @@ static void unfreeze_slab(struct kmem_ca * kmem_cache_shrink can reclaim any empty slabs from * the partial list. */ + spin_lock(&n->list_lock); add_partial(n, page, 1); + spin_unlock(&n->list_lock); slab_unlock(page); } else { slab_unlock(page); @@ -2114,7 +2112,11 @@ static void __slab_free(struct kmem_cach * then add it. */ if (unlikely(!prior)) { + struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + + spin_lock(&n->list_lock); add_partial(get_node(s, page_to_nid(page)), page, 1); + spin_unlock(&n->list_lock); stat(s, FREE_ADD_PARTIAL); } @@ -2130,7 +2132,11 @@ slab_empty: /* * Slab still on the partial list. */ - remove_partial(s, page); + struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + + spin_lock(&n->list_lock); + remove_partial(n, page); + spin_unlock(&n->list_lock); stat(s, FREE_REMOVE_PARTIAL); } slab_unlock(page); @@ -2432,7 +2438,6 @@ static void early_kmem_cache_node_alloc( { struct page *page; struct kmem_cache_node *n; - unsigned long flags; BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node)); @@ -2459,14 +2464,7 @@ static void early_kmem_cache_node_alloc( init_kmem_cache_node(n, kmem_cache_node); inc_slabs_node(kmem_cache_node, node, page->objects); - /* - * lockdep requires consistent irq usage for each lock - * so even though there cannot be a race this early in - * the boot sequence, we still disable irqs. - */ - local_irq_save(flags); add_partial(n, page, 0); - local_irq_restore(flags); } static void free_kmem_cache_nodes(struct kmem_cache *s) @@ -2744,7 +2742,7 @@ static void free_partial(struct kmem_cac spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry_safe(page, h, &n->partial, lru) { if (!page->inuse) { - __remove_partial(n, page); + remove_partial(n, page); discard_slab(s, page); } else { list_slab_objects(s, page, @@ -3082,7 +3080,7 @@ int kmem_cache_shrink(struct kmem_cache * may have freed the last object and be * waiting to release the slab. */ - __remove_partial(n, page); + remove_partial(n, page); slab_unlock(page); discard_slab(s, page); } else { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id F0E688D0052 for ; Wed, 30 Mar 2011 16:24:25 -0400 (EDT) Message-Id: <20110330202422.592704393@linux.com> Date: Wed, 30 Mar 2011 15:23:54 -0500 From: Christoph Lameter Subject: [slubll1 12/19] slub: Pass kmem_cache struct to lock and freeze slab References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=pass_kmem_cache_to_lock_and_freeze Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers We need more information about the slab for the cmpxchg implementation. Signed-off-by: Christoph Lameter --- mm/slub.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:42:55.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:42:58.000000000 -0500 @@ -1423,8 +1423,8 @@ static inline void remove_partial(struct * * Must hold list_lock. */ -static inline int lock_and_freeze_slab(struct kmem_cache_node *n, - struct page *page) +static inline int lock_and_freeze_slab(struct kmem_cache *s, + struct kmem_cache_node *n, struct page *page) { if (slab_trylock(page)) { remove_partial(n, page); @@ -1436,7 +1436,8 @@ static inline int lock_and_freeze_slab(s /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_node(struct kmem_cache *s, + struct kmem_cache_node *n) { struct page *page; @@ -1451,7 +1452,7 @@ static struct page *get_partial_node(str spin_lock(&n->list_lock); list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (lock_and_freeze_slab(s, n, page)) goto out; page = NULL; out: @@ -1502,7 +1503,7 @@ static struct page *get_any_partial(stru if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > s->min_partial) { - page = get_partial_node(n); + page = get_partial_node(s, n); if (page) { put_mems_allowed(); return page; @@ -1522,7 +1523,7 @@ static struct page *get_partial(struct k struct page *page; int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_node(s, get_node(s, searchnode)); if (page || node != -1) return page; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id C995B8D004C for ; Wed, 30 Mar 2011 16:24:26 -0400 (EDT) Message-Id: <20110330202423.327243730@linux.com> Date: Wed, 30 Mar 2011 15:23:55 -0500 From: Christoph Lameter Subject: [slubll1 13/19] slub: Rework allocator fastpaths References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=rework_fastpaths Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Rework the allocation paths so that updates of the page freelist, frozen state and number of objects use cmpxchg_double_slab(). Signed-off-by: Christoph Lameter --- mm/slub.c | 422 ++++++++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 130 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:42:58.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:43:01.000000000 -0500 @@ -974,11 +974,6 @@ static noinline int alloc_debug_processi if (!check_slab(s, page)) goto bad; - if (!on_freelist(s, page, object)) { - object_err(s, page, object, "Object already allocated"); - goto bad; - } - if (!check_valid_pointer(s, page, object)) { object_err(s, page, object, "Freelist Pointer check fails"); goto bad; @@ -1042,14 +1037,6 @@ static noinline int free_debug_processin goto fail; } - /* Special debug activities for freeing objects */ - if (!page->frozen && !page->freelist) { - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - - spin_lock(&n->list_lock); - remove_full(s, page); - spin_unlock(&n->list_lock); - } if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); @@ -1426,11 +1413,52 @@ static inline void remove_partial(struct static inline int lock_and_freeze_slab(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page) { - if (slab_trylock(page)) { - remove_partial(n, page); + void *freelist; + unsigned long counters; + struct page new; + + + if (!slab_trylock(page)) + return 0; + + /* + * Zap the freelist and set the frozen bit. + * The old freelist is the list of objects for the + * per cpu allocation list. + */ + do { + freelist = page->freelist; + counters = page->counters; + new.counters = counters; + new.inuse = page->objects; + + VM_BUG_ON(new.frozen); + new.frozen = 1; + + } while (!cmpxchg_double_slab(s, page, + freelist, counters, + NULL, new.counters, + "lock and freeze")); + + remove_partial(n, page); + + if (freelist) { + /* Populate the per cpu freelist */ + this_cpu_write(s->cpu_slab->freelist, freelist); + this_cpu_write(s->cpu_slab->page, page); + this_cpu_write(s->cpu_slab->node, page_to_nid(page)); return 1; + } else { + /* + * Slab page came from the wrong list. No object to allocate + * from. Put it onto the correct list and continue partial + * scan. + */ + printk(KERN_ERR "SLUB: %s : Page without available objects on" + " partial list\n", s->name); + slab_unlock(page); + return 0; } - return 0; } /* @@ -1530,59 +1558,6 @@ static struct page *get_partial(struct k return get_any_partial(s, flags); } -/* - * Move a page back to the lists. - * - * Must be called with the slab lock held. - * - * On exit the slab lock will have been dropped. - */ -static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) - __releases(bitlock) -{ - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - - if (page->inuse) { - - if (page->freelist) { - spin_lock(&n->list_lock); - add_partial(n, page, tail); - spin_unlock(&n->list_lock); - stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD); - } else { - stat(s, DEACTIVATE_FULL); - if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER)) { - spin_lock(&n->list_lock); - add_full(s, n, page); - spin_unlock(&n->list_lock); - } - } - slab_unlock(page); - } else { - stat(s, DEACTIVATE_EMPTY); - if (n->nr_partial < s->min_partial) { - /* - * Adding an empty slab to the partial slabs in order - * to avoid page allocator overhead. This slab needs - * to come after the other slabs with objects in - * so that the others get filled first. That way the - * size of the partial list stays small. - * - * kmem_cache_shrink can reclaim any empty slabs from - * the partial list. - */ - spin_lock(&n->list_lock); - add_partial(n, page, 1); - spin_unlock(&n->list_lock); - slab_unlock(page); - } else { - slab_unlock(page); - stat(s, FREE_SLAB); - discard_slab(s, page); - } - } -} - #ifdef CONFIG_CMPXCHG_LOCAL #ifdef CONFIG_PREEMPT /* @@ -1658,39 +1633,161 @@ void init_kmem_cache_cpus(struct kmem_ca /* * Remove the cpu slab */ + +/* + * Remove the cpu slab + */ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) - __releases(bitlock) { + enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE }; struct page *page = c->page; - int tail = 1; + struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + int lock = 0; + enum slab_modes l = M_NONE, m; + void *freelist; + void *nextfree; + int tail = 0; + struct page new; + struct page old; - if (page->freelist) + if (page->freelist) { stat(s, DEACTIVATE_REMOTE_FREES); + tail = 1; + } + +#ifdef CONFIG_CMPXCHG_LOCAL + c->tid = next_tid(c->tid); +#endif + c->page = NULL; + freelist = c->freelist; + c->freelist = NULL; + + /* + * Stage one: Free all available per cpu objects back + * to the page freelist while it is still frozen. Leave the + * last one. + * + * There is no need to take the list->lock because the page + * is still frozen. + */ + while (freelist && (nextfree = get_freepointer(s, freelist))) { + void *prior; + unsigned long counters; + + do { + prior = page->freelist; + counters = page->counters; + set_freepointer(s, freelist, prior); + new.counters = counters; + new.inuse--; + VM_BUG_ON(!new.frozen); + + } while (!cmpxchg_double_slab(s, page, + prior, counters, + freelist, new.counters, + "drain percpu freelist")); + + freelist = nextfree; + } + /* - * Merge cpu freelist into slab freelist. Typically we get here - * because both freelists are empty. So this is unlikely - * to occur. + * Stage two: Ensure that the page is unfrozen while the + * list presence reflects the actual number of objects + * during unfreeze. + * + * We setup the list membership and then perform a cmpxchg + * with the count. If there is a mismatch then the page + * is not unfrozen but the page is on the wrong list. + * + * Then we restart the process which may have to remove + * the page from the list that we just put it on again + * because the number of objects in the slab may have + * changed. */ - while (unlikely(c->freelist)) { - void **object; +redo: - tail = 0; /* Hot objects. Put the slab first */ + old.freelist = page->freelist; + old.counters = page->counters; + VM_BUG_ON(!old.frozen); + + /* Determine target state of the slab */ + new.counters = old.counters; + if (freelist) { + new.inuse--; + set_freepointer(s, freelist, old.freelist); + new.freelist = freelist; + } else + new.freelist = old.freelist; - /* Retrieve object from cpu_freelist */ - object = c->freelist; - c->freelist = get_freepointer(s, c->freelist); + new.frozen = 0; + + m = M_NONE; + + if (!new.inuse && n->nr_partial < s->min_partial) + m = M_FREE; + else if (new.freelist) { + m = M_PARTIAL; + if (!lock) { + lock = 1; + /* + * Taking the spinlock removes the possiblity + * that acquire_slab() will see a slab page that + * is frozen + */ + spin_lock(&n->list_lock); + } + } else { + m = M_FULL; + if (kmem_cache_debug(s) && !lock) { + lock = 1; + /* + * This also ensures that the scanning of full + * slabs from diagnostic functions will not see + * any frozen slabs. + */ + spin_lock(&n->list_lock); + } + } + + if (l != m) { + + if (l == M_PARTIAL) + + remove_partial(n, page); + + else if (l == M_FULL) + + remove_full(s, page); + + if (m == M_PARTIAL) { + + add_partial(n, page, tail); + stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD); + + } else if (m == M_FULL) { + + stat(s, DEACTIVATE_FULL); + add_full(s, n, page); - /* And put onto the regular freelist */ - set_freepointer(s, object, page->freelist); - page->freelist = object; - page->inuse--; + } + } + + l = m; + if (!cmpxchg_double_slab(s, page, + old.freelist, old.counters, + new.freelist, new.counters, + "unfreezing slab")) + goto redo; + + slab_unlock(page); + + if (lock) + spin_unlock(&n->list_lock); + + if (m == M_FREE) { + discard_slab(s, page); + stat(s, FREE_SLAB); } - c->page = NULL; -#ifdef CONFIG_CMPXCHG_LOCAL - c->tid = next_tid(c->tid); -#endif - page->frozen = 0; - unfreeze_slab(s, page, tail); } static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) @@ -1851,21 +1948,33 @@ static void *__slab_alloc(struct kmem_ca stat(s, ALLOC_REFILL); + { + struct page new; + unsigned long counters; + + do { + object = page->freelist; + counters = page->counters; + new.counters = counters; + new.inuse = page->objects; + VM_BUG_ON(!new.frozen); + + } while (!cmpxchg_double_slab(s, page, + object, counters, + NULL, new.counters, + "__slab_alloc")); + } + load_freelist: VM_BUG_ON(!page->frozen); - object = page->freelist; if (unlikely(!object)) goto another_slab; - if (kmem_cache_debug(s)) - goto debug; + + slab_unlock(page); c->freelist = get_freepointer(s, object); - page->inuse = page->objects; - page->freelist = NULL; -unlock_out: - slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL c->tid = next_tid(c->tid); local_irq_restore(flags); @@ -1880,10 +1989,11 @@ new_slab: page = get_partial(s, gfpflags, node); if (page) { stat(s, ALLOC_FROM_PARTIAL); - page->frozen = 1; load_from_page: - c->node = page_to_nid(page); - c->page = page; + object = c->freelist; + + if (kmem_cache_debug(s)) + goto debug; goto load_freelist; } @@ -1898,10 +2008,21 @@ load_from_page: if (page) { c = __this_cpu_ptr(s->cpu_slab); - stat(s, ALLOC_SLAB); if (c->page) flush_slab(s, c); + /* + * No other reference to the page yet so we can + * muck around with it freely without cmpxchg + */ + c->freelist = page->freelist; + page->freelist = NULL; + page->inuse = page->objects; + + c->node = page_to_nid(page); + c->page = page; + + stat(s, ALLOC_SLAB); slab_lock(page); goto load_from_page; @@ -1912,14 +2033,19 @@ load_from_page: local_irq_restore(flags); #endif return NULL; + debug: - if (!alloc_debug_processing(s, page, object, addr)) - goto another_slab; + if (!object || !alloc_debug_processing(s, page, object, addr)) + goto new_slab; - page->inuse++; - page->freelist = get_freepointer(s, object); + c->freelist = get_freepointer(s, object); + deactivate_slab(s, c); c->node = NUMA_NO_NODE; - goto unlock_out; + +#ifdef CONFIG_CMPXCHG_LOCAL + local_irq_restore(flags); +#endif + return object; } /* @@ -2084,6 +2210,11 @@ static void __slab_free(struct kmem_cach { void *prior; void **object = (void *)x; + int was_frozen; + int inuse; + struct page new; + unsigned long counters; + struct kmem_cache_node *n = NULL; #ifdef CONFIG_CMPXCHG_LOCAL unsigned long flags; @@ -2095,32 +2226,65 @@ static void __slab_free(struct kmem_cach if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr)) goto out_unlock; - prior = page->freelist; - set_freepointer(s, object, prior); - page->freelist = object; - page->inuse--; - - if (unlikely(page->frozen)) { - stat(s, FREE_FROZEN); - goto out_unlock; - } + do { + prior = page->freelist; + counters = page->counters; + set_freepointer(s, object, prior); + new.counters = counters; + was_frozen = new.frozen; + new.inuse--; + if ((!new.inuse || !prior) && !was_frozen && !n) { + n = get_node(s, page_to_nid(page)); + /* + * Speculatively acquire the list_lock. + * If the cmpxchg does not succeed then we may + * drop the list_lock without any processing. + * + * Otherwise the list_lock will synchronize with + * other processors updating the list of slabs. + */ + spin_lock(&n->list_lock); + } + inuse = new.inuse; - if (unlikely(!page->inuse)) - goto slab_empty; + } while (!cmpxchg_double_slab(s, page, + prior, counters, + object, new.counters, + "__slab_free")); + + if (likely(!n)) { + /* + * The list lock was not taken therefore no list + * activity can be necessary. + */ + if (was_frozen) + stat(s, FREE_FROZEN); + goto out_unlock; + } /* - * Objects left in the slab. If it was not on the partial list before - * then add it. + * was_frozen may have been set after we acquired the list_lock in + * an earlier loop. So we need to check it here again. */ - if (unlikely(!prior)) { - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + if (was_frozen) + stat(s, FREE_FROZEN); + else { + if (unlikely(!inuse && n->nr_partial > s->min_partial)) + goto slab_empty; - spin_lock(&n->list_lock); - add_partial(get_node(s, page_to_nid(page)), page, 1); - spin_unlock(&n->list_lock); - stat(s, FREE_ADD_PARTIAL); + /* + * Objects left in the slab. If it was not on the partial list before + * then add it. + */ + if (unlikely(!prior)) { + remove_full(s, page); + add_partial(n, page, 0); + stat(s, FREE_ADD_PARTIAL); + } } + spin_unlock(&n->list_lock); + out_unlock: slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL @@ -2133,13 +2297,11 @@ slab_empty: /* * Slab still on the partial list. */ - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - - spin_lock(&n->list_lock); remove_partial(n, page); - spin_unlock(&n->list_lock); stat(s, FREE_REMOVE_PARTIAL); } + + spin_unlock(&n->list_lock); slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 28EA38D0055 for ; Wed, 30 Mar 2011 16:24:27 -0400 (EDT) Message-Id: <20110330202424.617825534@linux.com> Date: Wed, 30 Mar 2011 15:23:57 -0500 From: Christoph Lameter Subject: [slubll1 15/19] slub: Disable interrupts in free_debug processing References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=irqoff_in_free_debug_processing Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers We will be calling free_debug_processing with interrupts disabled in some case when the later patches are applied. Some of the functions called by free_debug_processing expect interrupts to be off. Signed-off-by: Christoph Lameter --- mm/slub.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-23 16:25:00.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-24 08:50:37.000000000 -0500 @@ -1024,6 +1024,10 @@ bad: static noinline int free_debug_processing(struct kmem_cache *s, struct page *page, void *object, unsigned long addr) { + unsigned long flags; + + local_irq_save(flags); + if (!check_slab(s, page)) goto fail; @@ -1059,10 +1063,12 @@ static noinline int free_debug_processin set_track(s, object, TRACK_FREE, addr); trace(s, page, object, 0); init_object(s, object, SLUB_RED_INACTIVE); + local_irq_restore(flags); return 1; fail: slab_fix(s, "Object at 0x%p not freed", object); + local_irq_restore(flags); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 2A80B8D0056 for ; Wed, 30 Mar 2011 16:24:27 -0400 (EDT) Message-Id: <20110330202423.982893430@linux.com> Date: Wed, 30 Mar 2011 15:23:56 -0500 From: Christoph Lameter Subject: [slubll1 14/19] slub: Invert locking and avoid slab lock References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=slab_lock_subsume Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Locking slabs is no longer necesary if the arch supports cmpxchg operations and if no debuggin features are used on a slab. If the arch does not support cmpxchg then we fallback to use the slab lock to do a cmpxchg like operation. The patch also changes the lock order. Slab locks are subsumed to the node lock now. With that approach slab_trylocking is no longer necessary. Signed-off-by: Christoph Lameter --- mm/slub.c | 130 ++++++++++++++++++++++++-------------------------------------- 1 file changed, 52 insertions(+), 78 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:43:01.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:43:34.000000000 -0500 @@ -2,10 +2,11 @@ * SLUB: A slab allocator that limits cache line use instead of queuing * objects in per cpu and per node lists. * - * The allocator synchronizes using per slab locks and only - * uses a centralized lock to manage a pool of partial slabs. + * The allocator synchronizes using per slab locks or atomic operatios + * and only uses a centralized lock to manage a pool of partial slabs. * * (C) 2007 SGI, Christoph Lameter + * (C) 2011 Linux Foundation, Christoph Lameter */ #include @@ -32,15 +33,27 @@ /* * Lock order: - * 1. slab_lock(page) - * 2. slab->list_lock - * - * The slab_lock protects operations on the object of a particular - * slab and its metadata in the page struct. If the slab lock - * has been taken then no allocations nor frees can be performed - * on the objects in the slab nor can the slab be added or removed - * from the partial or full lists since this would mean modifying - * the page_struct of the slab. + * 1. slub_lock (Global Semaphore) + * 2. node->list_lock + * 3. slab_lock(page) (Only on some arches and for debugging) + * + * slub_lock + * + * The role of the slub_lock is to protect the list of all the slabs + * and to synchronize major metadata changes to slab cache structures. + * + * The slab_lock is only used for debugging and on arches that do not + * have the ability to do a cmpxchg_double. It only protects the second + * double word in the page struct. Meaning + * A. page->freelist -> List of object free in a page + * B. page->counters -> Counters of objects + * C. page->frozen -> frozen state + * + * If a slab is frozen then it is exempt from list management. It is not + * on any list. The processor that froze the slab is the one who can + * perform list operations on the page. Other processors may put objects + * onto the freelist but the processor that froze the slab is the only + * one that can retrieve the objects from the page's freelist. * * The list_lock protects the partial and full list on each node and * the partial slab counter. If taken then no new slabs may be added or @@ -53,20 +66,6 @@ * slabs, operations can continue without any centralized lock. F.e. * allocating a long series of objects that fill up slabs does not require * the list lock. - * - * The lock order is sometimes inverted when we are trying to get a slab - * off a list. We take the list_lock and then look for a page on the list - * to use. While we do that objects in the slabs may be freed. We can - * only operate on the slab if we have also taken the slab_lock. So we use - * a slab_trylock() on the slab. If trylock was successful then no frees - * can occur anymore and we can use the slab for allocations etc. If the - * slab_trylock() does not succeed then frees are in progress in the slab and - * we must stay away from it for a while since we may cause a bouncing - * cacheline if we try to acquire the lock. So go onto the next slab. - * If all pages are busy then we may allocate a new slab instead of reusing - * a partial slab. A new slab has noone operating on it and thus there is - * no danger of cacheline contention. - * * Interrupts are disabled during allocation and deallocation in order to * make the slab allocator safe to use in the context of an irq. In addition * interrupts are disabled to ensure that the processor does not change @@ -329,6 +328,19 @@ static inline int oo_objects(struct kmem return x.x & OO_MASK; } +/* + * Per slab locking using the pagelock + */ +static __always_inline void slab_lock(struct page *page) +{ + bit_spin_lock(PG_locked, &page->flags); +} + +static __always_inline void slab_unlock(struct page *page) +{ + __bit_spin_unlock(PG_locked, &page->flags); +} + static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, void *freelist_old, unsigned long counters_old, void *freelist_new, unsigned long counters_new, @@ -343,11 +355,14 @@ static inline bool cmpxchg_double_slab(s } else #endif { + slab_lock(page); if (page->freelist == freelist_old && page->counters == counters_old) { page->freelist = freelist_new; page->counters = counters_new; + slab_unlock(page); return 1; } + slab_unlock(page); } cpu_relax(); @@ -363,7 +378,7 @@ static inline bool cmpxchg_double_slab(s /* * Determine a map of object in use on a page. * - * Slab lock or node listlock must be held to guarantee that the page does + * Node listlock must be held to guarantee that the page does * not vanish from under us. */ static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map) @@ -795,10 +810,12 @@ static int check_slab(struct kmem_cache static int on_freelist(struct kmem_cache *s, struct page *page, void *search) { int nr = 0; - void *fp = page->freelist; + void *fp; void *object = NULL; unsigned long max_objects; + slab_lock(page); + fp = page->freelist; while (fp && nr <= page->objects) { if (fp == search) return 1; @@ -812,6 +829,7 @@ static int on_freelist(struct kmem_cache slab_err(s, page, "Freepointer corrupt"); page->freelist = NULL; page->inuse = page->objects; + slab_unlock(page); slab_fix(s, "Freelist cleared"); return 0; } @@ -838,6 +856,7 @@ static int on_freelist(struct kmem_cache page->inuse = page->objects - nr; slab_fix(s, "Object count adjusted."); } + slab_unlock(page); return search == NULL; } @@ -1364,27 +1383,6 @@ static void discard_slab(struct kmem_cac } /* - * Per slab locking using the pagelock - */ -static __always_inline void slab_lock(struct page *page) -{ - bit_spin_lock(PG_locked, &page->flags); -} - -static __always_inline void slab_unlock(struct page *page) -{ - __bit_spin_unlock(PG_locked, &page->flags); -} - -static __always_inline int slab_trylock(struct page *page) -{ - int rc = 1; - - rc = bit_spin_trylock(PG_locked, &page->flags); - return rc; -} - -/* * Management of partially allocated slabs */ static inline void add_partial(struct kmem_cache_node *n, @@ -1410,17 +1408,13 @@ static inline void remove_partial(struct * * Must hold list_lock. */ -static inline int lock_and_freeze_slab(struct kmem_cache *s, +static inline int acquire_slab(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page) { void *freelist; unsigned long counters; struct page new; - - if (!slab_trylock(page)) - return 0; - /* * Zap the freelist and set the frozen bit. * The old freelist is the list of objects for the @@ -1456,7 +1450,6 @@ static inline int lock_and_freeze_slab(s */ printk(KERN_ERR "SLUB: %s : Page without available objects on" " partial list\n", s->name); - slab_unlock(page); return 0; } } @@ -1480,7 +1473,7 @@ static struct page *get_partial_node(str spin_lock(&n->list_lock); list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(s, n, page)) + if (acquire_slab(s, n, page)) goto out; page = NULL; out: @@ -1779,8 +1772,6 @@ redo: "unfreezing slab")) goto redo; - slab_unlock(page); - if (lock) spin_unlock(&n->list_lock); @@ -1793,7 +1784,6 @@ redo: static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { stat(s, CPUSLAB_FLUSH); - slab_lock(c->page); deactivate_slab(s, c); } @@ -1942,7 +1932,6 @@ static void *__slab_alloc(struct kmem_ca if (!page) goto new_slab; - slab_lock(page); if (unlikely(!node_match(c, node))) goto another_slab; @@ -1971,8 +1960,6 @@ load_freelist: if (unlikely(!object)) goto another_slab; - slab_unlock(page); - c->freelist = get_freepointer(s, object); #ifdef CONFIG_CMPXCHG_LOCAL @@ -2023,7 +2010,6 @@ load_from_page: c->page = page; stat(s, ALLOC_SLAB); - slab_lock(page); goto load_from_page; } @@ -2220,7 +2206,6 @@ static void __slab_free(struct kmem_cach local_irq_save(flags); #endif - slab_lock(page); stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr)) @@ -2286,7 +2271,6 @@ static void __slab_free(struct kmem_cach spin_unlock(&n->list_lock); out_unlock: - slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); #endif @@ -2302,7 +2286,6 @@ slab_empty: } spin_unlock(&n->list_lock); - slab_unlock(page); #ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); #endif @@ -3237,14 +3220,8 @@ int kmem_cache_shrink(struct kmem_cache * list_lock. page->inuse here is the upper limit. */ list_for_each_entry_safe(page, t, &n->partial, lru) { - if (!page->inuse && slab_trylock(page)) { - /* - * Must hold slab lock here because slab_free - * may have freed the last object and be - * waiting to release the slab. - */ + if (!page->inuse) { remove_partial(n, page); - slab_unlock(page); discard_slab(s, page); } else { list_move(&page->lru, @@ -3832,12 +3809,9 @@ static int validate_slab(struct kmem_cac static void validate_slab_slab(struct kmem_cache *s, struct page *page, unsigned long *map) { - if (slab_trylock(page)) { - validate_slab(s, page, map); - slab_unlock(page); - } else - printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n", - s->name, page); + slab_lock(page); + validate_slab(s, page, map); + slab_unlock(page); } static int validate_slab_node(struct kmem_cache *s, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 285718D004C for ; Wed, 30 Mar 2011 16:24:28 -0400 (EDT) Message-Id: <20110330202425.248941090@linux.com> Date: Wed, 30 Mar 2011 15:23:58 -0500 From: Christoph Lameter Subject: [slubll1 16/19] slub: Avoid disabling interrupts in free slowpath References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=slab_free_without_irqoff Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Disabling interrupts can be avoided now. However, list operation still require disabling interrupts since allocations can occur from interrupt contexts and there is no way to perform atomic list operations. So acquire the list lock opportunistically if there is a chance that list operations would be needed. This may result in needless synchronizations but allows the avoidance of synchronization in the majority of the cases. Dropping interrupt handling significantly simplifies the slowpath. Signed-off-by: Christoph Lameter --- mm/slub.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:43:44.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:43:51.000000000 -0500 @@ -2209,13 +2209,11 @@ static void __slab_free(struct kmem_cach struct kmem_cache_node *n = NULL; #ifdef CONFIG_CMPXCHG_LOCAL unsigned long flags; - - local_irq_save(flags); #endif stat(s, FREE_SLOWPATH); if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr)) - goto out_unlock; + return; do { prior = page->freelist; @@ -2234,7 +2232,11 @@ static void __slab_free(struct kmem_cach * Otherwise the list_lock will synchronize with * other processors updating the list of slabs. */ +#ifdef CONFIG_CMPXCHG_LOCAL + spin_lock_irqsave(&n->list_lock, flags); +#else spin_lock(&n->list_lock); +#endif } inuse = new.inuse; @@ -2250,7 +2252,7 @@ static void __slab_free(struct kmem_cach */ if (was_frozen) stat(s, FREE_FROZEN); - goto out_unlock; + return; } /* @@ -2273,12 +2275,10 @@ static void __slab_free(struct kmem_cach stat(s, FREE_ADD_PARTIAL); } } - - spin_unlock(&n->list_lock); - -out_unlock: #ifdef CONFIG_CMPXCHG_LOCAL - local_irq_restore(flags); + spin_unlock_irqrestore(&n->list_lock, flags); +#else + spin_unlock(&n->list_lock); #endif return; @@ -2291,9 +2291,10 @@ slab_empty: stat(s, FREE_REMOVE_PARTIAL); } - spin_unlock(&n->list_lock); #ifdef CONFIG_CMPXCHG_LOCAL - local_irq_restore(flags); + spin_unlock_irqrestore(&n->list_lock, flags); +#else + spin_unlock(&n->list_lock); #endif stat(s, FREE_SLAB); discard_slab(s, page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 6D2EB8D0055 for ; Wed, 30 Mar 2011 16:24:28 -0400 (EDT) Message-Id: <20110330202425.905786071@linux.com> Date: Wed, 30 Mar 2011 15:23:59 -0500 From: Christoph Lameter Subject: [slubll1 17/19] slub: Get rid of the another_slab label References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=eliminate_another_slab Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers We can avoid deactivate slab in special cases if we do the deactivation of slabs in each code flow that leads to new_slab. Signed-off-by: Christoph Lameter --- mm/slub.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:43:51.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:43:57.000000000 -0500 @@ -1938,8 +1938,10 @@ static void *__slab_alloc(struct kmem_ca if (!page) goto new_slab; - if (unlikely(!node_match(c, node))) - goto another_slab; + if (unlikely(!node_match(c, node))) { + deactivate_slab(s, c); + goto new_slab; + } stat(s, ALLOC_REFILL); @@ -1964,7 +1966,7 @@ load_freelist: VM_BUG_ON(!page->frozen); if (unlikely(!object)) - goto another_slab; + goto new_slab; c->freelist = get_freepointer(s, object); @@ -1975,9 +1977,6 @@ load_freelist: stat(s, ALLOC_SLOWPATH); return object; -another_slab: - deactivate_slab(s, c); - new_slab: page = get_partial(s, gfpflags, node); if (page) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 414A58D0051 for ; Wed, 30 Mar 2011 16:24:29 -0400 (EDT) Message-Id: <20110330202426.533483554@linux.com> Date: Wed, 30 Mar 2011 15:24:00 -0500 From: Christoph Lameter Subject: [slubll1 18/19] slub: fast release on full slab References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=slab_alloc_fast_release Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Make deactivation occur implicitly while checking out the current freelist. This avoids one cmpxchg operation on a slab that is now fully in use. Signed-off-by: Christoph Lameter --- mm/slub.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:43:57.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:44:00.000000000 -0500 @@ -1953,9 +1953,21 @@ static void *__slab_alloc(struct kmem_ca object = page->freelist; counters = page->counters; new.counters = counters; - new.inuse = page->objects; VM_BUG_ON(!new.frozen); + /* + * If there is no object left then we use this loop to + * deactivate the slab which is simple since no objects + * are left in the slab and therefore we do not need to + * put the page back onto the partial list. + * + * If there are objects left then we retrieve them + * and use them to refill the per cpu queue. + */ + + new.inuse = page->objects; + new.frozen = object != NULL; + } while (!cmpxchg_double_slab(s, page, object, counters, NULL, new.counters, @@ -1965,8 +1977,10 @@ static void *__slab_alloc(struct kmem_ca load_freelist: VM_BUG_ON(!page->frozen); - if (unlikely(!object)) + if (unlikely(!object)) { + c->page = NULL; goto new_slab; + } c->freelist = get_freepointer(s, object); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 360A48D005A for ; Wed, 30 Mar 2011 16:24:32 -0400 (EDT) Message-Id: <20110330202427.218832134@linux.com> Date: Wed, 30 Mar 2011 15:24:01 -0500 From: Christoph Lameter Subject: [slubll1 19/19] slub: Not necessary to check for empty slab on load_freelist References: <20110330202342.669400887@linux.com> Content-Disposition: inline; filename=goto_load_freelist Sender: owner-linux-mm@kvack.org List-ID: To: Pekka Enberg Cc: David Rientjes , linux-mm@kvack.org, Eric Dumazet , "H. Peter Anvin" , Mathieu Desnoyers Load freelist is now only branched to if there are objects available. So no need to check. --- mm/slub.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-03-30 14:44:00.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-03-30 14:44:02.000000000 -0500 @@ -1974,14 +1974,13 @@ static void *__slab_alloc(struct kmem_ca "__slab_alloc")); } -load_freelist: - VM_BUG_ON(!page->frozen); - if (unlikely(!object)) { c->page = NULL; goto new_slab; } +load_freelist: + VM_BUG_ON(!page->frozen); c->freelist = get_freepointer(s, object); #ifdef CONFIG_CMPXCHG_LOCAL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org