[RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
@ 2013-12-13 23:59 Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 1/7] mm: print more details for bad_page() Dave Hansen
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen

SLUB depends on a 16-byte cmpxchg for an optimization.  For the
purposes of this series, I'm assuming that it is a very important
optimization that we desperately need to keep around.

In order to get guaranteed 16-byte alignment (required by the
hardware on x86), 'struct page' is padded out from 56 to 64
bytes.

Those 8-bytes matter.  We've gone to great lengths to keep
'struct page' small in the past.  It's a shame that we bloat it
now just for alignment reasons when we have extra space.  Plus,
bloating such a commonly-touched structure *HAS* to have cache
footprint implications.

These patches attempt _internal_ alignment instead of external
alignment for slub.

I also got a bug report from some folks running a large database
benchmark.  Their old kernel uses slab and their new one uses
slub.  They were swapping and couldn't figure out why.  It turned
out to be the 2GB of RAM that the slub padding wastes on their
system.

On my box, that 2GB cost about $200 to populate back when we
bought it.  I want my $200 back.

This set takes me from 16909584K of reserved memory at boot
down to 14814472K, so almost *exactly* 2GB of savings!  It also
helps performance, presumably because it touches 14% fewer
struct page cachelines.  A 30GB dd to a ramfs file:

        dd if=/dev/zero of=bigfile bs=$((1<<30)) count=30

is sped up by about 4.4% in my testing.

This is compile tested and lightly runtime tested.  I'm curious
what people think of it before we push it futher.  I believe this
gets rid of the concerns Christoph had about adding additional
branches in the fast path, although I still disagree that this
has any benefit in practice.

I also wrote up a document describing 'struct page's layout:

	http://tinyurl.com/n6kmedz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 1/7] mm: print more details for bad_page()
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-16 16:52   ` Christoph Lameter
  2013-12-13 23:59 ` [RFC][PATCH 2/7] mm: page->pfmemalloc only used by slab/skb Dave Hansen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


bad_page() is cool in that it prints out a bunch of data about
the page.  But, I can never remember which page flags are good
and which are bad, or whether ->index or ->mapping is required to
be NULL.

This patch allows bad/dump_page() callers to specify a string about
why they are dumping the page and adds explanation strings to a
number of places.  It also adds a 'bad_flags'
argument to bad_page(), which it then dumps out separately from
the flags which are actually set.

This way, the messages will show specifically why the page was
bad, *specifically* which flags it is complaining about, if it
was a page flag combination which was the problem.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm.h      |    4 +
 linux.git-davehans/mm/balloon_compaction.c |    4 -
 linux.git-davehans/mm/memory.c             |    2 
 linux.git-davehans/mm/memory_hotplug.c     |    2 
 linux.git-davehans/mm/page_alloc.c         |   73 +++++++++++++++++++++--------
 5 files changed, 62 insertions(+), 23 deletions(-)

diff -puN include/linux/mm.h~bad-page-details include/linux/mm.h
--- linux.git/include/linux/mm.h~bad-page-details	2013-12-13 15:51:47.177206143 -0800
+++ linux.git-davehans/include/linux/mm.h	2013-12-13 15:51:47.183206407 -0800
@@ -1977,7 +1977,9 @@ extern void shake_page(struct page *p, i
 extern atomic_long_t num_poisoned_pages;
 extern int soft_offline_page(struct page *page, int flags);
 
-extern void dump_page(struct page *page);
+extern void dump_page(struct page *page, char *reason);
+extern void dump_page_badflags(struct page *page, char *reason,
+			       unsigned long badflags);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
diff -puN mm/balloon_compaction.c~bad-page-details mm/balloon_compaction.c
--- linux.git/mm/balloon_compaction.c~bad-page-details	2013-12-13 15:51:47.178206187 -0800
+++ linux.git-davehans/mm/balloon_compaction.c	2013-12-13 15:51:47.183206407 -0800
@@ -267,7 +267,7 @@ void balloon_page_putback(struct page *p
 		put_page(page);
 	} else {
 		WARN_ON(1);
-		dump_page(page);
+		dump_page(page, "not movable balloon page");
 	}
 	unlock_page(page);
 }
@@ -287,7 +287,7 @@ int balloon_page_migrate(struct page *ne
 	BUG_ON(!trylock_page(newpage));
 
 	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page);
+		dump_page(page, "not movable balloon page");
 		unlock_page(newpage);
 		return rc;
 	}
diff -puN mm/memory.c~bad-page-details mm/memory.c
--- linux.git/mm/memory.c~bad-page-details	2013-12-13 15:51:47.179206231 -0800
+++ linux.git-davehans/mm/memory.c	2013-12-13 15:51:47.184206451 -0800
@@ -670,7 +670,7 @@ static void print_bad_pte(struct vm_area
 		current->comm,
 		(long long)pte_val(pte), (long long)pmd_val(*pmd));
 	if (page)
-		dump_page(page);
+		dump_page(page, "bad pte");
 	printk(KERN_ALERT
 		"addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
 		(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
diff -puN mm/memory_hotplug.c~bad-page-details mm/memory_hotplug.c
--- linux.git/mm/memory_hotplug.c~bad-page-details	2013-12-13 15:51:47.180206275 -0800
+++ linux.git-davehans/mm/memory_hotplug.c	2013-12-13 15:51:47.185206495 -0800
@@ -1310,7 +1310,7 @@ do_migrate_range(unsigned long start_pfn
 #ifdef CONFIG_DEBUG_VM
 			printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
 			       pfn);
-			dump_page(page);
+			dump_page(page, "failed to remove from LRU");
 #endif
 			put_page(page);
 			/* Because we don't have big zone->lock. we should
diff -puN mm/page_alloc.c~bad-page-details mm/page_alloc.c
--- linux.git/mm/page_alloc.c~bad-page-details	2013-12-13 15:51:47.181206319 -0800
+++ linux.git-davehans/mm/page_alloc.c	2013-12-13 15:51:47.186206539 -0800
@@ -295,7 +295,7 @@ static inline int bad_range(struct zone
 }
 #endif
 
-static void bad_page(struct page *page)
+static void bad_page(struct page *page, char *reason, unsigned long bad_flags)
 {
 	static unsigned long resume;
 	static unsigned long nr_shown;
@@ -329,7 +329,7 @@ static void bad_page(struct page *page)
 
 	printk(KERN_ALERT "BUG: Bad page state in process %s  pfn:%05lx\n",
 		current->comm, page_to_pfn(page));
-	dump_page(page);
+	dump_page_badflags(page, reason, bad_flags);
 
 	print_modules();
 	dump_stack();
@@ -383,7 +383,7 @@ static int destroy_compound_page(struct
 	int bad = 0;
 
 	if (unlikely(compound_order(page) != order)) {
-		bad_page(page);
+		bad_page(page, "wrong compound order", 0);
 		bad++;
 	}
 
@@ -392,8 +392,11 @@ static int destroy_compound_page(struct
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 
-		if (unlikely(!PageTail(p) || (p->first_page != page))) {
-			bad_page(page);
+		if (unlikely(!PageTail(p))) {
+			bad_page(page, "PageTail not set", 0);
+			bad++;
+		} else if (unlikely(p->first_page != page)) {
+			bad_page(page, "first_page not consistent", 0);
 			bad++;
 		}
 		__ClearPageTail(p);
@@ -618,12 +621,23 @@ out:
 
 static inline int free_pages_check(struct page *page)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(atomic_read(&page->_count) != 0) |
-		(page->flags & PAGE_FLAGS_CHECK_AT_FREE) |
-		(mem_cgroup_bad_page_check(page)))) {
-		bad_page(page);
+	char *bad_reason = NULL;
+	unsigned long bad_flags = 0;
+
+	if (unlikely(page_mapcount(page)))
+		bad_reason = "nonzero mapcount";
+	if (unlikely(page->mapping != NULL))
+		bad_reason = "non-NULL mapping";
+	if (unlikely(atomic_read(&page->_count) != 0))
+		bad_reason = "nonzero _count";
+	if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_FREE)) {
+		bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
+		bad_flags = PAGE_FLAGS_CHECK_AT_FREE;
+	}
+	if (unlikely(mem_cgroup_bad_page_check(page)))
+		bad_reason = "cgroup check failed";
+	if (unlikely(bad_reason)) {
+		bad_page(page, bad_reason, bad_flags);
 		return 1;
 	}
 	page_cpupid_reset_last(page);
@@ -843,12 +857,23 @@ static inline void expand(struct zone *z
  */
 static inline int check_new_page(struct page *page)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(atomic_read(&page->_count) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
-		(mem_cgroup_bad_page_check(page)))) {
-		bad_page(page);
+	char *bad_reason = NULL;
+	unsigned long bad_flags = 0;
+
+	if (unlikely(page_mapcount(page)))
+		bad_reason = "nonzero mapcount";
+	if (unlikely(page->mapping != NULL))
+		bad_reason = "non-NULL mapping";
+	if (unlikely(atomic_read(&page->_count) != 0))
+		bad_reason = "nonzero _count";
+	if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
+		bad_reason = "PAGE_FLAGS_CHECK_AT_PREP flag set";
+		bad_flags = PAGE_FLAGS_CHECK_AT_PREP;
+	}
+	if (unlikely(mem_cgroup_bad_page_check(page)))
+		bad_reason = "cgroup check failed";
+	if (unlikely(bad_reason)) {
+		bad_page(page, bad_reason, bad_flags);
 		return 1;
 	}
 	return 0;
@@ -6458,12 +6483,24 @@ static void dump_page_flags(unsigned lon
 	printk(")\n");
 }
 
-void dump_page(struct page *page)
+void dump_page_badflags(struct page *page, char *reason, unsigned long badflags)
 {
 	printk(KERN_ALERT
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
 		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
 	dump_page_flags(page->flags);
+	if (reason)
+		printk(KERN_ALERT "page dumped because: %s\n", reason);
+	if (page->flags & badflags) {
+		printk(KERN_ALERT "bad because of flags:\n");
+		dump_page_flags(page->flags & badflags);
+	}
 	mem_cgroup_print_bad_page(page);
 }
+
+void dump_page(struct page *page, char *reason)
+{
+	dump_page_badflags(page, reason, 0);
+}
+
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 1/7] mm: print more details for bad_page()
  2013-12-13 23:59 ` [RFC][PATCH 1/7] mm: print more details for bad_page() Dave Hansen
@ 2013-12-16 16:52   ` Christoph Lameter
  2013-12-16 17:20     ` Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2013-12-16 16:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Pravin B Shelar, Kirill A. Shutemov,
	Andi Kleen

On Fri, 13 Dec 2013, Dave Hansen wrote:

> This way, the messages will show specifically why the page was
> bad, *specifically* which flags it is complaining about, if it
> was a page flag combination which was the problem.

Yes this would have been helpful in the past for me.

Reviewed-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 1/7] mm: print more details for bad_page()
  2013-12-16 16:52   ` Christoph Lameter
@ 2013-12-16 17:20     ` Andi Kleen
  0 siblings, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2013-12-16 17:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov

On Mon, Dec 16, 2013 at 04:52:57PM +0000, Christoph Lameter wrote:
> On Fri, 13 Dec 2013, Dave Hansen wrote:
> 
> > This way, the messages will show specifically why the page was
> > bad, *specifically* which flags it is complaining about, if it
> > was a page flag combination which was the problem.
> 
> Yes this would have been helpful in the past for me.

Yes, for me too. </AOL>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/7] mm: page->pfmemalloc only used by slab/skb
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 1/7] mm: print more details for bad_page() Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 3/7] mm: slabs: reset page at free Dave Hansen
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


page->pfmemalloc does not deserve a spot in 'struct page'.  It is
only used transiently _just_ after a page leaves the buddy
allocator.

Instead of declaring a union, we move its functionality behind a
few quick accessor functions.  This way we could also much more
easily audit that it is being used correctly in debugging
scenarios.  For instance, we could store a magic number in there
which could never get reused as a page->index and check that the
magic number exists in page_pfmemalloc().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm.h       |   17 +++++++++++++++++
 linux.git-davehans/include/linux/mm_types.h |    9 ---------
 linux.git-davehans/include/linux/skbuff.h   |   10 +++++-----
 linux.git-davehans/mm/page_alloc.c          |    2 +-
 linux.git-davehans/mm/slab.c                |    4 ++--
 linux.git-davehans/mm/slub.c                |    2 +-
 6 files changed, 26 insertions(+), 18 deletions(-)

diff -puN include/linux/mm.h~page_pfmemalloc-only-used-by-slab include/linux/mm.h
--- linux.git/include/linux/mm.h~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.467218911 -0800
+++ linux.git-davehans/include/linux/mm.h	2013-12-13 15:51:47.475219263 -0800
@@ -2013,5 +2013,22 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+/*
+ * If set by the page allocator, ALLOC_NO_WATERMARKS was set and the
+ * low watermark was not met implying that the system is under some
+ * pressure. The caller should try ensure this page is only used to
+ * free other pages.  Currently only used by sl[au]b.  Note that
+ * this is only valid for a short time after the page returns
+ * from the allocator.
+ */
+static inline int page_pfmemalloc(struct page *page)
+{
+	return !!page->index;
+}
+static inline void set_page_pfmemalloc(struct page *page, int pfmemalloc)
+{
+	page->index = pfmemalloc;
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -puN include/linux/mm_types.h~page_pfmemalloc-only-used-by-slab include/linux/mm_types.h
--- linux.git/include/linux/mm_types.h~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.468218955 -0800
+++ linux.git-davehans/include/linux/mm_types.h	2013-12-13 15:51:47.475219263 -0800
@@ -60,15 +60,6 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* sl[aou]b first free object */
-			bool pfmemalloc;	/* If set by the page allocator,
-						 * ALLOC_NO_WATERMARKS was set
-						 * and the low watermark was not
-						 * met implying that the system
-						 * is under some pressure. The
-						 * caller should try ensure
-						 * this page is only used to
-						 * free other pages.
-						 */
 		};
 
 		union {
diff -puN include/linux/skbuff.h~page_pfmemalloc-only-used-by-slab include/linux/skbuff.h
--- linux.git/include/linux/skbuff.h~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.469218999 -0800
+++ linux.git-davehans/include/linux/skbuff.h	2013-12-13 15:51:47.475219263 -0800
@@ -1322,11 +1322,11 @@ static inline void __skb_fill_page_desc(
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	/*
-	 * Propagate page->pfmemalloc to the skb if we can. The problem is
+	 * Propagate page_pfmemalloc() to the skb if we can. The problem is
 	 * that not all callers have unique ownership of the page. If
 	 * pfmemalloc is set, we check the mapping as a mapping implies
 	 * page->index is set (index and pfmemalloc share space).
-	 * If it's a valid mapping, we cannot use page->pfmemalloc but we
+	 * If it's a valid mapping, we cannot use page_pfmemalloc() but we
 	 * do not lose pfmemalloc information as the pages would not be
 	 * allocated using __GFP_MEMALLOC.
 	 */
@@ -1335,7 +1335,7 @@ static inline void __skb_fill_page_desc(
 	skb_frag_size_set(frag, size);
 
 	page = compound_head(page);
-	if (page->pfmemalloc && !page->mapping)
+	if (page_pfmemalloc(page) && !page->mapping)
 		skb->pfmemalloc	= true;
 }
 
@@ -1917,7 +1917,7 @@ static inline struct page *__skb_alloc_p
 		gfp_mask |= __GFP_MEMALLOC;
 
 	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
-	if (skb && page && page->pfmemalloc)
+	if (skb && page && page_pfmemalloc(page))
 		skb->pfmemalloc = true;
 
 	return page;
@@ -1946,7 +1946,7 @@ static inline struct page *__skb_alloc_p
 static inline void skb_propagate_pfmemalloc(struct page *page,
 					     struct sk_buff *skb)
 {
-	if (page && page->pfmemalloc)
+	if (page && page_pfmemalloc(page))
 		skb->pfmemalloc = true;
 }
 
diff -puN mm/page_alloc.c~page_pfmemalloc-only-used-by-slab mm/page_alloc.c
--- linux.git/mm/page_alloc.c~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.470219043 -0800
+++ linux.git-davehans/mm/page_alloc.c	2013-12-13 15:51:47.477219351 -0800
@@ -2066,7 +2066,7 @@ this_zone_full:
 		 * memory. The caller should avoid the page being used
 		 * for !PFMEMALLOC purposes.
 		 */
-		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		set_page_pfmemalloc(page, alloc_flags & ALLOC_NO_WATERMARKS);
 
 	return page;
 }
diff -puN mm/slab.c~page_pfmemalloc-only-used-by-slab mm/slab.c
--- linux.git/mm/slab.c~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.471219087 -0800
+++ linux.git-davehans/mm/slab.c	2013-12-13 15:51:47.478219395 -0800
@@ -1672,7 +1672,7 @@ static struct page *kmem_getpages(struct
 	}
 
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
-	if (unlikely(page->pfmemalloc))
+	if (unlikely(page_pfmemalloc(page)))
 		pfmemalloc_active = true;
 
 	nr_pages = (1 << cachep->gfporder);
@@ -1683,7 +1683,7 @@ static struct page *kmem_getpages(struct
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
 	__SetPageSlab(page);
-	if (page->pfmemalloc)
+	if (page_pfmemalloc(page))
 		SetPageSlabPfmemalloc(page);
 	memcg_bind_pages(cachep, cachep->gfporder);
 
diff -puN mm/slub.c~page_pfmemalloc-only-used-by-slab mm/slub.c
--- linux.git/mm/slub.c~page_pfmemalloc-only-used-by-slab	2013-12-13 15:51:47.472219131 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:47.478219395 -0800
@@ -1403,7 +1403,7 @@ static struct page *new_slab(struct kmem
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
-	if (page->pfmemalloc)
+	if (page_pfmemalloc(page))
 		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 3/7] mm: slabs: reset page at free
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 1/7] mm: print more details for bad_page() Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 2/7] mm: page->pfmemalloc only used by slab/skb Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 4/7] mm: rearrange struct page Dave Hansen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


In order to simplify 'struct page', we will shortly be moving
some fields around.  This causes slub's ->freelist usage to
impinge on page->mapping's storage space.  The buddy allocator
wants ->mapping to be NULL when a page is handed back, so we have
to make sure that it is cleared.

Note that slab already doeds this, so just create a common helper
and have all the slabs do it this way.  ->mapping is right next
to ->flags, so it's virtually guaranteed to be in the L1 at this
point, so this shouldn't cost very much to do in practice.

Other allocators and users of 'struct page' may also want to call
this if they use parts of 'struct page' for nonstandard purposes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm.h |   11 +++++++++++
 linux.git-davehans/mm/slab.c          |    3 +--
 linux.git-davehans/mm/slab.h          |    1 +
 linux.git-davehans/mm/slob.c          |    2 +-
 linux.git-davehans/mm/slub.c          |    2 +-
 5 files changed, 15 insertions(+), 4 deletions(-)

diff -puN include/linux/mm.h~slub-reset-page-at-free include/linux/mm.h
--- linux.git/include/linux/mm.h~slub-reset-page-at-free	2013-12-13 15:51:47.771232294 -0800
+++ linux.git-davehans/include/linux/mm.h	2013-12-13 15:51:47.777232559 -0800
@@ -2030,5 +2030,16 @@ static inline void set_page_pfmemalloc(s
 	page->index = pfmemalloc;
 }
 
+/*
+ * Custom allocators (like the slabs) use 'struct page' fields
+ * for all kinds of things.  This resets the page's state so that
+ * the buddy allocator will be happy with it.
+ */
+static inline void allocator_reset_page(struct page *page)
+{
+	page->mapping = NULL;
+	page_mapcount_reset(page);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -puN mm/slab.c~slub-reset-page-at-free mm/slab.c
--- linux.git/mm/slab.c~slub-reset-page-at-free	2013-12-13 15:51:47.772232339 -0800
+++ linux.git-davehans/mm/slab.c	2013-12-13 15:51:47.778232603 -0800
@@ -1718,8 +1718,7 @@ static void kmem_freepages(struct kmem_c
 	BUG_ON(!PageSlab(page));
 	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
-	page_mapcount_reset(page);
-	page->mapping = NULL;
+	allocator_reset_page(page);
 
 	memcg_release_pages(cachep, cachep->gfporder);
 	if (current->reclaim_state)
diff -puN mm/slab.h~slub-reset-page-at-free mm/slab.h
--- linux.git/mm/slab.h~slub-reset-page-at-free	2013-12-13 15:51:47.773232383 -0800
+++ linux.git-davehans/mm/slab.h	2013-12-13 15:51:47.778232603 -0800
@@ -278,3 +278,4 @@ struct kmem_cache_node {
 
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
+
diff -puN mm/slob.c~slub-reset-page-at-free mm/slob.c
--- linux.git/mm/slob.c~slub-reset-page-at-free	2013-12-13 15:51:47.774232427 -0800
+++ linux.git-davehans/mm/slob.c	2013-12-13 15:51:47.778232603 -0800
@@ -360,7 +360,7 @@ static void slob_free(void *block, int s
 			clear_slob_page_free(sp);
 		spin_unlock_irqrestore(&slob_lock, flags);
 		__ClearPageSlab(sp);
-		page_mapcount_reset(sp);
+		allocator_reset_page(sp);
 		slob_free_pages(b, 0);
 		return;
 	}
diff -puN mm/slub.c~slub-reset-page-at-free mm/slub.c
--- linux.git/mm/slub.c~slub-reset-page-at-free	2013-12-13 15:51:47.775232471 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:47.779232647 -0800
@@ -1452,7 +1452,7 @@ static void __free_slab(struct kmem_cach
 	__ClearPageSlab(page);
 
 	memcg_release_pages(s, order);
-	page_mapcount_reset(page);
+	allocator_reset_page(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
 	__free_memcg_kmem_pages(page, order);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 4/7] mm: rearrange struct page
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
                   ` (2 preceding siblings ...)
  2013-12-13 23:59 ` [RFC][PATCH 3/7] mm: slabs: reset page at free Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 5/7] mm: slub: rearrange 'struct page' fields Dave Hansen
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


To make the layout of 'struct page' look nicer, I broke
up a few of the unions.  But, this has a cost: things that
were guaranteed to line up before might not any more.  To make up
for that, some BUILD_BUG_ON()s are added to manually check for
the alignment dependencies.

This makes it *MUCH* more clear how the first few fields of
'struct page' get used by the slab allocators.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm_types.h |  101 ++++++++++++++--------------
 linux.git-davehans/mm/slab.c                |    6 -
 linux.git-davehans/mm/slab_common.c         |   17 ++++
 linux.git-davehans/mm/slob.c                |   24 +++---
 linux.git-davehans/mm/slub.c                |   76 ++++++++++-----------
 5 files changed, 121 insertions(+), 103 deletions(-)

diff -puN include/linux/mm_types.h~rearrange-struct-page include/linux/mm_types.h
--- linux.git/include/linux/mm_types.h~rearrange-struct-page	2013-12-13 15:51:48.055244798 -0800
+++ linux.git-davehans/include/linux/mm_types.h	2013-12-13 15:51:48.061245062 -0800
@@ -45,27 +45,60 @@ struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	union {
-		struct address_space *mapping;	/* If low bit clear, points to
-						 * inode address_space, or NULL.
-						 * If page mapped as anonymous
-						 * memory, low bit is set, and
-						 * it points to anon_vma object:
-						 * see PAGE_MAPPING_ANON below.
-						 */
-		void *s_mem;			/* slab first object */
-	};
-
-	/* Second double word */
-	struct {
-		union {
+		struct /* the normal uses */ {
 			pgoff_t index;		/* Our offset within mapping. */
-			void *freelist;		/* sl[aou]b first free object */
+			/*
+			 * mapping: If low bit clear, points to
+			 * inode address_space, or NULL.  If page
+			 * mapped as anonymous memory, low bit is
+			 * set, and it points to anon_vma object:
+			 * see PAGE_MAPPING_ANON below.
+			 */
+			struct address_space *mapping;
+			/*
+			 * Count of ptes mapped in mms, to show when page
+			 * is mapped & limit reverse map searches.
+			 *
+			 * Used also for tail pages refcounting instead
+			 * of _count. Tail pages cannot be mapped and
+			 * keeping the tail page _count zero at all times
+			 * guarantees get_page_unless_zero() will never
+			 * succeed on tail pages.
+			 */
+			atomic_t _mapcount;
+			atomic_t _count;
+		}; /* end of the "normal" use */
+
+		struct { /* SLUB */
+			void *unused;
+			void *slub_freelist;
+			unsigned inuse:16;
+			unsigned objects:15;
+			unsigned frozen:1;
+			atomic_t dontuse_slub_count;
 		};
-
-		union {
+		struct { /* SLAB */
+			void *s_mem;
+			void *slab_freelist;
+			unsigned int active;
+			atomic_t dontuse_slab_count;
+		};
+		struct { /* SLOB */
+			void *slob_unused;
+			void *slob_freelist;
+			unsigned int units;
+			atomic_t dontuse_slob_count;
+		};
+		/*
+		 * This is here to help the slub code deal with
+		 * its inuse/objects/frozen bitfields as a single
+		 * blob.
+		 */
+		struct { /* slub helpers */
+			void *slubhelp_unused;
+			void *slubhelp_freelist;
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-	defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-			/* Used for cmpxchg_double in slub */
+    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 			unsigned long counters;
 #else
 			/*
@@ -75,38 +108,6 @@ struct page {
 			 */
 			unsigned counters;
 #endif
-
-			struct {
-
-				union {
-					/*
-					 * Count of ptes mapped in
-					 * mms, to show when page is
-					 * mapped & limit reverse map
-					 * searches.
-					 *
-					 * Used also for tail pages
-					 * refcounting instead of
-					 * _count. Tail pages cannot
-					 * be mapped and keeping the
-					 * tail page _count zero at
-					 * all times guarantees
-					 * get_page_unless_zero() will
-					 * never succeed on tail
-					 * pages.
-					 */
-					atomic_t _mapcount;
-
-					struct { /* SLUB */
-						unsigned inuse:16;
-						unsigned objects:15;
-						unsigned frozen:1;
-					};
-					int units;	/* SLOB */
-				};
-				atomic_t _count;		/* Usage count, see below. */
-			};
-			unsigned int active;	/* SLAB */
 		};
 	};
 
diff -puN mm/slab.c~rearrange-struct-page mm/slab.c
--- linux.git/mm/slab.c~rearrange-struct-page	2013-12-13 15:51:48.056244842 -0800
+++ linux.git-davehans/mm/slab.c	2013-12-13 15:51:48.062245106 -0800
@@ -1955,7 +1955,7 @@ static void slab_destroy(struct kmem_cac
 {
 	void *freelist;
 
-	freelist = page->freelist;
+	freelist = page->slab_freelist;
 	slab_destroy_debugcheck(cachep, page);
 	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
 		struct rcu_head *head;
@@ -2543,7 +2543,7 @@ static void *alloc_slabmgmt(struct kmem_
 
 static inline unsigned int *slab_freelist(struct page *page)
 {
-	return (unsigned int *)(page->freelist);
+	return (unsigned int *)(page->slab_freelist);
 }
 
 static void cache_init_objs(struct kmem_cache *cachep,
@@ -2648,7 +2648,7 @@ static void slab_map_pages(struct kmem_c
 			   void *freelist)
 {
 	page->slab_cache = cache;
-	page->freelist = freelist;
+	page->slab_freelist = freelist;
 }
 
 /*
diff -puN mm/slab_common.c~rearrange-struct-page mm/slab_common.c
--- linux.git/mm/slab_common.c~rearrange-struct-page	2013-12-13 15:51:48.057244886 -0800
+++ linux.git-davehans/mm/slab_common.c	2013-12-13 15:51:48.062245106 -0800
@@ -658,3 +658,20 @@ static int __init slab_proc_init(void)
 }
 module_init(slab_proc_init);
 #endif /* CONFIG_SLABINFO */
+#define SLAB_PAGE_CHECK(field1, field2)        \
+	BUILD_BUG_ON(offsetof(struct page, field1) !=   \
+		     offsetof(struct page, field2))
+/*
+ * To make the layout of 'struct page' look nicer, we've broken
+ * up a few of the unions.  Folks declaring their own use of the
+ * first few fields need to make sure that their use does not
+ * interfere with page->_count.  This ensures that the individual
+ * users' use actually lines up with the real ->_count.
+ */
+void slab_build_checks(void)
+{
+	SLAB_PAGE_CHECK(_count, dontuse_slab_count);
+	SLAB_PAGE_CHECK(_count, dontuse_slub_count);
+	SLAB_PAGE_CHECK(_count, dontuse_slob_count);
+}
+
diff -puN mm/slob.c~rearrange-struct-page mm/slob.c
--- linux.git/mm/slob.c~rearrange-struct-page	2013-12-13 15:51:48.058244930 -0800
+++ linux.git-davehans/mm/slob.c	2013-12-13 15:51:48.062245106 -0800
@@ -219,7 +219,7 @@ static void *slob_page_alloc(struct page
 	slob_t *prev, *cur, *aligned = NULL;
 	int delta = 0, units = SLOB_UNITS(size);
 
-	for (prev = NULL, cur = sp->freelist; ; prev = cur, cur = slob_next(cur)) {
+	for (prev = NULL, cur = sp->slob_freelist; ; prev = cur, cur = slob_next(cur)) {
 		slobidx_t avail = slob_units(cur);
 
 		if (align) {
@@ -243,12 +243,12 @@ static void *slob_page_alloc(struct page
 				if (prev)
 					set_slob(prev, slob_units(prev), next);
 				else
-					sp->freelist = next;
+					sp->slob_freelist = next;
 			} else { /* fragment */
 				if (prev)
 					set_slob(prev, slob_units(prev), cur + units);
 				else
-					sp->freelist = cur + units;
+					sp->slob_freelist = cur + units;
 				set_slob(cur + units, avail - units, next);
 			}
 
@@ -321,7 +321,7 @@ static void *slob_alloc(size_t size, gfp
 
 		spin_lock_irqsave(&slob_lock, flags);
 		sp->units = SLOB_UNITS(PAGE_SIZE);
-		sp->freelist = b;
+		sp->slob_freelist = b;
 		INIT_LIST_HEAD(&sp->list);
 		set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE));
 		set_slob_page_free(sp, slob_list);
@@ -368,7 +368,7 @@ static void slob_free(void *block, int s
 	if (!slob_page_free(sp)) {
 		/* This slob page is about to become partially free. Easy! */
 		sp->units = units;
-		sp->freelist = b;
+		sp->slob_freelist = b;
 		set_slob(b, units,
 			(void *)((unsigned long)(b +
 					SLOB_UNITS(PAGE_SIZE)) & PAGE_MASK));
@@ -388,15 +388,15 @@ static void slob_free(void *block, int s
 	 */
 	sp->units += units;
 
-	if (b < (slob_t *)sp->freelist) {
-		if (b + units == sp->freelist) {
-			units += slob_units(sp->freelist);
-			sp->freelist = slob_next(sp->freelist);
+	if (b < (slob_t *)sp->slob_freelist) {
+		if (b + units == sp->slob_freelist) {
+			units += slob_units(sp->slob_freelist);
+			sp->slob_freelist = slob_next(sp->slob_freelist);
 		}
-		set_slob(b, units, sp->freelist);
-		sp->freelist = b;
+		set_slob(b, units, sp->slob_freelist);
+		sp->slob_freelist = b;
 	} else {
-		prev = sp->freelist;
+		prev = sp->slob_freelist;
 		next = slob_next(prev);
 		while (b > next) {
 			prev = next;
diff -puN mm/slub.c~rearrange-struct-page mm/slub.c
--- linux.git/mm/slub.c~rearrange-struct-page	2013-12-13 15:51:48.059244974 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:48.063245150 -0800
@@ -52,7 +52,7 @@
  *   The slab_lock is only used for debugging and on arches that do not
  *   have the ability to do a cmpxchg_double. It only protects the second
  *   double word in the page struct. Meaning
- *	A. page->freelist	-> List of object free in a page
+ *	A. page->slub_freelist	-> List of object free in a page
  *	B. page->counters	-> Counters of objects
  *	C. page->frozen		-> frozen state
  *
@@ -365,7 +365,7 @@ static inline bool __cmpxchg_double_slab
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->freelist, &page->counters,
+		if (cmpxchg_double(&page->slub_freelist, &page->counters,
 			freelist_old, counters_old,
 			freelist_new, counters_new))
 		return 1;
@@ -373,9 +373,9 @@ static inline bool __cmpxchg_double_slab
 #endif
 	{
 		slab_lock(page);
-		if (page->freelist == freelist_old &&
+		if (page->slub_freelist == freelist_old &&
 					page->counters == counters_old) {
-			page->freelist = freelist_new;
+			page->slub_freelist = freelist_new;
 			page->counters = counters_new;
 			slab_unlock(page);
 			return 1;
@@ -401,7 +401,7 @@ static inline bool cmpxchg_double_slab(s
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->freelist, &page->counters,
+		if (cmpxchg_double(&page->slub_freelist, &page->counters,
 			freelist_old, counters_old,
 			freelist_new, counters_new))
 		return 1;
@@ -412,9 +412,9 @@ static inline bool cmpxchg_double_slab(s
 
 		local_irq_save(flags);
 		slab_lock(page);
-		if (page->freelist == freelist_old &&
+		if (page->slub_freelist == freelist_old &&
 					page->counters == counters_old) {
-			page->freelist = freelist_new;
+			page->slub_freelist = freelist_new;
 			page->counters = counters_new;
 			slab_unlock(page);
 			local_irq_restore(flags);
@@ -446,7 +446,7 @@ static void get_map(struct kmem_cache *s
 	void *p;
 	void *addr = page_address(page);
 
-	for (p = page->freelist; p; p = get_freepointer(s, p))
+	for (p = page->slub_freelist; p; p = get_freepointer(s, p))
 		set_bit(slab_index(p, s, addr), map);
 }
 
@@ -557,7 +557,7 @@ static void print_page_info(struct page
 {
 	printk(KERN_ERR
 	       "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
-	       page, page->objects, page->inuse, page->freelist, page->flags);
+	       page, page->objects, page->inuse, page->slub_freelist, page->flags);
 
 }
 
@@ -869,7 +869,7 @@ static int on_freelist(struct kmem_cache
 	void *object = NULL;
 	unsigned long max_objects;
 
-	fp = page->freelist;
+	fp = page->slub_freelist;
 	while (fp && nr <= page->objects) {
 		if (fp == search)
 			return 1;
@@ -880,7 +880,7 @@ static int on_freelist(struct kmem_cache
 				set_freepointer(s, object, NULL);
 			} else {
 				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
+				page->slub_freelist = NULL;
 				page->inuse = page->objects;
 				slab_fix(s, "Freelist cleared");
 				return 0;
@@ -919,7 +919,7 @@ static void trace(struct kmem_cache *s,
 			s->name,
 			alloc ? "alloc" : "free",
 			object, page->inuse,
-			page->freelist);
+			page->slub_freelist);
 
 		if (!alloc)
 			print_section("Object ", (void *)object,
@@ -1086,7 +1086,7 @@ bad:
 		 */
 		slab_fix(s, "Marking all objects used");
 		page->inuse = page->objects;
-		page->freelist = NULL;
+		page->slub_freelist = NULL;
 	}
 	return 0;
 }
@@ -1420,7 +1420,7 @@ static struct page *new_slab(struct kmem
 	setup_object(s, page, last);
 	set_freepointer(s, last, NULL);
 
-	page->freelist = start;
+	page->slub_freelist = start;
 	page->inuse = page->objects;
 	page->frozen = 1;
 out:
@@ -1548,15 +1548,15 @@ static inline void *acquire_slab(struct
 	 * The old freelist is the list of objects for the
 	 * per cpu allocation list.
 	 */
-	freelist = page->freelist;
+	freelist = page->slub_freelist;
 	counters = page->counters;
 	new.counters = counters;
 	*objects = new.objects - new.inuse;
 	if (mode) {
 		new.inuse = page->objects;
-		new.freelist = NULL;
+		new.slub_freelist = NULL;
 	} else {
-		new.freelist = freelist;
+		new.slub_freelist = freelist;
 	}
 
 	VM_BUG_ON(new.frozen);
@@ -1564,7 +1564,7 @@ static inline void *acquire_slab(struct
 
 	if (!__cmpxchg_double_slab(s, page,
 			freelist, counters,
-			new.freelist, new.counters,
+			new.slub_freelist, new.counters,
 			"acquire_slab"))
 		return NULL;
 
@@ -1789,7 +1789,7 @@ static void deactivate_slab(struct kmem_
 	struct page new;
 	struct page old;
 
-	if (page->freelist) {
+	if (page->slub_freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
 		tail = DEACTIVATE_TO_TAIL;
 	}
@@ -1807,7 +1807,7 @@ static void deactivate_slab(struct kmem_
 		unsigned long counters;
 
 		do {
-			prior = page->freelist;
+			prior = page->slub_freelist;
 			counters = page->counters;
 			set_freepointer(s, freelist, prior);
 			new.counters = counters;
@@ -1838,7 +1838,7 @@ static void deactivate_slab(struct kmem_
 	 */
 redo:
 
-	old.freelist = page->freelist;
+	old.slub_freelist = page->slub_freelist;
 	old.counters = page->counters;
 	VM_BUG_ON(!old.frozen);
 
@@ -1846,16 +1846,16 @@ redo:
 	new.counters = old.counters;
 	if (freelist) {
 		new.inuse--;
-		set_freepointer(s, freelist, old.freelist);
-		new.freelist = freelist;
+		set_freepointer(s, freelist, old.slub_freelist);
+		new.slub_freelist = freelist;
 	} else
-		new.freelist = old.freelist;
+		new.slub_freelist = old.slub_freelist;
 
 	new.frozen = 0;
 
 	if (!new.inuse && n->nr_partial > s->min_partial)
 		m = M_FREE;
-	else if (new.freelist) {
+	else if (new.slub_freelist) {
 		m = M_PARTIAL;
 		if (!lock) {
 			lock = 1;
@@ -1904,8 +1904,8 @@ redo:
 
 	l = m;
 	if (!__cmpxchg_double_slab(s, page,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
+				old.slub_freelist, old.counters,
+				new.slub_freelist, new.counters,
 				"unfreezing slab"))
 		goto redo;
 
@@ -1950,18 +1950,18 @@ static void unfreeze_partials(struct kme
 
 		do {
 
-			old.freelist = page->freelist;
+			old.slub_freelist = page->slub_freelist;
 			old.counters = page->counters;
 			VM_BUG_ON(!old.frozen);
 
 			new.counters = old.counters;
-			new.freelist = old.freelist;
+			new.slub_freelist = old.slub_freelist;
 
 			new.frozen = 0;
 
 		} while (!__cmpxchg_double_slab(s, page,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
+				old.slub_freelist, old.counters,
+				new.slub_freelist, new.counters,
 				"unfreezing slab"));
 
 		if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
@@ -2184,8 +2184,8 @@ static inline void *new_slab_objects(str
 		 * No other reference to the page yet so we can
 		 * muck around with it freely without cmpxchg
 		 */
-		freelist = page->freelist;
-		page->freelist = NULL;
+		freelist = page->slub_freelist;
+		page->slub_freelist = NULL;
 
 		stat(s, ALLOC_SLAB);
 		c->page = page;
@@ -2205,7 +2205,7 @@ static inline bool pfmemalloc_match(stru
 }
 
 /*
- * Check the page->freelist of a page and either transfer the freelist to the
+ * Check the page->slub_freelist of a page and either transfer the freelist to the
  * per cpu freelist or deactivate the page.
  *
  * The page is still frozen if the return value is not NULL.
@@ -2221,7 +2221,7 @@ static inline void *get_freelist(struct
 	void *freelist;
 
 	do {
-		freelist = page->freelist;
+		freelist = page->slub_freelist;
 		counters = page->counters;
 
 		new.counters = counters;
@@ -2533,7 +2533,7 @@ static void __slab_free(struct kmem_cach
 			spin_unlock_irqrestore(&n->list_lock, flags);
 			n = NULL;
 		}
-		prior = page->freelist;
+		prior = page->slub_freelist;
 		counters = page->counters;
 		set_freepointer(s, object, prior);
 		new.counters = counters;
@@ -2877,9 +2877,9 @@ static void early_kmem_cache_node_alloc(
 				"in order to be able to continue\n");
 	}
 
-	n = page->freelist;
+	n = page->slub_freelist;
 	BUG_ON(!n);
-	page->freelist = get_freepointer(kmem_cache_node, n);
+	page->slub_freelist = get_freepointer(kmem_cache_node, n);
 	page->inuse = 1;
 	page->frozen = 0;
 	kmem_cache_node->node[node] = n;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 5/7] mm: slub: rearrange 'struct page' fields
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
                   ` (3 preceding siblings ...)
  2013-12-13 23:59 ` [RFC][PATCH 4/7] mm: rearrange struct page Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-13 23:59 ` [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions Dave Hansen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


DESC
mm: slub: rearrange 'struct page' fields
EDESC SLUB has some very unique alignment constraints it places
on 'struct page'.  Break those out in to a separate structure
which will not pollute 'struct page'.

This structure will be moved around inside 'struct page' at
runtime in the next patch, so it is necessary to break it out for
those uses as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm_types.h |   66 ++++---
 linux.git-davehans/mm/slab_common.c         |   29 ++-
 linux.git-davehans/mm/slub.c                |  255 ++++++++++++++--------------
 3 files changed, 195 insertions(+), 155 deletions(-)

diff -puN include/linux/mm_types.h~slub-rearrange include/linux/mm_types.h
--- linux.git/include/linux/mm_types.h~slub-rearrange	2013-12-13 15:51:48.338257258 -0800
+++ linux.git-davehans/include/linux/mm_types.h	2013-12-13 15:51:48.342257434 -0800
@@ -23,6 +23,43 @@
 
 struct address_space;
 
+struct slub_data {
+	void *unused;
+	void *freelist;
+	union {
+		struct {
+			unsigned inuse:16;
+			unsigned objects:15;
+			unsigned frozen:1;
+			atomic_t dontuse_slub_count;
+		};
+		/*
+		 * ->counters is used to make it easier to copy
+		 * all of the above counters in one chunk.
+		 * The actual counts are never accessed via this.
+		 */
+#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
+    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+		unsigned long counters;
+#else
+		/*
+		 * Keep _count separate from slub cmpxchg_double data.
+		 * As the rest of the double word is protected by
+		 * slab_lock but _count is not.
+		 */
+		struct {
+			unsigned counters;
+			/*
+			 * This isn't used directly, but declare it here
+			 * for clarity since it must line up with _count
+			 * from 'struct page'
+			 */
+			atomic_t separate_count;
+		};
+#endif
+	};
+};
+
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
@@ -69,14 +106,7 @@ struct page {
 			atomic_t _count;
 		}; /* end of the "normal" use */
 
-		struct { /* SLUB */
-			void *unused;
-			void *slub_freelist;
-			unsigned inuse:16;
-			unsigned objects:15;
-			unsigned frozen:1;
-			atomic_t dontuse_slub_count;
-		};
+		struct slub_data slub_data;
 		struct { /* SLAB */
 			void *s_mem;
 			void *slab_freelist;
@@ -89,26 +119,6 @@ struct page {
 			unsigned int units;
 			atomic_t dontuse_slob_count;
 		};
-		/*
-		 * This is here to help the slub code deal with
-		 * its inuse/objects/frozen bitfields as a single
-		 * blob.
-		 */
-		struct { /* slub helpers */
-			void *slubhelp_unused;
-			void *slubhelp_freelist;
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-			unsigned long counters;
-#else
-			/*
-			 * Keep _count separate from slub cmpxchg_double data.
-			 * As the rest of the double word is protected by
-			 * slab_lock but _count is not.
-			 */
-			unsigned counters;
-#endif
-		};
 	};
 
 	/* Third double word block */
diff -puN mm/slab_common.c~slub-rearrange mm/slab_common.c
--- linux.git/mm/slab_common.c~slub-rearrange	2013-12-13 15:51:48.339257301 -0800
+++ linux.git-davehans/mm/slab_common.c	2013-12-13 15:51:48.342257434 -0800
@@ -658,20 +658,39 @@ static int __init slab_proc_init(void)
 }
 module_init(slab_proc_init);
 #endif /* CONFIG_SLABINFO */
+
 #define SLAB_PAGE_CHECK(field1, field2)        \
 	BUILD_BUG_ON(offsetof(struct page, field1) !=   \
 		     offsetof(struct page, field2))
 /*
  * To make the layout of 'struct page' look nicer, we've broken
- * up a few of the unions.  Folks declaring their own use of the
- * first few fields need to make sure that their use does not
- * interfere with page->_count.  This ensures that the individual
- * users' use actually lines up with the real ->_count.
+ * up a few of the unions.  But, this has made it hard to see if
+ * any given use will interfere with page->_count.
+ *
+ * To work around this, each user declares their own _count field
+ * and we check them at build time to ensure that the independent
+ * definitions actually line up with the real ->_count.
  */
 void slab_build_checks(void)
 {
 	SLAB_PAGE_CHECK(_count, dontuse_slab_count);
-	SLAB_PAGE_CHECK(_count, dontuse_slub_count);
+	SLAB_PAGE_CHECK(_count, slub_data.dontuse_slub_count);
 	SLAB_PAGE_CHECK(_count, dontuse_slob_count);
+
+	/*
+	 * When doing a double-cmpxchg, the slub code sucks in
+	 * _count.  But, this is harmless since if _count is
+	 * modified, the cmpxchg will fail.  When not using a
+	 * real cmpxchg, the slub code uses a lock.  But, _count
+	 * is not modified under that lock and updates can be
+	 * lost if they race with one of the "faked" cmpxchg
+	 * under that lock.  This makes sure that the space we
+	 * carve out for _count in that case actually lines up
+	 * with the real _count.
+	 */
+#if ! (defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
+	    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE))
+	SLAB_PAGE_CHECK(_count, slub_data.separate_count);
+#endif
 }
 
diff -puN mm/slub.c~slub-rearrange mm/slub.c
--- linux.git/mm/slub.c~slub-rearrange	2013-12-13 15:51:48.340257345 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:48.344257522 -0800
@@ -52,7 +52,7 @@
  *   The slab_lock is only used for debugging and on arches that do not
  *   have the ability to do a cmpxchg_double. It only protects the second
  *   double word in the page struct. Meaning
- *	A. page->slub_freelist	-> List of object free in a page
+ *	A. page->freelist	-> List of object free in a page
  *	B. page->counters	-> Counters of objects
  *	C. page->frozen		-> frozen state
  *
@@ -237,6 +237,12 @@ static inline struct kmem_cache_node *ge
 	return s->node[node];
 }
 
+static inline struct slub_data *slub_data(struct page *page)
+{
+	void *ptr = &page->slub_data;
+	return ptr;
+}
+
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -247,7 +253,7 @@ static inline int check_valid_pointer(st
 		return 1;
 
 	base = page_address(page);
-	if (object < base || object >= base + page->objects * s->size ||
+	if (object < base || object >= base + slub_data(page)->objects * s->size ||
 		(object - base) % s->size) {
 		return 0;
 	}
@@ -365,7 +371,7 @@ static inline bool __cmpxchg_double_slab
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->slub_freelist, &page->counters,
+		if (cmpxchg_double(&slub_data(page)->freelist, &slub_data(page)->counters,
 			freelist_old, counters_old,
 			freelist_new, counters_new))
 		return 1;
@@ -373,10 +379,10 @@ static inline bool __cmpxchg_double_slab
 #endif
 	{
 		slab_lock(page);
-		if (page->slub_freelist == freelist_old &&
-					page->counters == counters_old) {
-			page->slub_freelist = freelist_new;
-			page->counters = counters_new;
+		if (slub_data(page)->freelist == freelist_old &&
+		    slub_data(page)->counters == counters_old) {
+			slub_data(page)->freelist = freelist_new;
+			slub_data(page)->counters = counters_new;
 			slab_unlock(page);
 			return 1;
 		}
@@ -401,7 +407,8 @@ static inline bool cmpxchg_double_slab(s
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->slub_freelist, &page->counters,
+		if (cmpxchg_double(&slub_data(page)->freelist,
+				   &slub_data(page)->counters,
 			freelist_old, counters_old,
 			freelist_new, counters_new))
 		return 1;
@@ -412,10 +419,10 @@ static inline bool cmpxchg_double_slab(s
 
 		local_irq_save(flags);
 		slab_lock(page);
-		if (page->slub_freelist == freelist_old &&
-					page->counters == counters_old) {
-			page->slub_freelist = freelist_new;
-			page->counters = counters_new;
+		if (slub_data(page)->freelist == freelist_old &&
+		    slub_data(page)->counters == counters_old) {
+			slub_data(page)->freelist = freelist_new;
+			slub_data(page)->counters = counters_new;
 			slab_unlock(page);
 			local_irq_restore(flags);
 			return 1;
@@ -446,7 +453,7 @@ static void get_map(struct kmem_cache *s
 	void *p;
 	void *addr = page_address(page);
 
-	for (p = page->slub_freelist; p; p = get_freepointer(s, p))
+	for (p = slub_data(page)->freelist; p; p = get_freepointer(s, p))
 		set_bit(slab_index(p, s, addr), map);
 }
 
@@ -557,7 +564,8 @@ static void print_page_info(struct page
 {
 	printk(KERN_ERR
 	       "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
-	       page, page->objects, page->inuse, page->slub_freelist, page->flags);
+	       page, slub_data(page)->objects, slub_data(page)->inuse,
+	       slub_data(page)->freelist, page->flags);
 
 }
 
@@ -843,14 +851,14 @@ static int check_slab(struct kmem_cache
 	}
 
 	maxobj = order_objects(compound_order(page), s->size, s->reserved);
-	if (page->objects > maxobj) {
+	if (slub_data(page)->objects > maxobj) {
 		slab_err(s, page, "objects %u > max %u",
-			s->name, page->objects, maxobj);
+			s->name, slub_data(page)->objects, maxobj);
 		return 0;
 	}
-	if (page->inuse > page->objects) {
+	if (slub_data(page)->inuse > slub_data(page)->objects) {
 		slab_err(s, page, "inuse %u > max %u",
-			s->name, page->inuse, page->objects);
+			s->name, slub_data(page)->inuse, slub_data(page)->objects);
 		return 0;
 	}
 	/* Slab_pad_check fixes things up after itself */
@@ -869,8 +877,8 @@ static int on_freelist(struct kmem_cache
 	void *object = NULL;
 	unsigned long max_objects;
 
-	fp = page->slub_freelist;
-	while (fp && nr <= page->objects) {
+	fp = slub_data(page)->freelist;
+	while (fp && nr <= slub_data(page)->objects) {
 		if (fp == search)
 			return 1;
 		if (!check_valid_pointer(s, page, fp)) {
@@ -880,8 +888,8 @@ static int on_freelist(struct kmem_cache
 				set_freepointer(s, object, NULL);
 			} else {
 				slab_err(s, page, "Freepointer corrupt");
-				page->slub_freelist = NULL;
-				page->inuse = page->objects;
+				slub_data(page)->freelist = NULL;
+				slub_data(page)->inuse = slub_data(page)->objects;
 				slab_fix(s, "Freelist cleared");
 				return 0;
 			}
@@ -896,16 +904,16 @@ static int on_freelist(struct kmem_cache
 	if (max_objects > MAX_OBJS_PER_PAGE)
 		max_objects = MAX_OBJS_PER_PAGE;
 
-	if (page->objects != max_objects) {
+	if (slub_data(page)->objects != max_objects) {
 		slab_err(s, page, "Wrong number of objects. Found %d but "
-			"should be %d", page->objects, max_objects);
-		page->objects = max_objects;
+			"should be %d", slub_data(page)->objects, max_objects);
+		slub_data(page)->objects = max_objects;
 		slab_fix(s, "Number of objects adjusted.");
 	}
-	if (page->inuse != page->objects - nr) {
+	if (slub_data(page)->inuse != slub_data(page)->objects - nr) {
 		slab_err(s, page, "Wrong object count. Counter is %d but "
-			"counted were %d", page->inuse, page->objects - nr);
-		page->inuse = page->objects - nr;
+			"counted were %d", slub_data(page)->inuse, slub_data(page)->objects - nr);
+		slub_data(page)->inuse = slub_data(page)->objects - nr;
 		slab_fix(s, "Object count adjusted.");
 	}
 	return search == NULL;
@@ -918,8 +926,8 @@ static void trace(struct kmem_cache *s,
 		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
 			s->name,
 			alloc ? "alloc" : "free",
-			object, page->inuse,
-			page->slub_freelist);
+			object, slub_data(page)->inuse,
+			slub_data(page)->freelist);
 
 		if (!alloc)
 			print_section("Object ", (void *)object,
@@ -1085,8 +1093,8 @@ bad:
 		 * as used avoids touching the remaining objects.
 		 */
 		slab_fix(s, "Marking all objects used");
-		page->inuse = page->objects;
-		page->slub_freelist = NULL;
+		slub_data(page)->inuse = slub_data(page)->objects;
+		slub_data(page)->freelist = NULL;
 	}
 	return 0;
 }
@@ -1366,7 +1374,7 @@ static struct page *allocate_slab(struct
 	if (!page)
 		return NULL;
 
-	page->objects = oo_objects(oo);
+	slub_data(page)->objects = oo_objects(oo);
 	mod_zone_page_state(page_zone(page),
 		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
@@ -1399,7 +1407,7 @@ static struct page *new_slab(struct kmem
 		goto out;
 
 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	inc_slabs_node(s, page_to_nid(page), slub_data(page)->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1412,7 +1420,7 @@ static struct page *new_slab(struct kmem
 		memset(start, POISON_INUSE, PAGE_SIZE << order);
 
 	last = start;
-	for_each_object(p, s, start, page->objects) {
+	for_each_object(p, s, start, slub_data(page)->objects) {
 		setup_object(s, page, last);
 		set_freepointer(s, last, p);
 		last = p;
@@ -1420,9 +1428,9 @@ static struct page *new_slab(struct kmem
 	setup_object(s, page, last);
 	set_freepointer(s, last, NULL);
 
-	page->slub_freelist = start;
-	page->inuse = page->objects;
-	page->frozen = 1;
+	slub_data(page)->freelist = start;
+	slub_data(page)->inuse = slub_data(page)->objects;
+	slub_data(page)->frozen = 1;
 out:
 	return page;
 }
@@ -1437,7 +1445,7 @@ static void __free_slab(struct kmem_cach
 
 		slab_pad_check(s, page);
 		for_each_object(p, s, page_address(page),
-						page->objects)
+						slub_data(page)->objects)
 			check_object(s, page, p, SLUB_RED_INACTIVE);
 	}
 
@@ -1498,7 +1506,7 @@ static void free_slab(struct kmem_cache
 
 static void discard_slab(struct kmem_cache *s, struct page *page)
 {
-	dec_slabs_node(s, page_to_nid(page), page->objects);
+	dec_slabs_node(s, page_to_nid(page), slub_data(page)->objects);
 	free_slab(s, page);
 }
 
@@ -1548,23 +1556,24 @@ static inline void *acquire_slab(struct
 	 * The old freelist is the list of objects for the
 	 * per cpu allocation list.
 	 */
-	freelist = page->slub_freelist;
-	counters = page->counters;
-	new.counters = counters;
-	*objects = new.objects - new.inuse;
+	freelist = slub_data(page)->freelist;
+	counters = slub_data(page)->counters;
+	slub_data(&new)->counters = counters;
+	*objects = slub_data(&new)->objects - slub_data(&new)->inuse;
 	if (mode) {
-		new.inuse = page->objects;
-		new.slub_freelist = NULL;
+		slub_data(&new)->inuse = slub_data(page)->objects;
+		slub_data(&new)->freelist = NULL;
 	} else {
-		new.slub_freelist = freelist;
+		slub_data(&new)->freelist = freelist;
 	}
 
-	VM_BUG_ON(new.frozen);
-	new.frozen = 1;
+	VM_BUG_ON(slub_data(&new)->frozen);
+	slub_data(&new)->frozen = 1;
 
 	if (!__cmpxchg_double_slab(s, page,
 			freelist, counters,
-			new.slub_freelist, new.counters,
+			slub_data(&new)->freelist,
+			slub_data(&new)->counters,
 			"acquire_slab"))
 		return NULL;
 
@@ -1789,7 +1798,7 @@ static void deactivate_slab(struct kmem_
 	struct page new;
 	struct page old;
 
-	if (page->slub_freelist) {
+	if (slub_data(page)->freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
 		tail = DEACTIVATE_TO_TAIL;
 	}
@@ -1807,16 +1816,16 @@ static void deactivate_slab(struct kmem_
 		unsigned long counters;
 
 		do {
-			prior = page->slub_freelist;
-			counters = page->counters;
+			prior = slub_data(page)->freelist;
+			counters = slub_data(page)->counters;
 			set_freepointer(s, freelist, prior);
-			new.counters = counters;
-			new.inuse--;
-			VM_BUG_ON(!new.frozen);
+			slub_data(&new)->counters = counters;
+			slub_data(&new)->inuse--;
+			VM_BUG_ON(!slub_data(&new)->frozen);
 
 		} while (!__cmpxchg_double_slab(s, page,
 			prior, counters,
-			freelist, new.counters,
+			freelist, slub_data(&new)->counters,
 			"drain percpu freelist"));
 
 		freelist = nextfree;
@@ -1838,24 +1847,24 @@ static void deactivate_slab(struct kmem_
 	 */
 redo:
 
-	old.slub_freelist = page->slub_freelist;
-	old.counters = page->counters;
-	VM_BUG_ON(!old.frozen);
+	slub_data(&old)->freelist = slub_data(page)->freelist;
+	slub_data(&old)->counters = slub_data(page)->counters;
+	VM_BUG_ON(!slub_data(&old)->frozen);
 
 	/* Determine target state of the slab */
-	new.counters = old.counters;
+	slub_data(&new)->counters = slub_data(&old)->counters;
 	if (freelist) {
-		new.inuse--;
-		set_freepointer(s, freelist, old.slub_freelist);
-		new.slub_freelist = freelist;
+		slub_data(&new)->inuse--;
+		set_freepointer(s, freelist, slub_data(&old)->freelist);
+		slub_data(&new)->freelist = freelist;
 	} else
-		new.slub_freelist = old.slub_freelist;
+		slub_data(&new)->freelist = slub_data(&old)->freelist;
 
-	new.frozen = 0;
+	slub_data(&new)->frozen = 0;
 
-	if (!new.inuse && n->nr_partial > s->min_partial)
+	if (!slub_data(&new)->inuse && n->nr_partial > s->min_partial)
 		m = M_FREE;
-	else if (new.slub_freelist) {
+	else if (slub_data(&new)->freelist) {
 		m = M_PARTIAL;
 		if (!lock) {
 			lock = 1;
@@ -1904,8 +1913,8 @@ redo:
 
 	l = m;
 	if (!__cmpxchg_double_slab(s, page,
-				old.slub_freelist, old.counters,
-				new.slub_freelist, new.counters,
+				slub_data(&old)->freelist, slub_data(&old)->counters,
+				slub_data(&new)->freelist, slub_data(&new)->counters,
 				"unfreezing slab"))
 		goto redo;
 
@@ -1950,21 +1959,23 @@ static void unfreeze_partials(struct kme
 
 		do {
 
-			old.slub_freelist = page->slub_freelist;
-			old.counters = page->counters;
-			VM_BUG_ON(!old.frozen);
+			slub_data(&old)->freelist = slub_data(page)->freelist;
+			slub_data(&old)->counters = slub_data(page)->counters;
+			VM_BUG_ON(!slub_data(&old)->frozen);
 
-			new.counters = old.counters;
-			new.slub_freelist = old.slub_freelist;
+			slub_data(&new)->counters = slub_data(&old)->counters;
+			slub_data(&new)->freelist = slub_data(&old)->freelist;
 
-			new.frozen = 0;
+			slub_data(&new)->frozen = 0;
 
 		} while (!__cmpxchg_double_slab(s, page,
-				old.slub_freelist, old.counters,
-				new.slub_freelist, new.counters,
+				slub_data(&old)->freelist,
+				slub_data(&old)->counters,
+				slub_data(&new)->freelist,
+				slub_data(&new)->counters,
 				"unfreezing slab"));
 
-		if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
+		if (unlikely(!slub_data(&new)->inuse && n->nr_partial > s->min_partial)) {
 			page->next = discard_page;
 			discard_page = page;
 		} else {
@@ -2028,7 +2039,7 @@ static void put_cpu_partial(struct kmem_
 		}
 
 		pages++;
-		pobjects += page->objects - page->inuse;
+		pobjects += slub_data(page)->objects - slub_data(page)->inuse;
 
 		page->pages = pages;
 		page->pobjects = pobjects;
@@ -2101,7 +2112,7 @@ static inline int node_match(struct page
 
 static int count_free(struct page *page)
 {
-	return page->objects - page->inuse;
+	return slub_data(page)->objects - slub_data(page)->inuse;
 }
 
 static unsigned long count_partial(struct kmem_cache_node *n,
@@ -2184,8 +2195,8 @@ static inline void *new_slab_objects(str
 		 * No other reference to the page yet so we can
 		 * muck around with it freely without cmpxchg
 		 */
-		freelist = page->slub_freelist;
-		page->slub_freelist = NULL;
+		freelist = slub_data(page)->freelist;
+		slub_data(page)->freelist = NULL;
 
 		stat(s, ALLOC_SLAB);
 		c->page = page;
@@ -2205,7 +2216,7 @@ static inline bool pfmemalloc_match(stru
 }
 
 /*
- * Check the page->slub_freelist of a page and either transfer the freelist to the
+ * Check the ->freelist of a page and either transfer the freelist to the
  * per cpu freelist or deactivate the page.
  *
  * The page is still frozen if the return value is not NULL.
@@ -2221,18 +2232,18 @@ static inline void *get_freelist(struct
 	void *freelist;
 
 	do {
-		freelist = page->slub_freelist;
-		counters = page->counters;
+		freelist = slub_data(page)->freelist;
+		counters = slub_data(page)->counters;
 
-		new.counters = counters;
-		VM_BUG_ON(!new.frozen);
+		slub_data(&new)->counters = counters;
+		VM_BUG_ON(!slub_data(&new)->frozen);
 
-		new.inuse = page->objects;
-		new.frozen = freelist != NULL;
+		slub_data(&new)->inuse = slub_data(page)->objects;
+		slub_data(&new)->frozen = freelist != NULL;
 
 	} while (!__cmpxchg_double_slab(s, page,
 		freelist, counters,
-		NULL, new.counters,
+		NULL, slub_data(&new)->counters,
 		"get_freelist"));
 
 	return freelist;
@@ -2319,7 +2330,7 @@ load_freelist:
 	 * page is pointing to the page from which the objects are obtained.
 	 * That page must be frozen for per cpu allocations to work.
 	 */
-	VM_BUG_ON(!c->page->frozen);
+	VM_BUG_ON(!slub_data(c->page)->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
@@ -2533,13 +2544,13 @@ static void __slab_free(struct kmem_cach
 			spin_unlock_irqrestore(&n->list_lock, flags);
 			n = NULL;
 		}
-		prior = page->slub_freelist;
-		counters = page->counters;
+		prior = slub_data(page)->freelist;
+		counters = slub_data(page)->counters;
 		set_freepointer(s, object, prior);
-		new.counters = counters;
-		was_frozen = new.frozen;
-		new.inuse--;
-		if ((!new.inuse || !prior) && !was_frozen) {
+		slub_data(&new)->counters = counters;
+		was_frozen = slub_data(&new)->frozen;
+		slub_data(&new)->inuse--;
+		if ((!slub_data(&new)->inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior)
 
@@ -2549,7 +2560,7 @@ static void __slab_free(struct kmem_cach
 				 * We can defer the list move and instead
 				 * freeze it.
 				 */
-				new.frozen = 1;
+				slub_data(&new)->frozen = 1;
 
 			else { /* Needs to be taken off a list */
 
@@ -2569,7 +2580,7 @@ static void __slab_free(struct kmem_cach
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		object, slub_data(&new)->counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2578,7 +2589,7 @@ static void __slab_free(struct kmem_cach
 		 * If we just froze the page then put it onto the
 		 * per cpu partial list.
 		 */
-		if (new.frozen && !was_frozen) {
+		if (slub_data(&new)->frozen && !was_frozen) {
 			put_cpu_partial(s, page, 1);
 			stat(s, CPU_PARTIAL_FREE);
 		}
@@ -2591,7 +2602,7 @@ static void __slab_free(struct kmem_cach
                 return;
         }
 
-	if (unlikely(!new.inuse && n->nr_partial > s->min_partial))
+	if (unlikely(!slub_data(&new)->inuse && n->nr_partial > s->min_partial))
 		goto slab_empty;
 
 	/*
@@ -2877,18 +2888,18 @@ static void early_kmem_cache_node_alloc(
 				"in order to be able to continue\n");
 	}
 
-	n = page->slub_freelist;
+	n = slub_data(page)->freelist;
 	BUG_ON(!n);
-	page->slub_freelist = get_freepointer(kmem_cache_node, n);
-	page->inuse = 1;
-	page->frozen = 0;
+	slub_data(page)->freelist = get_freepointer(kmem_cache_node, n);
+	slub_data(page)->inuse = 1;
+	slub_data(page)->frozen = 0;
 	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
 	init_tracking(kmem_cache_node, n);
 #endif
 	init_kmem_cache_node(n);
-	inc_slabs_node(kmem_cache_node, node, page->objects);
+	inc_slabs_node(kmem_cache_node, node, slub_data(page)->objects);
 
 	add_partial(n, page, DEACTIVATE_TO_HEAD);
 }
@@ -3144,7 +3155,7 @@ static void list_slab_objects(struct kme
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
 	void *p;
-	unsigned long *map = kzalloc(BITS_TO_LONGS(page->objects) *
+	unsigned long *map = kzalloc(BITS_TO_LONGS(slub_data(page)->objects) *
 				     sizeof(long), GFP_ATOMIC);
 	if (!map)
 		return;
@@ -3152,7 +3163,7 @@ static void list_slab_objects(struct kme
 	slab_lock(page);
 
 	get_map(s, page, map);
-	for_each_object(p, s, addr, page->objects) {
+	for_each_object(p, s, addr, slub_data(page)->objects) {
 
 		if (!test_bit(slab_index(p, s, addr), map)) {
 			printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n",
@@ -3175,7 +3186,7 @@ static void free_partial(struct kmem_cac
 	struct page *page, *h;
 
 	list_for_each_entry_safe(page, h, &n->partial, lru) {
-		if (!page->inuse) {
+		if (!slub_data(page)->inuse) {
 			remove_partial(n, page);
 			discard_slab(s, page);
 		} else {
@@ -3412,11 +3423,11 @@ int kmem_cache_shrink(struct kmem_cache
 		 * Build lists indexed by the items in use in each slab.
 		 *
 		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
+		 * list_lock.  ->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
-			if (!page->inuse)
+			list_move(&page->lru, slabs_by_inuse + slub_data(page)->inuse);
+			if (!slub_data(page)->inuse)
 				n->nr_partial--;
 		}
 
@@ -3855,12 +3866,12 @@ void *__kmalloc_node_track_caller(size_t
 #ifdef CONFIG_SYSFS
 static int count_inuse(struct page *page)
 {
-	return page->inuse;
+	return slub_data(page)->inuse;
 }
 
 static int count_total(struct page *page)
 {
-	return page->objects;
+	return slub_data(page)->objects;
 }
 #endif
 
@@ -3876,16 +3887,16 @@ static int validate_slab(struct kmem_cac
 		return 0;
 
 	/* Now we know that a valid freelist exists */
-	bitmap_zero(map, page->objects);
+	bitmap_zero(map, slub_data(page)->objects);
 
 	get_map(s, page, map);
-	for_each_object(p, s, addr, page->objects) {
+	for_each_object(p, s, addr, slub_data(page)->objects) {
 		if (test_bit(slab_index(p, s, addr), map))
 			if (!check_object(s, page, p, SLUB_RED_INACTIVE))
 				return 0;
 	}
 
-	for_each_object(p, s, addr, page->objects)
+	for_each_object(p, s, addr, slub_data(page)->objects)
 		if (!test_bit(slab_index(p, s, addr), map))
 			if (!check_object(s, page, p, SLUB_RED_ACTIVE))
 				return 0;
@@ -4086,10 +4097,10 @@ static void process_slab(struct loc_trac
 	void *addr = page_address(page);
 	void *p;
 
-	bitmap_zero(map, page->objects);
+	bitmap_zero(map, slub_data(page)->objects);
 	get_map(s, page, map);
 
-	for_each_object(p, s, addr, page->objects)
+	for_each_object(p, s, addr, slub_data(page)->objects)
 		if (!test_bit(slab_index(p, s, addr), map))
 			add_location(t, s, get_track(s, p, alloc));
 }
@@ -4288,9 +4299,9 @@ static ssize_t show_slab_objects(struct
 
 			node = page_to_nid(page);
 			if (flags & SO_TOTAL)
-				x = page->objects;
+				x = slub_data(page)->objects;
 			else if (flags & SO_OBJECTS)
-				x = page->inuse;
+				x = slub_data(page)->inuse;
 			else
 				x = 1;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
                   ` (4 preceding siblings ...)
  2013-12-13 23:59 ` [RFC][PATCH 5/7] mm: slub: rearrange 'struct page' fields Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-14  3:13   ` Andi Kleen
  2013-12-13 23:59 ` [RFC][PATCH 7/7] mm: slub: cleanups after code churn Dave Hansen
  2013-12-17  0:01 ` [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Andrew Morton
  7 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


SLUB depends on a 16-byte cmpxchg for an optimization.  In order
to get guaranteed 16-byte alignment (required by the hardware on
x86), 'struct page' is padded out from 56 to 64 bytes.

Those 8-bytes matter.  We've gone to great lengths to keep
'struct page' small in the past.  It's a shame that we bloat it
now just for alignment reasons when we have *extra* space.  Also,
increasing the size of 'struct page' by 14% makes it 14% more
likely that we will miss a cacheline when fetching it.

This patch takes an unused 8-byte area of slub's 'struct page'
and reuses it to internally align to the 16-bytes that we need.

Note that this also gets rid of the ugly slub #ifdef that we use
to segregate ->counters and ->_count for cases where we need to
manipulate ->counters without the benefit of a hardware cmpxchg.

This patch takes me from 16909584K of reserved memory at boot
down to 14814472K, so almost *exactly* 2GB of savings!  It also
helps performance, presumably because of that 14% fewer
cacheline effect.  A 30GB dd to a ramfs file:

	dd if=/dev/zero of=bigfile bs=$((1<<30)) count=30

is sped up by about 4.4% in my testing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/include/linux/mm_types.h |   56 +++++++---------------------
 linux.git-davehans/mm/slab_common.c         |   10 +++--
 linux.git-davehans/mm/slub.c                |    5 ++
 3 files changed, 26 insertions(+), 45 deletions(-)

diff -puN include/linux/mm_types.h~remove-struct-page-alignment-restrictions include/linux/mm_types.h
--- linux.git/include/linux/mm_types.h~remove-struct-page-alignment-restrictions	2013-12-13 15:51:48.591268396 -0800
+++ linux.git-davehans/include/linux/mm_types.h	2013-12-13 15:51:48.595268572 -0800
@@ -24,39 +24,30 @@
 struct address_space;
 
 struct slub_data {
-	void *unused;
 	void *freelist;
 	union {
 		struct {
 			unsigned inuse:16;
 			unsigned objects:15;
 			unsigned frozen:1;
-			atomic_t dontuse_slub_count;
 		};
-		/*
-		 * ->counters is used to make it easier to copy
-		 * all of the above counters in one chunk.
-		 * The actual counts are never accessed via this.
-		 */
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-		unsigned long counters;
-#else
-		/*
-		 * Keep _count separate from slub cmpxchg_double data.
-		 * As the rest of the double word is protected by
-		 * slab_lock but _count is not.
-		 */
 		struct {
-			unsigned counters;
-			/*
-			 * This isn't used directly, but declare it here
-			 * for clarity since it must line up with _count
-			 * from 'struct page'
-			 */
+			/* Note: counters is just a helper for the above bitfield */
+			unsigned long counters;
+			atomic_t padding;
 			atomic_t separate_count;
 		};
-#endif
+		/*
+		 * the double-cmpxchg case:
+		 * counters and _count overlap:
+		 */
+		union {
+			unsigned long counters2;
+			struct {
+				atomic_t padding2;
+				atomic_t _count;
+			};
+		};
 	};
 };
 
@@ -70,15 +61,8 @@ struct slub_data {
  * moment. Note that we have no way to track which tasks are using
  * a page, though if it is a pagecache page, rmap structures can tell us
  * who is mapping it.
- *
- * The objects in struct page are organized in double word blocks in
- * order to allows us to use atomic double word operations on portions
- * of struct page. That is currently only used by slub but the arrangement
- * allows the use of atomic double word operations on the flags/mapping
- * and lru list pointers also.
  */
 struct page {
-	/* First double word block */
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	union {
@@ -121,7 +105,6 @@ struct page {
 		};
 	};
 
-	/* Third double word block */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -147,7 +130,6 @@ struct page {
 #endif
 	};
 
-	/* Remainder is not double word aligned */
 	union {
 		unsigned long private;		/* Mapping-private opaque data:
 					 	 * usually used for buffer_heads
@@ -196,15 +178,7 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
-}
-/*
- * The struct page can be forced to be double word aligned so that atomic ops
- * on double words work. The SLUB allocator can make use of such a feature.
- */
-#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
-	__aligned(2 * sizeof(unsigned long))
-#endif
-;
+};
 
 struct page_frag {
 	struct page *page;
diff -puN mm/slab_common.c~remove-struct-page-alignment-restrictions mm/slab_common.c
--- linux.git/mm/slab_common.c~remove-struct-page-alignment-restrictions	2013-12-13 15:51:48.592268440 -0800
+++ linux.git-davehans/mm/slab_common.c	2013-12-13 15:51:48.596268616 -0800
@@ -674,7 +674,6 @@ module_init(slab_proc_init);
 void slab_build_checks(void)
 {
 	SLAB_PAGE_CHECK(_count, dontuse_slab_count);
-	SLAB_PAGE_CHECK(_count, slub_data.dontuse_slub_count);
 	SLAB_PAGE_CHECK(_count, dontuse_slob_count);
 
 	/*
@@ -688,9 +687,12 @@ void slab_build_checks(void)
 	 * carve out for _count in that case actually lines up
 	 * with the real _count.
 	 */
-#if ! (defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-	    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE))
 	SLAB_PAGE_CHECK(_count, slub_data.separate_count);
-#endif
+
+	/*
+	 * We need at least three double-words worth of space to
+	 * ensure that we can align to a double-wordk internally.
+	 */
+	BUILD_BUG_ON(sizeof(struct slub_data) != sizeof(unsigned long) * 3);
 }
 
diff -puN mm/slub.c~remove-struct-page-alignment-restrictions mm/slub.c
--- linux.git/mm/slub.c~remove-struct-page-alignment-restrictions	2013-12-13 15:51:48.593268484 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:48.596268616 -0800
@@ -239,7 +239,12 @@ static inline struct kmem_cache_node *ge
 
 static inline struct slub_data *slub_data(struct page *page)
 {
+	int doubleword_bytes = BITS_PER_LONG * 2 / 8;
 	void *ptr = &page->slub_data;
+#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
+	    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+	ptr = PTR_ALIGN(ptr, doubleword_bytes);
+#endif
 	return ptr;
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions
  2013-12-13 23:59 ` [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions Dave Hansen
@ 2013-12-14  3:13   ` Andi Kleen
  0 siblings, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2013-12-14  3:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Pravin B Shelar, Christoph Lameter,
	Kirill A. Shutemov

> helps performance, presumably because of that 14% fewer
> cacheline effect.  A 30GB dd to a ramfs file:
> 
> 	dd if=/dev/zero of=bigfile bs=$((1<<30)) count=30
> 
> is sped up by about 4.4% in my testing.

Impressive result!

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 7/7] mm: slub: cleanups after code churn
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
                   ` (5 preceding siblings ...)
  2013-12-13 23:59 ` [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions Dave Hansen
@ 2013-12-13 23:59 ` Dave Hansen
  2013-12-17  0:01 ` [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Andrew Morton
  7 siblings, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-13 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Pravin B Shelar, Christoph Lameter, Kirill A. Shutemov,
	Andi Kleen, Dave Hansen


I added a bunch of longer than 80 column lines and other various
messes.  But, doing line-to-line code replacements makes the
previous patch much easier to audit.  I stuck the cleanups in
here instead.

The slub code also delcares a bunch of 'struct page's on the
stack.  Now that 'struct slub_data' is separate, we can declare
those smaller structures instead.  This ends up saving us a
couple hundred bytes in object size.

In the end, we take slub.o's code size from 26672->27168, so
about 500 bytes.  But, on an 8GB system, we save about 256k
in 'struct page' overhead.  That's a pretty good tradeoff.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/slub.c |  147 ++++++++++++++++++++++---------------------
 1 file changed, 78 insertions(+), 69 deletions(-)

diff -puN mm/slub.c~slub-cleanups mm/slub.c
--- linux.git/mm/slub.c~slub-cleanups	2013-12-13 15:51:48.843279491 -0800
+++ linux.git-davehans/mm/slub.c	2013-12-13 15:51:48.846279622 -0800
@@ -258,7 +258,8 @@ static inline int check_valid_pointer(st
 		return 1;
 
 	base = page_address(page);
-	if (object < base || object >= base + slub_data(page)->objects * s->size ||
+	if (object < base ||
+	    object >= base + slub_data(page)->objects * s->size ||
 		(object - base) % s->size) {
 		return 0;
 	}
@@ -376,10 +377,11 @@ static inline bool __cmpxchg_double_slab
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&slub_data(page)->freelist, &slub_data(page)->counters,
-			freelist_old, counters_old,
-			freelist_new, counters_new))
-		return 1;
+		if (cmpxchg_double(&slub_data(page)->freelist,
+				   &slub_data(page)->counters,
+				   freelist_old, counters_old,
+				   freelist_new, counters_new))
+			return 1;
 	} else
 #endif
 	{
@@ -414,9 +416,9 @@ static inline bool cmpxchg_double_slab(s
 	if (s->flags & __CMPXCHG_DOUBLE) {
 		if (cmpxchg_double(&slub_data(page)->freelist,
 				   &slub_data(page)->counters,
-			freelist_old, counters_old,
-			freelist_new, counters_new))
-		return 1;
+				   freelist_old, counters_old,
+				   freelist_new, counters_new))
+			return 1;
 	} else
 #endif
 	{
@@ -863,7 +865,8 @@ static int check_slab(struct kmem_cache
 	}
 	if (slub_data(page)->inuse > slub_data(page)->objects) {
 		slab_err(s, page, "inuse %u > max %u",
-			s->name, slub_data(page)->inuse, slub_data(page)->objects);
+			s->name, slub_data(page)->inuse,
+			slub_data(page)->objects);
 		return 0;
 	}
 	/* Slab_pad_check fixes things up after itself */
@@ -894,7 +897,8 @@ static int on_freelist(struct kmem_cache
 			} else {
 				slab_err(s, page, "Freepointer corrupt");
 				slub_data(page)->freelist = NULL;
-				slub_data(page)->inuse = slub_data(page)->objects;
+				slub_data(page)->inuse =
+					slub_data(page)->objects;
 				slab_fix(s, "Freelist cleared");
 				return 0;
 			}
@@ -917,7 +921,8 @@ static int on_freelist(struct kmem_cache
 	}
 	if (slub_data(page)->inuse != slub_data(page)->objects - nr) {
 		slab_err(s, page, "Wrong object count. Counter is %d but "
-			"counted were %d", slub_data(page)->inuse, slub_data(page)->objects - nr);
+			"counted were %d", slub_data(page)->inuse,
+			slub_data(page)->objects - nr);
 		slub_data(page)->inuse = slub_data(page)->objects - nr;
 		slab_fix(s, "Object count adjusted.");
 	}
@@ -1554,7 +1559,7 @@ static inline void *acquire_slab(struct
 {
 	void *freelist;
 	unsigned long counters;
-	struct page new;
+	struct slub_data new;
 
 	/*
 	 * Zap the freelist and set the frozen bit.
@@ -1563,22 +1568,22 @@ static inline void *acquire_slab(struct
 	 */
 	freelist = slub_data(page)->freelist;
 	counters = slub_data(page)->counters;
-	slub_data(&new)->counters = counters;
-	*objects = slub_data(&new)->objects - slub_data(&new)->inuse;
+	new.counters = counters;
+	*objects = new.objects - new.inuse;
 	if (mode) {
-		slub_data(&new)->inuse = slub_data(page)->objects;
-		slub_data(&new)->freelist = NULL;
+		new.inuse = slub_data(page)->objects;
+		new.freelist = NULL;
 	} else {
-		slub_data(&new)->freelist = freelist;
+		new.freelist = freelist;
 	}
 
-	VM_BUG_ON(slub_data(&new)->frozen);
-	slub_data(&new)->frozen = 1;
+	VM_BUG_ON(new.frozen);
+	new.frozen = 1;
 
 	if (!__cmpxchg_double_slab(s, page,
 			freelist, counters,
-			slub_data(&new)->freelist,
-			slub_data(&new)->counters,
+			new.freelist,
+			new.counters,
 			"acquire_slab"))
 		return NULL;
 
@@ -1800,8 +1805,8 @@ static void deactivate_slab(struct kmem_
 	enum slab_modes l = M_NONE, m = M_NONE;
 	void *nextfree;
 	int tail = DEACTIVATE_TO_HEAD;
-	struct page new;
-	struct page old;
+	struct slub_data new;
+	struct slub_data old;
 
 	if (slub_data(page)->freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
@@ -1824,13 +1829,13 @@ static void deactivate_slab(struct kmem_
 			prior = slub_data(page)->freelist;
 			counters = slub_data(page)->counters;
 			set_freepointer(s, freelist, prior);
-			slub_data(&new)->counters = counters;
-			slub_data(&new)->inuse--;
-			VM_BUG_ON(!slub_data(&new)->frozen);
+			new.counters = counters;
+			new.inuse--;
+			VM_BUG_ON(!new.frozen);
 
 		} while (!__cmpxchg_double_slab(s, page,
 			prior, counters,
-			freelist, slub_data(&new)->counters,
+			freelist, new.counters,
 			"drain percpu freelist"));
 
 		freelist = nextfree;
@@ -1852,24 +1857,24 @@ static void deactivate_slab(struct kmem_
 	 */
 redo:
 
-	slub_data(&old)->freelist = slub_data(page)->freelist;
-	slub_data(&old)->counters = slub_data(page)->counters;
-	VM_BUG_ON(!slub_data(&old)->frozen);
+	old.freelist = slub_data(page)->freelist;
+	old.counters = slub_data(page)->counters;
+	VM_BUG_ON(!old.frozen);
 
 	/* Determine target state of the slab */
-	slub_data(&new)->counters = slub_data(&old)->counters;
+	new.counters = old.counters;
 	if (freelist) {
-		slub_data(&new)->inuse--;
-		set_freepointer(s, freelist, slub_data(&old)->freelist);
-		slub_data(&new)->freelist = freelist;
+		new.inuse--;
+		set_freepointer(s, freelist, old.freelist);
+		new.freelist = freelist;
 	} else
-		slub_data(&new)->freelist = slub_data(&old)->freelist;
+		new.freelist = old.freelist;
 
-	slub_data(&new)->frozen = 0;
+	new.frozen = 0;
 
-	if (!slub_data(&new)->inuse && n->nr_partial > s->min_partial)
+	if (!new.inuse && n->nr_partial > s->min_partial)
 		m = M_FREE;
-	else if (slub_data(&new)->freelist) {
+	else if (new.freelist) {
 		m = M_PARTIAL;
 		if (!lock) {
 			lock = 1;
@@ -1918,8 +1923,10 @@ redo:
 
 	l = m;
 	if (!__cmpxchg_double_slab(s, page,
-				slub_data(&old)->freelist, slub_data(&old)->counters,
-				slub_data(&new)->freelist, slub_data(&new)->counters,
+				old.freelist,
+				old.counters,
+				new.freelist,
+				new.counters,
 				"unfreezing slab"))
 		goto redo;
 
@@ -1948,8 +1955,8 @@ static void unfreeze_partials(struct kme
 	struct page *page, *discard_page = NULL;
 
 	while ((page = c->partial)) {
-		struct page new;
-		struct page old;
+		struct slub_data new;
+		struct slub_data old;
 
 		c->partial = page->next;
 
@@ -1964,23 +1971,24 @@ static void unfreeze_partials(struct kme
 
 		do {
 
-			slub_data(&old)->freelist = slub_data(page)->freelist;
-			slub_data(&old)->counters = slub_data(page)->counters;
-			VM_BUG_ON(!slub_data(&old)->frozen);
+			old.freelist = slub_data(page)->freelist;
+			old.counters = slub_data(page)->counters;
+			VM_BUG_ON(!old.frozen);
 
-			slub_data(&new)->counters = slub_data(&old)->counters;
-			slub_data(&new)->freelist = slub_data(&old)->freelist;
+			new.counters = old.counters;
+			new.freelist = old.freelist;
 
-			slub_data(&new)->frozen = 0;
+			new.frozen = 0;
 
 		} while (!__cmpxchg_double_slab(s, page,
-				slub_data(&old)->freelist,
-				slub_data(&old)->counters,
-				slub_data(&new)->freelist,
-				slub_data(&new)->counters,
+				old.freelist,
+				old.counters,
+				new.freelist,
+				new.counters,
 				"unfreezing slab"));
 
-		if (unlikely(!slub_data(&new)->inuse && n->nr_partial > s->min_partial)) {
+		if (unlikely(!new.inuse &&
+			     n->nr_partial > s->min_partial)) {
 			page->next = discard_page;
 			discard_page = page;
 		} else {
@@ -2232,7 +2240,7 @@ static inline bool pfmemalloc_match(stru
  */
 static inline void *get_freelist(struct kmem_cache *s, struct page *page)
 {
-	struct page new;
+	struct slub_data new;
 	unsigned long counters;
 	void *freelist;
 
@@ -2240,15 +2248,15 @@ static inline void *get_freelist(struct
 		freelist = slub_data(page)->freelist;
 		counters = slub_data(page)->counters;
 
-		slub_data(&new)->counters = counters;
-		VM_BUG_ON(!slub_data(&new)->frozen);
+		new.counters = counters;
+		VM_BUG_ON(!new.frozen);
 
-		slub_data(&new)->inuse = slub_data(page)->objects;
-		slub_data(&new)->frozen = freelist != NULL;
+		new.inuse = slub_data(page)->objects;
+		new.frozen = freelist != NULL;
 
 	} while (!__cmpxchg_double_slab(s, page,
 		freelist, counters,
-		NULL, slub_data(&new)->counters,
+		NULL, new.counters,
 		"get_freelist"));
 
 	return freelist;
@@ -2533,7 +2541,7 @@ static void __slab_free(struct kmem_cach
 	void *prior;
 	void **object = (void *)x;
 	int was_frozen;
-	struct page new;
+	struct slub_data new;
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long uninitialized_var(flags);
@@ -2552,10 +2560,10 @@ static void __slab_free(struct kmem_cach
 		prior = slub_data(page)->freelist;
 		counters = slub_data(page)->counters;
 		set_freepointer(s, object, prior);
-		slub_data(&new)->counters = counters;
-		was_frozen = slub_data(&new)->frozen;
-		slub_data(&new)->inuse--;
-		if ((!slub_data(&new)->inuse || !prior) && !was_frozen) {
+		new.counters = counters;
+		was_frozen = new.frozen;
+		new.inuse--;
+		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior)
 
@@ -2565,7 +2573,7 @@ static void __slab_free(struct kmem_cach
 				 * We can defer the list move and instead
 				 * freeze it.
 				 */
-				slub_data(&new)->frozen = 1;
+				new.frozen = 1;
 
 			else { /* Needs to be taken off a list */
 
@@ -2585,7 +2593,7 @@ static void __slab_free(struct kmem_cach
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, slub_data(&new)->counters,
+		object, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2594,7 +2602,7 @@ static void __slab_free(struct kmem_cach
 		 * If we just froze the page then put it onto the
 		 * per cpu partial list.
 		 */
-		if (slub_data(&new)->frozen && !was_frozen) {
+		if (new.frozen && !was_frozen) {
 			put_cpu_partial(s, page, 1);
 			stat(s, CPU_PARTIAL_FREE);
 		}
@@ -2607,7 +2615,7 @@ static void __slab_free(struct kmem_cach
                 return;
         }
 
-	if (unlikely(!slub_data(&new)->inuse && n->nr_partial > s->min_partial))
+	if (unlikely(!new.inuse && n->nr_partial > s->min_partial))
 		goto slab_empty;
 
 	/*
@@ -3431,7 +3439,8 @@ int kmem_cache_shrink(struct kmem_cache
 		 * list_lock.  ->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + slub_data(page)->inuse);
+			list_move(&page->lru, slabs_by_inuse +
+						slub_data(page)->inuse);
 			if (!slub_data(page)->inuse)
 				n->nr_partial--;
 		}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
                   ` (6 preceding siblings ...)
  2013-12-13 23:59 ` [RFC][PATCH 7/7] mm: slub: cleanups after code churn Dave Hansen
@ 2013-12-17  0:01 ` Andrew Morton
  2013-12-17  0:45   ` Dave Hansen
  7 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2013-12-17  0:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Pravin B Shelar, Christoph Lameter,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On Fri, 13 Dec 2013 15:59:03 -0800 Dave Hansen <dave@sr71.net> wrote:

> SLUB depends on a 16-byte cmpxchg for an optimization.  For the
> purposes of this series, I'm assuming that it is a very important
> optimization that we desperately need to keep around.

What if we don't do that.

> In order to get guaranteed 16-byte alignment (required by the
> hardware on x86), 'struct page' is padded out from 56 to 64
> bytes.
> 
> Those 8-bytes matter.  We've gone to great lengths to keep
> 'struct page' small in the past.  It's a shame that we bloat it
> now just for alignment reasons when we have extra space.  Plus,
> bloating such a commonly-touched structure *HAS* to have cache
> footprint implications.
> 
> These patches attempt _internal_ alignment instead of external
> alignment for slub.
> 
> I also got a bug report from some folks running a large database
> benchmark.  Their old kernel uses slab and their new one uses
> slub.  They were swapping and couldn't figure out why.  It turned
> out to be the 2GB of RAM that the slub padding wastes on their
> system.
> 
> On my box, that 2GB cost about $200 to populate back when we
> bought it.  I want my $200 back.
> 
> This set takes me from 16909584K of reserved memory at boot
> down to 14814472K, so almost *exactly* 2GB of savings!  It also
> helps performance, presumably because it touches 14% fewer
> struct page cachelines.  A 30GB dd to a ramfs file:
> 
>         dd if=/dev/zero of=bigfile bs=$((1<<30)) count=30
> 
> is sped up by about 4.4% in my testing.

This is a gruesome and horrible tale of inefficiency and regression.

>From 5-10 minutes of gitting I couldn't see any performance testing
results for slub's cmpxchg_double stuff.  I am thinking we should just
tip it all overboard unless someone can demonstrate sufficiently
serious losses from so doing.

--- a/arch/x86/Kconfig~a
+++ a/arch/x86/Kconfig
@@ -78,7 +78,6 @@ config X86
 	select ANON_INODES
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_CMPXCHG_LOCAL
-	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_ARCH_KMEMCHECK
 	select HAVE_USER_RETURN_NOTIFIER
 	select ARCH_BINFMT_ELF_RANDOMIZE_PIE
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-17  0:01 ` [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Andrew Morton
@ 2013-12-17  0:45   ` Dave Hansen
  2013-12-17 15:17     ` Christoph Lameter
  2013-12-18  8:51     ` Pekka Enberg
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-17  0:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Pravin B Shelar, Christoph Lameter,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On 12/16/2013 04:01 PM, Andrew Morton wrote:
> On Fri, 13 Dec 2013 15:59:03 -0800 Dave Hansen <dave@sr71.net> wrote:
>> SLUB depends on a 16-byte cmpxchg for an optimization.  For the
>> purposes of this series, I'm assuming that it is a very important
>> optimization that we desperately need to keep around.
> 
> What if we don't do that.

I'll do some testing and see if I can coax out any delta from the
optimization myself.  Christoph went to a lot of trouble to put this
together, so I assumed that he had a really good reason, although the
changelogs don't really mention any.

I honestly can't imagine that a cmpxchg16 is going to be *THAT* much
cheaper than a per-page spinlock.  The contended case of the cmpxchg is
way more expensive than spinlock contention for sure.

fc9bb8c768's commit message says:
>     The doublewords must be properly aligned for cmpxchg_double to work.
>     Sadly this increases the size of page struct by one word on some architectures.
>     But as a resultpage structs are now cacheline aligned on x86_64.

I'm not sure what aligning them buys us though.  I think I just
demonstrated that cache footprint is *way* more important than alignment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-17  0:45   ` Dave Hansen
@ 2013-12-17 15:17     ` Christoph Lameter
  2013-12-19  0:24       ` Dave Hansen
  2013-12-18  8:51     ` Pekka Enberg
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2013-12-17 15:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On Mon, 16 Dec 2013, Dave Hansen wrote:

> I'll do some testing and see if I can coax out any delta from the
> optimization myself.  Christoph went to a lot of trouble to put this
> together, so I assumed that he had a really good reason, although the
> changelogs don't really mention any.

The cmpxchg on the struct page avoids disabling interrupts etc and
therefore simplifies the code significantly.

> I honestly can't imagine that a cmpxchg16 is going to be *THAT* much
> cheaper than a per-page spinlock.  The contended case of the cmpxchg is
> way more expensive than spinlock contention for sure.

Make sure slub does not set __CMPXCHG_DOUBLE in the kmem_cache flags
and it will fall back to spinlocks if you want to do a comparison. Most
non x86 arches will use that fallback code.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-17 15:17     ` Christoph Lameter
@ 2013-12-19  0:24       ` Dave Hansen
  2013-12-19  0:41         ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2013-12-19  0:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On 12/17/2013 07:17 AM, Christoph Lameter wrote:
> On Mon, 16 Dec 2013, Dave Hansen wrote:
> 
>> I'll do some testing and see if I can coax out any delta from the
>> optimization myself.  Christoph went to a lot of trouble to put this
>> together, so I assumed that he had a really good reason, although the
>> changelogs don't really mention any.
> 
> The cmpxchg on the struct page avoids disabling interrupts etc and
> therefore simplifies the code significantly.
> 
>> I honestly can't imagine that a cmpxchg16 is going to be *THAT* much
>> cheaper than a per-page spinlock.  The contended case of the cmpxchg is
>> way more expensive than spinlock contention for sure.
> 
> Make sure slub does not set __CMPXCHG_DOUBLE in the kmem_cache flags
> and it will fall back to spinlocks if you want to do a comparison. Most
> non x86 arches will use that fallback code.


I did four tests.  The first workload allocs a bunch of stuff, then
frees it all with both the cmpxchg-enabled 64-byte struct page and the
48-byte one that is supposed to use a spinlock.  I confirmed the 'struct
page' size in both cases by looking at dmesg.

Essentially, I see no worthwhile benefit from using the double-cmpxchg
over the spinlock.  In fact, the increased cache footprint makes it
*substantially* worse when doing a tight loop.

Unless somebody can find some holes in this, I think we have no choice
but to unset the HAVE_ALIGNED_STRUCT_PAGE config option and revert using
the cmpxchg, at least for now.

Kernel config:
https://www.sr71.net/~dave/intel/config-20131218-structpagesize
System was an 80-core "Westmere" Xeon

I suspect that the original data:

> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a5ec0b

are invalid because the data there were not done with the increased
'struct page' padding.

---------------------------

First test:

	for (i = 0; i < kmalloc_iterations; i++)
        	gunk[i] = kmalloc(kmalloc_size, GFP_KERNEL);
	for (i = 0; i < kmalloc_iterations; i++)
		kfree(gunk[i]);

All units are all in nanoseconds, lower is better.

		size of 'struct page':
kmalloc size	64-byte 48-byte
8		98.2	105.7
32		123.7	125.8
128		293.9	289.9
256		572.4	577.9
1024		621.0	639.3
4096		733.3	746.7
8192		968.3	948.6

As you can see, it's mostly a wash.  The 64-byte one looks to have a
~8ns advantage, but any advantage disappears in to the noise on the
other sizes.

---------------------------

Second test did the same 'struct page sizes', but instead did a
kmalloc() immediately followed by a kfree:

	for (i = 0; i < kmalloc_iterations; i++) {
        	gunk[i] = kmalloc(kmalloc_size, GFP_KERNEL);
		kfree(gunk[i]);
	}

		size of 'struct page':
kmalloc size	64-byte 48-byte
8		58.6	43.0
32		59.3	43.0
128		59.4	43.2
256		57.4	42.8
1024		80.4	43.0
4096		76.0	43.8
8192		79.9	43.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-19  0:24       ` Dave Hansen
@ 2013-12-19  0:41         ` Andrew Morton
  2013-12-19  0:48           ` Dave Hansen
  2013-12-19 19:14           ` Dave Hansen
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2013-12-19  0:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On Wed, 18 Dec 2013 16:24:15 -0800 Dave Hansen <dave@sr71.net> wrote:

> On 12/17/2013 07:17 AM, Christoph Lameter wrote:
> > On Mon, 16 Dec 2013, Dave Hansen wrote:
> > 
> >> I'll do some testing and see if I can coax out any delta from the
> >> optimization myself.  Christoph went to a lot of trouble to put this
> >> together, so I assumed that he had a really good reason, although the
> >> changelogs don't really mention any.
> > 
> > The cmpxchg on the struct page avoids disabling interrupts etc and
> > therefore simplifies the code significantly.
> > 
> >> I honestly can't imagine that a cmpxchg16 is going to be *THAT* much
> >> cheaper than a per-page spinlock.  The contended case of the cmpxchg is
> >> way more expensive than spinlock contention for sure.
> > 
> > Make sure slub does not set __CMPXCHG_DOUBLE in the kmem_cache flags
> > and it will fall back to spinlocks if you want to do a comparison. Most
> > non x86 arches will use that fallback code.
> 
> 
> I did four tests.  The first workload allocs a bunch of stuff, then
> frees it all with both the cmpxchg-enabled 64-byte struct page and the
> 48-byte one that is supposed to use a spinlock.  I confirmed the 'struct
> page' size in both cases by looking at dmesg.
> 
> Essentially, I see no worthwhile benefit from using the double-cmpxchg
> over the spinlock.  In fact, the increased cache footprint makes it
> *substantially* worse when doing a tight loop.
> 
> Unless somebody can find some holes in this, I think we have no choice
> but to unset the HAVE_ALIGNED_STRUCT_PAGE config option and revert using
> the cmpxchg, at least for now.
> 

So your scary patch series which shrinks struct page while retaining
the cmpxchg_double() might reclaim most of this loss?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-19  0:41         ` Andrew Morton
@ 2013-12-19  0:48           ` Dave Hansen
  2013-12-19 15:21             ` Christoph Lameter
  2013-12-19 19:14           ` Dave Hansen
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2013-12-19  0:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On 12/18/2013 04:41 PM, Andrew Morton wrote:
>> > Unless somebody can find some holes in this, I think we have no choice
>> > but to unset the HAVE_ALIGNED_STRUCT_PAGE config option and revert using
>> > the cmpxchg, at least for now.
>
> So your scary patch series which shrinks struct page while retaining
> the cmpxchg_double() might reclaim most of this loss?

That's what I'll test next, but I hope so.

The config tweak is important because it shows a low-risk way to get a
small 'struct page', plus get back some performance that we lost and
evidently never noticed.  A distro that was nearing a release might want
to go with this, for instance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-19  0:48           ` Dave Hansen
@ 2013-12-19 15:21             ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2013-12-19 15:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On Wed, 18 Dec 2013, Dave Hansen wrote:

> On 12/18/2013 04:41 PM, Andrew Morton wrote:
> >> > Unless somebody can find some holes in this, I think we have no choice
> >> > but to unset the HAVE_ALIGNED_STRUCT_PAGE config option and revert using
> >> > the cmpxchg, at least for now.
> >
> > So your scary patch series which shrinks struct page while retaining
> > the cmpxchg_double() might reclaim most of this loss?
>
> That's what I'll test next, but I hope so.
>
> The config tweak is important because it shows a low-risk way to get a
> small 'struct page', plus get back some performance that we lost and
> evidently never noticed.  A distro that was nearing a release might want
> to go with this, for instance.

Ok then lets just drop the cmpxchg updates to the page struct. The
spinlock code is already in there so just removing the __CMPXCHG flag
related processing should do the trick.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-19  0:41         ` Andrew Morton
  2013-12-19  0:48           ` Dave Hansen
@ 2013-12-19 19:14           ` Dave Hansen
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2013-12-19 19:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, linux-mm, Pravin B Shelar,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On 12/18/2013 04:41 PM, Andrew Morton wrote:
> So your scary patch series which shrinks struct page while retaining
> the cmpxchg_double() might reclaim most of this loss?

Well, this is cool.  Except for 1 case out of 14 (1024 bytes with the
alloc all / free all loops), my patched kernel either outperforms or
matches both of the existing cases.

To recap, we have two workloads, essentially the time to free an "old"
kmalloc which is not cache-warm (mode=0) and the time to free one which
is warm since it was just allocated (mode=1).

This is tried for 3 different kernel configurations:
1. The default today, SLUB with a 64-byte 'struct page' using cmpxchg16
2. Same kernel source as (1), but with SLUB's compile-time options
   changed to disable CMPXCHG16 and not align 'struct page'
3. Patched kernel to internally align th SLUB data so that we can both
   have an unaligned 56-byte 'struct page' and use the CMPXCHG16
   optimization.

> https://docs.google.com/spreadsheet/ccc?key=0AgUCVXtr5IwedDNXb1FLNEFqVHdSNDF6YktYZTBndEE&usp=sharing

I'll respin the patches a bit and send out another version with some
small updates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on.
  2013-12-17  0:45   ` Dave Hansen
  2013-12-17 15:17     ` Christoph Lameter
@ 2013-12-18  8:51     ` Pekka Enberg
  1 sibling, 0 replies; 20+ messages in thread
From: Pekka Enberg @ 2013-12-18  8:51 UTC (permalink / raw)
  To: Dave Hansen, Andrew Morton
  Cc: linux-kernel, linux-mm, Pravin B Shelar, Christoph Lameter,
	Kirill A. Shutemov, Andi Kleen, Pekka Enberg

On 12/17/2013 02:45 AM, Dave Hansen wrote:
> I'll do some testing and see if I can coax out any delta from the
> optimization myself.  Christoph went to a lot of trouble to put this
> together, so I assumed that he had a really good reason, although the
> changelogs don't really mention any.

IIRC it's commit 8a5ec0b ("Lockless (and preemptless) fastpaths for 
slub") that documents the performance gains.

The page alignment patches came later once we discovered that we broke 
the world...

                         Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-12-19 19:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-13 23:59 [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Dave Hansen
2013-12-13 23:59 ` [RFC][PATCH 1/7] mm: print more details for bad_page() Dave Hansen
2013-12-16 16:52   ` Christoph Lameter
2013-12-16 17:20     ` Andi Kleen
2013-12-13 23:59 ` [RFC][PATCH 2/7] mm: page->pfmemalloc only used by slab/skb Dave Hansen
2013-12-13 23:59 ` [RFC][PATCH 3/7] mm: slabs: reset page at free Dave Hansen
2013-12-13 23:59 ` [RFC][PATCH 4/7] mm: rearrange struct page Dave Hansen
2013-12-13 23:59 ` [RFC][PATCH 5/7] mm: slub: rearrange 'struct page' fields Dave Hansen
2013-12-13 23:59 ` [RFC][PATCH 6/7] mm: slub: remove 'struct page' alignment restrictions Dave Hansen
2013-12-14  3:13   ` Andi Kleen
2013-12-13 23:59 ` [RFC][PATCH 7/7] mm: slub: cleanups after code churn Dave Hansen
2013-12-17  0:01 ` [RFC][PATCH 0/7] re-shrink 'struct page' when SLUB is on Andrew Morton
2013-12-17  0:45   ` Dave Hansen
2013-12-17 15:17     ` Christoph Lameter
2013-12-19  0:24       ` Dave Hansen
2013-12-19  0:41         ` Andrew Morton
2013-12-19  0:48           ` Dave Hansen
2013-12-19 15:21             ` Christoph Lameter
2013-12-19 19:14           ` Dave Hansen
2013-12-18  8:51     ` Pekka Enberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).