public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
@ 2011-02-23 15:10 Michal Hocko
  2011-02-23 18:19 ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-23 15:10 UTC (permalink / raw)
  To: linux-mm; +Cc: KAMEZAWA Hiroyuki, linux-kernel

Hi,

I have just noticed that memory cgroups consume a lot of 2MB slab
objects (8 objects per 1GB of RAM on x86_64 with SPARSEMEM). It turned
out that this memory is allocated for per memory sections page_cgroup
arrays. 

If we consider that the array itself consume something above 1MB (but
still doesn't fit into 1MB kmalloc cache) it is rather big wasting of
(continous) memory (6MB per 1GB of RAM). 


The patch below tries to fix this up. Any thoughts?
---
>From 76ad25f5d2e31dd2af99b77b1f709748247a5fc3 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 23 Feb 2011 14:06:32 +0100
Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM

Currently we are allocating a single page_cgroup array per memory
section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
This is correct but memory inefficient solution because the allocated
memory (unless we fall back to vmalloc) is not kmalloc friendly:
	- 32b - 16384 entries (20B per entry) fit into 327680B so the
	  524288B slab cache is used
	- 32b with PAE - 131072 entries with 2621440B fit into 4194304B
	- 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the
whole memory. On a x86_64 machine it is something like 6MB per 1GB of
RAM.

We can reduce this internal fragmentation by splitting the single
page_cgroup array into more arrays where each one is well kmalloc
aligned. This patch implements this idea.

mem_section contains a double array (base) now and each entry of the
array is a slot which contains a header (first entry which contains the
number of entries in the slot) and an array of page_cgroups.

When we initialize page_cgroups (in init_section_page_cgroup) we are
using something like greedy algorithm to allocate slots. Each slot uses
kmalloc general purpose cache sizes (kmalloc_aligned_sizes array) and we
pick up the largest size which is smaller or equal than the size
requested for the number of entries. This is repeated until we are able
to store all entries.

lookup_page_cgroup gets little bit more complicated but not too much. We
have to jump over all slots to get to the entry in the last slot.
Currently we will fit into 3 slots so it means 3 more dereferences per
lookup (one to check the number of entries for each slot).

I haven't done any performance testing yet so if anybody has an
interesting test case to try I will happily do it.

The total internal fragmentation gets down under 4kB per section (32kB
per 1GB) on a x86_64 machine. We are also not wasting big memory
continuous chunks that much.

We can also tune kmalloc_aligned_sizes array to contain also smaller
allocation sizes for the remaining entries which do not fit into larger
allocations (this implementation wastes one page for the last 3
entries).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mmzone.h |    8 +-
 mm/page_cgroup.c       |  266 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 226 insertions(+), 48 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..f635ac6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -979,9 +979,13 @@ struct mem_section {
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
+	 * page_cgroups are not stored in an array but rather in an array
+	 * of slots where each slot is an array (kmalloc size aligned)
+	 * of pcg_slot_entry (look at mm/page_cgroup.c for more deails).
 	 */
-	struct page_cgroup *page_cgroup;
-	unsigned long pad;
+	struct page_cgroup **page_cgroup;
+	/* first pfn in this section */
+	unsigned long start_pfn;
 #endif
 };
 
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..c6cfff8 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -95,54 +95,230 @@ fail:
 
 #else /* CONFIG_FLAT_NODE_MEM_MAP */
 
+/* TODO more precise and calculated value would be nicer but 4 slots fit
+ * into 32b and we do not use more than 3 entries for any of 32b, 
+ * 32b with PAE and 64b anyway
+ */
+#define MAX_SLOTS_COUNT 4
+
+/* Index of the slot header */
+#define SLOT_HEADER	0
+
+/* Index of the first page_cgroup stored in a slot */
+#define SLOT_FIRST_PFN	(SLOT_HEADER+1)
+
+/* TODO we can do this by including <linux/kmalloc_sizes.h> to
+ * prevent from sizes duplication. Should we?
+ */
+static size_t kmalloc_aligned_sizes [] = {
+	4194304, 2097152, 1048576,
+       	524288, 262144, 131072, 65536, 
+	32768, 16384, 8192, 4096};
+
+/* Minumum allocation size for a slot to keep fragmentation under
+ * limit 
+ */
+#define MIN_SLOT_ALOC	4096
+
+/* Header of the page cgroup slot. It contains the number of entries
+ * (without header) stored in the slot.
+ */
+struct pcg_slot_header {
+	unsigned long entries;
+};
+
+/* Slot entry for the section page_cgroup array.
+ * First entry (SLOT_HEADER) of the slot defines the number
+ * of entries in the slot. All others (>=SLOT_FIRST_PFN) define
+ * struct page_cgroup.
+ */
+struct pcg_slot_entry {
+	union {
+		struct pcg_slot_header header;
+		struct page_cgroup pcg;
+	};
+};
+
+#define pcg_slots_from_section(section) \
+	((struct pcg_slot_entry **)section->page_cgroup)
+
+#define pcg_from_slot(slot) 			\
+	({					\
+	 struct pcg_slot_entry * ____slot = slot;		\
+	 struct pcg_slot_entry * ____first_pfn = ____slot + SLOT_FIRST_PFN; \
+         (struct page_cgroup *)____first_pfn;	\
+	 })
+
+#define header_from_slot(slot) 			\
+	({					\
+	 struct pcg_slot_entry * ____slot = slot;	\
+	 struct pcg_slot_entry * ____header = ____slot + SLOT_HEADER; \
+         (struct pcg_slot_header *)____header;		\
+	 })
+
+struct page_cgroup *lookup_pfn_page_cgroup(unsigned long pfn)
+{
+	struct mem_section *section = __pfn_to_section(pfn);
+	struct pcg_slot_entry **slots = pcg_slots_from_section(section);
+	unsigned long index = pfn - section->start_pfn,
+		      count = 0, slot;
+
+	if (!slots)
+		return NULL;
+
+	for (slot = 0; slot < MAX_SLOTS_COUNT; slot++) {
+		unsigned entries = header_from_slot(slots[slot])->entries;
+
+		if (index < count + entries)
+			return &pcg_from_slot(slots[slot])[index-count];
+		count += entries;
+	}
+
+	/* TODO really panic or rather return NULL */
+	printk(KERN_CRIT "%lu pfn not found in the section\n", pfn);
+	BUG();
+}
+
 struct page_cgroup *lookup_page_cgroup(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
-	struct mem_section *section = __pfn_to_section(pfn);
 
-	if (!section->page_cgroup)
+	return lookup_pfn_page_cgroup(pfn);
+}
+
+static void * __init_refok ____alloc_size(size_t size, int nid)
+{
+	void *addr = NULL;
+
+	if (node_state(nid, N_HIGH_MEMORY)) {
+		addr = kmalloc_node(size,
+			GFP_KERNEL | __GFP_NOWARN, nid);
+		if (!addr)
+			addr = vmalloc_node(size, nid);
+	} else {
+		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!addr)
+			addr = vmalloc(size);
+	}
+
+	return addr;
+}
+
+static void __free_addr(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		vfree(addr);
+	else {
+		struct page *page = virt_to_page(addr);
+
+		if (!PageReserved(page)) { /* Is bootmem ? */
+			kfree(addr);
+		}
+	}
+}
+
+#define array_size(array) (sizeof(array)/sizeof(*array))
+
+/* Allocates one kmalloc aligned slot for as many page_cgroups as possible
+ * (from the given count). We are trying hard to fill the allocated slot as
+ * full as possible by selecting the biggest kmalloc gen.  purpose cache that
+ * can be completely filled up).
+ * Initializes slot's header with the number of entries that fit in it.
+ */
+static struct pcg_slot_entry *__init_refok greedy_slot_alloc(unsigned count, int nid)
+{
+	/* allocate also space for header */
+	size_t size = sizeof(struct pcg_slot_entry)*(count + SLOT_FIRST_PFN);
+	struct pcg_slot_entry *pcg_slot;
+	unsigned entries;
+	int i = 0, blocks = array_size(kmalloc_aligned_sizes);
+
+	/* MIN_SLOT_ALOC keeps reasonably larg allocations so that we
+	 * do not split up into too many chunks
+	 */
+	while (i < blocks && kmalloc_aligned_sizes[i] > MIN_SLOT_ALOC) {
+		if (size >= kmalloc_aligned_sizes[i])
+			break;
+		i++;
+	};
+
+	if (!(pcg_slot = ____alloc_size(kmalloc_aligned_sizes[i], nid)))
+		return NULL;
+
+	/* final allocation can be bigger than we want */
+	entries = min(count,
+		       	(kmalloc_aligned_sizes[i]/sizeof(struct pcg_slot_entry) - SLOT_FIRST_PFN));
+	header_from_slot(pcg_slot)->entries = entries;
+	total_usage += kmalloc_aligned_sizes[i];
+
+	return pcg_slot;
+}
+
+static void free_pcg_slots(struct pcg_slot_entry **slots, size_t size)
+{
+	size_t i;
+	for (i = 0; i < size; i++)
+		__free_addr(slots[i]);
+	__free_addr(slots);
+}
+
+/* Allocates array of slots for at least the given count page_cgroup
+ * entries. Uses greedy_slot_alloc to minimize internal fragmentation.
+ */
+static struct pcg_slot_entry **__init_refok alloc_page_cgroup_slots(
+		unsigned count, int nid)
+{
+	struct pcg_slot_entry **base;
+	size_t slot = 0;
+	unsigned done;
+
+	VM_BUG_ON(!slab_is_available());
+	if (!(base = ____alloc_size(sizeof(struct pcg_slot_entry*)*MAX_SLOTS_COUNT, nid)))
 		return NULL;
-	return section->page_cgroup + pfn;
+
+	for (done = 0; done < count; ) {
+		if (slot >= MAX_SLOTS_COUNT) {
+			/* TODO PANIC? realloc? */
+			printk("Not enough slots\n");
+			goto out_free;
+		}
+
+		base[slot] = greedy_slot_alloc(count-done, nid);
+		if (!base[slot])
+			goto out_free;
+
+		done += header_from_slot(base[slot])->entries;
+		slot++;
+	}
+
+	return base;
+out_free:
+	free_pcg_slots(base, slot);
+	return NULL;	
 }
 
 /* __alloc_bootmem...() is protected by !slab_available() */
 static int __init_refok init_section_page_cgroup(unsigned long pfn)
 {
 	struct mem_section *section = __pfn_to_section(pfn);
-	struct page_cgroup *base, *pc;
-	unsigned long table_size;
-	int nid, index;
+	struct pcg_slot_entry **base = pcg_slots_from_section(section);
+	unsigned long start_pfn = pfn;
+	int nid, slot;
+	unsigned count = PAGES_PER_SECTION;
 
-	if (!section->page_cgroup) {
+	if (!base) {
 		nid = page_to_nid(pfn_to_page(pfn));
-		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-		VM_BUG_ON(!slab_is_available());
-		if (node_state(nid, N_HIGH_MEMORY)) {
-			base = kmalloc_node(table_size,
-				GFP_KERNEL | __GFP_NOWARN, nid);
-			if (!base)
-				base = vmalloc_node(table_size, nid);
-		} else {
-			base = kmalloc(table_size, GFP_KERNEL | __GFP_NOWARN);
-			if (!base)
-				base = vmalloc(table_size);
-		}
-		/*
-		 * The value stored in section->page_cgroup is (base - pfn)
-		 * and it does not point to the memory block allocated above,
-		 * causing kmemleak false positives.
-		 */
-		kmemleak_not_leak(base);
+		base = alloc_page_cgroup_slots(count, nid);
 	} else {
 		/*
  		 * We don't have to allocate page_cgroup again, but
 		 * address of memmap may be changed. So, we have to initialize
 		 * again.
 		 */
-		base = section->page_cgroup + pfn;
-		table_size = 0;
+		struct page_cgroup *page_cgroup = lookup_pfn_page_cgroup(pfn);
+
 		/* check address of memmap is changed or not. */
-		if (base->page == pfn_to_page(pfn))
+		if (page_cgroup->page == pfn_to_page(pfn))
 			return 0;
 	}
 
@@ -151,35 +327,33 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 		return -ENOMEM;
 	}
 
-	for (index = 0; index < PAGES_PER_SECTION; index++) {
-		pc = base + index;
-		__init_page_cgroup(pc, pfn + index);
-	}
+	for (slot = 0; slot < MAX_SLOTS_COUNT; slot++) {
+                unsigned index, entries = header_from_slot(base[slot])->entries;
 
-	section->page_cgroup = base - pfn;
-	total_usage += table_size;
+                for (index = 0; index < entries && count > 0;
+                                index++, pfn++, count--) {
+			struct page_cgroup *pcg = &pcg_from_slot(base[slot])[index];
+
+                        __init_page_cgroup(pcg, pfn);
+                }
+		if (!count)
+			break;
+        }
+
+	section->page_cgroup = (struct page_cgroup **)base;
+	section->start_pfn = start_pfn;
 	return 0;
 }
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __free_page_cgroup(unsigned long pfn)
 {
 	struct mem_section *ms;
-	struct page_cgroup *base;
 
 	ms = __pfn_to_section(pfn);
 	if (!ms || !ms->page_cgroup)
 		return;
-	base = ms->page_cgroup + pfn;
-	if (is_vmalloc_addr(base)) {
-		vfree(base);
-		ms->page_cgroup = NULL;
-	} else {
-		struct page *page = virt_to_page(base);
-		if (!PageReserved(page)) { /* Is bootmem ? */
-			kfree(base);
-			ms->page_cgroup = NULL;
-		}
-	}
+	free_pcg_slots(pcg_slots_from_section(ms), MAX_SLOTS_COUNT);
+	ms->page_cgroup = NULL;
 }
 
 int __meminit online_page_cgroup(unsigned long start_pfn,
-- 
1.7.2.3


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
  2011-02-23 15:10 [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM Michal Hocko
@ 2011-02-23 18:19 ` Dave Hansen
  2011-02-23 23:52   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Dave Hansen @ 2011-02-23 18:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, KAMEZAWA Hiroyuki, linux-kernel

On Wed, 2011-02-23 at 16:10 +0100, Michal Hocko wrote:
> We can reduce this internal fragmentation by splitting the single
> page_cgroup array into more arrays where each one is well kmalloc
> aligned. This patch implements this idea. 

How about using alloc_pages_exact()?  These things aren't allocated
often enough to really get most of the benefits of being in a slab.
That'll at least get you down to a maximum of about PAGE_SIZE wasted.  

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
  2011-02-23 18:19 ` Dave Hansen
@ 2011-02-23 23:52   ` KAMEZAWA Hiroyuki
  2011-02-24  9:35     ` Michal Hocko
  2011-02-24  9:33   ` Michal Hocko
  2011-02-24 13:40   ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2 Michal Hocko
  2 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-23 23:52 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Michal Hocko, linux-mm, linux-kernel

On Wed, 23 Feb 2011 10:19:22 -0800
Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Wed, 2011-02-23 at 16:10 +0100, Michal Hocko wrote:
> > We can reduce this internal fragmentation by splitting the single
> > page_cgroup array into more arrays where each one is well kmalloc
> > aligned. This patch implements this idea. 
> 
> How about using alloc_pages_exact()?  These things aren't allocated
> often enough to really get most of the benefits of being in a slab.
> That'll at least get you down to a maximum of about PAGE_SIZE wasted.  
> 

yes, alloc_pages_exact() is much better.

packing page_cgroups for multiple sections causes breakage in memory hotplug logic.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
  2011-02-23 18:19 ` Dave Hansen
  2011-02-23 23:52   ` KAMEZAWA Hiroyuki
@ 2011-02-24  9:33   ` Michal Hocko
  2011-02-24 13:40   ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2 Michal Hocko
  2 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2011-02-24  9:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, KAMEZAWA Hiroyuki, linux-kernel

On Wed 23-02-11 10:19:22, Dave Hansen wrote:
> On Wed, 2011-02-23 at 16:10 +0100, Michal Hocko wrote:
> > We can reduce this internal fragmentation by splitting the single
> > page_cgroup array into more arrays where each one is well kmalloc
> > aligned. This patch implements this idea. 
> 
> How about using alloc_pages_exact()?  These things aren't allocated
> often enough to really get most of the benefits of being in a slab.

Yes, you are right. alloc_pages_exact (which I wasn't aware of) is much
simpler solution and it gets comparable results. I will prepare the
patch for review.

Thanks for pointing this out.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
  2011-02-23 23:52   ` KAMEZAWA Hiroyuki
@ 2011-02-24  9:35     ` Michal Hocko
  2011-02-24 10:02       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-24  9:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, linux-mm, linux-kernel

On Thu 24-02-11 08:52:27, KAMEZAWA Hiroyuki wrote:
> On Wed, 23 Feb 2011 10:19:22 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 2011-02-23 at 16:10 +0100, Michal Hocko wrote:
> > > We can reduce this internal fragmentation by splitting the single
> > > page_cgroup array into more arrays where each one is well kmalloc
> > > aligned. This patch implements this idea. 
> > 
> > How about using alloc_pages_exact()?  These things aren't allocated
> > often enough to really get most of the benefits of being in a slab.
> > That'll at least get you down to a maximum of about PAGE_SIZE wasted.  
> > 
> 
> yes, alloc_pages_exact() is much better.
> 
> packing page_cgroups for multiple sections causes breakage in memory hotplug logic.

I am not sure I understand this. What do you mean by packing
page_cgroups for multiple sections? The patch I have posted doesn't do
any packing. Or do you mean that using a double array can break hotplog?
Not that this would change anything, alloc_pages_exact is really a
better solution, I am just curious ;) 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
  2011-02-24  9:35     ` Michal Hocko
@ 2011-02-24 10:02       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-24 10:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dave Hansen, linux-mm, linux-kernel

On Thu, 24 Feb 2011 10:35:19 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 24-02-11 08:52:27, KAMEZAWA Hiroyuki wrote:
> > On Wed, 23 Feb 2011 10:19:22 -0800
> > Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > 
> > > On Wed, 2011-02-23 at 16:10 +0100, Michal Hocko wrote:
> > > > We can reduce this internal fragmentation by splitting the single
> > > > page_cgroup array into more arrays where each one is well kmalloc
> > > > aligned. This patch implements this idea. 
> > > 
> > > How about using alloc_pages_exact()?  These things aren't allocated
> > > often enough to really get most of the benefits of being in a slab.
> > > That'll at least get you down to a maximum of about PAGE_SIZE wasted.  
> > > 
> > 
> > yes, alloc_pages_exact() is much better.
> > 
> > packing page_cgroups for multiple sections causes breakage in memory hotplug logic.
> 
> I am not sure I understand this. What do you mean by packing
> page_cgroups for multiple sections? The patch I have posted doesn't do
> any packing. Or do you mean that using a double array can break hotplog?
> Not that this would change anything, alloc_pages_exact is really a
> better solution, I am just curious ;) 
> 

Sorry, it seems I failed to read code correctly. 
You just implemented 2 level table..


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2
  2011-02-23 18:19 ` Dave Hansen
  2011-02-23 23:52   ` KAMEZAWA Hiroyuki
  2011-02-24  9:33   ` Michal Hocko
@ 2011-02-24 13:40   ` Michal Hocko
  2011-02-25  3:25     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-24 13:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, KAMEZAWA Hiroyuki, linux-kernel

Here is the second version of the patch. I have used alloc_pages_exact
instead of the complex double array approach.

I still fallback to kmalloc/vmalloc because hotplug can happen quite
some time after boot and we can end up not having enough continuous
pages at that time. 

I am also thinking whether it would make sense to introduce
alloc_pages_exact_node function which would allocate pages from the
given node.

Any thoughts?
---
>From e8909bbd1d759de274a6ed7812530e576ad8bc44 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 24 Feb 2011 11:25:44 +0100
Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM

Currently we are allocating a single page_cgroup array per memory
section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
This is correct but memory inefficient solution because the allocated
memory (unless we fall back to vmalloc) is not kmalloc friendly:
        - 32b - 16384 entries (20B per entry) fit into 327680B so the
          524288B slab cache is used
        - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
        - 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the
whole memory. On a x86_64 machine it is something like 6MB per 1GB of
RAM.

We can reduce the internal fragmentation either by imeplementing 2
dimensional array and allocate kmalloc aligned sizes for each entry (as
suggested in https://lkml.org/lkml/2011/2/23/232) or we can get rid of
kmalloc altogether and allocate directly from the buddy allocator (use
alloc_pages_exact) as suggested by Dave Hansen.

The later solution is much simpler and the internal fragmentation is
comparable (~1 page per section).

We still need a fallback to kmalloc/vmalloc because we have no
guarantees that we will have a continuous memory of that size (order-10)
later on the hotplug events.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/page_cgroup.c |   62 ++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..eaae7de 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -105,7 +105,41 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-/* __alloc_bootmem...() is protected by !slab_available() */
+static void *__init_refok alloc_mcg_table(size_t size, int nid)
+{
+	void *addr = NULL;
+	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
+		return addr;
+
+	if (node_state(nid, N_HIGH_MEMORY)) {
+		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
+		if (!addr)
+			addr = vmalloc_node(size, nid);
+	} else {
+		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!addr)
+			addr = vmalloc(size);
+	}
+
+	return addr;
+}
+
+static void *free__mcg_table(void *addr)
+{
+	if (is_vmalloc_addr(addr)) {
+		vfree(addr);
+	} else {
+		struct page *page = virt_to_page(addr);
+		if (!PageReserved(page)) { /* Is bootmem ? */
+			if (!PageSlab(page)) {
+				size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+				free_pages_exact(addr, table_size);
+			} else
+				kfree(addr);
+		}
+	}
+}
+
 static int __init_refok init_section_page_cgroup(unsigned long pfn)
 {
 	struct mem_section *section = __pfn_to_section(pfn);
@@ -114,19 +148,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 	int nid, index;
 
 	if (!section->page_cgroup) {
-		nid = page_to_nid(pfn_to_page(pfn));
 		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-		VM_BUG_ON(!slab_is_available());
-		if (node_state(nid, N_HIGH_MEMORY)) {
-			base = kmalloc_node(table_size,
-				GFP_KERNEL | __GFP_NOWARN, nid);
-			if (!base)
-				base = vmalloc_node(table_size, nid);
-		} else {
-			base = kmalloc(table_size, GFP_KERNEL | __GFP_NOWARN);
-			if (!base)
-				base = vmalloc(table_size);
-		}
+		nid = page_to_nid(pfn_to_page(pfn));
+		base = alloc_mcg_table(table_size, nid);
 		/*
 		 * The value stored in section->page_cgroup is (base - pfn)
 		 * and it does not point to the memory block allocated above,
@@ -170,16 +194,8 @@ void __free_page_cgroup(unsigned long pfn)
 	if (!ms || !ms->page_cgroup)
 		return;
 	base = ms->page_cgroup + pfn;
-	if (is_vmalloc_addr(base)) {
-		vfree(base);
-		ms->page_cgroup = NULL;
-	} else {
-		struct page *page = virt_to_page(base);
-		if (!PageReserved(page)) { /* Is bootmem ? */
-			kfree(base);
-			ms->page_cgroup = NULL;
-		}
-	}
+	free__mcg_table(base);
+	ms->page_cgroup = NULL;
 }
 
 int __meminit online_page_cgroup(unsigned long start_pfn,
-- 
1.7.2.3

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2
  2011-02-24 13:40   ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2 Michal Hocko
@ 2011-02-25  3:25     ` KAMEZAWA Hiroyuki
  2011-02-25  9:53       ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v3 Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-25  3:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dave Hansen, linux-mm, linux-kernel

On Thu, 24 Feb 2011 14:40:45 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> Here is the second version of the patch. I have used alloc_pages_exact
> instead of the complex double array approach.
> 
> I still fallback to kmalloc/vmalloc because hotplug can happen quite
> some time after boot and we can end up not having enough continuous
> pages at that time. 
> 
> I am also thinking whether it would make sense to introduce
> alloc_pages_exact_node function which would allocate pages from the
> given node.
> 
> Any thoughts?

The patch itself is fine but please update the description.

But have some comments, below.

> ---
> From e8909bbd1d759de274a6ed7812530e576ad8bc44 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 24 Feb 2011 11:25:44 +0100
> Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
> 
> Currently we are allocating a single page_cgroup array per memory
> section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
> This is correct but memory inefficient solution because the allocated
> memory (unless we fall back to vmalloc) is not kmalloc friendly:
>         - 32b - 16384 entries (20B per entry) fit into 327680B so the
>           524288B slab cache is used
>         - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
>         - 64b - 32768 entries (40B per entry) fit into 2097152 cache
> 
> This is ~37% wasted space per memory section and it sumps up for the
> whole memory. On a x86_64 machine it is something like 6MB per 1GB of
> RAM.
> 
> We can reduce the internal fragmentation either by imeplementing 2
> dimensional array and allocate kmalloc aligned sizes for each entry (as
> suggested in https://lkml.org/lkml/2011/2/23/232) or we can get rid of
> kmalloc altogether and allocate directly from the buddy allocator (use
> alloc_pages_exact) as suggested by Dave Hansen.
> 
> The later solution is much simpler and the internal fragmentation is
> comparable (~1 page per section).
> 
> We still need a fallback to kmalloc/vmalloc because we have no
> guarantees that we will have a continuous memory of that size (order-10)
> later on the hotplug events.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/page_cgroup.c |   62 ++++++++++++++++++++++++++++++++++--------------------
>  1 files changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5bffada..eaae7de 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -105,7 +105,41 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return section->page_cgroup + pfn;
>  }
>  
> -/* __alloc_bootmem...() is protected by !slab_available() */
> +static void *__init_refok alloc_mcg_table(size_t size, int nid)
> +{
> +	void *addr = NULL;
> +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> +		return addr;
> +
> +	if (node_state(nid, N_HIGH_MEMORY)) {
> +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> +		if (!addr)
> +			addr = vmalloc_node(size, nid);
> +	} else {
> +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +		if (!addr)
> +			addr = vmalloc(size);
> +	}
> +
> +	return addr;
> +}

What is the case we need to call kmalloc_node() even when alloc_pages_exact() fails ?
vmalloc() may need to be called when the size of chunk is larger than
MAX_ORDER or there is fragmentation.....

And the function name, alloc_mcg_table(), I don't like it because this is an
allocation for page_cgroup.

How about alloc_page_cgroup() simply ?


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v3
  2011-02-25  3:25     ` KAMEZAWA Hiroyuki
@ 2011-02-25  9:53       ` Michal Hocko
  2011-02-28  0:53         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-25  9:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, linux-mm, linux-kernel

On Fri 25-02-11 12:25:22, KAMEZAWA Hiroyuki wrote:
> On Thu, 24 Feb 2011 14:40:45 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Here is the second version of the patch. I have used alloc_pages_exact
> > instead of the complex double array approach.
> > 
> > I still fallback to kmalloc/vmalloc because hotplug can happen quite
> > some time after boot and we can end up not having enough continuous
> > pages at that time. 
> > 
> > I am also thinking whether it would make sense to introduce
> > alloc_pages_exact_node function which would allocate pages from the
> > given node.
> > 
> > Any thoughts?
> 
> The patch itself is fine but please update the description.

I have updated the description but kept those parts which describe how
the memory is wasted for different configurations. Do you have any tips
how it can be improved?

> 
> But have some comments, below.
[...]
> > -/* __alloc_bootmem...() is protected by !slab_available() */
> > +static void *__init_refok alloc_mcg_table(size_t size, int nid)
> > +{
> > +	void *addr = NULL;
> > +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> > +		return addr;
> > +
> > +	if (node_state(nid, N_HIGH_MEMORY)) {
> > +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> > +		if (!addr)
> > +			addr = vmalloc_node(size, nid);
> > +	} else {
> > +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > +		if (!addr)
> > +			addr = vmalloc(size);
> > +	}
> > +
> > +	return addr;
> > +}
> 
> What is the case we need to call kmalloc_node() even when alloc_pages_exact() fails ?
> vmalloc() may need to be called when the size of chunk is larger than
> MAX_ORDER or there is fragmentation.....

I kept the original kmalloc with fallback to vmalloc because vmalloc is
more scarce resource (especially on i386 where we can have memory
hotplug configured as well).

> 
> And the function name, alloc_mcg_table(), I don't like it because this is an
> allocation for page_cgroup.
> 
> How about alloc_page_cgroup() simply ?

OK, I have no preferences for the name. alloc_page_cgroup sounds good as
well.

I have also added VM_BUG_ON(!slab_is_available()) back to the allocation
path.

Thanks for the review. The updated patch is bellow:

Changes since v2
- rename alloc_mcg_table to alloc_page_cgroup
- free__mcg_table renamed to free_page_cgroup
- get VM_BUG_ON(!slab_is_available()) back into the allocation path
--- 
>From 9b34baf57f410a628bf45f2b35b23fdf38feae79 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 24 Feb 2011 11:25:44 +0100
Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM

Currently we are allocating a single page_cgroup array per memory
section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
This is correct but memory inefficient solution because the allocated
memory (unless we fall back to vmalloc) is not kmalloc friendly:
        - 32b - 16384 entries (20B per entry) fit into 327680B so the
          524288B slab cache is used
        - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
        - 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the
whole memory. On a x86_64 machine it is something like 6MB per 1GB of
RAM.

We can reduce the internal fragmentation either by imeplementing 2
dimensional array and allocate kmalloc aligned sizes for each entry (as
suggested in https://lkml.org/lkml/2011/2/23/232) or allocate directly
from the buddy allocator (use alloc_pages_exact) as suggested by Dave
Hansen.

The later solution is much simpler and the internal fragmentation is
comparable (~1 page per section). The initialization is done during the
boot (unless we are doing memory hotplug) so we shouldn't have any
issues from memory fragmentation and alloc_pages_exact should succeed.

We still need a fallback to kmalloc/vmalloc because we have no
guarantees that we will have a continuous memory of that size (order-10)
later on during the hotplug events.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |   62 ++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..ae322dc 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -106,6 +106,42 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 }
 
 /* __alloc_bootmem...() is protected by !slab_available() */
+static void *__init_refok alloc_page_cgroup(size_t size, int nid)
+{
+	void *addr = NULL;
+	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
+		return addr;
+
+	VM_BUG_ON(!slab_is_available());
+	if (node_state(nid, N_HIGH_MEMORY)) {
+		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
+		if (!addr)
+			addr = vmalloc_node(size, nid);
+	} else {
+		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!addr)
+			addr = vmalloc(size);
+	}
+
+	return addr;
+}
+
+static void *free_page_cgroup(void *addr)
+{
+	if (is_vmalloc_addr(addr)) {
+		vfree(addr);
+	} else {
+		struct page *page = virt_to_page(addr);
+		if (!PageReserved(page)) { /* Is bootmem ? */
+			if (!PageSlab(page)) {
+				size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+				free_pages_exact(addr, table_size);
+			} else
+				kfree(addr);
+		}
+	}
+}
+
 static int __init_refok init_section_page_cgroup(unsigned long pfn)
 {
 	struct mem_section *section = __pfn_to_section(pfn);
@@ -114,19 +150,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 	int nid, index;
 
 	if (!section->page_cgroup) {
-		nid = page_to_nid(pfn_to_page(pfn));
 		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-		VM_BUG_ON(!slab_is_available());
-		if (node_state(nid, N_HIGH_MEMORY)) {
-			base = kmalloc_node(table_size,
-				GFP_KERNEL | __GFP_NOWARN, nid);
-			if (!base)
-				base = vmalloc_node(table_size, nid);
-		} else {
-			base = kmalloc(table_size, GFP_KERNEL | __GFP_NOWARN);
-			if (!base)
-				base = vmalloc(table_size);
-		}
+		nid = page_to_nid(pfn_to_page(pfn));
+		base = alloc_page_cgroup(table_size, nid);
 		/*
 		 * The value stored in section->page_cgroup is (base - pfn)
 		 * and it does not point to the memory block allocated above,
@@ -170,16 +196,8 @@ void __free_page_cgroup(unsigned long pfn)
 	if (!ms || !ms->page_cgroup)
 		return;
 	base = ms->page_cgroup + pfn;
-	if (is_vmalloc_addr(base)) {
-		vfree(base);
-		ms->page_cgroup = NULL;
-	} else {
-		struct page *page = virt_to_page(base);
-		if (!PageReserved(page)) { /* Is bootmem ? */
-			kfree(base);
-			ms->page_cgroup = NULL;
-		}
-	}
+	free_page_cgroup(base);
+	ms->page_cgroup = NULL;
 }
 
 int __meminit online_page_cgroup(unsigned long start_pfn,
-- 
1.7.2.3

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v3
  2011-02-25  9:53       ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v3 Michal Hocko
@ 2011-02-28  0:53         ` KAMEZAWA Hiroyuki
  2011-02-28  9:12           ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4 Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-28  0:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dave Hansen, linux-mm, linux-kernel

On Fri, 25 Feb 2011 10:53:57 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 25-02-11 12:25:22, KAMEZAWA Hiroyuki wrote:
> > On Thu, 24 Feb 2011 14:40:45 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Here is the second version of the patch. I have used alloc_pages_exact
> > > instead of the complex double array approach.
> > > 
> > > I still fallback to kmalloc/vmalloc because hotplug can happen quite
> > > some time after boot and we can end up not having enough continuous
> > > pages at that time. 
> > > 
> > > I am also thinking whether it would make sense to introduce
> > > alloc_pages_exact_node function which would allocate pages from the
> > > given node.
> > > 
> > > Any thoughts?
> > 
> > The patch itself is fine but please update the description.
> 
> I have updated the description but kept those parts which describe how
> the memory is wasted for different configurations. Do you have any tips
> how it can be improved?
> 

This part was in your description.
==
We can reduce the internal fragmentation either by imeplementing 2
dimensional array and allocate kmalloc aligned sizes for each entry (as
suggested in https://lkml.org/lkml/2011/2/23/232) or we can get rid of
kmalloc altogether and allocate directly from the buddy allocator (use
alloc_pages_exact) as suggested by Dave Hansen.
==

please remove 2 dimentional..... etc. That's just a history.




> > 
> > But have some comments, below.
> [...]
> > > -/* __alloc_bootmem...() is protected by !slab_available() */
> > > +static void *__init_refok alloc_mcg_table(size_t size, int nid)
> > > +{
> > > +	void *addr = NULL;
> > > +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> > > +		return addr;
> > > +
> > > +	if (node_state(nid, N_HIGH_MEMORY)) {
> > > +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> > > +		if (!addr)
> > > +			addr = vmalloc_node(size, nid);
> > > +	} else {
> > > +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > > +		if (!addr)
> > > +			addr = vmalloc(size);
> > > +	}
> > > +
> > > +	return addr;
> > > +}
> > 
> > What is the case we need to call kmalloc_node() even when alloc_pages_exact() fails ?
> > vmalloc() may need to be called when the size of chunk is larger than
> > MAX_ORDER or there is fragmentation.....
> 
> I kept the original kmalloc with fallback to vmalloc because vmalloc is
> more scarce resource (especially on i386 where we can have memory
> hotplug configured as well).
> 

My point is, if alloc_pages_exact() failes because of order of the page,
kmalloc() will always fail. Please remove kmalloc().

> > 
> > And the function name, alloc_mcg_table(), I don't like it because this is an
> > allocation for page_cgroup.
> > 
> > How about alloc_page_cgroup() simply ?
> 
> OK, I have no preferences for the name. alloc_page_cgroup sounds good as
> well.
> 
> I have also added VM_BUG_ON(!slab_is_available()) back to the allocation
> path.
> 
> Thanks for the review. The updated patch is bellow:
> 

kmalloc() is unnecessary, again.


> Changes since v2
> - rename alloc_mcg_table to alloc_page_cgroup
> - free__mcg_table renamed to free_page_cgroup
> - get VM_BUG_ON(!slab_is_available()) back into the allocation path
> --- 
> From 9b34baf57f410a628bf45f2b35b23fdf38feae79 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 24 Feb 2011 11:25:44 +0100
> Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
> 
> Currently we are allocating a single page_cgroup array per memory
> section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
> This is correct but memory inefficient solution because the allocated
> memory (unless we fall back to vmalloc) is not kmalloc friendly:
>         - 32b - 16384 entries (20B per entry) fit into 327680B so the
>           524288B slab cache is used
>         - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
>         - 64b - 32768 entries (40B per entry) fit into 2097152 cache
> 
> This is ~37% wasted space per memory section and it sumps up for the
> whole memory. On a x86_64 machine it is something like 6MB per 1GB of
> RAM.
> 
> We can reduce the internal fragmentation either by imeplementing 2
> dimensional array and allocate kmalloc aligned sizes for each entry (as
> suggested in https://lkml.org/lkml/2011/2/23/232) or allocate directly
> from the buddy allocator (use alloc_pages_exact) as suggested by Dave
> Hansen.
> 
> The later solution is much simpler and the internal fragmentation is
> comparable (~1 page per section). The initialization is done during the
> boot (unless we are doing memory hotplug) so we shouldn't have any
> issues from memory fragmentation and alloc_pages_exact should succeed.
> 
> We still need a fallback to kmalloc/vmalloc because we have no
> guarantees that we will have a continuous memory of that size (order-10)
> later on during the hotplug events.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_cgroup.c |   62 ++++++++++++++++++++++++++++++++++-------------------
>  1 files changed, 40 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5bffada..ae322dc 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -106,6 +106,42 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  }
>  
>  /* __alloc_bootmem...() is protected by !slab_available() */
> +static void *__init_refok alloc_page_cgroup(size_t size, int nid)
> +{
> +	void *addr = NULL;
> +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> +		return addr;
> +
> +	VM_BUG_ON(!slab_is_available());
> +	if (node_state(nid, N_HIGH_MEMORY)) {
> +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> +		if (!addr)
> +			addr = vmalloc_node(size, nid);
> +	} else {
> +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +		if (!addr)
> +			addr = vmalloc(size);
> +	}
> +
> +	return addr;
> +}
> +
> +static void *free_page_cgroup(void *addr)
> +{
> +	if (is_vmalloc_addr(addr)) {
> +		vfree(addr);
> +	} else {
> +		struct page *page = virt_to_page(addr);
> +		if (!PageReserved(page)) { /* Is bootmem ? */
> +			if (!PageSlab(page)) {
> +				size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> +				free_pages_exact(addr, table_size);
> +			} else
> +				kfree(addr);
> +		}
> +	}
> +}
> +
>  static int __init_refok init_section_page_cgroup(unsigned long pfn)
>  {
>  	struct mem_section *section = __pfn_to_section(pfn);
> @@ -114,19 +150,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
>  	int nid, index;
>  
>  	if (!section->page_cgroup) {
> -		nid = page_to_nid(pfn_to_page(pfn));
>  		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> -		VM_BUG_ON(!slab_is_available());
> -		if (node_state(nid, N_HIGH_MEMORY)) {
> -			base = kmalloc_node(table_size,
> -				GFP_KERNEL | __GFP_NOWARN, nid);
> -			if (!base)
> -				base = vmalloc_node(table_size, nid);
> -		} else {
> -			base = kmalloc(table_size, GFP_KERNEL | __GFP_NOWARN);
> -			if (!base)
> -				base = vmalloc(table_size);
> -		}
> +		nid = page_to_nid(pfn_to_page(pfn));
> +		base = alloc_page_cgroup(table_size, nid);
>  		/*
>  		 * The value stored in section->page_cgroup is (base - pfn)
>  		 * and it does not point to the memory block allocated above,
> @@ -170,16 +196,8 @@ void __free_page_cgroup(unsigned long pfn)
>  	if (!ms || !ms->page_cgroup)
>  		return;
>  	base = ms->page_cgroup + pfn;
> -	if (is_vmalloc_addr(base)) {
> -		vfree(base);
> -		ms->page_cgroup = NULL;
> -	} else {
> -		struct page *page = virt_to_page(base);
> -		if (!PageReserved(page)) { /* Is bootmem ? */
> -			kfree(base);
> -			ms->page_cgroup = NULL;
> -		}
> -	}
> +	free_page_cgroup(base);
> +	ms->page_cgroup = NULL;
>  }
>  
>  int __meminit online_page_cgroup(unsigned long start_pfn,
> -- 
> 1.7.2.3
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4
  2011-02-28  0:53         ` KAMEZAWA Hiroyuki
@ 2011-02-28  9:12           ` Michal Hocko
  2011-02-28  9:23             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-28  9:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, linux-mm, linux-kernel

On Mon 28-02-11 09:53:47, KAMEZAWA Hiroyuki wrote:
> On Fri, 25 Feb 2011 10:53:57 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 25-02-11 12:25:22, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 24 Feb 2011 14:40:45 +0100
[...]
> > > The patch itself is fine but please update the description.
> > 
> > I have updated the description but kept those parts which describe how
> > the memory is wasted for different configurations. Do you have any tips
> > how it can be improved?
> > 
> 
> This part was in your description.
> ==
> We can reduce the internal fragmentation either by imeplementing 2
> dimensional array and allocate kmalloc aligned sizes for each entry (as
> suggested in https://lkml.org/lkml/2011/2/23/232) or we can get rid of
> kmalloc altogether and allocate directly from the buddy allocator (use
> alloc_pages_exact) as suggested by Dave Hansen.
> ==
> 
> please remove 2 dimentional..... etc. That's just a history.

I just wanted to mention both approaches. OK, I can remove that, of
course.

> > > 
> > > But have some comments, below.
> > [...]
> > > > -/* __alloc_bootmem...() is protected by !slab_available() */
> > > > +static void *__init_refok alloc_mcg_table(size_t size, int nid)
> > > > +{
> > > > +	void *addr = NULL;
> > > > +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> > > > +		return addr;
> > > > +
> > > > +	if (node_state(nid, N_HIGH_MEMORY)) {
> > > > +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> > > > +		if (!addr)
> > > > +			addr = vmalloc_node(size, nid);
> > > > +	} else {
> > > > +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > > > +		if (!addr)
> > > > +			addr = vmalloc(size);
> > > > +	}
> > > > +
> > > > +	return addr;
> > > > +}
> > > 
> > > What is the case we need to call kmalloc_node() even when alloc_pages_exact() fails ?
> > > vmalloc() may need to be called when the size of chunk is larger than
> > > MAX_ORDER or there is fragmentation.....
> > 
> > I kept the original kmalloc with fallback to vmalloc because vmalloc is
> > more scarce resource (especially on i386 where we can have memory
> > hotplug configured as well).
> > 
> 
> My point is, if alloc_pages_exact() failes because of order of the page,
> kmalloc() will always fail. 

You are right. I thought that kmalloc can make a difference due to reclaim
but the reclaim is already triggered by alloc_pages_exact and if it doesn't
succeed there are not big chances to have those pages ready for kmalloc.

> Please remove kmalloc().

OK.

Thanks for the review again and the updated patch is bellow:

Change since v3
- updated changelog - to not mentioned 2dim. solution
- get rid of kmalloc fallback based on Kame's suggestion.
- free_page_cgroup accidentally returned void* (we do not need any return value
  there)

Changes since v2
- rename alloc_mcg_table to alloc_page_cgroup
- free__mcg_table renamed to free_page_cgroup
- get VM_BUG_ON(!slab_is_available()) back into the allocation path

---
>From 84a9555741b59cb2a0a67b023e4bd0f92c670ca1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 24 Feb 2011 11:25:44 +0100
Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM

Currently we are allocating a single page_cgroup array per memory
section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
This is correct but memory inefficient solution because the allocated
memory (unless we fall back to vmalloc) is not kmalloc friendly:
        - 32b - 16384 entries (20B per entry) fit into 327680B so the
          524288B slab cache is used
        - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
        - 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the
whole memory. On a x86_64 machine it is something like 6MB per 1GB of
RAM.

We can reduce the internal fragmentation by using alloc_pages_exact
which allocates PAGE_SIZE aligned blocks so we will get down to <4kB
wasted memory per section which is much better.

We still need a fallback to vmalloc because we have no guarantees that
we will have a continuous memory of that size (order-10) later on during
the hotplug events.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |   54 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..eae3cd2 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -105,7 +105,33 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-/* __alloc_bootmem...() is protected by !slab_available() */
+static void *__init_refok alloc_page_cgroup(size_t size, int nid)
+{
+	void *addr = NULL;
+	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
+		return addr;
+
+	if (node_state(nid, N_HIGH_MEMORY))
+		addr = vmalloc_node(size, nid);
+	else
+		addr = vmalloc(size);
+
+	return addr;
+}
+
+static void free_page_cgroup(void *addr)
+{
+	if (is_vmalloc_addr(addr)) {
+		vfree(addr);
+	} else {
+		struct page *page = virt_to_page(addr);
+		if (!PageReserved(page)) { /* Is bootmem ? */
+			size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+			free_pages_exact(addr, table_size);
+		}
+	}
+}
+
 static int __init_refok init_section_page_cgroup(unsigned long pfn)
 {
 	struct mem_section *section = __pfn_to_section(pfn);
@@ -114,19 +140,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 	int nid, index;
 
 	if (!section->page_cgroup) {
-		nid = page_to_nid(pfn_to_page(pfn));
 		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-		VM_BUG_ON(!slab_is_available());
-		if (node_state(nid, N_HIGH_MEMORY)) {
-			base = kmalloc_node(table_size,
-				GFP_KERNEL | __GFP_NOWARN, nid);
-			if (!base)
-				base = vmalloc_node(table_size, nid);
-		} else {
-			base = kmalloc(table_size, GFP_KERNEL | __GFP_NOWARN);
-			if (!base)
-				base = vmalloc(table_size);
-		}
+		nid = page_to_nid(pfn_to_page(pfn));
+		base = alloc_page_cgroup(table_size, nid);
 		/*
 		 * The value stored in section->page_cgroup is (base - pfn)
 		 * and it does not point to the memory block allocated above,
@@ -170,16 +186,8 @@ void __free_page_cgroup(unsigned long pfn)
 	if (!ms || !ms->page_cgroup)
 		return;
 	base = ms->page_cgroup + pfn;
-	if (is_vmalloc_addr(base)) {
-		vfree(base);
-		ms->page_cgroup = NULL;
-	} else {
-		struct page *page = virt_to_page(base);
-		if (!PageReserved(page)) { /* Is bootmem ? */
-			kfree(base);
-			ms->page_cgroup = NULL;
-		}
-	}
+	free_page_cgroup(base);
+	ms->page_cgroup = NULL;
 }
 
 int __meminit online_page_cgroup(unsigned long start_pfn,
-- 
1.7.2.3


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4
  2011-02-28  9:12           ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4 Michal Hocko
@ 2011-02-28  9:23             ` KAMEZAWA Hiroyuki
  2011-02-28  9:53               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-28  9:23 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dave Hansen, linux-mm, linux-kernel

On Mon, 28 Feb 2011 10:12:56 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 28-02-11 09:53:47, KAMEZAWA Hiroyuki wrote:
> > On Fri, 25 Feb 2011 10:53:57 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Fri 25-02-11 12:25:22, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 24 Feb 2011 14:40:45 +0100
> [...]
> > > > The patch itself is fine but please update the description.
> > > 
> > > I have updated the description but kept those parts which describe how
> > > the memory is wasted for different configurations. Do you have any tips
> > > how it can be improved?
> > > 
> > 
> > This part was in your description.
> > ==
> > We can reduce the internal fragmentation either by imeplementing 2
> > dimensional array and allocate kmalloc aligned sizes for each entry (as
> > suggested in https://lkml.org/lkml/2011/2/23/232) or we can get rid of
> > kmalloc altogether and allocate directly from the buddy allocator (use
> > alloc_pages_exact) as suggested by Dave Hansen.
> > ==
> > 
> > please remove 2 dimentional..... etc. That's just a history.
> 
> I just wanted to mention both approaches. OK, I can remove that, of
> course.
> 
> > > > 
> > > > But have some comments, below.
> > > [...]
> > > > > -/* __alloc_bootmem...() is protected by !slab_available() */
> > > > > +static void *__init_refok alloc_mcg_table(size_t size, int nid)
> > > > > +{
> > > > > +	void *addr = NULL;
> > > > > +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> > > > > +		return addr;
> > > > > +
> > > > > +	if (node_state(nid, N_HIGH_MEMORY)) {
> > > > > +		addr = kmalloc_node(size, GFP_KERNEL | __GFP_NOWARN, nid);
> > > > > +		if (!addr)
> > > > > +			addr = vmalloc_node(size, nid);
> > > > > +	} else {
> > > > > +		addr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > > > > +		if (!addr)
> > > > > +			addr = vmalloc(size);
> > > > > +	}
> > > > > +
> > > > > +	return addr;
> > > > > +}
> > > > 
> > > > What is the case we need to call kmalloc_node() even when alloc_pages_exact() fails ?
> > > > vmalloc() may need to be called when the size of chunk is larger than
> > > > MAX_ORDER or there is fragmentation.....
> > > 
> > > I kept the original kmalloc with fallback to vmalloc because vmalloc is
> > > more scarce resource (especially on i386 where we can have memory
> > > hotplug configured as well).
> > > 
> > 
> > My point is, if alloc_pages_exact() failes because of order of the page,
> > kmalloc() will always fail. 
> 
> You are right. I thought that kmalloc can make a difference due to reclaim
> but the reclaim is already triggered by alloc_pages_exact and if it doesn't
> succeed there are not big chances to have those pages ready for kmalloc.
> 
> > Please remove kmalloc().
> 
> OK.
> 
> Thanks for the review again and the updated patch is bellow:
> 
> Change since v3
> - updated changelog - to not mentioned 2dim. solution
> - get rid of kmalloc fallback based on Kame's suggestion.
> - free_page_cgroup accidentally returned void* (we do not need any return value
>   there)
> 
> Changes since v2
> - rename alloc_mcg_table to alloc_page_cgroup
> - free__mcg_table renamed to free_page_cgroup
> - get VM_BUG_ON(!slab_is_available()) back into the allocation path
> 
> ---
> From 84a9555741b59cb2a0a67b023e4bd0f92c670ca1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 24 Feb 2011 11:25:44 +0100
> Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
> 
> Currently we are allocating a single page_cgroup array per memory
> section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
> This is correct but memory inefficient solution because the allocated
> memory (unless we fall back to vmalloc) is not kmalloc friendly:
>         - 32b - 16384 entries (20B per entry) fit into 327680B so the
>           524288B slab cache is used
>         - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
>         - 64b - 32768 entries (40B per entry) fit into 2097152 cache
> 
> This is ~37% wasted space per memory section and it sumps up for the
> whole memory. On a x86_64 machine it is something like 6MB per 1GB of
> RAM.
> 
> We can reduce the internal fragmentation by using alloc_pages_exact
> which allocates PAGE_SIZE aligned blocks so we will get down to <4kB
> wasted memory per section which is much better.
> 
> We still need a fallback to vmalloc because we have no guarantees that
> we will have a continuous memory of that size (order-10) later on during
> the hotplug events.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

But...nitpick, it may be from my fault..



> ---
>  mm/page_cgroup.c |   54 +++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 31 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5bffada..eae3cd2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -105,7 +105,33 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return section->page_cgroup + pfn;
>  }
>  
> -/* __alloc_bootmem...() is protected by !slab_available() */
> +static void *__init_refok alloc_page_cgroup(size_t size, int nid)
> +{
> +	void *addr = NULL;
> +	if((addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN)))
> +		return addr;
> +
> +	if (node_state(nid, N_HIGH_MEMORY))
> +		addr = vmalloc_node(size, nid);
> +	else
> +		addr = vmalloc(size);
> +
> +	return addr;
> +}
> +
> +static void free_page_cgroup(void *addr)
> +{
> +	if (is_vmalloc_addr(addr)) {
> +		vfree(addr);
> +	} else {
> +		struct page *page = virt_to_page(addr);
> +		if (!PageReserved(page)) { /* Is bootmem ? */

I think we never see PageReserved if we just use alloc_pages_exact()/vmalloc().
Maybe my old patch was not enough and this kind of junks are remaining in
the original code.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4
  2011-02-28  9:53               ` Michal Hocko
@ 2011-02-28  9:48                 ` KAMEZAWA Hiroyuki
  2011-02-28 10:12                   ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-28  9:48 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dave Hansen, linux-mm, linux-kernel

On Mon, 28 Feb 2011 10:53:16 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 28-02-11 18:23:22, KAMEZAWA Hiroyuki wrote:
> [...]
> > > From 84a9555741b59cb2a0a67b023e4bd0f92c670ca1 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Date: Thu, 24 Feb 2011 11:25:44 +0100
> > > Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
> > > 
> > > Currently we are allocating a single page_cgroup array per memory
> > > section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
> > > This is correct but memory inefficient solution because the allocated
> > > memory (unless we fall back to vmalloc) is not kmalloc friendly:
> > >         - 32b - 16384 entries (20B per entry) fit into 327680B so the
> > >           524288B slab cache is used
> > >         - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
> > >         - 64b - 32768 entries (40B per entry) fit into 2097152 cache
> > > 
> > > This is ~37% wasted space per memory section and it sumps up for the
> > > whole memory. On a x86_64 machine it is something like 6MB per 1GB of
> > > RAM.
> > > 
> > > We can reduce the internal fragmentation by using alloc_pages_exact
> > > which allocates PAGE_SIZE aligned blocks so we will get down to <4kB
> > > wasted memory per section which is much better.
> > > 
> > > We still need a fallback to vmalloc because we have no guarantees that
> > > we will have a continuous memory of that size (order-10) later on during
> > > the hotplug events.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > CC: Dave Hansen <dave@linux.vnet.ibm.com>
> > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Thanks. I will repost it with Andrew in the CC.
> 
> > 
> > But...nitpick, it may be from my fault..
> [...]
> > > +static void free_page_cgroup(void *addr)
> > > +{
> > > +	if (is_vmalloc_addr(addr)) {
> > > +		vfree(addr);
> > > +	} else {
> > > +		struct page *page = virt_to_page(addr);
> > > +		if (!PageReserved(page)) { /* Is bootmem ? */
> > 
> > I think we never see PageReserved if we just use alloc_pages_exact()/vmalloc().
> 
> I have checked that and we really do not (unless I am missing some
> subtle side effects). Anyway, I think we still should at least BUG_ON on
> that.
> 
> > Maybe my old patch was not enough and this kind of junks are remaining in
> > the original code.
> 
> Should I incorporate it into the patch. I think that a separate one
> would be better for readability.
> 
> ---
> From e7a897a42b526620eb4afada2d036e1c9ff9e62a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 28 Feb 2011 10:43:12 +0100
> Subject: [PATCH] page_cgroup array is never stored on reserved pages
> 
> KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
> PageReserved because we never store the array on reserved pages
> (neither alloc_pages_exact nor vmalloc use those pages).
> 
> So we can replace the check by a BUG_ON.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thank you.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4
  2011-02-28  9:23             ` KAMEZAWA Hiroyuki
@ 2011-02-28  9:53               ` Michal Hocko
  2011-02-28  9:48                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2011-02-28  9:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, linux-mm, linux-kernel

On Mon 28-02-11 18:23:22, KAMEZAWA Hiroyuki wrote:
[...]
> > From 84a9555741b59cb2a0a67b023e4bd0f92c670ca1 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Thu, 24 Feb 2011 11:25:44 +0100
> > Subject: [PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM
> > 
> > Currently we are allocating a single page_cgroup array per memory
> > section (stored in mem_section->base) when CONFIG_SPARSEMEM is selected.
> > This is correct but memory inefficient solution because the allocated
> > memory (unless we fall back to vmalloc) is not kmalloc friendly:
> >         - 32b - 16384 entries (20B per entry) fit into 327680B so the
> >           524288B slab cache is used
> >         - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
> >         - 64b - 32768 entries (40B per entry) fit into 2097152 cache
> > 
> > This is ~37% wasted space per memory section and it sumps up for the
> > whole memory. On a x86_64 machine it is something like 6MB per 1GB of
> > RAM.
> > 
> > We can reduce the internal fragmentation by using alloc_pages_exact
> > which allocates PAGE_SIZE aligned blocks so we will get down to <4kB
> > wasted memory per section which is much better.
> > 
> > We still need a fallback to vmalloc because we have no guarantees that
> > we will have a continuous memory of that size (order-10) later on during
> > the hotplug events.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > CC: Dave Hansen <dave@linux.vnet.ibm.com>
> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks. I will repost it with Andrew in the CC.

> 
> But...nitpick, it may be from my fault..
[...]
> > +static void free_page_cgroup(void *addr)
> > +{
> > +	if (is_vmalloc_addr(addr)) {
> > +		vfree(addr);
> > +	} else {
> > +		struct page *page = virt_to_page(addr);
> > +		if (!PageReserved(page)) { /* Is bootmem ? */
> 
> I think we never see PageReserved if we just use alloc_pages_exact()/vmalloc().

I have checked that and we really do not (unless I am missing some
subtle side effects). Anyway, I think we still should at least BUG_ON on
that.

> Maybe my old patch was not enough and this kind of junks are remaining in
> the original code.

Should I incorporate it into the patch. I think that a separate one
would be better for readability.

---
>From e7a897a42b526620eb4afada2d036e1c9ff9e62a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 28 Feb 2011 10:43:12 +0100
Subject: [PATCH] page_cgroup array is never stored on reserved pages

KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
PageReserved because we never store the array on reserved pages
(neither alloc_pages_exact nor vmalloc use those pages).

So we can replace the check by a BUG_ON.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index eae3cd2..dd5f789 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -125,10 +125,10 @@ static void free_page_cgroup(void *addr)
 		vfree(addr);
 	} else {
 		struct page *page = virt_to_page(addr);
-		if (!PageReserved(page)) { /* Is bootmem ? */
-			size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-			free_pages_exact(addr, table_size);
-		}
+		size_t table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+
+		BUG_ON(PageReserved(page));
+		free_pages_exact(addr, table_size);
 	}
 }
 
-- 
1.7.2.3

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4
  2011-02-28  9:48                 ` KAMEZAWA Hiroyuki
@ 2011-02-28 10:12                   ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2011-02-28 10:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Hansen, linux-mm, linux-kernel

On Mon 28-02-11 18:48:21, KAMEZAWA Hiroyuki wrote:
> On Mon, 28 Feb 2011 10:53:16 +0100
> > From e7a897a42b526620eb4afada2d036e1c9ff9e62a Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Mon, 28 Feb 2011 10:43:12 +0100
> > Subject: [PATCH] page_cgroup array is never stored on reserved pages
> > 
> > KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
> > PageReserved because we never store the array on reserved pages
> > (neither alloc_pages_exact nor vmalloc use those pages).
> > 
> > So we can replace the check by a BUG_ON.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Thank you.
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks!

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-02-28 10:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-23 15:10 [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM Michal Hocko
2011-02-23 18:19 ` Dave Hansen
2011-02-23 23:52   ` KAMEZAWA Hiroyuki
2011-02-24  9:35     ` Michal Hocko
2011-02-24 10:02       ` KAMEZAWA Hiroyuki
2011-02-24  9:33   ` Michal Hocko
2011-02-24 13:40   ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v2 Michal Hocko
2011-02-25  3:25     ` KAMEZAWA Hiroyuki
2011-02-25  9:53       ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v3 Michal Hocko
2011-02-28  0:53         ` KAMEZAWA Hiroyuki
2011-02-28  9:12           ` [RFC PATCH] page_cgroup: Reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM v4 Michal Hocko
2011-02-28  9:23             ` KAMEZAWA Hiroyuki
2011-02-28  9:53               ` Michal Hocko
2011-02-28  9:48                 ` KAMEZAWA Hiroyuki
2011-02-28 10:12                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox