[this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
@ 2009-12-14 22:03 Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion Christoph Lameter
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Leftovers from the earlier patchset. Mostly applications of per cpu counters
to core components.

After this patchset there will be only one user of local_t left: Mathieu's
trace ringbuffer. Does it really need these ops?

V6->V7
- Drop patches merged in 2.6.33 merge cycle
- Drop risky slub patches

V5->V6:
- Drop patches merged by Tejun.
- Drop irqless slub fastpath for now.
- Patches against Tejun percpu for-next branch.

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
  patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-15  3:53   ` Tejun Heo
  2009-12-14 22:03 ` [this_cpu_xx V7 2/8] this_cpu ops: Remove pageset_notifier Christoph Lameter
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_page_allocator --]
[-- Type: text/plain, Size: 14178 bytes --]

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  187 ++++++++++++++++++-------------------------------
 mm/vmstat.c            |   14 ++-
 4 files changed, 81 insertions(+), 136 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-12-14 15:13:43.000000000 -0600
+++ linux-2.6/include/linux/mm.h	2009-12-14 15:20:45.000000000 -0600
@@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-12-14 15:13:43.000000000 -0600
+++ linux-2.6/include/linux/mmzone.h	2009-12-14 15:20:45.000000000 -0600
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-12-14 15:13:43.000000000 -0600
+++ linux-2.6/mm/page_alloc.c	2009-12-14 15:21:17.000000000 -0600
@@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		local_irq_save(flags);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
-		local_irq_save(flags);
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
 
 out:
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
-		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		list = &pcp->lists[migratetype];
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2734,10 +2729,29 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+
 /* return values int ....just for stop_machine() */
 static int __build_all_zonelists(void *dummy)
 {
 	int nid;
+	int cpu;
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -2748,6 +2762,14 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 	}
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors.
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	return 0;
 }
 
@@ -3086,120 +3108,60 @@ static void setup_pagelist_highmark(stru
 }
 
 
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
-
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
 
+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
+
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
 
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }
 
-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3253,7 +3215,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
 		local_irq_save(flags);
@@ -3271,21 +3233,13 @@ void zone_pcp_update(struct zone *zone)
 
 static __meminit void zone_pcp_init(struct zone *zone)
 {
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
+	/* Use boot pagesets until we have the per cpu allocator up */
+	zone->pageset = &per_cpu_var(boot_pageset);
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
 	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+			zone->name, zone->present_pages,
+					 zone_batchsize(zone));
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -4799,10 +4753,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-12-14 15:13:43.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2009-12-14 15:20:45.000000000 -0600
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
 
-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 2/8] this_cpu ops: Remove pageset_notifier
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 3/8] Use this_cpu operations in slub Christoph Lameter
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_page_alloc_remove_pageset_notifier --]
[-- Type: text/plain, Size: 2015 bytes --]

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   28 ----------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 28 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-29 15:37:50.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-29 15:37:54.000000000 -0500
@@ -905,6 +905,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-29 15:37:50.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-29 15:37:54.000000000 -0500
@@ -3107,27 +3107,6 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3153,13 +3132,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 
 static noinline __init_refok

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 3/8] Use this_cpu operations in slub
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 2/8] this_cpu ops: Remove pageset_notifier Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 4/8] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_conversion --]
[-- Type: text/plain, Size: 12323 bytes --]

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  200 +++++++++++------------------------------------
 2 files changed, 48 insertions(+), 158 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-14 15:11:06.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-14 15:44:11.000000000 -0600
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-14 15:11:06.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-14 15:44:11.000000000 -0600
@@ -242,15 +242,6 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1124,7 +1115,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1422,7 +1413,7 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1454,7 +1445,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1673,7 +1664,7 @@ new_slab:
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __this_cpu_ptr(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1711,7 +1702,7 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
+	unsigned long objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1722,14 +1713,14 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
+	object = c->freelist;
 	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
@@ -1800,7 +1791,7 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1872,7 +1863,7 @@ static __always_inline void slab_free(st
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, c->objsize);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
@@ -2095,130 +2086,28 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu [NR_KMEM_CACHE_CPU],
-		      kmem_cache_cpu);
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c >= per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[SLUB_PAGE_SHIFT]);
 
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
-}
-
-static void __init init_alloc_cpu(void)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
+	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+		/*
+		 * Boot time creation of the kmalloc array. Use static per cpu data
+		 * since the per cpu allocator is not available yet.
+		 */
+		s->cpu_slab = per_cpu_var(kmalloc_percpu) + (s - kmalloc_caches);
+	else
+		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2609,9 +2498,8 @@ static inline int kmem_cache_close(struc
 	int node;
 
 	flush_all(s);
-
+	free_percpu(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2760,7 +2648,19 @@ static noinline struct kmem_cache *dma_k
 	realsize = kmalloc_caches[index].objsize;
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
-	s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+
+	if (flags & __GFP_WAIT)
+		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	else {
+		int i;
+
+		s = NULL;
+		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
+			if (kmalloc_caches[i].size) {
+				s = kmalloc_caches + i;
+				break;
+			}
+	}
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -3176,8 +3076,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3261,8 +3159,10 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3365,7 +3265,7 @@ struct kmem_cache *kmem_cache_create(con
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3422,11 +3322,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3436,13 +3334,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3928,7 +3822,7 @@ static ssize_t show_slab_objects(struct 
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4353,7 +4247,7 @@ static int show_stat(struct kmem_cache *
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 4/8] SLUB: Get rid of dynamic DMA kmalloc cache allocation
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (2 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 3/8] Use this_cpu operations in slub Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 5/8] this_cpu: Remove slub kmem_cache fields Christoph Lameter
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_static_dma_kmalloc --]
[-- Type: text/plain, Size: 3688 bytes --]

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-14 15:22:57.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-14 15:23:27.000000000 -0600
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-14 15:23:19.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-14 15:23:27.000000000 -0600
@@ -2092,7 +2092,7 @@ static inline int alloc_kmem_cache_cpus(
 {
 	int cpu;
 
-	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2539,7 +2539,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2629,6 +2629,7 @@ static noinline struct kmem_cache *dma_k
 	char *text;
 	size_t realsize;
 	unsigned long slabflags;
+	int i;
 
 	s = kmalloc_caches_dma[index];
 	if (s)
@@ -2649,18 +2650,13 @@ static noinline struct kmem_cache *dma_k
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	if (flags & __GFP_WAIT)
-		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
-	else {
-		int i;
+	s = NULL;
+	for (i = 0; i < KMALLOC_CACHES; i++)
+		if (!kmalloc_caches[i].size)
+			break;
 
-		s = NULL;
-		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
-			if (kmalloc_caches[i].size) {
-				s = kmalloc_caches + i;
-				break;
-			}
-	}
+	BUG_ON(i >= KMALLOC_CACHES);
+	s = kmalloc_caches + i;
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -2674,7 +2670,7 @@ static noinline struct kmem_cache *dma_k
 
 	if (!s || !text || !kmem_cache_open(s, flags, text,
 			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		kfree(s);
+		s->size = 0;
 		kfree(text);
 		goto unlock_out;
 	}

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 5/8] this_cpu: Remove slub kmem_cache fields
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (3 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 4/8] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-14 22:03 ` [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc Christoph Lameter
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_remove_fields --]
[-- Type: text/plain, Size: 7862 bytes --]

Remove the fields in struct kmem_cache_cpu that were used to cache data from
struct kmem_cache when they were in different cachelines. The cacheline that
holds the per cpu array pointer now also holds these values. We can cut down
the struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the size field does not require accessing an additional cacheline anymore.
This results in consistent use of functions for setting the freepointer of
objects throughout SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   76 ++++++++++-------------------------------------
 2 files changed, 17 insertions(+), 61 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-14 15:23:27.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-14 15:23:59.000000000 -0600
@@ -38,8 +38,6 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-14 15:23:27.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-14 15:24:55.000000000 -0600
@@ -260,13 +260,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1473,10 +1466,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1635,7 +1628,7 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1681,7 +1674,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1702,7 +1695,6 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned long objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1715,22 +1707,21 @@ static __always_inline void *slab_alloc(
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
 	object = c->freelist;
-	objsize = c->objsize;
 	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, objsize, 1, s->flags, gfpflags);
+	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1785,7 +1776,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr, unsigned int offset)
+			void *x, unsigned long addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1799,7 +1790,8 @@ static void __slab_free(struct kmem_cach
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1864,16 +1856,16 @@ static __always_inline void slab_free(st
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, c->objsize);
-	debug_check_no_locks_freed(object, c->objsize);
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, c->objsize);
+		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -2060,19 +2052,6 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -2090,8 +2069,6 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	int cpu;
-
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2104,8 +2081,6 @@ static inline int alloc_kmem_cache_cpus(
 	if (!s->cpu_slab)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
 
@@ -2391,6 +2366,7 @@ static int kmem_cache_open(struct kmem_c
 
 	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -3247,22 +3223,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3316,14 +3282,6 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (4 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 5/8] this_cpu: Remove slub kmem_cache fields Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-15  6:24   ` Eric Dumazet
  2009-12-14 22:03 ` [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_cleanup_stat --]
[-- Type: text/plain, Size: 5136 bytes --]

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, DEACTIVATE_EMPTY);
 		if (n->nr_partial < s->min_partial) {
 			/*
 			 * Adding an empty slab to the partial slabs in order
@@ -1438,7 +1437,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
+			stat(s, FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1453,7 +1452,7 @@ static void deactivate_slab(struct kmem_
 	int tail = 1;
 
 	if (page->freelist)
-		stat(c, DEACTIVATE_REMOTE_FREES);
+		stat(s, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
@@ -1479,7 +1478,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(c, CPUSLAB_FLUSH);
+	stat(s, CPUSLAB_FLUSH);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1619,7 +1618,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 
-	stat(c, ALLOC_REFILL);
+	stat(s, ALLOC_REFILL);
 
 load_freelist:
 	object = c->page->freelist;
@@ -1634,7 +1633,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-	stat(c, ALLOC_SLOWPATH);
+	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
@@ -1644,7 +1643,7 @@ new_slab:
 	new = get_partial(s, gfpflags, node);
 	if (new) {
 		c->page = new;
-		stat(c, ALLOC_FROM_PARTIAL);
+		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
 
@@ -1658,7 +1657,7 @@ new_slab:
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
-		stat(c, ALLOC_SLAB);
+		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
 		slab_lock(new);
@@ -1713,7 +1712,7 @@ static __always_inline void *slab_alloc(
 
 	else {
 		c->freelist = get_freepointer(s, object);
-		stat(c, ALLOC_FASTPATH);
+		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
@@ -1780,10 +1779,8 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 
-	c = __this_cpu_ptr(s->cpu_slab);
-	stat(c, FREE_SLOWPATH);
+	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
@@ -1796,7 +1793,7 @@ checks_ok:
 	page->inuse--;
 
 	if (unlikely(PageSlubFrozen(page))) {
-		stat(c, FREE_FROZEN);
+		stat(s, FREE_FROZEN);
 		goto out_unlock;
 	}
 
@@ -1809,7 +1806,7 @@ checks_ok:
 	 */
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(c, FREE_ADD_PARTIAL);
+		stat(s, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
@@ -1822,10 +1819,10 @@ slab_empty:
 		 * Slab still on the partial list.
 		 */
 		remove_partial(s, page);
-		stat(c, FREE_REMOVE_PARTIAL);
+		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-	stat(c, FREE_SLAB);
+	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
 
@@ -1863,7 +1860,7 @@ static __always_inline void slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
-		stat(c, FREE_FASTPATH);
+		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (5 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-15  4:03   ` Tejun Heo
  2009-12-14 22:03 ` [this_cpu_xx V7 8/8] Remove cpu_local_xx macros Christoph Lameter
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_local_t_remove_modules --]
[-- Type: text/plain, Size: 6374 bytes --]

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

The last hold out of users of local_t is the tracing ringbuffer after this patch
has been applied.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/module.h     |   36 ++++++++++++------------------------
 kernel/module.c            |   30 ++++++++++++++++--------------
 kernel/trace/ring_buffer.c |    1 +
 3 files changed, 29 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2009-11-13 09:34:39.000000000 -0600
+++ linux-2.6/include/linux/module.h	2009-12-14 15:26:58.000000000 -0600
@@ -16,8 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
 #include <asm/module.h>
 
 #include <trace/events/module.h>
@@ -361,11 +360,9 @@ struct module
 	/* Destruction function. */
 	void (*exit)(void);
 
-#ifdef CONFIG_SMP
-	char *refptr;
-#else
-	local_t ref;
-#endif
+	struct module_ref {
+		int count;
+	} *refptr;
 #endif
 
 #ifdef CONFIG_CONSTRUCTORS
@@ -452,25 +449,16 @@ void __symbol_put(const char *symbol);
 #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
 void symbol_put_addr(void *addr);
 
-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
-	return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
-	return &mod->ref;
-#endif
-}
-
 /* Sometimes we know we already have a refcount, and it's easier not
    to handle the error case (which only happens with rmmod --wait). */
 static inline void __module_get(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_inc(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_inc(module->refptr->count);
 		trace_module_get(module, _THIS_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
-		put_cpu();
+				 __this_cpu_read(module->refptr->count));
+		preempt_enable();
 	}
 }
 
@@ -479,15 +467,15 @@ static inline int try_module_get(struct 
 	int ret = 1;
 
 	if (module) {
-		unsigned int cpu = get_cpu();
 		if (likely(module_is_live(module))) {
-			local_inc(__module_ref_addr(module, cpu));
+			preempt_disable();
+			__this_cpu_inc(module->refptr->count);
 			trace_module_get(module, _THIS_IP_,
-				local_read(__module_ref_addr(module, cpu)));
+				__this_cpu_read(module->refptr->count));
+			preempt_enable();
 		}
 		else
 			ret = 0;
-		put_cpu();
 	}
 	return ret;
 }
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c	2009-12-14 14:55:49.000000000 -0600
+++ linux-2.6/kernel/module.c	2009-12-14 15:26:58.000000000 -0600
@@ -474,9 +474,10 @@ static void module_unload_init(struct mo
 
 	INIT_LIST_HEAD(&mod->modules_which_use_me);
 	for_each_possible_cpu(cpu)
-		local_set(__module_ref_addr(mod, cpu), 0);
+		per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
 	/* Hold reference count during initialization. */
-	local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+	__this_cpu_write(mod->refptr->count, 1);
 	/* Backwards compatibility macros put refcount during init. */
 	mod->waiter = current;
 }
@@ -555,6 +556,7 @@ static void module_unload_free(struct mo
 				kfree(use);
 				sysfs_remove_link(i->holders_dir, mod->name);
 				/* There can be at most one match. */
+				free_percpu(i->refptr);
 				break;
 			}
 		}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct modu
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		total += local_read(__module_ref_addr(mod, cpu));
+		total += per_cpu_ptr(mod->refptr, cpu)->count;
 	return total;
 }
 EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt = 
 void module_put(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_dec(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_dec(module->refptr->count);
+
 		trace_module_put(module, _RET_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
+				 __this_cpu_read(module->refptr->count));
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
 			wake_up_process(module->waiter);
-		put_cpu();
+		preempt_enable();
 	}
 }
 EXPORT_SYMBOL(module_put);
@@ -1380,9 +1383,9 @@ static void free_module(struct module *m
 	kfree(mod->args);
 	if (mod->percpu)
 		percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
 	if (mod->refptr)
-		percpu_modfree(mod->refptr);
+		free_percpu(mod->refptr);
 #endif
 	/* Free lock-classes: */
 	lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2148,9 +2151,8 @@ static noinline struct module *load_modu
 	mod = (void *)sechdrs[modindex].sh_addr;
 	kmemleak_load_module(mod, hdr, sechdrs, secstrings);
 
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
-				      mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+	mod->refptr = alloc_percpu(struct module_ref);
 	if (!mod->refptr) {
 		err = -ENOMEM;
 		goto free_init;
@@ -2376,8 +2378,8 @@ static noinline struct module *load_modu
 	kobject_put(&mod->mkobj.kobj);
  free_unload:
 	module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+	free_percpu(mod->refptr);
  free_init:
 #endif
 	module_free(mod, mod->module_init);
Index: linux-2.6/kernel/trace/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-07 08:59:46.000000000 -0600
+++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-14 15:26:58.000000000 -0600
@@ -12,6 +12,7 @@
 #include <linux/hardirq.h>
 #include <linux/kmemcheck.h>
 #include <linux/module.h>
+#include <asm/local.h>
 #include <linux/percpu.h>
 #include <linux/mutex.h>
 #include <linux/init.h>

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [this_cpu_xx V7 8/8] Remove cpu_local_xx macros
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (6 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-12-14 22:03 ` Christoph Lameter
  2009-12-15  4:04   ` Tejun Heo
  2009-12-15  6:37 ` [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Pekka Enberg
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-14 22:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_local_t_remove_cpu_alloc --]
[-- Type: text/plain, Size: 9281 bytes --]

These macros have not been used for awhile now and are not used
by the last local_t user in the tree (trace ringbuffer logic).

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/arch/alpha/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/alpha/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -98,21 +98,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/arch/m32r/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/m32r/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/m32r/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -338,29 +338,4 @@ static inline void local_set_mask(unsign
  * a variable, not an address.
  */
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non local way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* __M32R_LOCAL_H */
Index: linux-2.6/arch/mips/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/mips/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/mips/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -193,29 +193,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i, l)	((l)->a.counter+=(i))
 #define __local_sub(i, l)	((l)->a.counter-=(i))
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/arch/powerpc/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/powerpc/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -172,29 +172,4 @@ static __inline__ long local_dec_if_posi
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/arch/x86/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -195,41 +195,4 @@ static inline long local_sub_return(long
 #define __local_add(i, l)	local_add((i), (l))
 #define __local_sub(i, l)	local_sub((i), (l))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- *
- * X86_64: This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)		\
-({					\
-	local_t res__;			\
-	preempt_disable(); 		\
-	res__ = (l);			\
-	preempt_enable();		\
-	res__;				\
-})
-#define cpu_local_wrap(l)		\
-({					\
-	preempt_disable();		\
-	(l);				\
-	preempt_enable();		\
-})					\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
-
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ASM_X86_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h	2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/include/asm-generic/local.h	2009-10-08 12:57:40.000000000 -0500
@@ -52,23 +52,4 @@ typedef struct
 #define __local_add(i,l)	local_set((l), local_read(l) + (i))
 #define __local_sub(i,l)	local_set((l), local_read(l) - (i))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc.  Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ASM_GENERIC_LOCAL_H */

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-14 22:03 ` [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion Christoph Lameter
@ 2009-12-15  3:53   ` Tejun Heo
  2009-12-15 15:04     ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-15  3:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Hello,

On 12/15/2009 07:03 AM, Christoph Lameter wrote:
>  static __meminit void zone_pcp_init(struct zone *zone)
>  {
> -	int cpu;
> -	unsigned long batch = zone_batchsize(zone);
> +	/* Use boot pagesets until we have the per cpu allocator up */
> +	zone->pageset = &per_cpu_var(boot_pageset);

Ummm... this scares me a little bit.  Before it was a pointer to
statically allocated area which can be used from basically anywhere.
Now, it's being initialized to a percpu pointer which won't be
available before setup_per_cpu_areas() is complete and the above
initialization takes place from setup_arch() which is before percpu
initialization.  I don't think there's anything which would access
page allocator between the two places, but it still seems a bit risky.
Maybe it's better to keep the boot_pageset a static array?  Or am I
misunderstanding something?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-14 22:03 ` [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-12-15  4:03   ` Tejun Heo
  2009-12-15 22:41     ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-15  4:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers,
	Rusty Russell

(Rusty Russell cc'd.)

On 12/15/2009 07:03 AM, Christoph Lameter wrote:
> @@ -479,15 +467,15 @@ static inline int try_module_get(struct 
>  	int ret = 1;
>  
>  	if (module) {
> -		unsigned int cpu = get_cpu();
>  		if (likely(module_is_live(module))) {
> -			local_inc(__module_ref_addr(module, cpu));
> +			preempt_disable();
> +			__this_cpu_inc(module->refptr->count);
>  			trace_module_get(module, _THIS_IP_,
> -				local_read(__module_ref_addr(module, cpu)));
> +				__this_cpu_read(module->refptr->count));
> +			preempt_enable();
>  		}
>  		else
>  			ret = 0;
> -		put_cpu();

I think you need preemption disabled while checking whether
module_is_live().  The state is protected by stop_machine or
synchronize_sched().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 8/8] Remove cpu_local_xx macros
  2009-12-14 22:03 ` [this_cpu_xx V7 8/8] Remove cpu_local_xx macros Christoph Lameter
@ 2009-12-15  4:04   ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2009-12-15  4:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On 12/15/2009 07:03 AM, Christoph Lameter wrote:
> These macros have not been used for awhile now and are not used
> by the last local_t user in the tree (trace ringbuffer logic).
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Cool.  Other than the things I commented before, the series looks
great to me.

Thanks for doing this.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc
  2009-12-14 22:03 ` [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc Christoph Lameter
@ 2009-12-15  6:24   ` Eric Dumazet
  2009-12-15 14:46     ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-12-15  6:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mel Gorman,
	Mathieu Desnoyers

Le 14/12/2009 23:03, Christoph Lameter a écrit :
> this_cpu_inc() translates into a single instruction on x86 and does not
> need any register. So use it in stat(). We also want to avoid the
> calculation of the per cpu kmem_cache_cpu structure pointer. So pass
> a kmem_cache pointer instead of a kmem_cache_cpu pointer.
> 

Hmm, last time I checked, [__]this_cpu_inc were not using "inc" on x86,
but the default [__]this_cpu_add((pcp), 1)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (7 preceding siblings ...)
  2009-12-14 22:03 ` [this_cpu_xx V7 8/8] Remove cpu_local_xx macros Christoph Lameter
@ 2009-12-15  6:37 ` Pekka Enberg
  2009-12-15  6:47   ` Tejun Heo
  2009-12-15 17:06 ` Mel Gorman
  2009-12-15 17:43 ` Mathieu Desnoyers
  10 siblings, 1 reply; 39+ messages in thread
From: Pekka Enberg @ 2009-12-15  6:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Mathieu Desnoyers

On Mon, 2009-12-14 at 16:03 -0600, Christoph Lameter wrote:
> Leftovers from the earlier patchset. Mostly applications of per cpu counters
> to core components.

I can pick up the SLUB patches. Is that OK with you Tejun?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-15  6:37 ` [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Pekka Enberg
@ 2009-12-15  6:47   ` Tejun Heo
  2009-12-15 14:50     ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-15  6:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Mathieu Desnoyers

On 12/15/2009 03:37 PM, Pekka Enberg wrote:
> On Mon, 2009-12-14 at 16:03 -0600, Christoph Lameter wrote:
>> Leftovers from the earlier patchset. Mostly applications of per cpu counters
>> to core components.
> 
> I can pick up the SLUB patches. Is that OK with you Tejun?

Yeap, now that this_cpu stuff is upstream, no reason to route them
through percpu tree.  I'll be happy to pick up whatever is left.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc
  2009-12-15  6:24   ` Eric Dumazet
@ 2009-12-15 14:46     ` Christoph Lameter
  2009-12-15 14:59       ` Eric Dumazet
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-15 14:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mel Gorman,
	Mathieu Desnoyers

[-- Attachment #1: Type: TEXT/PLAIN, Size: 644 bytes --]

On Tue, 15 Dec 2009, Eric Dumazet wrote:

> Le 14/12/2009 23:03, Christoph Lameter a écrit :
> > this_cpu_inc() translates into a single instruction on x86 and does not
> > need any register. So use it in stat(). We also want to avoid the
> > calculation of the per cpu kmem_cache_cpu structure pointer. So pass
> > a kmem_cache pointer instead of a kmem_cache_cpu pointer.
> >
>
> Hmm, last time I checked, [__]this_cpu_inc were not using "inc" on x86,
> but the default [__]this_cpu_add((pcp), 1)

Well the add 1 is still a single instruction so its okay. We can easily
add support for inc on x86 if makes a difference?





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-15  6:47   ` Tejun Heo
@ 2009-12-15 14:50     ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-15 14:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Pekka Enberg, linux-kernel, Mel Gorman, Mathieu Desnoyers

Small fixup patch to the slub patches:

Subject: Update remaining reference to get_cpu_slab

If CONFIG_SLUB_STATS is set then some additional code is being
compiled that was not updated since it was a recent addition to slub.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-15 08:47:44.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-15 08:48:03.000000000 -0600
@@ -4221,7 +4221,7 @@ static void clear_stat(struct kmem_cache
 	int cpu;

 	for_each_online_cpu(cpu)
-		get_cpu_slab(s, cpu)->stat[si] = 0;
+		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
 }

 #define STAT_ATTR(si, text) 					\

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc
  2009-12-15 14:46     ` Christoph Lameter
@ 2009-12-15 14:59       ` Eric Dumazet
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2009-12-15 14:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mel Gorman,
	Mathieu Desnoyers

Le 15/12/2009 15:46, Christoph Lameter a écrit :
> On Tue, 15 Dec 2009, Eric Dumazet wrote:
> 
>> Le 14/12/2009 23:03, Christoph Lameter a écrit :
>>> this_cpu_inc() translates into a single instruction on x86 and does not
>>> need any register. So use it in stat(). We also want to avoid the
>>> calculation of the per cpu kmem_cache_cpu structure pointer. So pass
>>> a kmem_cache pointer instead of a kmem_cache_cpu pointer.
>>>
>>
>> Hmm, last time I checked, [__]this_cpu_inc were not using "inc" on x86,
>> but the default [__]this_cpu_add((pcp), 1)
> 
> Well the add 1 is still a single instruction so its okay. We can easily
> add support for inc on x86 if makes a difference?
> 

Probably not, only one byte of text per use can be saved.

Sorry, rereading your changelog I dont know why I thought about "inc" instruction.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-15  3:53   ` Tejun Heo
@ 2009-12-15 15:04     ` Christoph Lameter
  2009-12-16  0:53       ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-15 15:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On Tue, 15 Dec 2009, Tejun Heo wrote:

> Hello,
>
> On 12/15/2009 07:03 AM, Christoph Lameter wrote:
> >  static __meminit void zone_pcp_init(struct zone *zone)
> >  {
> > -	int cpu;
> > -	unsigned long batch = zone_batchsize(zone);
> > +	/* Use boot pagesets until we have the per cpu allocator up */
> > +	zone->pageset = &per_cpu_var(boot_pageset);
>
> Ummm... this scares me a little bit.  Before it was a pointer to
> statically allocated area which can be used from basically anywhere.
> Now, it's being initialized to a percpu pointer which won't be
> available before setup_per_cpu_areas() is complete and the above
> initialization takes place from setup_arch() which is before percpu
> initialization.  I don't think there's anything which would access
> page allocator between the two places, but it still seems a bit risky.
> Maybe it's better to keep the boot_pageset a static array?  Or am I
> misunderstanding something?

A static array would have to be dimensioned to NR_CPUS. That is one thing
we are trying to avoid.

The assignment of the pageset "percpu" pointer does not mean that the pcp
is usable. It must first be properly initialized through setup_pageset().

setup_pageset() is run for each cpu. zone->pageset is the same for all
cpus that is why it is in zone_pcp_init() and not in setup_pageset().

The boot pageset initialization was moved into __build_all_zonelists(). We
could move the zone->pageset initialization there too?




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (8 preceding siblings ...)
  2009-12-15  6:37 ` [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Pekka Enberg
@ 2009-12-15 17:06 ` Mel Gorman
  2009-12-16 14:44   ` Christoph Lameter
  2009-12-15 17:43 ` Mathieu Desnoyers
  10 siblings, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2009-12-15 17:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Mon, Dec 14, 2009 at 04:03:20PM -0600, Christoph Lameter wrote:
> Leftovers from the earlier patchset. Mostly applications of per cpu counters
> to core components.
> 
> After this patchset there will be only one user of local_t left: Mathieu's
> trace ringbuffer. Does it really need these ops?
> 

What kernel are these patches based on? They do not cleanly apply and
when fixed up, they do not build against 2.6.32.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
                   ` (9 preceding siblings ...)
  2009-12-15 17:06 ` Mel Gorman
@ 2009-12-15 17:43 ` Mathieu Desnoyers
  2009-12-16  0:58   ` Tejun Heo
  10 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-12-15 17:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

* Christoph Lameter (cl@linux-foundation.org) wrote:
> Leftovers from the earlier patchset. Mostly applications of per cpu counters
> to core components.
> 
> After this patchset there will be only one user of local_t left: Mathieu's
> trace ringbuffer. Does it really need these ops?
> 

Besides my own ring buffer implementation in LTTng, at least Steven's
kernel/trace/ring_buffer.c (in mainline) use this too. We would need a
way to directly map to the same resulting behavior with per-cpu
variables.

In LTTng, I use local_cmpxchg, local_read, local_add and, in some
setups, local_add_return to manage the write counter and commit
counters. These per-cpu counters are kept in per-cpu buffer management
data allocated for each data collection "channel".

The current way I allocate this structure for all cpus is:

  chan->buf = alloc_percpu(struct ltt_chanbuf);

But note that each struct ltt_chanbuf contains a pointer to an array
containing each sub-buffer commit counters for the given buffer:

struct commit_counters {
        local_t cc;
        local_t cc_sb;                  /* Incremented _once_ at sb switch */
        local_t events;                 /* Event count */
};

struct ltt_chanbuf {
        struct ltt_chanbuf_alloc a;     /* Parent. First field. */
        /* First 32 bytes cache-hot cacheline */
        local_t offset;                 /* Current offset in the buffer */
        struct commit_counters *commit_count;
                                        /* Commit count per sub-buffer */
        atomic_long_t consumed;         /*
                                         * Current offset in the buffer
                                         * standard atomic access (shared)
                                         */
....

So I think accessing the "local_t offset" through percpu pointers should
be fine if I allocate struct ltt_chanbuf through the per cpu API.
However, I wonder how to deal with the commit_count counters, because
there is an indirection level.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-15  4:03   ` Tejun Heo
@ 2009-12-15 22:41     ` Rusty Russell
  2009-12-16 16:10       ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2009-12-15 22:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

On Tue, 15 Dec 2009 02:33:50 pm Tejun Heo wrote:
> (Rusty Russell cc'd.)
> 
> On 12/15/2009 07:03 AM, Christoph Lameter wrote:
> > @@ -479,15 +467,15 @@ static inline int try_module_get(struct 
> >  	int ret = 1;
> >  
> >  	if (module) {
> > -		unsigned int cpu = get_cpu();
> >  		if (likely(module_is_live(module))) {
> > -			local_inc(__module_ref_addr(module, cpu));
> > +			preempt_disable();
> > +			__this_cpu_inc(module->refptr->count);
> >  			trace_module_get(module, _THIS_IP_,
> > -				local_read(__module_ref_addr(module, cpu)));
> > +				__this_cpu_read(module->refptr->count));
> > +			preempt_enable();
> >  		}
> >  		else
> >  			ret = 0;
> > -		put_cpu();
> 
> I think you need preemption disabled while checking whether
> module_is_live().  The state is protected by stop_machine or
> synchronize_sched().

Yes.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-15 15:04     ` Christoph Lameter
@ 2009-12-16  0:53       ` Tejun Heo
  2009-12-16 14:55         ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-16  0:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Hello, Christoph.

On 12/16/2009 12:04 AM, Christoph Lameter wrote:
> A static array would have to be dimensioned to NR_CPUS. That is one thing
> we are trying to avoid.
> 
> The assignment of the pageset "percpu" pointer does not mean that the pcp
> is usable. It must first be properly initialized through setup_pageset().
>
> setup_pageset() is run for each cpu. zone->pageset is the same for all
> cpus that is why it is in zone_pcp_init() and not in setup_pageset().
> 
> The boot pageset initialization was moved into __build_all_zonelists(). We
> could move the zone->pageset initialization there too?

Maybe that is a bit less scary. (at least for me :-) The reason why
I'm a bit worried is that different architectures handle percpu
pointers differently before setup_per_cpu_areas().  x86 sets up the
offsets and stuff such that cpu0 can access the original percpu
section in the kernel image.  ia64 sets up everything properly way
before setup_per_cpu_areas() and in some archs percpu pointers are
completely invalid before setup_per_cpu_areas().  So, percpu pointer
being handled in generic code which is being called before percpu
setup is a bit worrying.

Another thing is that there were attempts to simplify memory
initialization stages such that bootmem is removed and page / k*
allocators can be used earlier which kind of puts percpu allocator in
the dependency loop but I don't think it's something we need to worry
about at this point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-15 17:43 ` Mathieu Desnoyers
@ 2009-12-16  0:58   ` Tejun Heo
  2009-12-16  1:40     ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-16  0:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Steven Rostedt

Hello, Mathieu.

On 12/16/2009 02:43 AM, Mathieu Desnoyers wrote:
> So I think accessing the "local_t offset" through percpu pointers should
> be fine if I allocate struct ltt_chanbuf through the per cpu API.
> However, I wonder how to deal with the commit_count counters, because
> there is an indirection level.

Are they different in numbers for different cpus?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-16  0:58   ` Tejun Heo
@ 2009-12-16  1:40     ` Mathieu Desnoyers
  2009-12-16  1:46       ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-12-16  1:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Steven Rostedt

* Tejun Heo (tj@kernel.org) wrote:
> Hello, Mathieu.
> 
> On 12/16/2009 02:43 AM, Mathieu Desnoyers wrote:
> > So I think accessing the "local_t offset" through percpu pointers should
> > be fine if I allocate struct ltt_chanbuf through the per cpu API.
> > However, I wonder how to deal with the commit_count counters, because
> > there is an indirection level.
> 
> Are they different in numbers for different cpus?

Nope, there is the same number of sub-buffers for each per-cpu buffer.
I just want to see if supplementary indirections are allowed after
dereferencing the per-cpu pointer ?

Thanks,

Mathieu

> 
> Thanks.
> 
> -- 
> tejun

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-16  1:40     ` Mathieu Desnoyers
@ 2009-12-16  1:46       ` Tejun Heo
  2009-12-17 13:39         ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-16  1:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Steven Rostedt

Hello,

On 12/16/2009 10:40 AM, Mathieu Desnoyers wrote:
> Nope, there is the same number of sub-buffers for each per-cpu buffer.
> I just want to see if supplementary indirections are allowed after
> dereferencing the per-cpu pointer ?

Hmmm... you can store percpu pointer to a variable.  If there are the
same number of commit_count for each cpu, they can be allocated using
percpu allocator and their pointers can be stored, offset and
dereferenced.  Would that be enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-15 17:06 ` Mel Gorman
@ 2009-12-16 14:44   ` Christoph Lameter
  2009-12-16 21:36     ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-16 14:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Tue, 15 Dec 2009, Mel Gorman wrote:

> What kernel are these patches based on? They do not cleanly apply and
> when fixed up, they do not build against 2.6.32.

Upstream. Linus tree.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-16  0:53       ` Tejun Heo
@ 2009-12-16 14:55         ` Christoph Lameter
  2009-12-17  0:23           ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-16 14:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On Wed, 16 Dec 2009, Tejun Heo wrote:

> > The boot pageset initialization was moved into __build_all_zonelists(). We
> > could move the zone->pageset initialization there too?
>
> Maybe that is a bit less scary. (at least for me :-) The reason why

Wel sorry moving the ->pageset assign to __build_all_zonelists does not
work since __build_all_zonelists does not scan through
zones. zone_pcp_init is called reliable first for all zones. I think just
leave it at is.

> I'm a bit worried is that different architectures handle percpu
> pointers differently before setup_per_cpu_areas().  x86 sets up the
> offsets and stuff such that cpu0 can access the original percpu
> section in the kernel image.  ia64 sets up everything properly way
> before setup_per_cpu_areas() and in some archs percpu pointers are
> completely invalid before setup_per_cpu_areas().  So, percpu pointer
> being handled in generic code which is being called before percpu
> setup is a bit worrying.

True but we are not dereferencing a per cpu pointer here. It is simply the
assignment of the unreferenced native per cpu address generated by the
linker. This address is unaffected by allocator bootstrap.

> Another thing is that there were attempts to simplify memory
> initialization stages such that bootmem is removed and page / k*
> allocators can be used earlier which kind of puts percpu allocator in
> the dependency loop but I don't think it's something we need to worry
> about at this point.

We already worried about this earlier. The initialization we are talking
about does not require the per cpu allocator to be up. It just requires
the static per cpu areas to function with a single pcp while the zonelists
are being built.

The assignment of the final per cpu areas (dynamically allocated) occurs
after all the other memory allocators have been bootstrapped.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-15 22:41     ` Rusty Russell
@ 2009-12-16 16:10       ` Christoph Lameter
  2009-12-17  0:25         ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-16 16:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

Fixup patch:

Subject: Extend preempt_disable section around module_is_live()

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/module.h |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2009-12-16 08:57:37.000000000 -0600
+++ linux-2.6/include/linux/module.h	2009-12-16 08:58:09.000000000 -0600
@@ -467,15 +467,17 @@ static inline int try_module_get(struct
 	int ret = 1;

 	if (module) {
+		preempt_disable();
+
 		if (likely(module_is_live(module))) {
-			preempt_disable();
 			__this_cpu_inc(module->refptr->count);
 			trace_module_get(module, _THIS_IP_,
 				__this_cpu_read(module->refptr->count));
-			preempt_enable();
 		}
 		else
 			ret = 0;
+
+		preempt_enable();
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-16 14:44   ` Christoph Lameter
@ 2009-12-16 21:36     ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-16 21:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

With todays git tree I get a reject. Will rediff when rc1 is out.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion
  2009-12-16 14:55         ` Christoph Lameter
@ 2009-12-17  0:23           ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2009-12-17  0:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Hello,

On 12/16/2009 11:55 PM, Christoph Lameter wrote:
>>> The boot pageset initialization was moved into __build_all_zonelists(). We
>>> could move the zone->pageset initialization there too?
>>
>> Maybe that is a bit less scary. (at least for me :-) The reason why
> 
> Wel sorry moving the ->pageset assign to __build_all_zonelists does not
> work since __build_all_zonelists does not scan through
> zones. zone_pcp_init is called reliable first for all zones. I think just
> leave it at is.

Ah, well, alright.

>> I'm a bit worried is that different architectures handle percpu
>> pointers differently before setup_per_cpu_areas().  x86 sets up the
>> offsets and stuff such that cpu0 can access the original percpu
>> section in the kernel image.  ia64 sets up everything properly way
>> before setup_per_cpu_areas() and in some archs percpu pointers are
>> completely invalid before setup_per_cpu_areas().  So, percpu pointer
>> being handled in generic code which is being called before percpu
>> setup is a bit worrying.
> 
> True but we are not dereferencing a per cpu pointer here. It is simply the
> assignment of the unreferenced native per cpu address generated by the
> linker. This address is unaffected by allocator bootstrap.

Yeap, the code is fine.  Probably I'm just being a bit paranoid.  If
you don't mind, can you please add a comment pointing out that the
percpu pointer isn't dereferenced before later initialization is done
which happens after percpu initialization?

> The assignment of the final per cpu areas (dynamically allocated) occurs
> after all the other memory allocators have been bootstrapped.

Great, thanks for the explanation.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-16 16:10       ` Christoph Lameter
@ 2009-12-17  0:25         ` Tejun Heo
  2009-12-17  5:42           ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2009-12-17  0:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

Hello,

On 12/17/2009 01:10 AM, Christoph Lameter wrote:
> Fixup patch:
> 
> Subject: Extend preempt_disable section around module_is_live()
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Looks good to me but as the patch currently isn't any tree, it'll
probably be better to merge it into the original patch.  Rusty, do you
want to take this one?  If not, I can route it through percpu tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-17  0:25         ` Tejun Heo
@ 2009-12-17  5:42           ` Rusty Russell
  0 siblings, 0 replies; 39+ messages in thread
From: Rusty Russell @ 2009-12-17  5:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

On Thu, 17 Dec 2009 10:55:13 am Tejun Heo wrote:
> Hello,
> 
> On 12/17/2009 01:10 AM, Christoph Lameter wrote:
> > Fixup patch:
> > 
> > Subject: Extend preempt_disable section around module_is_live()
> > 
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Looks good to me but as the patch currently isn't any tree, it'll
> probably be better to merge it into the original patch.  Rusty, do you
> want to take this one?  If not, I can route it through percpu tree.

Easiest to go with your stuff I think:

	Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-16  1:46       ` Tejun Heo
@ 2009-12-17 13:39         ` Mathieu Desnoyers
  2009-12-17 19:28           ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-12-17 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Steven Rostedt

* Tejun Heo (tj@kernel.org) wrote:
> Hello,
> 
> On 12/16/2009 10:40 AM, Mathieu Desnoyers wrote:
> > Nope, there is the same number of sub-buffers for each per-cpu buffer.
> > I just want to see if supplementary indirections are allowed after
> > dereferencing the per-cpu pointer ?
> 
> Hmmm... you can store percpu pointer to a variable.  If there are the
> same number of commit_count for each cpu, they can be allocated using
> percpu allocator and their pointers can be stored, offset and
> dereferenced.  Would that be enough?

Yes, I think I could allocate, from the channel structure perspective:

- A percpu pointer to the per-cpu buffer structures
- A percpu pointer to the per-cpu commit counters.

This should fix my problem. The main change here is that the pointer to
the commit counters would not be located in the per-cpu buffer
structures anymore.

However, I would need:

this_cpu_cmpxchg(scalar, oldv, newv)
  (maps to x86 cmpxchg)

this_cpu_add_return(scalar, value)
  (maps to x86 xadd)

too. Is that a planned addition ?

(while we are at it, we might as will add the xchg instruction,
althrough it has an implied LOCK prefix on x86).

Thanks,

Mathieu

> 
> Thanks.
> 
> -- 
> tejun

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-17 13:39         ` Mathieu Desnoyers
@ 2009-12-17 19:28           ` Christoph Lameter
  2009-12-17 20:25             ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-17 19:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

> However, I would need:
>
> this_cpu_cmpxchg(scalar, oldv, newv)
>   (maps to x86 cmpxchg)
>
> this_cpu_add_return(scalar, value)
>   (maps to x86 xadd)
>
> too. Is that a planned addition ?

It was not necessary. Its easy to add though.

> (while we are at it, we might as will add the xchg instruction,
> althrough it has an implied LOCK prefix on x86).

Well yeah thats a thorny one. One could use the cmpxchg instead?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-17 19:28           ` Christoph Lameter
@ 2009-12-17 20:25             ` Mathieu Desnoyers
  2009-12-17 20:43               ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-12-17 20:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

* Christoph Lameter (cl@linux-foundation.org) wrote:
> > However, I would need:
> >
> > this_cpu_cmpxchg(scalar, oldv, newv)
> >   (maps to x86 cmpxchg)
> >
> > this_cpu_add_return(scalar, value)
> >   (maps to x86 xadd)
> >
> > too. Is that a planned addition ?
> 
> It was not necessary. Its easy to add though.
> 
> > (while we are at it, we might as will add the xchg instruction,
> > althrough it has an implied LOCK prefix on x86).
> 
> Well yeah thats a thorny one. One could use the cmpxchg instead?

Yes, although maybe it would make sense to encapsulate it in a xchg
primitive anyway, in case some architecture has a better xchg than x86.
For instance, powerpc, with its linked load/store conditional, can skip
a comparison for xchg that's otherwise required for cmpxchg.

Some quick test on my Intel Xeon E5405:

local cmpxchg:  14 cycles
xchg:           18 cycles

So yes, indeed, the non-LOCK prefixed local cmpxchg seems a bit faster
than the xchg, given the latter has an implied LOCK prefix.

Code used for local cmpxchg:
                old = var;
                do {
                        ret = cmpxchg_local(&var, old, 4);
                        if (likely(ret == old))
                                break;
                        old = ret;
                } while (1);

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-17 20:25             ` Mathieu Desnoyers
@ 2009-12-17 20:43               ` Christoph Lameter
  2009-12-18  0:13                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2009-12-17 20:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

On Thu, 17 Dec 2009, Mathieu Desnoyers wrote:

> Some quick test on my Intel Xeon E5405:
>
> local cmpxchg:  14 cycles
> xchg:           18 cycles
>
> So yes, indeed, the non-LOCK prefixed local cmpxchg seems a bit faster
> than the xchg, given the latter has an implied LOCK prefix.
>
> Code used for local cmpxchg:
>                 old = var;
>                 do {
>                         ret = cmpxchg_local(&var, old, 4);
>                         if (likely(ret == old))
>                                 break;
>                         old = ret;
>                 } while (1);
>

Great. Could you also put that into "patch-format"?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-17 20:43               ` Christoph Lameter
@ 2009-12-18  0:13                 ` Mathieu Desnoyers
  2009-12-18  0:27                   ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-12-18  0:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Thu, 17 Dec 2009, Mathieu Desnoyers wrote:
> 
> > Some quick test on my Intel Xeon E5405:
> >
> > local cmpxchg:  14 cycles
> > xchg:           18 cycles
> >
> > So yes, indeed, the non-LOCK prefixed local cmpxchg seems a bit faster
> > than the xchg, given the latter has an implied LOCK prefix.
> >
> > Code used for local cmpxchg:
> >                 old = var;
> >                 do {
> >                         ret = cmpxchg_local(&var, old, 4);
> >                         if (likely(ret == old))
> >                                 break;
> >                         old = ret;
> >                 } while (1);
> >
> 
> Great. Could you also put that into "patch-format"?
> 

Sure, can you point me to a git tree I should work on top of which
includes the per cpu infrastructure to extend ?

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup
  2009-12-18  0:13                 ` Mathieu Desnoyers
@ 2009-12-18  0:27                   ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2009-12-18  0:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Steven Rostedt

On Thu, 17 Dec 2009, Mathieu Desnoyers wrote:

> Sure, can you point me to a git tree I should work on top of which
> includes the per cpu infrastructure to extend ?

Linus' git tree contains what yoou need. I have a early draft here of a
patch to implement the generic portions. Unfinished. I hope I have time to
complete this. Feel free to complete it but keep me posted so that I wont
repeat anything you do.

The modifications to asm-generic/cmpxchg-local wont work since we need to
do this_cpu_ptr() pointer calculations within the protected sections. I
was in the middle of getting rid of it when I found it was time to go
home...


---
 include/asm-generic/cmpxchg-local.h |   24 ++++-
 include/linux/percpu.h              |  151 ++++++++++++++++++++++++++++++++++++
 2 files changed, 169 insertions(+), 6 deletions(-)

Index: linux-2.6/include/asm-generic/cmpxchg-local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/cmpxchg-local.h	2009-12-17 17:44:01.000000000 -0600
+++ linux-2.6/include/asm-generic/cmpxchg-local.h	2009-12-17 17:46:31.000000000 -0600
@@ -6,13 +6,12 @@
 extern unsigned long wrong_size_cmpxchg(volatile void *ptr);

 /*
- * Generic version of __cmpxchg_local (disables interrupts). Takes an unsigned
- * long parameter, supporting various types of architectures.
+ * Generic version of __cmpxchg_local.
  */
-static inline unsigned long __cmpxchg_local_generic(volatile void *ptr,
+static inline unsigned long ____cmpxchg_local_generic(volatile void *ptr,
 		unsigned long old, unsigned long new, int size)
 {
-	unsigned long flags, prev;
+	unsigned long prev;

 	/*
 	 * Sanity checking, compile-time.
@@ -20,7 +19,6 @@ static inline unsigned long __cmpxchg_lo
 	if (size == 8 && sizeof(unsigned long) != 8)
 		wrong_size_cmpxchg(ptr);

-	local_irq_save(flags);
 	switch (size) {
 	case 1: prev = *(u8 *)ptr;
 		if (prev == old)
@@ -41,11 +39,25 @@ static inline unsigned long __cmpxchg_lo
 	default:
 		wrong_size_cmpxchg(ptr);
 	}
-	local_irq_restore(flags);
 	return prev;
 }

 /*
+ * Generic version of __cmpxchg_local (disables interrupts). Takes an unsigned
+ * long parameter, supporting various types of architectures.
+ */
+static inline unsigned long __cmpxchg_local_generic(volatile void *ptr,
+		unsigned long old, unsigned long new, int size)
+{
+	unsigned long flags, r;
+
+	local_irq_save(flags);
+	r = ____cmpxchg_local_generic(ptr, old, new ,size);
+	local_irq_restore(flags);
+	return r;
+}
+
+/*
  * Generic version of __cmpxchg64_local. Takes an u64 parameter.
  */
 static inline u64 __cmpxchg64_local_generic(volatile void *ptr,
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-12-17 17:31:10.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2009-12-17 18:23:02.000000000 -0600
@@ -443,6 +443,48 @@ do {									\
 # define this_cpu_xor(pcp, val)		__pcpu_size_call(this_cpu_or_, (pcp), (val))
 #endif

+#ifndef this_cpu_cmpxchg
+# ifndef this_cpu_cmpxchg_1
+#  define this_cpu_cmpxchg_1(pcp, old, new) this_cpu_cmpxchg_generic(((pcp), (old), (new), 1)
+# endif
+# ifndef this_cpu_cmpxchg_2
+#  define this_cpu_cmpxchg_2(pcp, old, new) this_cpu_cmpxchg_generic(((pcp), (old), (new), 2)
+# endif
+# ifndef this_cpu_cmpxchg_4
+#  define this_cpu_cmpxchg_4(pcp, old, new) this_cpu_cmpxchg_generic(((pcp), (old), (new), 4)
+# endif
+# ifndef this_cpu_cmpxchg_8
+#  define this_cpu_cmpxchg_8(pcp, old, new) this_cpu_cmpxchg_generic(((pcp), (old), (new), 8)
+# endif
+# define this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(this_cpu_cmpxchg_, (old), (new))
+#endif
+
+#define _this_cpu_generic_xchg_op(pcp, val)				\
+  ({									\
+	typeof(*(var)) __tmp_var__;					\
+	preempt_disable();						\
+	__tmp_var = __this_cpu_read(pcp);				\
+	__this_cpu_read(pcp) = (val);					\
+	preemt_enable();						\
+	__tmp_var__;							\
+  })
+
+#ifndef this_cpu_xchg
+# ifndef this_cpu_xchg_1
+#  define this_cpu_xchg_1(pcp, val) _this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef this_cpu_xchg_2
+#  define this_cpu_xchg_2(pcp, val) _this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef this_cpu_xchg_4
+#  define this_cpu_xchg_4(pcp, val) _this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef this_cpu_xchg_8
+#  define this_cpu_xchg_8(pcp, val) _this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# define this_cpu_xchg(pcp, val) __pcpu_size_call_return(this_cpu_xchg_, (val))
+#endif
+
 /*
  * Generic percpu operations that do not require preemption handling.
  * Either we do not care about races or the caller has the
@@ -594,6 +636,46 @@ do {									\
 # define __this_cpu_xor(pcp, val)	__pcpu_size_call(__this_cpu_xor_, (pcp), (val))
 #endif

+#ifndef __this_cpu_cmpxchg
+# ifndef __this_cpu_cmpxchg_1
+#  define __this_cpu_cmpxchg _1(pcp, old, new) ____cmpxchg_local_generic(__this_cpu_ptr(pcp), (old), (new), 1)
+# endif
+# ifndef __this_cpu_cmpxchg_2
+#  define __this_cpu_cmpxchg_2(pcp, old, new) ____cmpxchg_local_generic(__this_cpu_ptr(pcp), (old), (new), 2)
+# endif
+# ifndef __this_cpu_cmpxchg_4
+#  define __this_cpu_cmpxchg_4(pcp, old, new) ____cmpxchg_local_generic(__this_cpu_ptr(pcp), (old), (new), 4)
+# endif
+# ifndef __this_cpu_cmpxchg_8
+#  define __this_cpu_cmpxchg_8(pcp, old, new) ____cmpxchg_local_generic(__this_cpu_ptr(pcp), (old), (new), 8)
+# endif
+# define __this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
+#endif
+
+#define _this_cpu_generic_xchg_op(pcp, val)				\
+  ({									\
+	typeof(*(var)) __tmp_var__;					\
+	__tmp_var = __this_cpu_read(pcp);				\
+	__this_cpu_write((pcp), val);					\
+	__tmp_var__;							\
+  })
+
+#ifndef __this_cpu_xchg
+# ifndef __this_cpu_xchg_1
+#  define __this_cpu_xchg_1(pcp, val) __this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef __this_cpu_xchg_2
+#  define __this_cpu_xchg_2(pcp, val) __this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef __this_cpu_xchg_4
+#  define __this_cpu_xchg_4(pcp, val) __this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# ifndef __this_cpu_xchg_8
+#  define __this_cpu_xchg_8(pcp, val) __this_cpu_generic_xchg_op((pcp), (val))
+# endif
+# define __this_cpu_xchg(pcp, val) __pcpu_size_call_return(__this_cpu_xchg_, (val))
+#endif
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
@@ -709,4 +791,73 @@ do {									\
 # define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
 #endif

+#ifndef irqsafe_cpu_cmpxchg
+# ifndef irqsafe_cpu_cmpxchg_1
+#  define irqsafe_cpu_cmpxchg_1(pcp, old, new) __cmpxchg_local_generic(((pcp), (old), (new), 1)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_2
+#  define irqsafe_cpu_cmpxchg_2(pcp, old, new) __cmpxchg_local_generic(((pcp), (old), (new), 2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_4
+#  define irqsafe_cpu_cmpxchg_4(pcp, old, new) __cmpxchg_local_generic(((pcp), (old), (new), 4)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_8
+#  define irqsafe_cpu_cmpxchg_8(pcp, old, new) __cmpxchg_local_generic(((pcp), (old), (new), 8)
+# endif
+# define irqsafe_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(irqsafe_cpu_cmpxchg_, (old), (new))
+#endif
+
+#define irqsafe_generic_xchg_op(pcp, val)				\
+  ({									\
+	typeof(*(var)) __tmp_var__;					\
+	unsigned long flags;						\
+	local_irq_disable(flags);					\
+	__tmp_var = __this_cpu_read(pcp);				\
+	__this_cpu_write(pcp, val);					\
+	local_irq_enable(flags);					\
+	__tmp_var__;							\
+  })
+
+#ifndef irqsafe_cpu_xchg
+# ifndef irqsafe_cpu_xchg_1
+#  define irqsafe_cpu_xchg_1(pcp, val) irqsafe_generic_xchg_op((pcp), (val))
+# endif
+# ifndef irqsafe_cpu_xchg_2
+#  define irqsafe_cpu_xchg_2(pcp, val) irqsafe_generic_xchg_op((pcp), (val))
+# endif
+# ifndef irqsafe_cpu_xchg_4
+#  define irqsafe_cpu_xchg_4(pcp, val) irqsafe_generic_xchg_op((pcp), (val))
+# endif
+# ifndef irqsafe_cpu_xchg_8
+#  define irqsafe_cpu_xchg_8(pcp, val) irqsafe_generic_xchg_op((pcp), (val))
+# endif
+# define irqsafe_cpu_xchg(pcp, val) __pcpu_size_call_return(irqsafe_cpu_xchg_, (val))
+#endif
+
+#define _this_cpu_generic_add_return_op(pcp, val)			\
+do {									\
+	preempt_disable();						\
+	*__this_cpu_ptr(&pcp) op val;					\
+	preempt_enable();						\
+} while (0)
+
+
+#ifndef this_cpu_add_return
+# ifndef this_cpu_add_return_1
+#  define this_cpu_add_return_1(pcp, val) _this_cpu_generic_add_return_op((pcp), (val))
+# endif
+# ifndef this_cpu_ _2
+#  define this_cpu_add_return_2(pcp, val) _this_cpu_generic_add_return_op((pcp), (val))
+# endif
+# ifndef this_cpu_add_return_4
+#  define this_cpu_add_return_4(pcp, val) _this_cpu_generic_add_return_op((pcp), (val))
+# endif
+# ifndef this_cpu_add_return_8
+#  define this_cpu_add_return_8(pcp, val) _this_cpu_generic_add_return_op((pcp), (val))
+# endif
+# define this_cpu_add_return(pcp, val) __pcpu_size_call_return(_cpu_ _, (val))
+#endif
+
+#ifndef irqsafe_cpu_add_return
+
 #endif /* __LINUX_PERCPU_H */

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-12-18  0:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-14 22:03 [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Christoph Lameter
2009-12-14 22:03 ` [this_cpu_xx V7 1/8] this_cpu_ops: page allocator conversion Christoph Lameter
2009-12-15  3:53   ` Tejun Heo
2009-12-15 15:04     ` Christoph Lameter
2009-12-16  0:53       ` Tejun Heo
2009-12-16 14:55         ` Christoph Lameter
2009-12-17  0:23           ` Tejun Heo
2009-12-14 22:03 ` [this_cpu_xx V7 2/8] this_cpu ops: Remove pageset_notifier Christoph Lameter
2009-12-14 22:03 ` [this_cpu_xx V7 3/8] Use this_cpu operations in slub Christoph Lameter
2009-12-14 22:03 ` [this_cpu_xx V7 4/8] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
2009-12-14 22:03 ` [this_cpu_xx V7 5/8] this_cpu: Remove slub kmem_cache fields Christoph Lameter
2009-12-14 22:03 ` [this_cpu_xx V7 6/8] Make slub statistics use this_cpu_inc Christoph Lameter
2009-12-15  6:24   ` Eric Dumazet
2009-12-15 14:46     ` Christoph Lameter
2009-12-15 14:59       ` Eric Dumazet
2009-12-14 22:03 ` [this_cpu_xx V7 7/8] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
2009-12-15  4:03   ` Tejun Heo
2009-12-15 22:41     ` Rusty Russell
2009-12-16 16:10       ` Christoph Lameter
2009-12-17  0:25         ` Tejun Heo
2009-12-17  5:42           ` Rusty Russell
2009-12-14 22:03 ` [this_cpu_xx V7 8/8] Remove cpu_local_xx macros Christoph Lameter
2009-12-15  4:04   ` Tejun Heo
2009-12-15  6:37 ` [this_cpu_xx V7 0/8] Per cpu atomics in core allocators and cleanup Pekka Enberg
2009-12-15  6:47   ` Tejun Heo
2009-12-15 14:50     ` Christoph Lameter
2009-12-15 17:06 ` Mel Gorman
2009-12-16 14:44   ` Christoph Lameter
2009-12-16 21:36     ` Christoph Lameter
2009-12-15 17:43 ` Mathieu Desnoyers
2009-12-16  0:58   ` Tejun Heo
2009-12-16  1:40     ` Mathieu Desnoyers
2009-12-16  1:46       ` Tejun Heo
2009-12-17 13:39         ` Mathieu Desnoyers
2009-12-17 19:28           ` Christoph Lameter
2009-12-17 20:25             ` Mathieu Desnoyers
2009-12-17 20:43               ` Christoph Lameter
2009-12-18  0:13                 ` Mathieu Desnoyers
2009-12-18  0:27                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).