public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Purpose of the mm/slab.c changes
@ 2001-09-09 11:05 Manfred Spraul
  2001-09-09 14:26 ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Manfred Spraul @ 2001-09-09 11:05 UTC (permalink / raw)
  To: linux-kernel, Alan Cox, torvalds, andrea

What's the purpose of the mm/slab.c changes?

Linus, Alan, could you please drop them. Switching from 1 list to 3
lists is just a slowdown.

		2.4.9	2.4.9-ac9
km(1)		0x87	0x89
kf(1)		0xa9	0xa8
kf(cachep, 1)	0xad	0xad
km(4096)	0x7B	0x87	(+24% without overhead!)
kf(4096)	0xcB	0xd1
kf(cachep,4096)	0xcc	0xd3

(cpu ticks on a Duron 700, UP kernel, 100 calls in a tight loop, loop
overhead (0x49) not removed)

Benchmarking with SMP kernel is meaningless, since the actual list
changes are outside of the hot path - I did it, and the differences are
negligable (+-1 cpu tick)

And this is a benchmark with 100 calls in a tight loop - Andrea's patch
adds several additional branches, in real-world there would be an even
larger slowdown.

If there are real performance problems with the slab, then the not-yet
optimized parts should get changed:

* kmalloc(): the loop to find the correct slab is too slow. (the numbers
in the table are without the loop, i.e.
kmem_cache_alloc(kmem_find_general_cache(4096, SLAB_KERNEL),
SLAB_KERNEL);

* finetune the size of the per-cpu caches
* finetune the amount of pages freed during kmem_cache_reap()
* replace the division in kmem_cache_free_one() with a multiplication.
(UP only, on SMP outside of the hot path)
* enable OPTIMIZE - could help on platforms without a fast
virt_addr->struct page lookup.

I have a patch with some optimization, but I tought that would be 2.5
stuff.

--
	Manfred


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 11:05 Purpose of the mm/slab.c changes Manfred Spraul
@ 2001-09-09 14:26 ` Andrea Arcangeli
  2001-09-09 15:18   ` Manfred Spraul
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 14:26 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Alan Cox, torvalds

On Sun, Sep 09, 2001 at 01:05:34PM +0200, Manfred Spraul wrote:
> What's the purpose of the mm/slab.c changes?

it provides lifo allocations from both partial and unused slabs.  You
should benchmark it with a real load, not with dummy
allocations/freeing.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 14:26 ` Andrea Arcangeli
@ 2001-09-09 15:18   ` Manfred Spraul
  2001-09-09 15:33     ` Andrea Arcangeli
                       ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Manfred Spraul @ 2001-09-09 15:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Alan Cox, torvalds

>
> it provides lifo allocations from both partial and unused slabs.
>

lifo/fifo for unused slabs is obviously superflous - free is free, it
doesn't matter which free page is used first/last.
The partial slabs are already lifo: if a slab goes from full to partial,
it's added to the head of the partial list.

> You
> should benchmark it with a real load, not with dummy
> allocations/freeing.
>
Did you run any benchmarks? If yes, could you post them?

What I did was checking the number of branches in the hot path
(increased), the code length (increased) and execution time with dummy
loads (slower).

--
    Manfred




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:18   ` Manfred Spraul
@ 2001-09-09 15:33     ` Andrea Arcangeli
  2001-09-11 18:41       ` Manfred Spraul
  2001-09-09 16:12     ` Alan Cox
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 15:33 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Alan Cox, torvalds

On Sun, Sep 09, 2001 at 05:18:00PM +0200, Manfred Spraul wrote:
> >
> > it provides lifo allocations from both partial and unused slabs.
> >
> 
> lifo/fifo for unused slabs is obviously superflous - free is free, it
> doesn't matter which free page is used first/last.

then why don't you also make fifo the buddy allocator and the partial
list as well if free is free and fifo/lifo doesn't matter?

> Did you run any benchmarks? If yes, could you post them?

I didn't run any specific benchmark for such change but please let me
know if you can find that any real benchmark is hurted by it. I think
the cleanup and the potential for lifo in the free slabs is much more
sensible than the other factors you mentioned, of course there's less
probability of having to fall into the free slabs rather than in the
partial ones during allocations, but that doesn't mean that cannot
happen very often, but I will glady suggest to remove it if you prove me
wrong. All I'm saying here is that the dummy allocations with no access
to the ram returned are not interesting numbers.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:18   ` Manfred Spraul
  2001-09-09 15:33     ` Andrea Arcangeli
@ 2001-09-09 16:12     ` Alan Cox
  2001-09-09 16:25     ` Linus Torvalds
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Alan Cox @ 2001-09-09 16:12 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Alan Cox, torvalds

> lifo/fifo for unused slabs is obviously superflous - free is free, it
> doesn't matter which free page is used first/last.

Which pages are likely to be in the CPU caches ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:18   ` Manfred Spraul
  2001-09-09 15:33     ` Andrea Arcangeli
  2001-09-09 16:12     ` Alan Cox
@ 2001-09-09 16:25     ` Linus Torvalds
  2001-09-09 16:47       ` Alan Cox
  2001-09-15  0:29       ` Albert D. Cahalan
  2001-09-09 20:23     ` Rik van Riel
  2001-09-10  2:28     ` Daniel Phillips
  4 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2001-09-09 16:25 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Alan Cox


On Sun, 9 Sep 2001, Manfred Spraul wrote:
> >
> > it provides lifo allocations from both partial and unused slabs.
> >
>
> lifo/fifo for unused slabs is obviously superflous - free is free, it
> doesn't matter which free page is used first/last.

You're full of crap.

LIFO is obviously superior due to cache re-use.

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:25     ` Linus Torvalds
@ 2001-09-09 16:47       ` Alan Cox
  2001-09-09 16:55         ` Manfred Spraul
                           ` (4 more replies)
  2001-09-15  0:29       ` Albert D. Cahalan
  1 sibling, 5 replies; 28+ messages in thread
From: Alan Cox @ 2001-09-09 16:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Alan Cox

> > doesn't matter which free page is used first/last.
> 
> You're full of crap.
> LIFO is obviously superior due to cache re-use.

Interersting question however. On SMP without sufficient per CPU slab caches
is tht still the case ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:47       ` Alan Cox
@ 2001-09-09 16:55         ` Manfred Spraul
  2001-09-09 17:01         ` Linus Torvalds
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Manfred Spraul @ 2001-09-09 16:55 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, Andrea Arcangeli, linux-kernel

Alan Cox wrote:
> 
> > > doesn't matter which free page is used first/last.
> >
> > You're full of crap.
> > LIFO is obviously superior due to cache re-use.
> 
> Interersting question however. On SMP without sufficient per CPU slab caches
> is tht still the case ?

Correct. SMP was perfect LIFO even without Andrea's changes.

I thought Andrea tried to reduce the fragmentation, therefore I wrote
"free is free".

But even for cache re-use his changes are not a big change: The main
fifo/lifo ordering on UP is mandated by the defragmentation property of
the slab allocator. 

Afaics there is exactly one case where my code is not lifo and Andrea's
is: kmem_cache_free frees the last object in slab, each slab contains
more than one object, and there are no further partial slabs.
In all other cases Andrea just adds list_del();list_add() instead of 
changes to the firstnotfull pointer.

full->partial is/was lifo,
partial->partial doesn't change the lists at all
partial->empty was fifo, is now lifo_if_no_partial_slab_exists

--
	Manfred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:47       ` Alan Cox
  2001-09-09 16:55         ` Manfred Spraul
@ 2001-09-09 17:01         ` Linus Torvalds
  2001-09-09 17:22           ` Manfred Spraul
                             ` (2 more replies)
  2001-09-09 17:30         ` Andrea Arcangeli
                           ` (2 subsequent siblings)
  4 siblings, 3 replies; 28+ messages in thread
From: Linus Torvalds @ 2001-09-09 17:01 UTC (permalink / raw)
  To: Alan Cox; +Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel


On Sun, 9 Sep 2001, Alan Cox wrote:
>
> > LIFO is obviously superior due to cache re-use.
>
> Interersting question however. On SMP without sufficient per CPU slab caches
> is tht still the case ?

That _should_ be fairly easily testable by just changing the cpucache
tuning, which yu can do through /proc. The default parameters look
reasonable, though.

Note, however, that as far as the slab is concerned, this issue never
exists, if only because all the lists are per-CPU. So the only way you can
get cross-CPU cache behaviour is (a) when the actual _use_ of the slab is
moved across CPU's and (b) when you have a page that ends up moving from
one CPU to the other through page allocation when the slab caches run out.

Case (a) is clearly rather independent of the actual slab allocation
logic, and depends on the _user_ of the allocation, not on the allocator
itself.

Case (b) implies that the page allocator might also should also find
per-CPU LIFO queues in front of the actual allocator useful. That might
certainly be worth-while looking into - although it would also increase
the risk of fragementation.

(Doing per-CPU LIFO queues for the actual page allocator would potentially
make page alloc/de-alloc much faster due to lower locking requirements
too. So you might have a double performance win if anybody wants to try
this out).

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 17:01         ` Linus Torvalds
@ 2001-09-09 17:22           ` Manfred Spraul
  2001-09-09 17:27           ` arjan
  2001-09-09 17:35           ` Andrea Arcangeli
  2 siblings, 0 replies; 28+ messages in thread
From: Manfred Spraul @ 2001-09-09 17:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Andrea Arcangeli, linux-kernel

Linus Torvalds wrote:
> 
> (Doing per-CPU LIFO queues for the actual page allocator would potentially
> make page alloc/de-alloc much faster due to lower locking requirements
> too. So you might have a double performance win if anybody wants to try
> this out).
>
That patch already exists - Ingo wrote it, I think it's included in some
RedHat kernels.

But Andrea's changes mainly affect UP, not SMP.
--
	Manfred

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 17:01         ` Linus Torvalds
  2001-09-09 17:22           ` Manfred Spraul
@ 2001-09-09 17:27           ` arjan
  2001-09-09 17:35           ` Andrea Arcangeli
  2 siblings, 0 replies; 28+ messages in thread
From: arjan @ 2001-09-09 17:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

In article <Pine.LNX.4.33.0109090952380.14365-100000@penguin.transmeta.com> you wrote:

> Case (b) implies that the page allocator might also should also find
> per-CPU LIFO queues in front of the actual allocator useful. That might
> certainly be worth-while looking into - although it would also increase
> the risk of fragementation.

Hi,

Ingo Molnar did such a patch already; attached below.
This yields a 1.5% improvement on a 4 way machine using Oracle, more for
other applications (as the oracle tests is also IO bound).

Greetings,
   Arjan van de Ven


diff -urN linux.org/include/linux/mmzone.h linux/include/linux/mmzone.h
--- linux.org/include/linux/mmzone.h	Wed Aug 15 17:21:11 2001
+++ linux/include/linux/mmzone.h	Tue Sep  4 08:12:37 2001
@@ -7,12 +7,13 @@
 #include <linux/config.h>
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/smp.h>
 
 /*
  * Free memory management - zoned buddy allocator.
  */
 
-#define MAX_ORDER 10
+#define MAX_ORDER 11
 
 typedef struct free_area_struct {
 	struct list_head	free_list;
@@ -21,6 +22,14 @@
 
 struct pglist_data;
 
+#define MAX_PER_CPU_PAGES 64
+
+typedef struct per_cpu_pages_s {
+	int nr_pages;
+	int max_nr_pages;
+	struct list_head head;
+} __attribute__((aligned(L1_CACHE_BYTES))) per_cpu_t;
+
 /*
  * On machines where it is needed (eg PCs) we divide physical memory
  * into multiple physical zones. On a PC we have 3 zones:
@@ -33,6 +42,7 @@
 	/*
 	 * Commonly accessed fields:
 	 */
+	per_cpu_t		cpu_pages [NR_CPUS];
 	spinlock_t		lock;
 	unsigned long		free_pages;
 	unsigned long		inactive_clean_pages;
diff -urN linux.org/mm/page_alloc.c linux/mm/page_alloc.c
--- linux.org/mm/page_alloc.c	Thu Aug 16 12:43:02 2001
+++ linux/mm/page_alloc.c	Tue Sep  4 08:12:37 2001
@@ -7,6 +7,7 @@
  *  Reshaped it to be a zoned allocator, Ingo Molnar, Red Hat, 1999
  *  Discontiguous memory support, Kanoj Sarcar, SGI, Nov 1999
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
+ *  Per-CPU page pool, Ingo Molnar, Red Hat, 2001
  */
 
 #include <linux/config.h>
@@ -67,8 +68,16 @@
 	unsigned long index, page_idx, mask, flags;
 	free_area_t *area;
 	struct page *base;
+	per_cpu_t *per_cpu;
 	zone_t *zone;
 
+	/*
+	 * This late check is safe because reserved pages do not
+	 * have a valid page->count. This trick avoids overhead
+	 * in __free_pages().
+	 */
+	if (PageReserved(page))
+		return;
 	if (page->buffers)
 		BUG();
 	if (page->mapping)
@@ -102,7 +111,18 @@
 
 	area = zone->free_area + order;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	per_cpu = zone->cpu_pages + smp_processor_id();
+
+	__save_flags(flags);
+	__cli();
+	if (!order && (per_cpu->nr_pages < per_cpu->max_nr_pages) && (!free_shortage())) {
+		list_add(&page->list, &per_cpu->head);
+		per_cpu->nr_pages++;
+		__restore_flags(flags);
+		return;
+	}
+
+	spin_lock(&zone->lock);
 
 	zone->free_pages -= mask;
 
@@ -169,16 +189,35 @@
 	return page;
 }
 
+
 static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned long order));
 static struct page * rmqueue(zone_t *zone, unsigned long order)
 {
+	per_cpu_t *per_cpu = zone->cpu_pages + smp_processor_id();
 	free_area_t * area = zone->free_area + order;
 	unsigned long curr_order = order;
 	struct list_head *head, *curr;
 	unsigned long flags;
 	struct page *page;
+	int threshold=0;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!(current->flags & PF_MEMALLOC)) threshold = per_cpu->max_nr_pages/8;
+	__save_flags(flags);
+	__cli();
+
+	if (!order && (per_cpu->nr_pages>threshold)) {
+		if (list_empty(&per_cpu->head))
+			BUG();
+		page = list_entry(per_cpu->head.next, struct page, list);
+		list_del(&page->list);
+		per_cpu->nr_pages--;
+		__restore_flags(flags);
+
+		set_page_count(page, 1);
+		return page;
+	}
+
+	spin_lock(&zone->lock);
 	do {
 		head = &area->free_list;
 		curr = memlist_next(head);
@@ -534,7 +573,7 @@
 
 void __free_pages(struct page *page, unsigned long order)
 {
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__free_pages_ok(page, order);
 }
 
@@ -796,6 +835,7 @@
 
 	offset = lmem_map - mem_map;	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
+		int k;
 		zone_t *zone = pgdat->node_zones + j;
 		unsigned long mask;
 		unsigned long size, realsize;
@@ -807,6 +847,18 @@
 		printk("zone(%lu): %lu pages.\n", j, size);
 		zone->size = size;
 		zone->name = zone_names[j];
+
+		for (k = 0; k < NR_CPUS; k++) {
+			per_cpu_t *per_cpu = zone->cpu_pages + k;
+
+			INIT_LIST_HEAD(&per_cpu->head);
+			per_cpu->max_nr_pages = realsize / smp_num_cpus / 128;
+			if (per_cpu->max_nr_pages > MAX_PER_CPU_PAGES)
+				per_cpu->max_nr_pages = MAX_PER_CPU_PAGES;
+			else
+				if (!per_cpu->max_nr_pages)
+					per_cpu->max_nr_pages = 1;
+		}
 		zone->lock = SPIN_LOCK_UNLOCKED;
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:47       ` Alan Cox
  2001-09-09 16:55         ` Manfred Spraul
  2001-09-09 17:01         ` Linus Torvalds
@ 2001-09-09 17:30         ` Andrea Arcangeli
  2001-09-09 17:59         ` Fwd: 2.4.10-pre6 ramdisk driver broken? won't compile Stephan Gutschke
  2001-09-09 20:26         ` Purpose of the mm/slab.c changes Rik van Riel
  4 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 17:30 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, Manfred Spraul, linux-kernel

On Sun, Sep 09, 2001 at 05:47:12PM +0100, Alan Cox wrote:
> > > doesn't matter which free page is used first/last.
> > 
> > You're full of crap.
> > LIFO is obviously superior due to cache re-use.
> 
> Interersting question however. On SMP without sufficient per CPU slab caches
> is tht still the case ?

if we stay in per-cpu caches then the other paths cannot run at all so it
cannot make any difference for such case.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 17:01         ` Linus Torvalds
  2001-09-09 17:22           ` Manfred Spraul
  2001-09-09 17:27           ` arjan
@ 2001-09-09 17:35           ` Andrea Arcangeli
  2 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-09 17:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Manfred Spraul, linux-kernel

On Sun, Sep 09, 2001 at 10:01:32AM -0700, Linus Torvalds wrote:
> (Doing per-CPU LIFO queues for the actual page allocator would potentially
> make page alloc/de-alloc much faster due to lower locking requirements
> too. So you might have a double performance win if anybody wants to try
> this out).

I recall I seen this one implemented during some auditing recently, I
think it's in the tux patch but I may remeber wrong.

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Fwd: 2.4.10-pre6 ramdisk driver broken? won't compile
  2001-09-09 16:47       ` Alan Cox
                           ` (2 preceding siblings ...)
  2001-09-09 17:30         ` Andrea Arcangeli
@ 2001-09-09 17:59         ` Stephan Gutschke
  2001-09-09 20:26         ` Purpose of the mm/slab.c changes Rik van Riel
  4 siblings, 0 replies; 28+ messages in thread
From: Stephan Gutschke @ 2001-09-09 17:59 UTC (permalink / raw)
  To: linux-kernel


From: Stephan Gutschke <stephan@kernel.gutschke.com>
To: Majordomo@vger.kernel.org
Date: Sun, 09 Sep 2001 09:55:14 -800

Hi there,

my kernel stops compiling with this:

gcc -D__KERNEL__ -I/usr/src/linux-2.4.10-pre6/include -Wall -Wstrict-prototypes 
-Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -
mpreferred-stack-boundary=2 -march=i686    -c -o rd.o rd.c
rd.c: In function `rd_ioctl':
rd.c:262: invalid type argument of `->'
make[3]: *** [rd.o] Error 1
make[3]: Leaving directory `/usr/src/linux-2.4.10-pre6/drivers/block'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory `/usr/src/linux-2.4.10-pre6/drivers/block'
make[1]: *** [_subdir_block] Error 2
make[1]: Leaving directory `/usr/src/linux-2.4.10-pre6/drivers'
make: *** [_dir_drivers] Error 2

I checked the patch (patch-2.4.10-pre6) and thought maybe there is 
a "&" missing, where someone changed &inode->... to rd_bdev[... ?
Anyways, i changed it and it seems to compile ;)

@@ -259,7 +259,7 @@
                        /* special: we want to release the ramdisk memory,
                           it's not like with the other blockdevices where
                           this ioctl only flushes away the buffer cache. */
-                       if ((atomic_read(&inode->i_bdev->bd_openers) > 2))
+                       if ((atomic_read(rd_bdev[minor]->bd_openers) > 2))
                                return -EBUSY;
                        destroy_buffers(inode->i_rdev);
                        rd_blocksizes[minor] = 0;



Bye
Stephan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:18   ` Manfred Spraul
                       ` (2 preceding siblings ...)
  2001-09-09 16:25     ` Linus Torvalds
@ 2001-09-09 20:23     ` Rik van Riel
  2001-09-09 20:44       ` Davide Libenzi
  2001-09-10  2:28     ` Daniel Phillips
  4 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2001-09-09 20:23 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrea Arcangeli, linux-kernel, Alan Cox, torvalds

On Sun, 9 Sep 2001, Manfred Spraul wrote:

> > it provides lifo allocations from both partial and unused slabs.
>
> lifo/fifo for unused slabs is obviously superflous - free is
> free, it doesn't matter which free page is used first/last.

Mind that the L2 cache is often as much as 10 times faster
than RAM, so it would be nice if we had a good chance that
the slab we just allocated would be in L2 cache.

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:47       ` Alan Cox
                           ` (3 preceding siblings ...)
  2001-09-09 17:59         ` Fwd: 2.4.10-pre6 ramdisk driver broken? won't compile Stephan Gutschke
@ 2001-09-09 20:26         ` Rik van Riel
  4 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2001-09-09 20:26 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, Manfred Spraul, Andrea Arcangeli, linux-kernel

On Sun, 9 Sep 2001, Alan Cox wrote:

> > > doesn't matter which free page is used first/last.
> >
> > You're full of crap.
> > LIFO is obviously superior due to cache re-use.
>
> Interersting question however. On SMP without sufficient per CPU
> slab caches is tht still the case ?

By definition, no.

If we're allocating and freeing the slabs at such a fast
speed that the slabs which are NOT in the per-CPU caches
are still in the cache, chances are our per-CPU caches
are too small.

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 20:23     ` Rik van Riel
@ 2001-09-09 20:44       ` Davide Libenzi
  2001-09-09 20:45         ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Davide Libenzi @ 2001-09-09 20:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, Alan Cox, linux-kernel, Andrea Arcangeli,
	Manfred Spraul


On 09-Sep-2001 Rik van Riel wrote:
> On Sun, 9 Sep 2001, Manfred Spraul wrote:
> 
>> > it provides lifo allocations from both partial and unused slabs.
>>
>> lifo/fifo for unused slabs is obviously superflous - free is
>> free, it doesn't matter which free page is used first/last.
> 
> Mind that the L2 cache is often as much as 10 times faster
> than RAM, so it would be nice if we had a good chance that
> the slab we just allocated would be in L2 cache.

Do You see it as a plus ?
The new allocated slab will be very likely written ( w/o regard about the old content )
and an L2 mapping will generate invalidate traffic.



- Davide


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 20:44       ` Davide Libenzi
@ 2001-09-09 20:45         ` Rik van Riel
  2001-09-09 20:58           ` Davide Libenzi
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2001-09-09 20:45 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: torvalds, Alan Cox, linux-kernel, Andrea Arcangeli,
	Manfred Spraul

On Sun, 9 Sep 2001, Davide Libenzi wrote:

> Do You see it as a plus ?
> The new allocated slab will be very likely written ( w/o regard
> about the old content ) and an L2 mapping will generate
> invalidate traffic.

If your invalidates are slower than your RAM, you should
consider getting another computer.

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 20:45         ` Rik van Riel
@ 2001-09-09 20:58           ` Davide Libenzi
  2001-09-22 12:28             ` Ralf Baechle
  0 siblings, 1 reply; 28+ messages in thread
From: Davide Libenzi @ 2001-09-09 20:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Alan Cox,
	torvalds


On 09-Sep-2001 Rik van Riel wrote:
> On Sun, 9 Sep 2001, Davide Libenzi wrote:
> 
>> Do You see it as a plus ?
>> The new allocated slab will be very likely written ( w/o regard
>> about the old content ) and an L2 mapping will generate
>> invalidate traffic.
> 
> If your invalidates are slower than your RAM, you should
> consider getting another computer.

You mean a Sun, that uses a separate bus for snooping ? :)
Besides to not under estimate the cache coherency traffic ( that on many CPUs
uses the main memory bus ) there's the fact that the old data eventually
present in L2 won't be used by the new slab user.




- Davide


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:18   ` Manfred Spraul
                       ` (3 preceding siblings ...)
  2001-09-09 20:23     ` Rik van Riel
@ 2001-09-10  2:28     ` Daniel Phillips
  4 siblings, 0 replies; 28+ messages in thread
From: Daniel Phillips @ 2001-09-10  2:28 UTC (permalink / raw)
  To: Manfred Spraul, Andrea Arcangeli; +Cc: linux-kernel, Alan Cox, torvalds

On September 9, 2001 05:18 pm, Manfred Spraul wrote:
> lifo/fifo for unused slabs is obviously superflous - free is free, it
> doesn't matter which free page is used first/last.

Welll... there is a some difference with respect to the buddy allocator and 
fragmentation.  <--- nit

--
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 15:33     ` Andrea Arcangeli
@ 2001-09-11 18:41       ` Manfred Spraul
  2001-09-11 19:36         ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Manfred Spraul @ 2001-09-11 18:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Alan Cox, torvalds

> I think the cleanup
I'm sure you read of this comment in page_alloc.c:
* Buddy system. Hairy. You really aren't expected to understand this
*
* Hint: -mask = 1+~mask

and the slab allocator must sustain more 10 times more allocations/sec:
from lse netbench on sourceforge, 4-cpu, ext2, one minute:
    4 million kmallocs,
    5 million kmem_cache_alloc
    721 000 rmqueue
slab.c doesn't need to be simple, it must be fast.

> and the potential for lifo in the free slabs is much more
> sensible than the other factors you mentioned, of course there's less
> probability of having to fall into the free slabs rather than in the
> partial ones during allocations, but that doesn't mean that cannot
> happen very often, but I will glady suggest to remove it if you prove
> me wrong.

Ok, so you agree that your changes are only beneficial in one case:

kmem_cache_free(), uniprocessor or SMP not-per-cpu cached.
* frees one object
* after that free-operation no further slabs with allocated objects are
left - only full and free slabs.

Your code ensures that the next object returned will be the previously
freed object, my code doesn't guarantee that.

If I can modify my slab allocator to guarantee it, you'd drop your
patch?

--
    Manfred




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-11 18:41       ` Manfred Spraul
@ 2001-09-11 19:36         ` Andrea Arcangeli
  2001-09-11 20:43           ` Manfred Spraul
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-11 19:36 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Alan Cox, torvalds

On Tue, Sep 11, 2001 at 08:41:32PM +0200, Manfred Spraul wrote:
> > I think the cleanup
> I'm sure you read of this comment in page_alloc.c:
> * Buddy system. Hairy. You really aren't expected to understand this
> *
> * Hint: -mask = 1+~mask
> 
> and the slab allocator must sustain more 10 times more allocations/sec:
> from lse netbench on sourceforge, 4-cpu, ext2, one minute:
>     4 million kmallocs,
>     5 million kmem_cache_alloc
>     721 000 rmqueue
> slab.c doesn't need to be simple, it must be fast.
> 
> > and the potential for lifo in the free slabs is much more
> > sensible than the other factors you mentioned, of course there's less
> > probability of having to fall into the free slabs rather than in the
> > partial ones during allocations, but that doesn't mean that cannot
> > happen very often, but I will glady suggest to remove it if you prove
> > me wrong.
> 
> Ok, so you agree that your changes are only beneficial in one case:
> 
> kmem_cache_free(), uniprocessor or SMP not-per-cpu cached.

I wouldn't say "not only in such case" but "mainly in such case"
(there's not infinite ram in the per-cpu caches).

> If I can modify my slab allocator to guarantee it, you'd drop your
> patch?

I'm open to an alternate more optimized approch, however I've to say
that using the struct list_head and maintaining the very clean three
separated lists was really nice IMHO.

Also I believe there are more interesting parts to optimize on the
design side rather than making the slab.c implementation more complex
with the object of microoptimization for microbenchmarks (as you told me
you couldn't measure any decrease in performance in real life in a
sendfile benchmark, infact the first run with the patch was a little
faster and the second a little slower).

Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-11 19:36         ` Andrea Arcangeli
@ 2001-09-11 20:43           ` Manfred Spraul
  2001-09-12 14:18             ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Manfred Spraul @ 2001-09-11 20:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Alan Cox, torvalds

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

Andrea Arcangeli wrote:
> 
> >
> > Ok, so you agree that your changes are only beneficial in one case:
> >
> > kmem_cache_free(), uniprocessor or SMP not-per-cpu cached.
> 
> I wouldn't say "not only in such case" but "mainly in such case"
> (there's not infinite ram in the per-cpu caches).
>
Large object were LIFO even without your patch.
\x18
Small objects on UP with partial slabs were LIFO without your patch.

Small objects on UP without any partial slabs were not LIFO, but it was
simple to fix that (patch attached, not yet fully tested).
Small objects on SMP are managed in the per-cpu arrays - and transfered
in batches (by default 30-120 objects, depending on the size) to the
backend slab lists. The backend was not LIFO (also fixed in the patch).

> 
> Also I believe there are more interesting parts to optimize on the
> design side rather than making the slab.c implementation more complex
> with the object of microoptimization for microbenchmarks (as you told me
> you couldn't measure any decrease in performance in real life in a
> sendfile benchmark, infact the first run with the patch was a little
> faster and the second a little slower).
>
Interesting: you loose all microbenchmarks, your patch doesn't improve
LIFO ordering, and you still think your patch is better? Could you
explain why?

Btw, I didn't merge the current slab.c directly: Ingo Molnar compared it
with his own version and sent it to Linus because it was faster than his
version (8-way, tux optimization).

But I agree that redesign is probably a good idea for 2.5, but that's
2.5 stuff:

* microoptimization: try to avoid the division in kmem_cache_free_one.
Might give a 20 % boost on UP i386, even more on cpus without an
efficient hardware division. I have an (ugly) patch. Mark Hemment's
version avoided the division in some cases, and that really helps.
* redesign: the code relies on a fast virt_to_page(). Is that realistic
with NUMA/CONFIG_DISCONTIGMEM?
* redesign: inode and dcache currently drain the per-cpu caches very
often to reduce the fragmentation - perhaps another, stronger
defragmenting allocator for inode & dcache? CPU crosscalls can't be the
perfect solution.
* redesign: NUMA.

But that redesign - virt_to_page is the first function in
kmem_cache_free and kfree - you won't need a backend simplification,
more a rewrite.

--
	Manfred

[-- Attachment #2: patch-slab-lifo --]
[-- Type: text/plain, Size: 3478 bytes --]

--- 2.4/mm/slab.c	Tue Sep 11 21:32:23 2001
+++ build-2.4/mm/slab.c	Tue Sep 11 22:39:54 2001
@@ -85,9 +85,9 @@
  * FORCED_DEBUG	- 1 enables SLAB_RED_ZONE and SLAB_POISON (if possible)
  */
 
-#define	DEBUG		0
-#define	STATS		0
-#define	FORCED_DEBUG	0
+#define	DEBUG		1
+#define	STATS		1
+#define	FORCED_DEBUG	1
 
 /*
  * Parameters for kmem_cache_reap
@@ -448,7 +448,7 @@
 		/* Inc off-slab bufctl limit until the ceiling is hit. */
 		if (!(OFF_SLAB(sizes->cs_cachep))) {
 			offslab_limit = sizes->cs_size-sizeof(slab_t);
-			offslab_limit /= 2;
+			offslab_limit /= sizeof(kmem_bufctl_t);
 		}
 		sprintf(name, "size-%Zd(DMA)",sizes->cs_size);
 		sizes->cs_dmacachep = kmem_cache_create(name, sizes->cs_size, 0,
@@ -1411,16 +1411,31 @@
 moveslab_free:
 	/*
 	 * was partial, now empty.
-	 * c_firstnotfull might point to slabp
-	 * FIXME: optimize
+	 * c_firstnotfull might point to slabp.
+	 * The code ensures LIFO ordering if there are no partial slabs.
+	 * (allocation from partial slabs has higher priority that LIFO
+	 *  - we just found a freeable page!)
 	 */
 	{
-		struct list_head *t = cachep->firstnotfull->prev;
-
-		list_del(&slabp->list);
-		list_add_tail(&slabp->list, &cachep->slabs);
-		if (cachep->firstnotfull == &slabp->list)
-			cachep->firstnotfull = t->next;
+		slab_t* next = list_entry(slabp->list.next, slab_t, list);
+		if (&next->list != &cachep->slabs) {
+			if (next->inuse != cachep->num) {
+				if (&slabp->list == cachep->firstnotfull)
+					cachep->firstnotfull = &next->list;
+				list_del(&slabp->list);
+				list_add_tail(&slabp->list, &cachep->slabs);
+			} /* else {
+			     The next slab is a free slab. That means
+			     the slab with the freed object in in it's
+			     correct position: behind all partial slabs,
+			     in front of all other free slabs to ensure
+			     LIFO.
+			} */
+		}/* else {
+		    the slab the freed object was in was the last slab in
+		    the cache. That means it's already in the correct 
+	 	    position: behind all partial slabs (if any).
+		} */
 		return;
 	}
 }
@@ -1473,6 +1488,46 @@
 #endif
 }
 
+#if DEBUG
+static void kmem_slabchain_test(kmem_cache_t *cachep)
+{
+	int pos = 0;
+	slab_t* walk;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cachep->spinlock, flags);
+
+	walk = list_entry(cachep->slabs.next, slab_t, list);
+	while(&walk->list != &cachep->slabs) {
+		if (walk->inuse == cachep->num) {
+			if (pos >  0)
+				BUG();
+		} else if (walk->inuse > 0) {
+			if (pos == 0) {
+				if (cachep->firstnotfull != &walk->list)
+					BUG();
+			}
+			if (pos >  1)
+				BUG();
+			pos = 1; /* found partial slabp */
+		} else {
+			if (pos == 0) {
+				if (cachep->firstnotfull != &walk->list)
+					BUG();
+			}
+			pos = 2; /* found free slabp */
+		}
+		walk = list_entry(walk->list.next, slab_t, list);
+	}
+	if (pos == 0) {
+		if (cachep->firstnotfull != &cachep->slabs)
+			BUG();
+	}
+	spin_unlock_irqrestore(&cachep->spinlock, flags);
+}
+#else
+#define kmem_slabchain_test(cachep)	do { } while(0)
+#endif
 /**
  * kmem_cache_alloc - Allocate an object
  * @cachep: The cache to allocate from.
@@ -1540,6 +1595,7 @@
 	local_irq_save(flags);
 	__kmem_cache_free(cachep, objp);
 	local_irq_restore(flags);
+	kmem_slabchain_test(cachep);
 }
 
 /**
@@ -1561,6 +1617,7 @@
 	c = GET_PAGE_CACHE(virt_to_page(objp));
 	__kmem_cache_free(c, (void*)objp);
 	local_irq_restore(flags);
+	kmem_slabchain_test(c);
 }
 
 kmem_cache_t * kmem_find_general_cachep (size_t size, int gfpflags)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-11 20:43           ` Manfred Spraul
@ 2001-09-12 14:18             ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2001-09-12 14:18 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Alan Cox, torvalds

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

On Tue, Sep 11, 2001 at 10:43:40PM +0200, Manfred Spraul wrote:
> Interesting: you loose all microbenchmarks, your patch doesn't improve
> LIFO ordering, and you still think your patch is better? Could you
> explain why?

I believe it's much cleaner to have proper list. It also shrinks the
caches in fifo order (shrinking is fifo and allocation is lifo).

The main reason it was significantly slower in the microbenchmarks is
that it wasn't microoptimized, the design of the three list wasn't the
real problem. I wrote this microbenchmark:

#include <linux/module.h>
#include <linux/slab.h>
#include <asm/msr.h>

#define NR_MALLOC 1000
#define SIZE 10

int init_module(void)
{
	unsigned long start, stop;
	int i;
	void * malloc[NR_MALLOC];

	rdtscl(start);
	for (i = 0; i < NR_MALLOC; i++)
		malloc[i] = kmalloc(SIZE, GFP_KERNEL);
	rdtscl(stop);

	for (i = 0; i < NR_MALLOC; i++)
		kfree(malloc[i]);

	printk("cycles %lu\n", stop - start);

	return 1;
}

these are the figures on UP PII 450mhz:

mainline pre10 (slabs-list patch applied)

cycles 90236
cycles 89812
cycles 90106

mainline but with slabs-list backed out

cycles 85187
cycles 85386
cycles 85169

mainline but with slabs-list backed out and your latest lifo patch
applied but with debugging turned off (also I didn't checked the
offslab_limit but it was applied)

cycles 85578
cycles 85856
cycles 85596

mainline with this new attached patch applied

cycles 85775
cycles 85963
cycles 85901

So in short I'd prefer to merge in mainline the attached patch that
optimizes the slab lists taking full advantage of the three lists.

Never mind to complain if you still have problems with those
microoptimizations applied. (btw, compiled with gcc 3.0.2)

Andrea

[-- Attachment #2: 00_slab-microoptimize-1 --]
[-- Type: text/plain, Size: 4524 bytes --]

--- 2.4.10pre8aa1/mm/slab.c.~1~	Wed Sep 12 03:23:43 2001
+++ 2.4.10pre8aa1/mm/slab.c	Wed Sep 12 05:02:40 2001
@@ -926,8 +926,10 @@
 			break;
 
 		slabp = list_entry(cachep->slabs_free.prev, slab_t, list);
+#if DEBUG
 		if (slabp->inuse)
 			BUG();
+#endif
 		list_del(&slabp->list);
 
 		spin_unlock_irq(&cachep->spinlock);
@@ -1215,7 +1217,7 @@
 }
 
 static inline void * kmem_cache_alloc_one_tail (kmem_cache_t *cachep,
-						slab_t *slabp, int partial)
+						slab_t *slabp)
 {
 	void *objp;
 
@@ -1228,14 +1230,9 @@
 	objp = slabp->s_mem + slabp->free*cachep->objsize;
 	slabp->free=slab_bufctl(slabp)[slabp->free];
 
-	if (slabp->free == BUFCTL_END) {
+	if (__builtin_expect(slabp->free == BUFCTL_END, 0)) {
 		list_del(&slabp->list);
 		list_add(&slabp->list, &cachep->slabs_full);
-	} else {
-		if (!partial) {
-			list_del(&slabp->list);
-			list_add(&slabp->list, &cachep->slabs_partial);
-		}
 	}
 #if DEBUG
 	if (cachep->flags & SLAB_POISON)
@@ -1262,20 +1259,23 @@
  */
 #define kmem_cache_alloc_one(cachep)				\
 ({								\
-	slab_t	*slabp;						\
-	struct list_head * slab_freelist;			\
-	int partial = 1;					\
+	struct list_head * slabs_partial, * entry;		\
+	slab_t *slabp;						\
 								\
-	slab_freelist = &(cachep)->slabs_partial;		\
-	if (list_empty(slab_freelist)) {			\
-		partial = 0;					\
-		slab_freelist = &(cachep)->slabs_free;		\
-		if (list_empty(slab_freelist))			\
+	slabs_partial = &(cachep)->slabs_partial;		\
+	entry = slabs_partial->next;				\
+	if (__builtin_expect(entry == slabs_partial, 0)) {	\
+		struct list_head * slabs_free;			\
+		slabs_free = &(cachep)->slabs_free;		\
+		entry = slabs_free->next;			\
+		if (__builtin_expect(entry == slabs_free, 0))	\
 			goto alloc_new_slab;			\
+		list_del(entry);				\
+		list_add(entry, slabs_partial);			\
 	}							\
 								\
-	slabp = list_entry(slab_freelist->next, slab_t, list);	\
-	kmem_cache_alloc_one_tail(cachep, slabp, partial);	\
+	slabp = list_entry(entry, slab_t, list);		\
+	kmem_cache_alloc_one_tail(cachep, slabp);		\
 })
 
 #ifdef CONFIG_SMP
@@ -1283,25 +1283,27 @@
 {
 	int batchcount = cachep->batchcount;
 	cpucache_t* cc = cc_data(cachep);
-	struct list_head * slab_freelist;
-	int partial;
-	slab_t *slabp;
 
 	spin_lock(&cachep->spinlock);
 	while (batchcount--) {
+		struct list_head * slabs_partial, * entry;
+		slab_t *slabp;
 		/* Get slab alloc is to come from. */
-		slab_freelist = &(cachep)->slabs_partial;
-		partial = 1;
-		if (list_empty(slab_freelist)) {
-			partial = 0;
-			slab_freelist = &(cachep)->slabs_free;
-			if (list_empty(slab_freelist))
+		slabs_partial = &(cachep)->slabs_partial;
+		entry = slabs_partial->next;
+		if (__builtin_expect(entry == slabs_partial, 0)) {
+			struct list_head * slabs_free;
+			slabs_free = &(cachep)->slabs_free;
+			entry = slabs_free->next;
+			if (__builtin_expect(entry == slabs_free, 0))
 				break;
+			list_del(entry);
+			list_add(entry, slabs_partial);
 		}
 
-		slabp = list_entry(slab_freelist->next, slab_t, list);
+		slabp = list_entry(entry, slab_t, list);
 		cc_entry(cc)[cc->avail++] =
-				kmem_cache_alloc_one_tail(cachep, slabp, partial);
+				kmem_cache_alloc_one_tail(cachep, slabp);
 	}
 	spin_unlock(&cachep->spinlock);
 
@@ -1432,23 +1434,18 @@
 	STATS_DEC_ACTIVE(cachep);
 	
 	/* fixup slab chains */
-	if (!--slabp->inuse)
-		goto moveslab_free;
-	if (slabp->inuse + 1 == cachep->num)
-		goto moveslab_partial;
-	return;
-
-moveslab_partial:
-    	/* Was full. */
-	list_del(&slabp->list);
-	list_add(&slabp->list, &cachep->slabs_partial);
-	return;
-
-moveslab_free:
-	/* Was partial, now empty. */
-	list_del(&slabp->list);
-	list_add(&slabp->list, &cachep->slabs_free);
-	return;
+	{
+		int inuse = slabp->inuse;
+		if (__builtin_expect(!--slabp->inuse, 0)) {
+			/* Was partial or full, now empty. */
+			list_del(&slabp->list);
+			list_add(&slabp->list, &cachep->slabs_free);
+		} else if (__builtin_expect(inuse == cachep->num, 0)) {
+			/* Was full. */
+			list_del(&slabp->list);
+			list_add(&slabp->list, &cachep->slabs_partial);
+		}
+	}
 }
 
 #ifdef CONFIG_SMP
@@ -1756,8 +1753,10 @@
 		p = searchp->slabs_free.next;
 		while (p != &searchp->slabs_free) {
 			slabp = list_entry(p, slab_t, list);
+#if DEBUG
 			if (slabp->inuse)
 				BUG();
+#endif
 			full_free++;
 			p = p->next;
 		}
@@ -1807,8 +1806,10 @@
 		if (p == &best_cachep->slabs_free)
 			break;
 		slabp = list_entry(p,slab_t,list);
+#if DEBUG
 		if (slabp->inuse)
 			BUG();
+#endif
 		list_del(&slabp->list);
 		STATS_INC_REAPED(best_cachep);
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 16:25     ` Linus Torvalds
  2001-09-09 16:47       ` Alan Cox
@ 2001-09-15  0:29       ` Albert D. Cahalan
  1 sibling, 0 replies; 28+ messages in thread
From: Albert D. Cahalan @ 2001-09-15  0:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Manfred Spraul, Andrea Arcangeli, linux-kernel, Alan Cox

Linus Torvalds writes:
> On Sun, 9 Sep 2001, Manfred Spraul wrote:

>>> it provides lifo allocations from both partial and unused slabs.
>>
>> lifo/fifo for unused slabs is obviously superflous - free is free, it
>> doesn't matter which free page is used first/last.
>
> You're full of crap.
>
> LIFO is obviously superior due to cache re-use.

I think a bit of arch-specific code could do better.

Since I happen to have the MPC7400 manual at hand, I'll use that
for my example. This is the PowerPC "G4" chip.

The L1 cache is 8-way, 32 bytes/block, and has 128 sets. The L1
cache block replacement algorithm will select an invalid block
whenever possible. If an invalid block isn't available, then a
dirty block must go out to L2. The L2 cache is 2-way, as large
as 2 MB, and has FIFO replacement.

Now, considering that allocator:

The LIFO way: likely to suck up memory/L2 bandwidth writing
out dirty data that we don't care about, plus you risk stalling
the CPU to do this while in need of a free cache block.

The alternative: invalidate the cache line. This takes just
one instruction per cache line, and causes an address-only
bus operation at worst. Completely unrelated code may run
faster due to the availability of invalid cache lines.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-09 20:58           ` Davide Libenzi
@ 2001-09-22 12:28             ` Ralf Baechle
  2001-09-22 21:03               ` Davide Libenzi
  0 siblings, 1 reply; 28+ messages in thread
From: Ralf Baechle @ 2001-09-22 12:28 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Rik van Riel, Manfred Spraul, Andrea Arcangeli, linux-kernel,
	Alan Cox, torvalds

On Sun, Sep 09, 2001 at 01:58:44PM -0700, Davide Libenzi wrote:

> >> Do You see it as a plus ?
> >> The new allocated slab will be very likely written ( w/o regard
> >> about the old content ) and an L2 mapping will generate
> >> invalidate traffic.
> > 
> > If your invalidates are slower than your RAM, you should
> > consider getting another computer.
> 
> You mean a Sun, that uses a separate bus for snooping ? :)
> Besides to not under estimate the cache coherency traffic ( that on many CPUs
> uses the main memory bus ) there's the fact that the old data eventually
> present in L2 won't be used by the new slab user.

That's actually what having a slab cache of pre-initialized elements tries
to achieve.

On anything that uses a MESI-like cache coherence protocol a cached dirty
cache line that is written to will not cause any coherency traffic and
thus be faster.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-22 12:28             ` Ralf Baechle
@ 2001-09-22 21:03               ` Davide Libenzi
  2001-09-22 21:36                 ` David S. Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Davide Libenzi @ 2001-09-22 21:03 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: torvalds, Alan Cox, linux-kernel, Andrea Arcangeli,
	Manfred Spraul, Rik van Riel


On 22-Sep-2001 Ralf Baechle wrote:
> On Sun, Sep 09, 2001 at 01:58:44PM -0700, Davide Libenzi wrote:
> 
>> >> Do You see it as a plus ?
>> >> The new allocated slab will be very likely written ( w/o regard
>> >> about the old content ) and an L2 mapping will generate
>> >> invalidate traffic.
>> > 
>> > If your invalidates are slower than your RAM, you should
>> > consider getting another computer.
>> 
>> You mean a Sun, that uses a separate bus for snooping ? :)
>> Besides to not under estimate the cache coherency traffic ( that on many CPUs
>> uses the main memory bus ) there's the fact that the old data eventually
>> present in L2 won't be used by the new slab user.
> 
> That's actually what having a slab cache of pre-initialized elements tries
> to achieve.
> 
> On anything that uses a MESI-like cache coherence protocol a cached dirty
> cache line that is written to will not cause any coherency traffic and
> thus be faster.

MESI is a bit more complicated than clean/dirty status.
This is a very good state machine graph for MESI :

http://odin.ee.uwa.edu.au/~morris/CA406/cache_coh.html

Besides this, i don't get how a LIFO could help you.
>From a cache point of view if the slab-free code run on processor A and
the slab-alloc code will run on a processor B, if these two ops are
executed very close in time ( due LIFO ) there's a good probability of
modified->shared migration that will result in pushback ops.
A FIFO will result in more time between free and alloc with a good probability
that the interlock will be relaxed.




- Davide


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Purpose of the mm/slab.c changes
  2001-09-22 21:03               ` Davide Libenzi
@ 2001-09-22 21:36                 ` David S. Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David S. Miller @ 2001-09-22 21:36 UTC (permalink / raw)
  To: davidel; +Cc: ralf, torvalds, alan, linux-kernel, andrea, manfred, riel

   From: Davide Libenzi <davidel@xmailserver.org>
   Date: Sat, 22 Sep 2001 14:03:02 -0700 (PDT)

   Besides this, i don't get how a LIFO could help you.

Actually, there is a school of thought which says that if you make the
time between the free and re-alloc of a piece of memory as long as
possible you increase the likelyhood that any dirty cache lines of
that memory can be sent back to memory "quietly" during natural L2
cache line replacement.

I don't necessarily subscribe to these ideas, but I do see the
potential benefits.  For one thing, it does have the potential to lead
to more repeatable timings, at least in theory.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2001-09-22 21:36 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-09 11:05 Purpose of the mm/slab.c changes Manfred Spraul
2001-09-09 14:26 ` Andrea Arcangeli
2001-09-09 15:18   ` Manfred Spraul
2001-09-09 15:33     ` Andrea Arcangeli
2001-09-11 18:41       ` Manfred Spraul
2001-09-11 19:36         ` Andrea Arcangeli
2001-09-11 20:43           ` Manfred Spraul
2001-09-12 14:18             ` Andrea Arcangeli
2001-09-09 16:12     ` Alan Cox
2001-09-09 16:25     ` Linus Torvalds
2001-09-09 16:47       ` Alan Cox
2001-09-09 16:55         ` Manfred Spraul
2001-09-09 17:01         ` Linus Torvalds
2001-09-09 17:22           ` Manfred Spraul
2001-09-09 17:27           ` arjan
2001-09-09 17:35           ` Andrea Arcangeli
2001-09-09 17:30         ` Andrea Arcangeli
2001-09-09 17:59         ` Fwd: 2.4.10-pre6 ramdisk driver broken? won't compile Stephan Gutschke
2001-09-09 20:26         ` Purpose of the mm/slab.c changes Rik van Riel
2001-09-15  0:29       ` Albert D. Cahalan
2001-09-09 20:23     ` Rik van Riel
2001-09-09 20:44       ` Davide Libenzi
2001-09-09 20:45         ` Rik van Riel
2001-09-09 20:58           ` Davide Libenzi
2001-09-22 12:28             ` Ralf Baechle
2001-09-22 21:03               ` Davide Libenzi
2001-09-22 21:36                 ` David S. Miller
2001-09-10  2:28     ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox