* [RFC] Simple Slab: A slab allocator with minimal meta information
@ 2006-08-10 0:52 Christoph Lameter
2006-08-10 2:07 ` Matt Mackall
2006-08-10 5:01 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-08-10 0:52 UTC (permalink / raw)
To: Matt Mackall; +Cc: npiggin, manfred, linux-kernel
Lately I have started tinkering around with the slab in particular after
Matt Mackal mentioned that the slab should be more modular at the KS.
One particular design issue with the current slab is that it is build on the
basic notion of shifting object references from list to list. Without NUMA this
is wild enough with the per cpu caches and the shared cache but with NUMA we now
have per node shared arrays, per node list and per node per node alien caches.
Somehow this all works but one wonders does it have to be that way? On very
large systems the number of these entities grows to unbelievable numbers.
After getting through a lot of patching of the existing slab to get there
I got finally fed up with this. Maybe it is best to try to develop another
basic slab layer that does not have all the object queues and that does
not have to carry so much state information. I also have had concerns
about the way locking is handled for awhile. We could increase parallelism
by finer grained locking. This in turn may avoid the need for object
queues (what was it... We had ~7-10 mio cpu array allocations on our 1k
processor system on bootup ...). Ironically not having object queues may
be beneficial for embedded and for large systems at this point.
After toying around for awhile I came to the realization that the page struct
contains all the information necessary to manage a slab block. One can put
all the management information there and that is also advantageous
for performance since we constantly have to use the page struct anyways for
reverse object lookups and during slab creation. So this also reduces the
cache footprint of the slab. The alignment is naturally the best since the
first object starts right at the page boundary. This reduces the complexity
of alignment calculations.
We use two locks:
1. The per slab list_lock to protect the lists. This lock does not protect
the slab and is only taken during list manipulation. List manupulation
is reduced to necessary moves if the state of a page changes. An allocation
of an object or the freeing of an object in a slab does not require
that the list_lock is taken if the slab does not run empty or overflows.
2. The page lock in struct page is used to protect the slab during
allocation and freeing. This lock is conveniently placed in a cacheline
that is already available. Other key information is also placed there.
struct page overloading:
- _mapcout => Used to count the objects in use in a slab
- mapping => Reference to the slab structure
- index => Pointer to the first free element in a slab
- lru => Used for list management.
Flag overloading:
PageReferenced => Used to control per cpu slab freeing.
PageActive => slab is under active allocation.
PageLocked => slab locking
The freelists of objects per page are managed as a chained list.
The struct page contains a pointer to the first element. The first 4 bytes of
the free element contains a pointer to the next free element etc until the
chain ends with NULL.
There is no freelist for slabs. slabs are immediately returned to the page
allocator. The page allocator has its own per cpu page queues that should provide
enough caching.
Per cpu caches exist in the sense that each processor has a per processor
"cpuslab". Objects in this slab will only be allocated from this processor.
The page state is likely going to stay in the cache. Allocation will be
very fast since we only need the page struct reference for all our needs
which is likely not contended at all. Fetching the next free pointer from
the location of the object nicely prefetches the object.
This is by no means complete and probably full of bugs. Feedback and help
wanted! I have tried to switch over two minor system caches (memory
policies) to use simple slab and it seems to work. We probably have some
way to go before we could do performance tests.
sslab does not disable interrupts by default--Not sure if that is something that would be
desirable. sslab does not have the per cpu object cache addiction of the regular slab
that requires the disablig of irqs. Maybe there is a way to work around it.
The Simple Slab is not NUMA capable at this point and I think the NUMAness may
better be implemented in a different way. Maybe we could understand the Simple Slab
as a lower layer and then add all the bells and whistles including NUMAness, proc API,
kmalloc caches etc. on top as a management layer for this lower level
functionality.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.18-rc4/include/linux/sslab.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/include/linux/sslab.h 2006-08-09 16:54:19.506233021 -0700
@@ -0,0 +1,79 @@
+/*
+ * Slab allocator with minimal management overhead.
+ *
+ * (C) 2006 Silicon Graphics Inc., Christoph Lameter <clameter@sgi.com>
+ */
+#include <linux/mm.h>
+
+/*
+ * In a NUMA configuration we plan to have an sslab per node and we can
+ * then allocate based on the node. The NR_CPUS should reflect the
+ * maximum number of processors per node. Remote processors will share in
+ * the use of cpuslabs.
+ */
+struct sslab {
+ struct list_head partial; /* List of partially allocated slabs */
+ struct list_head full; /* Fully allocated slabs */
+ struct list_head active; /* Slabs from which we currently allocate */
+ spinlock_t list_lock; /* Lock for partial slabs */
+ int order; /* Allocation order */
+ gfp_t gfpflags; /* Allocation flags */
+ unsigned int size; /* Size of objects */
+ unsigned long slabs; /* Slabs in use for this one */
+ unsigned long objects; /* Objects per slab */
+ struct page *cpuslab[NR_CPUS]; /* Per CPU slabs list */
+ struct work_struct flush[NR_CPUS];/* Cache flushers */
+};
+
+/*
+ * Sslab_create produces objects aligned at size and the first object
+ * is placed at offset 0 in the slab (sslab has no metainformation on the
+ * slab, all slabs are in essence off slab).
+ *
+ * In order to get the desired alignment one just needs to align the
+ * size. F.e.
+ *
+ * sslab_create(&my_cache, ALIGN(sizeof(struct mystruct)), CACHE_L1_SIZE), 2, GFP_KERNEL);
+ *
+ * Notice that the allocation order determines the sizes of the per cpu caches.
+ * Each processor has always one slab available for allocations. Increasing the
+ * allocation order reduces the number of times that slabs must be moved
+ * on and off lists and therefore influences sslab overhead.
+ */
+
+extern int sslab_create(struct sslab *s, size_t size, int order, gfp_t flags);
+
+extern void *sslab_alloc(struct sslab *, gfp_t);
+extern void *sslab_zalloc(struct sslab *, gfp_t);
+
+/*
+ * If the struct sslab pointer is NULL then sslab will determine the
+ * proper cache on its on. Otherwise the sslab pointer will be verified
+ * before object removal.
+ */
+extern void sslab_free(struct sslab *, void *);
+
+/*
+ * Preloading allows a guarantee that the slab will not call the page
+ * allocator for the given number of allocations. This will add new slabs
+ * to the partial list until the required number of objects can be allocated
+ * without additional system call overhead.
+ */
+extern void sslab_preload(struct sslab *, unsigned long);
+
+extern void sslab_destroy(struct sslab *);
+
+/*
+ * Shrinking drops all the per cpu slabs and also reaps all empty slabs
+ * off the partial list. (preloading may have added those).
+ *
+ * Note that there is currently no slab reaping. The per cpu slabs
+ * will stay allocated if this function is not called.
+ */
+extern void sslab_shrink(struct sslab *);
+
+extern int sslab_pointer_valid(struct sslab *, void *);
+
+extern unsigned long sslab_objects(struct sslab *);
+
+
Index: linux-2.6.18-rc4/mm/sslab.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/sslab.c 2006-08-09 16:54:19.505256518 -0700
@@ -0,0 +1,528 @@
+/*
+ * Simple slab allocator with minimal management overhead.
+ *
+ * (C) 2006 Silicon Graphics Inc., Christoph Lameter <clameter@sgi.com>
+ */
+
+#include <linux/sslab.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * The page struct is used to keep necessary information about a slab.
+ * For a compound page the first page keeps the slab state.
+ *
+ * Overloaded fields in struct page:
+ *
+ * lru -> used to a slab on the lists
+ * mapping -> pointer to struct sslab
+ * index -> pointer to next free object
+ * _mapcount -> count number of elements in use
+ *
+ * Lock order:
+ * 1. slab_lock(page)
+ * 2. sslab->list_lock
+ *
+ * The sslab assigns one slab for allocation to each processor
+ * for allocations. This slab is on the active list and allocations
+ * occur only on the active slabs. If a cpuslab is active then
+ * a workqueue thread checks every 10 seconds if the cpuslab is
+ * still in use. It is dropped back to the inactive lists if not.
+ *
+ * Leftover slabs with free elements are kept on the partial list
+ * and full slabs on the full list.
+ *
+ * Slabs are freed when they become empty. Teardown and setup is
+ * minimal so we rely on the page cache per cpu caches for
+ * performance on frees/allocs.
+ */
+
+#define SSLAB_DEBUG
+
+#define lru_to_page(_head) (list_entry((_head)->next, struct page, lru))
+
+/* Some definitions to overload certain fields in struct page */
+static void *get_object_pointer(struct page *page)
+{
+ return (void *)page->index;
+}
+
+static void set_object_pointer(struct page *page, void *object)
+{
+ page->index = (unsigned long)object;
+}
+
+static void *get_sslab(struct page *page)
+{
+ return (struct sslab *)page->mapping;
+}
+
+static void set_sslab(struct page *page, struct sslab *s)
+{
+ page->mapping = (void *)s;
+}
+
+int *object_counter(struct page *page)
+{
+ return (int *)&page->_mapcount;
+}
+
+static void inc_object_counter(struct page *page)
+{
+ (*object_counter(page))++;
+}
+
+static void dec_object_counter(struct page *page)
+{
+ (*object_counter(page))--;
+}
+
+static void set_object_counter(struct page *page, int counter)
+{
+ (*object_counter(page))= counter;
+}
+
+static int get_object_counter(struct page *page)
+{
+ return (*object_counter(page));
+}
+
+/*
+ * Locking for each individual slab using the pagelock
+ */
+static void slab_lock(struct page *page)
+{
+ if (!TestSetPageLocked(page))
+ return;
+ bit_spin_lock(PG_locked, &page->flags);
+}
+
+static void slab_unlock(struct page *page)
+{
+ smp_mb__before_clear_bit();
+ if (!TestClearPageLocked(page))
+ BUG();
+}
+
+/*
+ * Discard an unused slab page
+ * List lock is held.
+ */
+static void discard_slab(struct sslab *s, struct page *page)
+{
+ list_del(&page->lru);
+ s->slabs--;
+ set_sslab(page, NULL);
+ set_object_pointer(page, NULL);
+ set_object_counter(page, -1); /* -1 is convention for mapcount */
+ __ClearPageSlab(page);
+ __free_pages(page, s->order);
+ sub_zone_page_state(page_zone(page), NR_SLAB, 1 << s->order);
+}
+
+static struct page *new_slab(struct sslab *s)
+{
+ void **p, **last;
+ int i;
+ struct page *page;
+
+ page = alloc_pages(s->gfpflags, s->order);
+ if (!page)
+ return NULL;
+
+ set_sslab(page, s);
+ last = page_address(page);
+ set_object_pointer(page, last);
+
+ for (i = 0; i < s->objects - 1; i++) {
+ p = last + s->size / sizeof(void *);
+ *last = p;
+ last = p;
+ }
+ *last = NULL;
+
+ __SetPageSlab(page);
+ set_object_counter(page, 0);
+ add_zone_page_state(page_zone(page), NR_SLAB, 1 << s->order);
+ return page;
+}
+/*
+ * Return a per cpu slab by taking it off the active list.
+ */
+void flush_cpuslab(struct sslab *s, int cpu)
+{
+ struct page *page;
+
+ /* Terminate any active flusher */
+ cancel_delayed_work(&s->flush[cpu]);
+
+ /* Avoid lock inversion here. */
+redo:
+ page = s->cpuslab[cpu];
+ if (!page)
+ return;
+ slab_lock(page);
+ if (s->cpuslab[cpu] != page) {
+ slab_unlock(page);
+ goto redo;
+ }
+
+ spin_lock(&s->list_lock);
+ page = s->cpuslab[cpu];
+ s->cpuslab[cpu] = NULL;
+ ClearPageActive(page);
+ if (get_object_counter(page)) {
+ if (get_object_counter(page) < s->objects)
+ list_move(&page->lru, &s->partial);
+ else
+ list_move(&page->lru, &s->full);
+ spin_unlock(&s->list_lock);
+ slab_unlock(page);
+ } else {
+ slab_unlock(page);
+ discard_slab(s, page);
+ spin_unlock(&s->list_lock);
+ }
+}
+
+/*
+ * Flush the per cpu slab if it is not in use.
+ */
+void flusher(void *d)
+{
+ struct sslab *s = d;
+ int cpu = smp_processor_id();
+ struct page *page;
+
+ page = s->cpuslab[cpu];
+
+ if (!page)
+ return;
+
+ if (PageReferenced(page)) {
+ ClearPageReferenced(page);
+ schedule_delayed_work(&s->flush[cpu], 10 * HZ);
+ } else
+ flush_cpuslab(s, cpu);
+}
+
+/*
+ * Reload the per cpu slab
+ */
+static int reload(struct sslab *s, int cpu, gfp_t flags)
+{
+ struct page *page;
+
+ cancel_delayed_work(&s->flush[cpu]);
+ spin_lock(&s->list_lock);
+ if (!list_empty(&s->partial)) {
+ page = lru_to_page(&s->partial);
+ list_move(&page->lru, &s->active);
+
+ } else {
+
+ /* No slabs with free objets */
+ spin_unlock(&s->list_lock);
+ page = new_slab(s);
+ if (!page)
+ return 0;
+
+ spin_lock(&s->list_lock);
+ s->slabs++;
+ list_add(&page->lru, &s->active);
+ }
+ SetPageActive(page);
+ ClearPageReferenced(page);
+ s->cpuslab[cpu] = page;
+ spin_unlock(&s->list_lock);
+ if (keventd_up())
+ schedule_delayed_work_on(cpu, &s->flush[cpu], 10 * HZ);
+ return 1;
+}
+
+
+void check_valid_pointer(struct sslab *s, struct page *page, void *object)
+{
+#ifdef SSLAB_DEBUG
+ BUG_ON(!PageSlab(page));
+ BUG_ON(object < page_address(page));
+ BUG_ON(object >= page_address(page) + s->objects * s->size);
+ BUG_ON((object - page_address(page)) % s->size);
+#endif
+}
+
+void check_free_chain(struct sslab *s, struct page *page)
+{
+#ifdef SSLAB_DEBUG
+ int nr = 0;
+ void **object = get_sslab(page);
+
+ BUG_ON(!PageSlab(page));
+
+ while (object) {
+ check_valid_pointer(s, page, object);
+ object = *object;
+ nr++;
+ }
+ BUG_ON(nr > s->objects);
+#endif
+}
+
+/*
+ * Create a new slab.
+ */
+int sslab_create(struct sslab *s, size_t size, int order, gfp_t flags)
+{
+ int cpu;
+
+ s->size = ALIGN(size, sizeof(void *));
+ s->order = order;
+ s->gfpflags = flags;
+ if (order > 1)
+ s->gfpflags |= __GFP_COMP;
+ s->objects = (PAGE_SIZE << order) / size;
+ if (!s->objects)
+ return -EINVAL;
+
+ INIT_LIST_HEAD(&s->partial);
+ INIT_LIST_HEAD(&s->full);
+ INIT_LIST_HEAD(&s->active);
+ spin_lock_init(&s->list_lock);
+ s->slabs = 0;
+ for_each_possible_cpu(cpu) {
+ s->cpuslab[cpu] = NULL;
+ INIT_WORK(&s->flush[cpu], &flusher, s);
+ }
+ return 0;
+}
+
+void *sslab_alloc(struct sslab *s, gfp_t flags)
+{
+ int cpu = smp_processor_id();
+ struct page *page;
+ void **x;
+
+ do {
+ page = s->cpuslab[cpu];
+ if (unlikely(!page))
+ continue;
+
+ check_free_chain(s, page);
+ slab_lock(page);
+ if (!PageActive(page)) {
+ slab_unlock(page);
+ continue;
+ }
+ x = get_object_pointer(page);
+ if (likely(x)) {
+ set_object_pointer(page, *x);
+ inc_object_counter(page);
+ SetPageReferenced(page);
+ slab_unlock(page);
+ check_free_chain(s, page);
+ return x;
+ }
+
+ /* Slab is full */
+ spin_lock(&s->list_lock);
+ ClearPageActive(page);
+ list_move(&page->lru, &s->full);
+ s->cpuslab[cpu] = NULL;
+ spin_unlock(&s->list_lock);
+ slab_unlock(page);
+ } while (reload(s, cpu, flags));
+ return NULL;
+}
+
+/*
+ * Allocate and zero an object
+ */
+void *sslab_zalloc(struct sslab *s, gfp_t flags)
+{
+ void *x = sslab_alloc(s, flags);
+
+ if (x)
+ memset(x, 0, s->size);
+ return x;
+}
+
+void sslab_free(struct sslab *s, void *object) {
+ struct page * page;
+ int leftover;
+ void *prior;
+
+ /* Figure out on which slab the object resides */
+ page = virt_to_page(object);
+ if (unlikely(PageCompound(page)))
+ page = (struct page *)page_private(page);
+
+ if (!s)
+ s = get_sslab(page);
+ else
+ BUG_ON(s != get_sslab(page));
+
+ check_free_chain(s, page);
+ check_valid_pointer(s, page, object);
+ slab_lock(page);
+ prior = get_object_pointer(page);
+ * (void **) object = prior;
+ set_object_pointer(page, object);
+ dec_object_counter(page);
+ leftover = get_object_counter(page);
+ slab_unlock(page);
+
+ if (unlikely(PageActive(page)))
+ return;
+
+ if (unlikely(leftover == 0)) {
+ spin_lock(&s->list_lock);
+ if (PageActive(page)) {
+ /* No discarding of slabs under active allocation */
+ spin_unlock(&s->list_lock);
+ return;
+ }
+ discard_slab(s, page);
+ spin_unlock(&s->list_lock);
+ }
+
+ if (unlikely(!prior)) {
+ /*
+ * Page was fully used before. It will only have one free
+ * object now. So move to the front of the partial list.
+ */
+ spin_lock(&s->list_lock);
+ if (!PageActive(page))
+ list_move(&page->lru, &s->partial);
+ spin_unlock(&s->list_lock);
+ }
+ check_free_chain(s, page);
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int sslab_pointer_valid(struct sslab *s, void *object) {
+ struct page * page;
+ void *slab_addr;
+
+ /* Figure out on which slab the object resides */
+ page = virt_to_page(object);
+ if (unlikely(PageCompound(page)))
+ page = (struct page *)page_private(page);
+
+ if (!PageSlab(page) || s != get_sslab(page))
+ return 0;
+
+ slab_addr = page_address(page);
+ if (object < slab_addr || object >= slab_addr + s->objects * s->size)
+ return 0;
+
+ if ((object - slab_addr) & s->size)
+ return 0;
+
+ return 1;
+}
+
+/*
+ * Preload the cache with enough slabs so that we we not need to
+ * allocate for the specified number of objects.
+ */
+void sslab_preload(struct sslab *s, unsigned long nr)
+{
+ int nr_local = 0;
+ int cpu = smp_processor_id();
+ struct page * page = s->cpuslab[cpu];
+
+ if (page)
+ nr_local = get_object_counter(page);
+
+ nr_local += sslab_objects(s);
+
+ while (nr_local < nr) {
+ struct page *page = new_slab(s);
+
+ spin_lock(&s->list_lock);
+ list_add_tail(&page->lru, &s->partial);
+ s->slabs++;
+ spin_unlock(&s->list_lock);
+ nr_local += s->objects;
+ }
+}
+
+
+static void drain_cpu(void *v)
+{
+ struct sslab *s = v;
+
+ flush_cpuslab(s, smp_processor_id());
+}
+
+/*
+ * Try to remove as many slabs as possible. In particular try to undo the
+ * effect of sslab_preload which may have added empty pages to the
+ * partial list.
+ */
+void sslab_shrink(struct sslab *s) {
+ struct list_head *h1,*h2;
+
+ spin_lock(&s->list_lock);
+ list_for_each_safe(h1, h2, &s->partial) {
+ struct page *page = lru_to_page(h1);
+
+ if (get_object_counter(page) == 0)
+ discard_slab(s, page);
+ }
+ spin_unlock(&s->list_lock);
+
+ /*
+ * Free each of the per cpu slabs
+ */
+ on_each_cpu(drain_cpu, s, 0, 0);
+}
+
+static void free_list(struct sslab *s, struct list_head *list)
+{
+ while (!list_empty(list))
+ discard_slab(s, lru_to_page(list));
+
+}
+
+/*
+ * Release all leftover slabs. If there are any leftover pointers dangling
+ * to these objects then we will get into a lot of trouble later.
+ */
+void sslab_destroy(struct sslab *s)
+{
+ sslab_shrink(s);
+
+ spin_lock(&s->list_lock);
+ free_list(s, &s->full);
+ free_list(s, &s->partial);
+ spin_unlock(&s->list_lock);
+
+ /* Just to make sure that no one uses this again */
+ s->size = 0;
+ BUG_ON(s->slabs);
+}
+
+static unsigned long count_objects(struct sslab *s, struct list_head *list)
+{
+ int count = 0;
+ struct list_head *h1;
+
+ spin_lock(&s->list_lock);
+ list_for_each(h1, list) {
+ struct page *page = lru_to_page(h1);
+
+ count += get_object_counter(page);
+ }
+ spin_unlock(&s->list_lock);
+ return count;
+}
+
+unsigned long sslab_objects(struct sslab *s)
+{
+ return count_objects(s, &s->partial) +
+ count_objects(s, &s->full) +
+ count_objects(s, &s->active);
+}
+
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-09 15:14:42.694933996 -0700
@@ -19,8 +19,9 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SHMEM) += shmem.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
obj-$(CONFIG_SLOB) += slob.o
-obj-$(CONFIG_SLAB) += slab.o
+obj-$(CONFIG_SLAB) += slab.o sslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
+
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 0:52 [RFC] Simple Slab: A slab allocator with minimal meta information Christoph Lameter
@ 2006-08-10 2:07 ` Matt Mackall
2006-08-10 5:01 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 16+ messages in thread
From: Matt Mackall @ 2006-08-10 2:07 UTC (permalink / raw)
To: Christoph Lameter; +Cc: npiggin, manfred, linux-kernel
On Wed, Aug 09, 2006 at 05:52:00PM -0700, Christoph Lameter wrote:
> This is by no means complete and probably full of bugs. Feedback and help
> wanted! I have tried to switch over two minor system caches (memory
> policies) to use simple slab and it seems to work. We probably have some
> way to go before we could do performance tests.
There's probably enough here to shim in a regular slab and kmalloc
interface layer to run a desktop machine.
> The Simple Slab is not NUMA capable at this point and I think the
> NUMAness may better be implemented in a different way. Maybe we
> could understand the Simple Slab as a lower layer and then add all
> the bells and whistles including NUMAness, proc API, kmalloc caches
> etc. on top as a management layer for this lower level
> functionality.
I think a layered approach to handling NUMA and the like makes an
awful lot of sense here. And probably greatly simplifies locking, etc.
Also, I like that you've gone to off-slab accounting. Not only does
this simplify things overall, it's good for memory footprint and
possibly better on cache footprint.
It's gonna need a better name though..
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 0:52 [RFC] Simple Slab: A slab allocator with minimal meta information Christoph Lameter
2006-08-10 2:07 ` Matt Mackall
@ 2006-08-10 5:01 ` KAMEZAWA Hiroyuki
2006-08-10 5:13 ` Christoph Lameter
1 sibling, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-08-10 5:01 UTC (permalink / raw)
To: Christoph Lameter; +Cc: mpm, npiggin, manfred, linux-kernel
Hi,
On Wed, 9 Aug 2006 17:52:00 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> struct page overloading:
>
> - _mapcout => Used to count the objects in use in a slab
> - mapping => Reference to the slab structure
> - index => Pointer to the first free element in a slab
> - lru => Used for list management.
>
it seems it's time that the page struct should have more unions ;)
> There is no freelist for slabs. slabs are immediately returned to the page
> allocator. The page allocator has its own per cpu page queues that should provide
> enough caching.
>
I think that the advantage of Slab allocator is
- object is already initizalized at setup, so you don't have to initialize it again at
allocation.
- object is initialized only once when slab is created.
If a slab page is returned to page allocator ASAP, # of object initilization may
increase.
-Kame
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 5:01 ` KAMEZAWA Hiroyuki
@ 2006-08-10 5:13 ` Christoph Lameter
2006-08-10 5:44 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-08-10 5:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: mpm, npiggin, manfred, linux-kernel
On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
> > There is no freelist for slabs. slabs are immediately returned to the page
> > allocator. The page allocator has its own per cpu page queues that should provide
> > enough caching.
> >
>
> I think that the advantage of Slab allocator is
> - object is already initizalized at setup, so you don't have to initialize it again at
> allocation.
> - object is initialized only once when slab is created.
If you do that then you loose the cache hot advantage. It is advantageous
to initialize the object and then immediately use it. If you initialize it
before then the cacheline will be evicted and then brought back.
> If a slab page is returned to page allocator ASAP, # of object initilization may
> increase.
The initializers in the existing slab are only used in rare cases in the
kernel.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 5:13 ` Christoph Lameter
@ 2006-08-10 5:44 ` KAMEZAWA Hiroyuki
2006-08-10 5:44 ` Christoph Lameter
2006-08-10 6:13 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-08-10 5:44 UTC (permalink / raw)
To: Christoph Lameter; +Cc: mpm, npiggin, manfred, linux-kernel
On Wed, 9 Aug 2006 22:13:27 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
>
> > > There is no freelist for slabs. slabs are immediately returned to the page
> > > allocator. The page allocator has its own per cpu page queues that should provide
> > > enough caching.
> > >
> >
> > I think that the advantage of Slab allocator is
> > - object is already initizalized at setup, so you don't have to initialize it again at
> > allocation.
> > - object is initialized only once when slab is created.
>
> If you do that then you loose the cache hot advantage. It is advantageous
> to initialize the object and then immediately use it. If you initialize it
> before then the cacheline will be evicted and then brought back.
hmm, I don't know precise analization of the perfromance benefit of slab on
current (Linux + Fast CPU/Bus + Large Cache) systems. I'm grad if you show the
performance of new "Simple Slab" next time.
>
> > If a slab page is returned to page allocator ASAP, # of object initilization may
> > increase.
>
> The initializers in the existing slab are only used in rare cases in the
> kernel.
>
>
Because of inode_init_once, many codes which uses inode uses initilization code.
And inode is one of heavy users of slab.
[kamezawa@aworks linux-2.6.18-rc4]$ grep init_once /proc/kallsyms
c0155cdf t init_once
c0163e00 t init_once
c016ef02 t init_once
c0172a01 T inode_init_once
c0172b77 t init_once
c0187704 t init_once
c019a19f t init_once
c01aa643 t init_once
c01abd3f t init_once
c01acdab t init_once
c01b2500 t init_once
c01d3139 t init_once
c01db8b3 t init_once
c02f9a62 t init_once
c0358ab5 t init_once
c03b9f6c r __ksymtab_inode_init_once
c03c28fe r __kstrtab_inode_init_once
performance measurement related to inode will show good data. maybe
Thanks,
- Kame
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 5:44 ` KAMEZAWA Hiroyuki
@ 2006-08-10 5:44 ` Christoph Lameter
2006-08-10 5:56 ` KAMEZAWA Hiroyuki
2006-08-10 6:13 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-08-10 5:44 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: mpm, npiggin, manfred, linux-kernel
On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
> Because of inode_init_once, many codes which uses inode uses initilization code.
> And inode is one of heavy users of slab.
Probably just code copied from the same location. It has the same name.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 5:44 ` Christoph Lameter
@ 2006-08-10 5:56 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-08-10 5:56 UTC (permalink / raw)
To: Christoph Lameter; +Cc: mpm, npiggin, manfred, linux-kernel
On Wed, 9 Aug 2006 22:44:59 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
>
> > Because of inode_init_once, many codes which uses inode uses initilization code.
> > And inode is one of heavy users of slab.
>
> Probably just code copied from the same location. It has the same name.
>
for example, ext3's copy of init_once() is
--
static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
{
struct ext3_inode_info *ei = (struct ext3_inode_info *) foo;
if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
SLAB_CTOR_CONSTRUCTOR) {
INIT_LIST_HEAD(&ei->i_orphan); ---------------(A)
#ifdef CONFIG_EXT3_FS_XATTR
init_rwsem(&ei->xattr_sem); ---------------(B)
#endif
mutex_init(&ei->truncate_mutex); ---------------(C)
inode_init_once(&ei->vfs_inode);
}
}
--
(A) and (B) and (C) is only for ext3.
NFS's
--
static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
{
struct nfs_inode *nfsi = (struct nfs_inode *) foo;
if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
SLAB_CTOR_CONSTRUCTOR) {
inode_init_once(&nfsi->vfs_inode);
spin_lock_init(&nfsi->req_lock);
INIT_LIST_HEAD(&nfsi->dirty);
INIT_LIST_HEAD(&nfsi->commit);
INIT_LIST_HEAD(&nfsi->open_files);
INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC);
atomic_set(&nfsi->data_updates, 0);
nfsi->ndirty = 0;
nfsi->ncommit = 0;
nfsi->npages = 0;
nfs4_init_once(nfsi);
}
}
--
Of cource, many of init_once() just call inode_init_once(). But some fs does
something special.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 5:44 ` KAMEZAWA Hiroyuki
2006-08-10 5:44 ` Christoph Lameter
@ 2006-08-10 6:13 ` KAMEZAWA Hiroyuki
2006-08-10 15:25 ` Christoph Lameter
1 sibling, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-08-10 6:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: clameter, mpm, npiggin, manfred, linux-kernel
On Thu, 10 Aug 2006 14:44:51 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > If you do that then you loose the cache hot advantage. It is advantageous
> > to initialize the object and then immediately use it. If you initialize it
> > before then the cacheline will be evicted and then brought back.
>
> hmm, I don't know precise analization of the perfromance benefit of slab on
> current (Linux + Fast CPU/Bus + Large Cache) systems. I'm grad if you show the
> performance of new "Simple Slab" next time.
>
BTW, in recent Linux, many objects are freed by call_rcu(hoo, dealyed_free_foo).
Objects are freed far after it was touched.
I think catching this kind of freeing will not boost performance by cache-hit if
reuse freed page (object).
-Kame
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 6:13 ` KAMEZAWA Hiroyuki
@ 2006-08-10 15:25 ` Christoph Lameter
2006-08-10 18:47 ` Manfred Spraul
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-08-10 15:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: mpm, npiggin, manfred, linux-kernel
On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
> BTW, in recent Linux, many objects are freed by call_rcu(hoo, dealyed_free_foo).
> Objects are freed far after it was touched.
> I think catching this kind of freeing will not boost performance by cache-hit if
> reuse freed page (object).
Yes that is a general problem with RCU freeing. One can use the
SLAB_DESTROY_BY_RCU option to have RCU applied to the whole slab. In that
case on can use the cache hot effect but has the additional problem in RCU
of dealing with the issue that the object can be replaced at any time.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 15:25 ` Christoph Lameter
@ 2006-08-10 18:47 ` Manfred Spraul
2006-08-10 18:52 ` Christoph Lameter
2006-08-11 17:21 ` Christoph Lameter
0 siblings, 2 replies; 16+ messages in thread
From: Manfred Spraul @ 2006-08-10 18:47 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, mpm, npiggin, linux-kernel
Christoph Lameter wrote:
>On Thu, 10 Aug 2006, KAMEZAWA Hiroyuki wrote:
>
>
>
>>BTW, in recent Linux, many objects are freed by call_rcu(hoo, dealyed_free_foo).
>>Objects are freed far after it was touched.
>>I think catching this kind of freeing will not boost performance by cache-hit if
>>reuse freed page (object).
>>
>>
>
>Yes that is a general problem with RCU freeing. One can use the
>SLAB_DESTROY_BY_RCU option to have RCU applied to the whole slab. In that
>case on can use the cache hot effect but has the additional problem in RCU
>of dealing with the issue that the object can be replaced at any time.
>
>
No SLAB_DESTROY_BY_RCU is not equivalent to delayed_free_foo().
SLAB_DESTROY_BY_RCU just means that the slab allocator uses
delayed_free_pages() instead of free_pages().
kmem_cache_free() does not delay the reuse, an object will be returned
by the next kmem_cache_alloc, without any grace periods in between.
Independantly from that point, we need some benchmarks to test the
allocator.
The last benchmarks of the slab allocator (that I'm aware of) were done
with packet routing - packet routing was the reason why the shared_array
layer was added:
The shared_array layer is used to perform inter-cpu bulk object
transfers. Without that cache, i.e. if a list_add() / list_del() was
required to transfer one object from one cpu to another cpu, a
significant amount of time was spent in the allocator.
Christoph, could you run a test? GigE routing with tiny packets should
be sufficient. Two cpu bound nics, one does rx, the other one tx. Move
the skb_head_cache to sslab.
--
Manfred
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 18:47 ` Manfred Spraul
@ 2006-08-10 18:52 ` Christoph Lameter
2006-08-11 17:21 ` Christoph Lameter
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-08-10 18:52 UTC (permalink / raw)
To: Manfred Spraul; +Cc: KAMEZAWA Hiroyuki, mpm, npiggin, linux-kernel
On Thu, 10 Aug 2006, Manfred Spraul wrote:
> > Yes that is a general problem with RCU freeing. One can use the
> > SLAB_DESTROY_BY_RCU option to have RCU applied to the whole slab. In that
> > case on can use the cache hot effect but has the additional problem in RCU
> > of dealing with the issue that the object can be replaced at any time.
> >
> No SLAB_DESTROY_BY_RCU is not equivalent to delayed_free_foo().
> SLAB_DESTROY_BY_RCU just means that the slab allocator uses
> delayed_free_pages() instead of free_pages().
> kmem_cache_free() does not delay the reuse, an object will be returned by the
> next kmem_cache_alloc, without any grace periods in between.
Yes that is what I said. SLAB_DESTROY_BY_RCU is RCU applied to the "whole
slab".
> Independantly from that point, we need some benchmarks to test the allocator.
Right. This is pretty early for tests though. Its barely functional.
> The last benchmarks of the slab allocator (that I'm aware of) were done with
> packet routing - packet routing was the reason why the shared_array layer was
> added:
> The shared_array layer is used to perform inter-cpu bulk object transfers.
> Without that cache, i.e. if a list_add() / list_del() was required to transfer
> one object from one cpu to another cpu, a significant amount of time was spent
> in the allocator.
If the overhead of general allocation/free from a slab is reduced then
this effect should not occur. IMHO it may turn out that the need for
the shared array is an artifact of the per cpu caches.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-10 18:47 ` Manfred Spraul
2006-08-10 18:52 ` Christoph Lameter
@ 2006-08-11 17:21 ` Christoph Lameter
2006-08-11 20:33 ` Manfred Spraul
1 sibling, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2006-08-11 17:21 UTC (permalink / raw)
To: Manfred Spraul; +Cc: KAMEZAWA Hiroyuki, mpm, npiggin, linux-kernel
On Thu, 10 Aug 2006, Manfred Spraul wrote:
> Christoph, could you run a test? GigE routing with tiny packets should be
> sufficient. Two cpu bound nics, one does rx, the other one tx. Move the
> skb_head_cache to sslab.
You are right. The problem is that the simple slab does not do the LIFO
thing that regular slab does. Instead it services objects from the same
page until its empty and then it gets a new page (which is very unlikely
to be cache hot). The cpu caches are necessary in order to effectively
handle objects that still have cachelines in the various processor
memory caches. So at a minimum we would need a cpucache layer on top of
simple slab.
I still do not get the role of the shared cache though. This from the
early days of SMP and at that time I thought that all processors had
separate caches? Thus crossfeeding objects could not have too much of a
benefit. On the other hand NUMA nodes cache series of cachelines when
going on to the interconnect. The shared cache is useful to track
the state of objects that are potentially in that cache. Reuse of those
objects would avoid additional cacheline acquisition. The same is true
for multi core. The shared caches provide a kind of state of the shared
caches. Maybe the shared caches need to be configured depending on the
underlying cache architecture?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-11 17:21 ` Christoph Lameter
@ 2006-08-11 20:33 ` Manfred Spraul
2006-08-11 21:02 ` Christoph Lameter
0 siblings, 1 reply; 16+ messages in thread
From: Manfred Spraul @ 2006-08-11 20:33 UTC (permalink / raw)
To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, mpm, npiggin, linux-kernel
Christoph Lameter wrote:
>I still do not get the role of the shared cache though.
>
The shared cache is just for efficient object transfers:
Think about two nics, both cpu bound, one does rx, the other does tx.
Result: a few 100k kmalloc, kmem_cache_alloc(skb_head_cache) calls each
second on cpu1.
the same number of kfree, kmem_cache_free(skb_head_cache) calls each
second on cpu 2.
What is needed is an efficient algorithm for transfering all objects
from cpu 2 to cpu 1.
Initially, the slab allocator just had the cpu cache. Thus an object
transfer was a free_block call: add the freed object to the bufctl
linked list. Move the slab to the tail of the partial list. Probably the
list_del()/list_add() calls caused cache line trashing, but I don't
remember the details. IIRC Robert Olsson did the test. Pentium III Xeon
system?
Anyway: The solution was the shared array. It allows to move objects
around with a simple memmove of the pointers, without the
list_del()/list_add() calls.
--
Manfred
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-11 20:33 ` Manfred Spraul
@ 2006-08-11 21:02 ` Christoph Lameter
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2006-08-11 21:02 UTC (permalink / raw)
To: Manfred Spraul; +Cc: KAMEZAWA Hiroyuki, mpm, npiggin, linux-kernel
On Fri, 11 Aug 2006, Manfred Spraul wrote:
> Christoph Lameter wrote:
>
> > I still do not get the role of the shared cache though.
> >
> The shared cache is just for efficient object transfers:
> Think about two nics, both cpu bound, one does rx, the other does tx.
> Result: a few 100k kmalloc, kmem_cache_alloc(skb_head_cache) calls each second
> on cpu1.
> the same number of kfree, kmem_cache_free(skb_head_cache) calls each second on
> cpu 2.
Hmmm. In that case a faster free/alloc would help since cachelines are not
shared.
> Initially, the slab allocator just had the cpu cache. Thus an object transfer
> was a free_block call: add the freed object to the bufctl linked list. Move
> the slab to the tail of the partial list. Probably the list_del()/list_add()
> calls caused cache line trashing, but I don't remember the details. IIRC
> Robert Olsson did the test. Pentium III Xeon system?
> Anyway: The solution was the shared array. It allows to move objects around
> with a simple memmove of the pointers, without the list_del()/list_add()
> calls.
But the shared array still needs the list_lock for access. So this is
avoiding the list operations?
In the case of the simple slab we would have one task continually
allocating from its slab until its full. If the other task is freeing
objects in the same slab at the same time then there is lock contention on
the per cpu slab.
If however there would be sufficient distance between the free and the
alloc then one task would complete a slab put it back to the full list and
get it back from the partial list.
The other task would be removing objects from the full slab and move it
back to the partial list.
In an ideal case we would have two slab continually trading roles with
separate slab locks. The list lock would only be taken if they switch
roles.If we can get that going then its the same effect as the shared
cache.
Hmmm... Hmmm...
Essentially if we discover that a process frees in the cpuslab of another
then the cpuslab must be detached and the other process must be forced to
pick another slab for allocation.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
@ 2006-08-14 1:10 linux
2006-08-14 11:47 ` Andi Kleen
0 siblings, 1 reply; 16+ messages in thread
From: linux @ 2006-08-14 1:10 UTC (permalink / raw)
To: linux-kernel
Um, with all this discussion of keeping caches hot, people do remember
that FIFO handling of free blocks *greatly* reduces fragmentation, right?
That's an observation from malloc implementations that support merging
of any two adjacent blocks, but at least some of it should apply to slab
pages that require multple adjacent free objects to be returned to the
free-page pool.
With steady-state allocations and a LIFO free list, your "hot" end
of the list is never free long enough to be combined, and the "cold"
end, which shared pages with long-lived objects that have no hope of
ever being freed, is rarely used and just wastes memory.
Managing the free list FIFO gives every chunk an equal opportunity to
have its neighbor chunks freed.
The first idea that comes to mind for adapting this to a slab cache is
to put the cache pages on a free list. Whenever a chunk is freed on
a page, that page is moved to the "recent" end. Objects are allocated
from the page at the "old" end until it is full, then the next-oldest
page taken, and so on.
Completely free pages are either returned to the system, or put on a
lowest-priority list that is only used when the other pages are all
full.
Especially in a memory-constrained embedded environment, I'd think
space-efficiency would be at least as important as time.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Simple Slab: A slab allocator with minimal meta information
2006-08-14 1:10 linux
@ 2006-08-14 11:47 ` Andi Kleen
0 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2006-08-14 11:47 UTC (permalink / raw)
To: linux; +Cc: linux-kernel, clameter
linux@horizon.com writes:
[this time without typo in cc, sorry if you get it twice]
> Um, with all this discussion of keeping caches hot, people do remember
> that FIFO handling of free blocks *greatly* reduces fragmentation, right?
>
> That's an observation from malloc implementations that support merging
> of any two adjacent blocks, but at least some of it should apply to slab
> pages that require multple adjacent free objects to be returned to the
> free-page pool.
Interesting observation.
slab is a zone allocator and stores objects of the same type
into zones. The theory behind that is that normally that objects of the same
type have similar livetimes and that should in theory avoid
many fragmentation problems.
However some caches like dcache/inode cache don't seem to follow
that, and kmalloc which mixes all kinds of objects into the same
caches especially doesn't.
> Especially in a memory-constrained embedded environment, I'd think
> space-efficiency would be at least as important as time.
For memory-constrained environments there is already the optional
specialized "slob" allocator.
That said even big systems have problems with fragmentation.
It is good that the basic assumptions behind slabs are being
revisited now. I suspect some of them might be obsolete.
-Andi
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2006-08-14 11:47 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-10 0:52 [RFC] Simple Slab: A slab allocator with minimal meta information Christoph Lameter
2006-08-10 2:07 ` Matt Mackall
2006-08-10 5:01 ` KAMEZAWA Hiroyuki
2006-08-10 5:13 ` Christoph Lameter
2006-08-10 5:44 ` KAMEZAWA Hiroyuki
2006-08-10 5:44 ` Christoph Lameter
2006-08-10 5:56 ` KAMEZAWA Hiroyuki
2006-08-10 6:13 ` KAMEZAWA Hiroyuki
2006-08-10 15:25 ` Christoph Lameter
2006-08-10 18:47 ` Manfred Spraul
2006-08-10 18:52 ` Christoph Lameter
2006-08-11 17:21 ` Christoph Lameter
2006-08-11 20:33 ` Manfred Spraul
2006-08-11 21:02 ` Christoph Lameter
-- strict thread matches above, loose matches on Subject: below --
2006-08-14 1:10 linux
2006-08-14 11:47 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox