From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Thomas Graf <tgraf@suug.ch>, David Miller <davem@davemloft.net>,
Andrew Morton <akpm@linux-foundation.org>,
Daniel Phillips <phillips@google.com>,
Pekka Enberg <penberg@cs.helsinki.fi>, Paul Jackson <pj@sgi.com>
Subject: Re: [PATCH 0/5] make slab gfp fair
Date: Sun, 20 May 2007 10:39:44 +0200 [thread overview]
Message-ID: <1179650384.7019.33.camel@twins> (raw)
In-Reply-To: <Pine.LNX.4.64.0705181002400.9372@schroedinger.engr.sgi.com>
Ok, full reset.
I care about kernel allocations only. In particular about those that
have PF_MEMALLOC semantics.
The thing I need is that any memory allocated below
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
is only ever used by processes that have ALLOC_NO_WATERMARKS rights;
for the duration of the distress.
What this patch does:
- change the page allocator to try ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
if ALLOC_NO_WATERMARKS, before the actual ALLOC_NO_WATERMARKS alloc
- set page->reserve nonzero for each page allocated with
ALLOC_NO_WATERMARKS; which by the previous point implies that all
available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
- when a page->reserve slab is allocated store it in s->reserve_slab
and do not update the ->cpu_slab[] (this forces subsequent allocs to
retry the allocation).
All ALLOC_NO_WATERMARKS enabled slab allocations are served from
->reserve_slab, up until the point where a !page->reserve slab alloc
succeeds, at which point the ->reserve_slab is pushed into the partial
lists and ->reserve_slab set to NULL.
Since only the allocation of a new slab uses the gfp zone flags, and
other allocations placement hints they have to be uniform over all slab
allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
status is kmem_cache wide.
Any holes left?
---
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,50 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1175,14 +1175,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1494,6 +1486,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->reserve = (alloc_flags & ALLOC_NO_WATERMARKS);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1619,48 +1612,36 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1669,6 +1650,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
Index: linux-2.6-git/include/linux/mm_types.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_types.h
+++ linux-2.6-git/include/linux/mm_types.h
@@ -60,6 +60,7 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+ int reserve; /* page_alloc: page is a reserve page */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1395,6 +1400,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1430,25 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
+ if (page && !reserve) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ unfreeze_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1476,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ goto got_reserve;
+ }
+ slab_lock(page);
+ SetSlabFrozen(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+ goto use_reserve;
}
return NULL;
debug:
@@ -1470,6 +1503,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+got_reserve:
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ unfreeze_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+use_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1865,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2077,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Thomas Graf <tgraf@suug.ch>, David Miller <davem@davemloft.net>,
Andrew Morton <akpm@linux-foundation.org>,
Daniel Phillips <phillips@google.com>,
Pekka Enberg <penberg@cs.helsinki.fi>, Paul Jackson <pj@sgi.com>
Subject: Re: [PATCH 0/5] make slab gfp fair
Date: Sun, 20 May 2007 10:39:44 +0200 [thread overview]
Message-ID: <1179650384.7019.33.camel@twins> (raw)
In-Reply-To: <Pine.LNX.4.64.0705181002400.9372@schroedinger.engr.sgi.com>
Ok, full reset.
I care about kernel allocations only. In particular about those that
have PF_MEMALLOC semantics.
The thing I need is that any memory allocated below
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
is only ever used by processes that have ALLOC_NO_WATERMARKS rights;
for the duration of the distress.
What this patch does:
- change the page allocator to try ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
if ALLOC_NO_WATERMARKS, before the actual ALLOC_NO_WATERMARKS alloc
- set page->reserve nonzero for each page allocated with
ALLOC_NO_WATERMARKS; which by the previous point implies that all
available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
- when a page->reserve slab is allocated store it in s->reserve_slab
and do not update the ->cpu_slab[] (this forces subsequent allocs to
retry the allocation).
All ALLOC_NO_WATERMARKS enabled slab allocations are served from
->reserve_slab, up until the point where a !page->reserve slab alloc
succeeds, at which point the ->reserve_slab is pushed into the partial
lists and ->reserve_slab set to NULL.
Since only the allocation of a new slab uses the gfp zone flags, and
other allocations placement hints they have to be uniform over all slab
allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
status is kmem_cache wide.
Any holes left?
---
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,50 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1175,14 +1175,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1494,6 +1486,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->reserve = (alloc_flags & ALLOC_NO_WATERMARKS);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1619,48 +1612,36 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1669,6 +1650,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
Index: linux-2.6-git/include/linux/mm_types.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_types.h
+++ linux-2.6-git/include/linux/mm_types.h
@@ -60,6 +60,7 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+ int reserve; /* page_alloc: page is a reserve page */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1395,6 +1400,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1430,25 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
+ if (page && !reserve) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ unfreeze_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1476,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ goto got_reserve;
+ }
+ slab_lock(page);
+ SetSlabFrozen(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+ goto use_reserve;
}
return NULL;
debug:
@@ -1470,6 +1503,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+got_reserve:
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ unfreeze_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+use_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1865,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2077,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-05-20 8:39 UTC|newest]
Thread overview: 138+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-14 13:19 [PATCH 0/5] make slab gfp fair Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 1/5] mm: page allocation rank Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 2/5] mm: slab allocation fairness Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:51 ` Christoph Lameter
2007-05-14 15:51 ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 3/5] mm: slub " Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:49 ` Christoph Lameter
2007-05-14 15:49 ` Christoph Lameter
2007-05-14 16:14 ` Peter Zijlstra
2007-05-14 16:14 ` Peter Zijlstra
2007-05-14 16:35 ` Christoph Lameter
2007-05-14 16:35 ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 4/5] mm: slob " Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 5/5] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:53 ` [PATCH 0/5] make slab gfp fair Christoph Lameter
2007-05-14 15:53 ` Christoph Lameter
2007-05-14 16:10 ` Peter Zijlstra
2007-05-14 16:10 ` Peter Zijlstra
2007-05-14 16:37 ` Christoph Lameter
2007-05-14 16:37 ` Christoph Lameter
2007-05-14 16:12 ` Matt Mackall
2007-05-14 16:12 ` Matt Mackall
2007-05-14 16:29 ` Christoph Lameter
2007-05-14 16:29 ` Christoph Lameter
2007-05-14 17:40 ` Peter Zijlstra
2007-05-14 17:40 ` Peter Zijlstra
2007-05-14 17:57 ` Christoph Lameter
2007-05-14 17:57 ` Christoph Lameter
2007-05-14 19:28 ` Peter Zijlstra
2007-05-14 19:28 ` Peter Zijlstra
2007-05-14 19:56 ` Christoph Lameter
2007-05-14 19:56 ` Christoph Lameter
2007-05-14 20:03 ` Peter Zijlstra
2007-05-14 20:03 ` Peter Zijlstra
2007-05-14 20:06 ` Christoph Lameter
2007-05-14 20:06 ` Christoph Lameter
2007-05-14 20:12 ` Peter Zijlstra
2007-05-14 20:12 ` Peter Zijlstra
2007-05-14 20:25 ` Christoph Lameter
2007-05-14 20:25 ` Christoph Lameter
2007-05-15 17:27 ` Peter Zijlstra
2007-05-15 17:27 ` Peter Zijlstra
2007-05-15 22:02 ` Christoph Lameter
2007-05-15 22:02 ` Christoph Lameter
2007-05-16 6:59 ` Peter Zijlstra
2007-05-16 6:59 ` Peter Zijlstra
2007-05-16 18:43 ` Christoph Lameter
2007-05-16 18:43 ` Christoph Lameter
2007-05-16 19:25 ` Peter Zijlstra
2007-05-16 19:25 ` Peter Zijlstra
2007-05-16 19:53 ` Christoph Lameter
2007-05-16 19:53 ` Christoph Lameter
2007-05-16 20:18 ` Peter Zijlstra
2007-05-16 20:18 ` Peter Zijlstra
2007-05-16 20:27 ` Christoph Lameter
2007-05-16 20:27 ` Christoph Lameter
2007-05-16 20:40 ` Peter Zijlstra
2007-05-16 20:40 ` Peter Zijlstra
2007-05-16 20:44 ` Christoph Lameter
2007-05-16 20:44 ` Christoph Lameter
2007-05-16 20:54 ` Peter Zijlstra
2007-05-16 20:54 ` Peter Zijlstra
2007-05-16 20:59 ` Christoph Lameter
2007-05-16 20:59 ` Christoph Lameter
2007-05-16 21:04 ` Peter Zijlstra
2007-05-16 21:04 ` Peter Zijlstra
2007-05-16 21:13 ` Christoph Lameter
2007-05-16 21:13 ` Christoph Lameter
2007-05-16 21:20 ` Peter Zijlstra
2007-05-16 21:20 ` Peter Zijlstra
2007-05-16 21:42 ` Christoph Lameter
2007-05-16 21:42 ` Christoph Lameter
2007-05-17 7:28 ` Peter Zijlstra
2007-05-17 7:28 ` Peter Zijlstra
2007-05-17 17:30 ` Christoph Lameter
2007-05-17 17:30 ` Christoph Lameter
2007-05-17 17:53 ` Peter Zijlstra
2007-05-17 17:53 ` Peter Zijlstra
2007-05-17 18:01 ` Christoph Lameter
2007-05-17 18:01 ` Christoph Lameter
2007-05-14 19:44 ` Andrew Morton
2007-05-14 19:44 ` Andrew Morton
2007-05-14 20:01 ` Matt Mackall
2007-05-14 20:01 ` Matt Mackall
2007-05-14 20:05 ` Peter Zijlstra
2007-05-14 20:05 ` Peter Zijlstra
2007-05-17 3:02 ` Christoph Lameter
2007-05-17 3:02 ` Christoph Lameter
2007-05-17 7:08 ` Peter Zijlstra
2007-05-17 7:08 ` Peter Zijlstra
2007-05-17 17:29 ` Christoph Lameter
2007-05-17 17:29 ` Christoph Lameter
2007-05-17 17:52 ` Peter Zijlstra
2007-05-17 17:52 ` Peter Zijlstra
2007-05-17 17:59 ` Christoph Lameter
2007-05-17 17:59 ` Christoph Lameter
2007-05-17 17:53 ` Matt Mackall
2007-05-17 17:53 ` Matt Mackall
2007-05-17 18:02 ` Christoph Lameter
2007-05-17 18:02 ` Christoph Lameter
2007-05-17 19:18 ` Peter Zijlstra
2007-05-17 19:18 ` Peter Zijlstra
2007-05-17 19:24 ` Christoph Lameter
2007-05-17 19:24 ` Christoph Lameter
2007-05-17 21:26 ` Peter Zijlstra
2007-05-17 21:26 ` Peter Zijlstra
2007-05-17 21:44 ` Paul Jackson
2007-05-17 21:44 ` Paul Jackson
2007-05-17 22:27 ` Christoph Lameter
2007-05-17 22:27 ` Christoph Lameter
2007-05-18 9:54 ` Peter Zijlstra
2007-05-18 9:54 ` Peter Zijlstra
2007-05-18 17:11 ` Paul Jackson
2007-05-18 17:11 ` Paul Jackson
2007-05-18 17:11 ` Christoph Lameter
2007-05-18 17:11 ` Christoph Lameter
2007-05-20 8:39 ` Peter Zijlstra [this message]
2007-05-20 8:39 ` Peter Zijlstra
2007-05-21 16:45 ` Christoph Lameter
2007-05-21 16:45 ` Christoph Lameter
2007-05-21 19:33 ` Peter Zijlstra
2007-05-21 19:33 ` Peter Zijlstra
2007-05-21 19:43 ` Christoph Lameter
2007-05-21 19:43 ` Christoph Lameter
2007-05-21 20:08 ` Peter Zijlstra
2007-05-21 20:08 ` Peter Zijlstra
2007-05-21 20:32 ` Christoph Lameter
2007-05-21 20:32 ` Christoph Lameter
2007-05-21 20:54 ` Peter Zijlstra
2007-05-21 20:54 ` Peter Zijlstra
2007-05-21 21:04 ` Christoph Lameter
2007-05-21 21:04 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1179650384.7019.33.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=davem@davemloft.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mpm@selenic.com \
--cc=penberg@cs.helsinki.fi \
--cc=phillips@google.com \
--cc=pj@sgi.com \
--cc=tgraf@suug.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.