public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/8] Critical Page Pool
@ 2005-11-18 19:32 Matthew Dobson
  2005-11-18 19:36 ` [RFC][PATCH 1/8] Create " Matthew Dobson
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

We have a clustering product that needs to be able to guarantee that the
networking system won't stop functioning in the case of OOM/low memory
condition.  The current mempool system is inadequate because to keep the
whole networking stack functioning, we need more than 1 or 2 slab caches to
be guaranteed.  We need to guarantee that any request made with a specific
flag will succeed, assuming of course that you've made your "critical page
pool" big enough.

The following patch series implements such a critical page pool.  It
creates 2 userspace triggers:

/proc/sys/vm/critical_pages: write the number of pages you want to reserve
for the critical pool into this file

/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
the system is in an emergency state and authorize the kernel to dip into
the critical pool to satisfy critical allocations.

We mark critical allocations with the __GFP_CRITICAL flag, and when the
system is in an emergency state, we are allowed to delve into this pool to
satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
normal means.

Feedback on our approach would be appreciated.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 1/8] Create Critical Page Pool
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
@ 2005-11-18 19:36 ` Matthew Dobson
  2005-11-19  0:08   ` Paul Jackson
  2005-11-18 19:36 ` [RFC][PATCH 2/8] Create emergency trigger Matthew Dobson
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 77 bytes --]

Create the critical page pool and it's associated /proc/sys/vm/ file.

-Matt

[-- Attachment #2: critical_pool.patch --]
[-- Type: text/x-patch, Size: 10222 bytes --]

Implement a Critical Page Pool:

* Write a number of pages into /proc/sys/vm/critical_pages
* These pages will be reserved for 'critical' allocations, signified
     by the __GFP_CRITICAL flag passed to an allocation request.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/Documentation/sysctl/vm.txt	2005-11-15 13:47:14.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/Documentation/sysctl/vm.txt	2005-11-17 16:27:33.108545112 -0800
@@ -26,6 +26,7 @@ Currently, these files are in /proc/sys/
 - min_free_kbytes
 - laptop_mode
 - block_dump
+- critical_pages
 
 ==============================================================
 
@@ -102,3 +103,12 @@ This is used to force the Linux VM to ke
 of kilobytes free.  The VM uses this number to compute a pages_min
 value for each lowmem zone in the system.  Each lowmem zone gets 
 a number of reserved free pages based proportionally on its size.
+
+==============================================================
+
+critical_pages:
+
+This is used to force the Linux VM to reserve a pool of pages for
+emergency (__GFP_CRITICAL) allocations.  Allocations with this flag
+MUST succeed.
+The number written into this file is the number of pages to reserve.
Index: linux-2.6.15-rc1+critical_pool/include/linux/gfp.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/gfp.h	2005-11-15 13:46:37.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/gfp.h	2005-11-17 16:28:28.375143312 -0800
@@ -41,6 +41,7 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_NORECLAIM  ((__force gfp_t)0x20000u) /* No realy zone reclaim during allocation */
 #define __GFP_HARDWALL   ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_CRITICAL	((__force gfp_t)0x80000u) /* Critical allocation. MUST succeed! */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -49,7 +50,8 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \
+			__GFP_CRITICAL)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
@@ -58,6 +60,8 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
 			 __GFP_HIGHMEM)
+#define GFP_ATOMIC_CRIT	(GFP_ATOMIC | __GFP_CRITICAL)
+#define GFP_KERNEL_CRIT	(GFP_KERNEL | __GFP_CRITICAL)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.15-rc1+critical_pool/include/linux/sysctl.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/sysctl.h	2005-11-15 13:46:38.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/sysctl.h	2005-11-17 16:27:33.111544656 -0800
@@ -181,6 +181,7 @@ enum
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_CRITICAL_PAGES=30,	/* # of pages to reserve for __GFP_CRITICAL allocs */
 };
 
 
Index: linux-2.6.15-rc1+critical_pool/kernel/sysctl.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/kernel/sysctl.c	2005-11-15 13:47:25.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/kernel/sysctl.c	2005-11-17 16:27:33.116543896 -0800
@@ -849,6 +849,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_CRITICAL_PAGES,
+		.procname	= "critical_pages",
+		.data		= &critical_pages,
+		.maxlen		= sizeof(critical_pages),
+		.mode		= 0644,
+		.proc_handler	= &critical_pages_sysctl_handler,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.15-rc1+critical_pool/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/mm/page_alloc.c	2005-11-15 13:47:26.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/mm/page_alloc.c	2005-11-17 16:27:33.120543288 -0800
@@ -53,6 +53,65 @@ unsigned long totalram_pages __read_most
 unsigned long totalhigh_pages __read_mostly;
 long nr_swap_pages;
 
+/* The number of pages to maintain in the critical page pool */
+int critical_pages = 0;
+
+/* Critical Page Pool control structure */
+static struct critical_pool {
+	unsigned int free;
+	spinlock_t lock;
+	struct list_head pages;
+} critical_pool = {
+	.free	= 0,
+	.lock	= SPIN_LOCK_UNLOCKED,
+	.pages	= LIST_HEAD_INIT(critical_pool.pages),
+};
+
+/* MCD - This needs to be arch specific */
+#define CRITICAL_POOL_GFPMASK	(GFP_KERNEL)
+
+static inline int is_critical_pool_full(void)
+{
+	return critical_pool.free >= critical_pages;
+}
+
+static inline struct page *get_critical_page(gfp_t gfpmask)
+{
+	struct page *page = NULL;
+
+	spin_lock(&critical_pool.lock);
+	if (!list_empty(&critical_pool.pages)) {
+		/* Grab the next free critical pool page */
+		page = list_entry(critical_pool.pages.next, struct page, lru);
+		list_del(&page->lru);
+		critical_pool.free--;
+	}
+	spin_unlock(&critical_pool.lock);
+
+	return page;
+}
+
+static inline int put_critical_page(struct page *page)
+{
+	int ret = 0;
+
+	spin_lock(&critical_pool.lock);
+	if (!is_critical_pool_full()) {
+		/*
+		 * We snached this page away in the process of being freed, so
+		 * we must re-increment it's count so we don't cause problems
+		 * when we hand it back out to a future __GFP_CRITICAL alloc.
+		 */
+		BUG_ON(!get_page_testone(page));
+		list_add(&page->lru, &critical_pool.pages);
+		critical_pool.free++;
+		ret = 1;
+	}
+	spin_unlock(&critical_pool.lock);
+
+	return ret;
+}
+
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
  *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
@@ -305,6 +364,10 @@ static inline void __free_pages_bulk (st
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
+	else if (!is_critical_pool_full())
+		/* If the critical pool isn't full, add this page to it */
+		if (put_critical_page(page))
+			return;
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
@@ -984,6 +1047,20 @@ rebalance:
 	}
 
 nopage:
+	/*
+	 * This is the LAST DITCH effort.  We maintain a pool of 'critical'
+	 * pages specifically for allocation requests marked __GFP_CRITICAL.
+	 * Rather than fail one of these allocations, take a page (if any)
+	 * from the critical pool.
+	 */
+	if (gfp_mask & __GFP_CRITICAL) {
+		page = get_critical_page(gfp_mask);
+		if (page) {
+			z = page_zone(page);
+			goto got_pg;
+		}
+	}
+
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
 			" order:%d, mode:0x%x\n",
@@ -2522,6 +2599,68 @@ int lowmem_reserve_ratio_sysctl_handler(
 	return 0;
 }
 
+static inline void fill_critical_pool(int num)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		page = alloc_page(CRITICAL_POOL_GFPMASK);
+		if (!page)
+			return;
+		spin_lock(&critical_pool.lock);
+		list_add(&page->lru, &critical_pool.pages);
+		critical_pool.free++;
+		spin_unlock(&critical_pool.lock);
+	}
+}
+
+static inline void drain_critical_pool(int num)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		spin_lock(&critical_pool.lock);
+		BUG_ON(critical_pool.free < 0);
+		if (list_empty(&critical_pool.pages) || !critical_pool.free) {
+			spin_unlock(&critical_pool.lock);
+			break;
+		}
+			
+		page = list_entry(critical_pool.pages.next, struct page, lru);
+		list_del(&page->lru);
+		critical_pool.free--;
+		spin_unlock(&critical_pool.lock);
+			
+		__free_pages(page, 0);
+	}
+}
+
+/*
+ * critical_pages_sysctl_handler - handle writes to /proc/sys/vm/critical_pages
+ *     Whenever this file changes, add/remove pages to/from the critical page
+ *     pool.
+ */
+int critical_pages_sysctl_handler(ctl_table *table, int write,
+				   struct file *file, void __user *buffer,
+				   size_t *length, loff_t *ppos)
+{
+	int num;
+
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (!write)
+		return 0;
+
+	num = critical_pages - critical_pool.free;
+	if (num > 0)
+		fill_critical_pool(num);
+	else if (num < 0)
+		drain_critical_pool(-num);
+
+	return 0;
+}
+
 __initdata int hashdist = HASHDIST_DEFAULT;
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.15-rc1+critical_pool/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/mmzone.h	2005-11-15 13:46:38.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/mmzone.h	2005-11-17 16:27:33.121543136 -0800
@@ -430,6 +430,8 @@ int min_free_kbytes_sysctl_handler(struc
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
+int critical_pages_sysctl_handler(struct ctl_table *, int, struct file *,
+				  void __user *, size_t *, loff_t *);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6.15-rc1+critical_pool/include/linux/mm.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/mm.h	2005-11-15 13:46:36.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/mm.h	2005-11-17 16:27:33.125542528 -0800
@@ -32,6 +32,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int critical_pages;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 2/8] Create emergency trigger
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
  2005-11-18 19:36 ` [RFC][PATCH 1/8] Create " Matthew Dobson
@ 2005-11-18 19:36 ` Matthew Dobson
  2005-11-19  0:21   ` Paul Jackson
  2005-11-18 19:41 ` [RFC][PATCH 4/8] Fix a bug in scsi_get_command Matthew Dobson
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 40 bytes --]

Create the in_emergency trigger.

-Matt

[-- Attachment #2: emergency_trigger.patch --]
[-- Type: text/x-patch, Size: 4892 bytes --]

Create a userspace trigger: /proc/sys/vm/in_emergency that notifies the kernel
that the system is in an emergency state, and allows the kernel to delve into
the 'critical pool' to satisfy __GFP_CRITICAL allocations.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/Documentation/sysctl/vm.txt	2005-11-17 16:51:19.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/Documentation/sysctl/vm.txt	2005-11-17 16:51:20.000000000 -0800
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/
 - laptop_mode
 - block_dump
 - critical_pages
+- in_emergency
 
 ==============================================================
 
@@ -112,3 +113,12 @@ This is used to force the Linux VM to re
 emergency (__GFP_CRITICAL) allocations.  Allocations with this flag
 MUST succeed.
 The number written into this file is the number of pages to reserve.
+
+==============================================================
+
+in_emergency:
+
+This is used to let the Linux VM know that userspace thinks that the system is
+in an emergency situation.
+Writing a non-zero value into this file tells the VM we *are* in an emergency
+situation & writing zero tells the VM we *are not* in an emergency situation.
Index: linux-2.6.15-rc1+critical_pool/include/linux/sysctl.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/sysctl.h	2005-11-17 16:51:19.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/sysctl.h	2005-11-17 16:51:20.000000000 -0800
@@ -182,6 +182,7 @@ enum
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
 	VM_CRITICAL_PAGES=30,	/* # of pages to reserve for __GFP_CRITICAL allocs */
+	VM_IN_EMERGENCY=31,	/* tell the VM if we are/aren't in an emergency */
 };
 
 
Index: linux-2.6.15-rc1+critical_pool/kernel/sysctl.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/kernel/sysctl.c	2005-11-17 16:51:19.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/kernel/sysctl.c	2005-11-17 16:51:20.000000000 -0800
@@ -859,6 +859,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= VM_IN_EMERGENCY,
+		.procname	= "in_emergency",
+		.data		= &system_in_emergency,
+		.maxlen		= sizeof(system_in_emergency),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.15-rc1+critical_pool/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/mm/page_alloc.c	2005-11-17 16:51:19.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/mm/page_alloc.c	2005-11-18 11:24:02.024254248 -0800
@@ -53,6 +53,9 @@ unsigned long totalram_pages __read_most
 unsigned long totalhigh_pages __read_mostly;
 long nr_swap_pages;
 
+/* Is the sytem in an emergency situation? */
+int system_in_emergency = 0;
+
 /* The number of pages to maintain in the critical page pool */
 int critical_pages = 0;
 
@@ -865,7 +868,7 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct zone **zones, *z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
@@ -876,6 +879,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	int can_try_harder;
 	int did_some_progress;
 
+	if (is_emergency_alloc(gfp_mask)) {
+		/*
+		 * If the system is 'in emergency' and this is a critical
+		 * allocation, then make sure we don't sleep
+		 */
+		gfp_mask &= ~__GFP_WAIT;
+		gfp_mask |= __GFP_NORECLAIM | __GFP_HIGH;
+		wait = 0;
+	}
+
 	might_sleep_if(wait);
 
 	/*
@@ -1053,7 +1066,7 @@ nopage:
 	 * Rather than fail one of these allocations, take a page (if any)
 	 * from the critical pool.
 	 */
-	if (gfp_mask & __GFP_CRITICAL) {
+	if (is_emergency_alloc(gfp_mask)) {
 		page = get_critical_page(gfp_mask);
 		if (page) {
 			z = page_zone(page);
Index: linux-2.6.15-rc1+critical_pool/include/linux/mm.h
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/include/linux/mm.h	2005-11-17 16:51:19.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/include/linux/mm.h	2005-11-17 16:51:20.000000000 -0800
@@ -33,6 +33,12 @@ extern int sysctl_legacy_va_layout;
 #endif
 
 extern int critical_pages;
+extern int system_in_emergency;
+
+static inline int is_emergency_alloc(gfp_t gfpmask)
+{
+	return system_in_emergency && (gfpmask & __GFP_CRITICAL);
+}
 
 #include <asm/page.h>
 #include <asm/pgtable.h>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 4/8] Fix a bug in scsi_get_command
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
  2005-11-18 19:36 ` [RFC][PATCH 1/8] Create " Matthew Dobson
  2005-11-18 19:36 ` [RFC][PATCH 2/8] Create emergency trigger Matthew Dobson
@ 2005-11-18 19:41 ` Matthew Dobson
  2005-11-18 19:43 ` [RFC][PATCH 5/8] get_object/return_object Matthew Dobson
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 104 bytes --]

Testing this patch series uncovered a small bug in scsi_get_command.  This
patch fixes that bug.

-Matt

[-- Attachment #2: scsi_get_command-fix.patch --]
[-- Type: text/x-patch, Size: 958 bytes --]

scsi_get_command() attempts to write into a structure that may not have been
successfully allocated.  Move this write inside the if statement that ensures
we won't panic the kernel with a NULL pointer dereference.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/drivers/scsi/scsi.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/drivers/scsi/scsi.c	2005-11-15 13:45:38.000000000 -0800
+++ linux-2.6.15-rc1+critical_pool/drivers/scsi/scsi.c	2005-11-17 16:49:54.279656112 -0800
@@ -265,10 +265,10 @@ struct scsi_cmnd *scsi_get_command(struc
 		spin_lock_irqsave(&dev->list_lock, flags);
 		list_add_tail(&cmd->list, &dev->cmd_list);
 		spin_unlock_irqrestore(&dev->list_lock, flags);
+		cmd->jiffies_at_alloc = jiffies;
 	} else
 		put_device(&dev->sdev_gendev);
 
-	cmd->jiffies_at_alloc = jiffies;
 	return cmd;
 }				
 EXPORT_SYMBOL(scsi_get_command);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 5/8] get_object/return_object
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (2 preceding siblings ...)
  2005-11-18 19:41 ` [RFC][PATCH 4/8] Fix a bug in scsi_get_command Matthew Dobson
@ 2005-11-18 19:43 ` Matthew Dobson
  2005-11-18 19:44 ` [RFC][PATCH 6/8] slab_destruct Matthew Dobson
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 88 bytes --]

Move the code to get/return an object back to its slab into their own
functions.

-Matt

[-- Attachment #2: slab_prep-get_return_object.patch --]
[-- Type: text/x-patch, Size: 3855 bytes --]

Create two helper functions: get_object_from_slab() & return_object_to_slab().
Use these two helper function to replace duplicated code in mm/slab.c

These functions will also be reused by a later patch in this series.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/mm/slab.c	2005-11-17 16:39:24.401412160 -0800
+++ linux-2.6.15-rc1+critical_pool/mm/slab.c	2005-11-17 16:45:07.337277984 -0800
@@ -2148,6 +2148,42 @@ static void kmem_flagcheck(kmem_cache_t 
 	}
 }
 
+static void *get_object(kmem_cache_t *cachep, struct slab *slabp, int nid)
+{
+	void *obj = slabp->s_mem + (slabp->free * cachep->objsize);
+	kmem_bufctl_t next;
+
+	slabp->inuse++;
+	next = slab_bufctl(slabp)[slabp->free];
+#if DEBUG
+	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
+	WARN_ON(slabp->nid != nid);
+#endif
+	slabp->free = next;
+
+	return obj;
+}
+
+static void return_object(kmem_cache_t *cachep, struct slab *slabp, void *objp,
+			  int nid)
+{
+	unsigned int objnr = (objp - slabp->s_mem) / cachep->objsize;
+
+#if DEBUG
+	/* Verify that the slab belongs to the intended node */
+	WARN_ON(slabp->nid != nid);
+
+	if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
+		printk(KERN_ERR "slab: double free detected in cache "
+		       "'%s', objp %p\n", cachep->name, objp);
+		BUG();
+	}
+#endif
+	slab_bufctl(slabp)[objnr] = slabp->free;
+	slabp->free = objnr;
+	slabp->inuse--;
+}
+
 static void set_slab_attr(kmem_cache_t *cachep, struct slab *slabp, void *objp)
 {
 	int i;
@@ -2436,22 +2472,12 @@ retry:
 		check_slabp(cachep, slabp);
 		check_spinlock_acquired(cachep);
 		while (slabp->inuse < cachep->num && batchcount--) {
-			kmem_bufctl_t next;
 			STATS_INC_ALLOCED(cachep);
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			/* get obj pointer */
-			ac->entry[ac->avail++] = slabp->s_mem +
-				slabp->free*cachep->objsize;
-
-			slabp->inuse++;
-			next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
-			slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-			WARN_ON(numa_node_id() != slabp->nid);
-#endif
-			slabp->free = next;
+			ac->entry[ac->avail++] = get_object(cachep, slabp,
+							    numa_node_id());
 		}
 		check_slabp(cachep, slabp);
 
@@ -2586,7 +2612,6 @@ static void *__cache_alloc_node(kmem_cac
 	struct slab *slabp;
 	struct kmem_list3 *l3;
 	void *obj;
-	kmem_bufctl_t next;
 	int x;
 
 	l3 = cachep->nodelists[nid];
@@ -2612,14 +2637,7 @@ retry:
 
 	BUG_ON(slabp->inuse == cachep->num);
 
-	/* get obj pointer */
-	obj =  slabp->s_mem + slabp->free * cachep->objsize;
-	slabp->inuse++;
-	next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
-	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-#endif
-	slabp->free = next;
+	obj = get_object(cachep, slabp, nid);
 	check_slabp(cachep, slabp);
 	l3->free_objects--;
 	/* move slabp to correct slabp list: */
@@ -2659,29 +2677,14 @@ static void free_block(kmem_cache_t *cac
 	for (i = 0; i < nr_objects; i++) {
 		void *objp = objpp[i];
 		struct slab *slabp;
-		unsigned int objnr;
 
 		slabp = GET_PAGE_SLAB(virt_to_page(objp));
 		l3 = cachep->nodelists[nid];
 		list_del(&slabp->list);
-		objnr = (objp - slabp->s_mem) / cachep->objsize;
 		check_spinlock_acquired_node(cachep, nid);
 		check_slabp(cachep, slabp);
-
-#if DEBUG
-		/* Verify that the slab belongs to the intended node */
-		WARN_ON(slabp->nid != nid);
-
-		if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
-			printk(KERN_ERR "slab: double free detected in cache "
-					"'%s', objp %p\n", cachep->name, objp);
-			BUG();
-		}
-#endif
-		slab_bufctl(slabp)[objnr] = slabp->free;
-		slabp->free = objnr;
+		return_object(cachep, slabp, objp, nid);
 		STATS_DEC_ACTIVE(cachep);
-		slabp->inuse--;
 		l3->free_objects++;
 		check_slabp(cachep, slabp);
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 6/8] slab_destruct
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (3 preceding siblings ...)
  2005-11-18 19:43 ` [RFC][PATCH 5/8] get_object/return_object Matthew Dobson
@ 2005-11-18 19:44 ` Matthew Dobson
  2005-11-18 19:44 ` [RFC][PATCH 0/8] Critical Page Pool Avi Kivity
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 161 bytes --]

Break the current slab_destroy() into 2 functions: slab_destroy and
slab_destruct.  slab_destruct calls the destructor code and any necessary
debug code.

-Matt

[-- Attachment #2: slab_prep-slab_destruct.patch --]
[-- Type: text/x-patch, Size: 2082 bytes --]

Create a helper function, slab_destruct(), called from slab_destroy().  This
makes slab_destroy() smaller and more readable, and moves ifdefs outside the
function body.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.14+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.14+critical_pool.orig/mm/slab.c	2005-11-14 10:52:22.427207392 -0800
+++ linux-2.6.14+critical_pool/mm/slab.c	2005-11-14 10:52:27.514434016 -0800
@@ -1388,16 +1388,13 @@ static void check_poison_obj(kmem_cache_
 }
 #endif
 
-/*
- * Destroy all the objs in a slab, and release the mem back to the system.
- * Before calling the slab must have been unlinked from the cache.
- * The cache-lock is not held/needed.
+#if DEBUG
+/**
+ * slab_destruct - call the registered destructor for each object in
+ *      a slab that is to be destroyed.
  */
-static void slab_destroy(kmem_cache_t *cachep, struct slab *slabp)
+static void slab_destruct(kmem_cache_t *cachep, struct slab *slabp)
 {
-	void *addr = slabp->s_mem - slabp->colouroff;
-
-#if DEBUG
 	int i;
 	for (i = 0; i < cachep->num; i++) {
 		void *objp = slabp->s_mem + cachep->objsize * i;
@@ -1425,7 +1422,10 @@ static void slab_destroy(kmem_cache_t *c
 		if (cachep->dtor && !(cachep->flags & SLAB_POISON))
 			(cachep->dtor)(objp + obj_dbghead(cachep), cachep, 0);
 	}
+}
 #else
+static void slab_destruct(kmem_cache_t *cachep, struct slab *slabp)
+{
 	if (cachep->dtor) {
 		int i;
 		for (i = 0; i < cachep->num; i++) {
@@ -1433,8 +1433,19 @@ static void slab_destroy(kmem_cache_t *c
 			(cachep->dtor)(objp, cachep, 0);
 		}
 	}
+}
 #endif
 
+/**
+ * Destroy all the objs in a slab, and release the mem back to the system.
+ * Before calling the slab must have been unlinked from the cache.
+ * The cache-lock is not held/needed.
+ */
+static void slab_destroy(kmem_cache_t *cachep, struct slab *slabp)
+{
+	void *addr = slabp->s_mem - slabp->colouroff;
+
+	slab_destruct(cachep, slabp);
 	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
 		struct slab_rcu *slab_rcu;
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (4 preceding siblings ...)
  2005-11-18 19:44 ` [RFC][PATCH 6/8] slab_destruct Matthew Dobson
@ 2005-11-18 19:44 ` Avi Kivity
  2005-11-18 19:51   ` Matthew Dobson
  2005-11-18 19:45 ` [RFC][PATCH 7/8] __cache_grow() Matthew Dobson
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2005-11-18 19:44 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, Linux Memory Management

Matthew Dobson wrote:

>We have a clustering product that needs to be able to guarantee that the
>networking system won't stop functioning in the case of OOM/low memory
>condition.  The current mempool system is inadequate because to keep the
>whole networking stack functioning, we need more than 1 or 2 slab caches to
>be guaranteed.  We need to guarantee that any request made with a specific
>flag will succeed, assuming of course that you've made your "critical page
>pool" big enough.
>
>The following patch series implements such a critical page pool.  It
>creates 2 userspace triggers:
>
>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>for the critical pool into this file
>
>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>the system is in an emergency state and authorize the kernel to dip into
>the critical pool to satisfy critical allocations.
>
>We mark critical allocations with the __GFP_CRITICAL flag, and when the
>system is in an emergency state, we are allowed to delve into this pool to
>satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>normal means.
>
>  
>
1. If you have two subsystems which allocate critical pages, how do you 
protect against the condition where one subsystem allocates all the 
critical memory, causing the second to oom?

2. There already exists a critical pool: ordinary allocations fail if 
free memory is below some limit, but special processes (kswapd) can 
allocate that memory by setting PF_MEMALLOC. Perhaps this should be 
extended, possibly with a per-process threshold.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 7/8] __cache_grow()
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (5 preceding siblings ...)
  2005-11-18 19:44 ` [RFC][PATCH 0/8] Critical Page Pool Avi Kivity
@ 2005-11-18 19:45 ` Matthew Dobson
  2005-11-18 19:47 ` [RFC][PATCH 8/8] Add support critical pool support to the slab allocator Matthew Dobson
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 125 bytes --]

Create a helper for cache_grow() that handles doing the cache coloring and
allocating & initializing the struct slab.

-Matt

[-- Attachment #2: slab_prep-__cache_grow.patch --]
[-- Type: text/x-patch, Size: 6001 bytes --]

Create a helper function, __cache_grow(), called by cache_grow().  This allows
us to move the cache coloring and struct slab allocation & initialization to
its own discrete function.

Also, move both functions below some debugging function definitions, so they can be used
in these functions by the next patch without needing forward declarations.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/mm/slab.c	2005-11-17 16:45:09.979876248 -0800
+++ linux-2.6.15-rc1+critical_pool/mm/slab.c	2005-11-17 16:49:45.118048888 -0800
@@ -2209,95 +2209,6 @@ static void set_slab_attr(kmem_cache_t *
 	}
 }
 
-/*
- * Grow (by 1) the number of slabs within a cache.  This is called by
- * kmem_cache_alloc() when there are no active objs left in a cache.
- */
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nid)
-{
-	struct slab *slabp;
-	void *objp;
-	size_t offset;
-	gfp_t local_flags;
-	unsigned long ctor_flags;
-	struct kmem_list3 *l3;
-
-	/*
-	 * Be lazy and only check for valid flags here,
-	 * keeping it out of the critical path in kmem_cache_alloc().
-	 */
-	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
-		BUG();
-	if (flags & SLAB_NO_GROW)
-		return 0;
-
-	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
-	local_flags = (flags & SLAB_LEVEL_MASK);
-	if (!(local_flags & __GFP_WAIT))
-		/*
-		 * Not allowed to sleep.  Need to tell a constructor about
-		 * this - it might need to know...
-		 */
-		ctor_flags |= SLAB_CTOR_ATOMIC;
-
-	/* About to mess with non-constant members - lock. */
-	check_irq_off();
-	spin_lock(&cachep->spinlock);
-
-	/* Get colour for the slab, and cal the next value. */
-	offset = cachep->colour_next;
-	cachep->colour_next++;
-	if (cachep->colour_next >= cachep->colour)
-		cachep->colour_next = 0;
-	offset *= cachep->colour_off;
-
-	spin_unlock(&cachep->spinlock);
-
-	check_irq_off();
-	if (local_flags & __GFP_WAIT)
-		local_irq_enable();
-
-	/*
-	 * Ensure caller isn't asking for DMA memory if the slab wasn't created
-	 * with the SLAB_DMA flag.
-	 * Also ensure the caller *is* asking for DMA memory if the slab was
-	 * created with the SLAB_DMA flag.
-	 */
-	kmem_flagcheck(cachep, flags);
-
-	/* Get mem for the objects by allocating a physical page from 'nid' */
-	if (!(objp = kmem_getpages(cachep, flags, nid)))
-		goto out_nomem;
-
-	/* Get slab management. */
-	if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
-		goto out_freepages;
-
-	slabp->nid = nid;
-	set_slab_attr(cachep, slabp, objp);
-
-	cache_init_objs(cachep, slabp, ctor_flags);
-
-	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
-	check_irq_off();
-	l3 = cachep->nodelists[nid];
-	spin_lock(&l3->list_lock);
-
-	/* Make slab active. */
-	list_add_tail(&slabp->list, &(l3->slabs_free));
-	STATS_INC_GROWN(cachep);
-	l3->free_objects += cachep->num;
-	spin_unlock(&l3->list_lock);
-	return 1;
-out_freepages:
-	kmem_freepages(cachep, objp);
-out_nomem:
-	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
-	return 0;
-}
-
 #if DEBUG
 /*
  * Perform extra freeing checks:
@@ -2430,6 +2341,105 @@ bad:
 #define check_slabp(x,y)			do { } while(0)
 #endif
 
+/**
+ * Helper function for cache_grow().  Handle cache coloring, allocating a
+ * struct slab and initializing the slab.
+ */
+static struct slab *__cache_grow(kmem_cache_t *cachep, void *objp, gfp_t flags)
+{
+	struct slab *slabp;
+	size_t offset;
+	unsigned int local_flags;
+	unsigned long ctor_flags;
+
+	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+	local_flags = (flags & SLAB_LEVEL_MASK);
+	if (!(local_flags & __GFP_WAIT))
+		/*
+		 * Not allowed to sleep.  Need to tell a constructor about
+		 * this - it might need to know...
+		 */
+		ctor_flags |= SLAB_CTOR_ATOMIC;
+
+	/* About to mess with non-constant members - lock. */
+	check_irq_off();
+	spin_lock(&cachep->spinlock);
+
+	/* Get colour for the slab, and cal the next value. */
+	offset = cachep->colour_next;
+	cachep->colour_next++;
+	if (cachep->colour_next >= cachep->colour)
+		cachep->colour_next = 0;
+	offset *= cachep->colour_off;
+
+	spin_unlock(&cachep->spinlock);
+
+	check_irq_off();
+	if (local_flags & __GFP_WAIT)
+		local_irq_enable();
+
+	/* Get slab management. */
+	if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
+		goto out;
+
+	set_slab_attr(cachep, slabp, objp);
+	cache_init_objs(cachep, slabp, ctor_flags);
+
+out:
+	if (local_flags & __GFP_WAIT)
+		local_irq_disable();
+	check_irq_off();
+	return slabp;
+}
+
+/**
+ * Grow (by 1) the number of slabs within a cache.  This is called by
+ * kmem_cache_alloc() when there are no active objs left in a cache.
+ */
+static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nid)
+{
+	struct slab *slabp = NULL;
+	void *objp = NULL;
+
+	/*
+	 * Be lazy and only check for valid flags here,
+ 	 * keeping it out of the critical path in kmem_cache_alloc().
+	 */
+	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
+		BUG();
+	if (flags & SLAB_NO_GROW)
+		goto out;
+
+	/*
+	 * Ensure caller isn't asking for DMA memory if the slab wasn't created
+	 * with the SLAB_DMA flag.
+	 * Also ensure the caller *is* asking for DMA memory if the slab was
+	 * created with the SLAB_DMA flag.
+	 */
+	kmem_flagcheck(cachep, flags);
+
+	/* Get mem for the objects by allocating a physical page from 'nid' */
+	if ((objp = kmem_getpages(cachep, flags, nid))) {
+		struct kmem_list3 *l3 = cachep->nodelists[nid];
+
+		if (!(slabp = __cache_grow(cachep, objp, flags))) {
+			kmem_freepages(cachep, objp);
+			objp = NULL;
+			goto out;
+		}
+		slabp->nid = nid;
+
+		STATS_INC_GROWN(cachep);
+		/* Make slab active. */
+		spin_lock(&l3->list_lock);
+		list_add_tail(&slabp->list, &l3->slabs_free);
+		l3->free_objects += cachep->num;
+		spin_unlock(&l3->list_lock);
+	}
+out:
+	return objp != NULL;
+}
+
 static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
 {
 	int batchcount;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 8/8] Add support critical pool support to the slab allocator
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (6 preceding siblings ...)
  2005-11-18 19:45 ` [RFC][PATCH 7/8] __cache_grow() Matthew Dobson
@ 2005-11-18 19:47 ` Matthew Dobson
  2005-11-18 19:56 ` [RFC][PATCH 0/8] Critical Page Pool Chris Wright
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 145 bytes --]

Finally, teach the slab allocator how to deal with critical pages and how
to keep them for use exclusively by __GFP_CRITICAL allocations.

-Matt

[-- Attachment #2: slab_support.patch --]
[-- Type: text/x-patch, Size: 6638 bytes --]

Modify the Slab Allocator to support the addition of a Critical Pool to the VM.
What we want is to ensure that if a cache is allocated a new slab page from the
Critical Pool during an Emergency situation, that only other __GFP_CRITICAL
allocations are satisfied from that slab.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc1+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc1+critical_pool.orig/mm/slab.c	2005-11-17 16:51:22.965173864 -0800
+++ linux-2.6.15-rc1+critical_pool/mm/slab.c	2005-11-17 17:22:03.056437472 -0800
@@ -220,6 +220,7 @@ struct slab {
 	unsigned long		colouroff;
 	void			*s_mem;		/* including colour offset */
 	unsigned int		inuse;		/* # of objs active in slab */
+	unsigned short		critical;	/* is this an critical slab? */
 	kmem_bufctl_t		free;
 	unsigned short          nid;		/* node number slab is on */
 };
@@ -395,6 +396,9 @@ struct kmem_cache {
 	unsigned int		slab_size;
 	unsigned int		dflags;		/* dynamic flags */
 
+	/* list of critical slabs for this cache */
+	struct list_head	slabs_crit;
+
 	/* constructor func */
 	void (*ctor)(void *, kmem_cache_t *, unsigned long);
 
@@ -1770,6 +1774,7 @@ kmem_cache_t *kmem_cache_create(const ch
 		cachep->gfpflags |= GFP_DMA;
 	spin_lock_init(&cachep->spinlock);
 	cachep->objsize = size;
+	INIT_LIST_HEAD(&cachep->slabs_crit);
 
 	if (flags & CFLGS_OFF_SLAB)
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
@@ -2090,6 +2095,7 @@ static struct slab *alloc_slabmgmt(kmem_
 	slabp->inuse = 0;
 	slabp->colouroff = colour_off;
 	slabp->s_mem = objp + colour_off;
+	slabp->critical = 0;
 
 	return slabp;
 }
@@ -2182,7 +2188,8 @@ static void return_object(kmem_cache_t *
 
 #if DEBUG
 	/* Verify that the slab belongs to the intended node */
-	WARN_ON(slabp->nid != nid);
+	if (nid >= 0)
+		WARN_ON(slabp->nid != nid);
 
 	if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
 		printk(KERN_ERR "slab: double free detected in cache "
@@ -2341,6 +2348,24 @@ bad:
 #define check_slabp(x,y)			do { } while(0)
 #endif
 
+static inline struct slab *get_critical_slab(kmem_cache_t *cachep, gfp_t flags)
+{
+	struct slab *slabp = NULL;
+
+	spin_lock(&cachep->spinlock);
+	/* search for any partially free critical slabs */
+	if (!list_empty(&cachep->slabs_crit)) {
+		list_for_each_entry(slabp, &cachep->slabs_crit, list)
+			if (slabp->free != BUFCTL_END)
+				goto found;
+		slabp = NULL;
+	}
+found:
+	spin_unlock(&cachep->spinlock);
+
+	return slabp;
+}
+
 /**
  * Helper function for cache_grow().  Handle cache coloring, allocating a
  * struct slab and initializing the slab.
@@ -2396,10 +2421,11 @@ out:
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nid)
+static void *cache_grow(kmem_cache_t *cachep, gfp_t flags, int nid)
 {
 	struct slab *slabp = NULL;
 	void *objp = NULL;
+	int critical = is_emergency_alloc(flags);
 
 	/*
 	 * Be lazy and only check for valid flags here,
@@ -2411,6 +2437,13 @@ static int cache_grow(kmem_cache_t *cach
 		goto out;
 
 	/*
+	 * We are in an emergency situation and this is a 'critical' alloc,
+	 * so check if we've got an existing critical slab first
+	 */
+	if (critical && (slabp = get_critical_slab(cachep, flags)))
+		goto got_critical_slab;
+
+	/*
 	 * Ensure caller isn't asking for DMA memory if the slab wasn't created
 	 * with the SLAB_DMA flag.
 	 * Also ensure the caller *is* asking for DMA memory if the slab was
@@ -2431,13 +2464,34 @@ static int cache_grow(kmem_cache_t *cach
 
 		STATS_INC_GROWN(cachep);
 		/* Make slab active. */
-		spin_lock(&l3->list_lock);
-		list_add_tail(&slabp->list, &l3->slabs_free);
-		l3->free_objects += cachep->num;
-		spin_unlock(&l3->list_lock);
+		if (!critical) {
+			spin_lock(&l3->list_lock);
+			list_add_tail(&slabp->list, &l3->slabs_free);
+			l3->free_objects += cachep->num;
+			spin_unlock(&l3->list_lock);
+		} else {
+			spin_lock(&cachep->spinlock);
+			list_add_tail(&slabp->list, &cachep->slabs_crit);
+			slabp->critical = 1;
+			spin_unlock(&cachep->spinlock);
+got_critical_slab:
+			objp = get_object(cachep, slabp, nid);
+			check_slabp(cachep, slabp);
+		}
 	}
 out:
-	return objp != NULL;
+	return objp;
+}
+
+static inline int is_critical_object(void *obj)
+{
+	struct slab *slabp;
+
+	if (!obj)
+		return 0;
+
+	slabp = GET_PAGE_SLAB(virt_to_page(obj));
+	return slabp->critical;
 }
 
 static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
@@ -2516,12 +2570,15 @@ alloc_done:
 	spin_unlock(&l3->list_lock);
 
 	if (unlikely(!ac->avail)) {
-		int x;
-		x = cache_grow(cachep, flags, numa_node_id());
+		void *obj = cache_grow(cachep, flags, numa_node_id());
+
+		/* critical objects don't "grow" the slab, just return 'obj' */
+		if (is_critical_object(obj))
+			return obj;
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = ac_data(cachep);
-		if (!x && ac->avail == 0) /* no objects in sight? abort      */
+		if (!obj && ac->avail == 0) /* No objects in sight?  Abort.  */
 			return NULL;
 
 		if (!ac->avail)		  /* objects refilled by interrupt?  */
@@ -2633,7 +2690,6 @@ static void *__cache_alloc_node(kmem_cac
 	struct slab *slabp;
 	struct kmem_list3 *l3;
 	void *obj;
-	int x;
 
 	l3 = cachep->nodelists[nid];
 	BUG_ON(!l3);
@@ -2675,11 +2731,15 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
-	x = cache_grow(cachep, flags, nid);
+	obj = cache_grow(cachep, flags, nid);
 
-	if (!x)
+	if (!obj)
 		return NULL;
 
+	/* critical objects don't "grow" the slab, just return 'obj' */
+	if (is_critical_object(obj))
+		goto done;
+
 	goto retry;
 done:
 	return obj;
@@ -2780,6 +2840,22 @@ free_done:
 		sizeof(void *) * ac->avail);
 }
 
+static inline void free_critical_object(kmem_cache_t *cachep, void *objp)
+{
+	struct slab *slabp = GET_PAGE_SLAB(virt_to_page(objp));
+
+	check_slabp(cachep, slabp);
+	return_object(cachep, slabp, objp, -1);
+	check_slabp(cachep, slabp);
+
+	if (slabp->inuse == 0) {
+		BUG_ON(cachep->flags & SLAB_DESTROY_BY_RCU);
+		BUG_ON(cachep->gfporder);
+
+		list_del(&slabp->list);
+		slab_destroy(cachep, slabp);
+	}
+}
 
 /**
  * __cache_free
@@ -2795,6 +2871,11 @@ static inline void __cache_free(kmem_cac
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
+	if (is_critical_object(objp)) {
+		free_critical_object(cachep, objp);
+		return;
+	}
+
 	/*
 	 * Make sure we are not freeing a object from another
 	 * node to the array cache on this cpu.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:44 ` [RFC][PATCH 0/8] Critical Page Pool Avi Kivity
@ 2005-11-18 19:51   ` Matthew Dobson
  2005-11-18 20:42     ` Avi Kivity
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-11-18 19:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, Linux Memory Management

Avi Kivity wrote:
> Matthew Dobson wrote:
> 
>> We have a clustering product that needs to be able to guarantee that the
>> networking system won't stop functioning in the case of OOM/low memory
>> condition.  The current mempool system is inadequate because to keep the
>> whole networking stack functioning, we need more than 1 or 2 slab
>> caches to
>> be guaranteed.  We need to guarantee that any request made with a
>> specific
>> flag will succeed, assuming of course that you've made your "critical
>> page
>> pool" big enough.
>>
>> The following patch series implements such a critical page pool.  It
>> creates 2 userspace triggers:
>>
>> /proc/sys/vm/critical_pages: write the number of pages you want to
>> reserve
>> for the critical pool into this file
>>
>> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>> the system is in an emergency state and authorize the kernel to dip into
>> the critical pool to satisfy critical allocations.
>>
>> We mark critical allocations with the __GFP_CRITICAL flag, and when the
>> system is in an emergency state, we are allowed to delve into this
>> pool to
>> satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>> normal means.
>>
>>  
>>
> 1. If you have two subsystems which allocate critical pages, how do you
> protect against the condition where one subsystem allocates all the
> critical memory, causing the second to oom?

You don't.  You make sure that you size the critical pool appropriately for
your workload.


> 2. There already exists a critical pool: ordinary allocations fail if
> free memory is below some limit, but special processes (kswapd) can
> allocate that memory by setting PF_MEMALLOC. Perhaps this should be
> extended, possibly with a per-process threshold.

The exception for threads with PF_MEMALLOC set is there because those
threads are essentially promising that if the kernel gives them memory,
they will use that memory to free up MORE memory.  If we ignore that
promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
zone_watermarks, we'll simply OOM faster, and potentially in situations
that could be avoided (ie: we steal memory that kswapd could have used to
free up more memory).

Thanks for your feedback!

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (7 preceding siblings ...)
  2005-11-18 19:47 ` [RFC][PATCH 8/8] Add support critical pool support to the slab allocator Matthew Dobson
@ 2005-11-18 19:56 ` Chris Wright
  2005-11-21  5:47   ` Matthew Dobson
  2005-11-20  7:45 ` Keith Owens
  2005-11-20 23:04 ` Pavel Machek
  10 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2005-11-18 19:56 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, Linux Memory Management

* Matthew Dobson (colpatch@us.ibm.com) wrote:
> /proc/sys/vm/critical_pages: write the number of pages you want to reserve
> for the critical pool into this file

How do you size this pool?  Allocations are interrupt driven, so how to you
ensure you're allocating for the cluster network traffic you care about?

> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
> the system is in an emergency state and authorize the kernel to dip into
> the critical pool to satisfy critical allocations.

Seems odd to me.  Why make this another knob?  How did you run to set this
flag if you're in emergency and kswapd is going nuts?

thanks,
-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:51   ` Matthew Dobson
@ 2005-11-18 20:42     ` Avi Kivity
  2005-11-19  0:10       ` Paul Jackson
  2005-11-21  5:36       ` Matthew Dobson
  0 siblings, 2 replies; 27+ messages in thread
From: Avi Kivity @ 2005-11-18 20:42 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, Linux Memory Management

Matthew Dobson wrote:

>Avi Kivity wrote:
>  
>
>>1. If you have two subsystems which allocate critical pages, how do you
>>protect against the condition where one subsystem allocates all the
>>critical memory, causing the second to oom?
>>    
>>
>
>You don't.  You make sure that you size the critical pool appropriately for
>your workload.
>
>  
>
This may not be possible. What if subsystem A depends on subsystem B to 
do its work, both are critical, and subsystem A allocated all the memory 
reserve?
If A and B have different allocation thresholds, the deadlock is avoided.

At the very least you need a critical pool per subsystem.

>  
>
>>2. There already exists a critical pool: ordinary allocations fail if
>>free memory is below some limit, but special processes (kswapd) can
>>allocate that memory by setting PF_MEMALLOC. Perhaps this should be
>>extended, possibly with a per-process threshold.
>>    
>>
>
>The exception for threads with PF_MEMALLOC set is there because those
>threads are essentially promising that if the kernel gives them memory,
>they will use that memory to free up MORE memory.  If we ignore that
>promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
>zone_watermarks, we'll simply OOM faster, and potentially in situations
>that could be avoided (ie: we steal memory that kswapd could have used to
>free up more memory).
>  
>
Sure, but that's just an example of a critical subsystem.

If we introduce yet another mechanism for critical memory allocation, 
we'll have a hard time making different subsystems, which use different 
critical allocation mechanisms, play well together.

I propose that instead of a single watermark, there should be a 
watermark per critical subsystem. The watermarks would be arranged 
according to the dependency graph, with the depended-on services allowed 
to go the deepest into the reserves.

(instead of PF_MEMALLOC have a tsk->memory_allocation_threshold, or 
similar. set it to 0 for kswapd, and for other systems according to taste)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/8] Create Critical Page Pool
  2005-11-18 19:36 ` [RFC][PATCH 1/8] Create " Matthew Dobson
@ 2005-11-19  0:08   ` Paul Jackson
  2005-11-21  5:50     ` Matthew Dobson
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Jackson @ 2005-11-19  0:08 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, linux-mm

Total nit:

 #define __GFP_HARDWALL   ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_CRITICAL	((__force gfp_t)0x80000u) /* Critical allocation. MUST succeed! */

Looks like you used a space instead of a tab.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 20:42     ` Avi Kivity
@ 2005-11-19  0:10       ` Paul Jackson
  2005-11-21  5:36       ` Matthew Dobson
  1 sibling, 0 replies; 27+ messages in thread
From: Paul Jackson @ 2005-11-19  0:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: colpatch, linux-kernel, linux-mm

Avi wrote:
> This may not be possible. What if subsystem A depends on subsystem B to 
> do its work, both are critical, and subsystem A allocated all the memory 
> reserve?

Apparently Matthew's subsystems have some knowable upper limits on
their critical memory needs, so that your scenario can be avoided.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/8] Create emergency trigger
  2005-11-18 19:36 ` [RFC][PATCH 2/8] Create emergency trigger Matthew Dobson
@ 2005-11-19  0:21   ` Paul Jackson
  2005-11-21  5:51     ` Matthew Dobson
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Jackson @ 2005-11-19  0:21 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, linux-mm

> @@ -876,6 +879,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
>  	int can_try_harder;
>  	int did_some_progress;
>  
> +	if (is_emergency_alloc(gfp_mask)) {

Can this check for is_emergency_alloc be moved lower in __alloc_pages?

I don't see any reason why most __alloc_pages() calls, that succeed
easily in the first loop over the zonelist, have to make this check.
This would save one conditional test and jump on the most heavily
used code path in __alloc_pages().

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (8 preceding siblings ...)
  2005-11-18 19:56 ` [RFC][PATCH 0/8] Critical Page Pool Chris Wright
@ 2005-11-20  7:45 ` Keith Owens
  2005-11-21  5:53   ` Matthew Dobson
  2005-11-20 23:04 ` Pavel Machek
  10 siblings, 1 reply; 27+ messages in thread
From: Keith Owens @ 2005-11-20  7:45 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, Linux Memory Management

On Fri, 18 Nov 2005 11:32:57 -0800, 
Matthew Dobson <colpatch@us.ibm.com> wrote:
>We have a clustering product that needs to be able to guarantee that the
>networking system won't stop functioning in the case of OOM/low memory
>condition.  The current mempool system is inadequate because to keep the
>whole networking stack functioning, we need more than 1 or 2 slab caches to
>be guaranteed.  We need to guarantee that any request made with a specific
>flag will succeed, assuming of course that you've made your "critical page
>pool" big enough.
>
>The following patch series implements such a critical page pool.  It
>creates 2 userspace triggers:
>
>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>for the critical pool into this file
>
>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>the system is in an emergency state and authorize the kernel to dip into
>the critical pool to satisfy critical allocations.

FWIW, the Kernel Debugger (KDB) has similar problems where the system
is dying because of lack of memory but KDB must call some functions
that use kmalloc.  A related problem is that sometimes KDB is invoked
from a non maskable interrupt, so I could not even trust the state of
the spinlocks and the chains in the slab code.

I worked around the problem by adding a last ditch allocator.  Extract
from the kdb patch.

---

/* Last ditch allocator for debugging, so we can still debug even when the
 * GFP_ATOMIC pool has been exhausted.  The algorithms are tuned for space
 * usage, not for speed.  One smallish memory pool, the free chain is always in
 * ascending address order to allow coalescing, allocations are done in brute
 * force best fit.
 */

struct debug_alloc_header {
	u32 next;	/* offset of next header from start of pool */
	u32 size;
};
#define dah_align 8

static u64 debug_alloc_pool_aligned[64*1024/dah_align];	/* 64K pool */
static char *debug_alloc_pool = (char *)debug_alloc_pool_aligned;
static u32 dah_first;

/* Locking is awkward.  The debug code is called from all contexts, including
 * non maskable interrupts.  A normal spinlock is not safe in NMI context.  Try
 * to get the debug allocator lock, if it cannot be obtained after a second
 * then give up.  If the lock could not be previously obtained on this cpu then
 * only try once.
 */
static DEFINE_SPINLOCK(dap_lock);
static inline
int get_dap_lock(void)
{
	static int dap_locked = -1;
	int count;
	if (dap_locked == smp_processor_id())
		count = 1;
	else
		count = 1000;
	while (1) {
		if (spin_trylock(&dap_lock)) {
			dap_locked = -1;
			return 1;
		}
		if (!count--)
			break;
		udelay(1000);
	}
	dap_locked = smp_processor_id();
	return 0;
}

void *debug_kmalloc(size_t size, int flags)
{
	unsigned int rem, h_offset;
	struct debug_alloc_header *best, *bestprev, *prev, *h;
	void *p = NULL;
	if ((p = kmalloc(size, flags)))
		return p;
	if (!get_dap_lock())
		return NULL;
	h = (struct debug_alloc_header *)(debug_alloc_pool + dah_first);
	prev = best = bestprev = NULL;
	while (1) {
		if (h->size >= size && (!best || h->size < best->size)) {
			best = h;
			bestprev = prev;
		}
		if (!h->next)
			break;
		prev = h;
		h = (struct debug_alloc_header *)(debug_alloc_pool + h->next);
	}
	if (!best)
		goto out;
	rem = (best->size - size) & -dah_align;
	/* The pool must always contain at least one header */
	if (best->next == 0 && bestprev == NULL && rem < sizeof(*h))
		goto out;
	if (rem >= sizeof(*h)) {
		best->size = (size + dah_align - 1) & -dah_align;
		h_offset = (char *)best - debug_alloc_pool + sizeof(*best) + best->size;
		h = (struct debug_alloc_header *)(debug_alloc_pool + h_offset);
		h->size = rem - sizeof(*h);
		h->next = best->next;
	} else
		h_offset = best->next;
	if (bestprev)
		bestprev->next = h_offset;
	else
		dah_first = h_offset;
	p = best+1;
out:
	spin_unlock(&dap_lock);
	return p;
}

void debug_kfree(const void *p)
{
	struct debug_alloc_header *h;
	unsigned int h_offset;
	if (!p)
		return;
	if ((char *)p < debug_alloc_pool ||
	    (char *)p >= debug_alloc_pool + sizeof(debug_alloc_pool_aligned)) {
		kfree(p);
		return;
	}
	if (!get_dap_lock())
		return;		/* memory leak, cannot be helped */
	h = (struct debug_alloc_header *)p - 1;
	h_offset = (char *)h - debug_alloc_pool;
	if (h_offset < dah_first) {
		h->next = dah_first;
		dah_first = h_offset;
	} else {
		struct debug_alloc_header *prev;
		prev = (struct debug_alloc_header *)(debug_alloc_pool + dah_first);
		while (1) {
			if (!prev->next || prev->next > h_offset)
				break;
			prev = (struct debug_alloc_header *)(debug_alloc_pool + prev->next);
		}
		if (sizeof(*prev) + prev->size == h_offset) {
			prev->size += sizeof(*h) + h->size;
			h = prev;
			h_offset = (char *)h - debug_alloc_pool;
		} else {
			h->next = prev->next;
			prev->next = h_offset;
		}
	}
	if (h_offset + sizeof(*h) + h->size == h->next) {
		struct debug_alloc_header *next;
		next = (struct debug_alloc_header *)(debug_alloc_pool + h->next);
		h->size += sizeof(*next) + next->size;
		h->next = next->next;
	}
	spin_unlock(&dap_lock);
}

void kdb_initsupport()
{
	struct debug_alloc_header *h;
	h = (struct debug_alloc_header *)debug_alloc_pool;
	h->next = 0;
	h->size = sizeof(debug_alloc_pool_aligned) - sizeof(*h);
	dah_first = 0;
}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
                   ` (9 preceding siblings ...)
  2005-11-20  7:45 ` Keith Owens
@ 2005-11-20 23:04 ` Pavel Machek
  2005-11-21  5:58   ` Matthew Dobson
  10 siblings, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2005-11-20 23:04 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, Linux Memory Management

On Fri 18-11-05 11:32:57, Matthew Dobson wrote:
> We have a clustering product that needs to be able to guarantee that the
> networking system won't stop functioning in the case of OOM/low memory
> condition.  The current mempool system is inadequate because to keep the
> whole networking stack functioning, we need more than 1 or 2 slab caches to
> be guaranteed.  We need to guarantee that any request made with a specific
> flag will succeed, assuming of course that you've made your "critical page
> pool" big enough.
> 
> The following patch series implements such a critical page pool.  It
> creates 2 userspace triggers:
> 
> /proc/sys/vm/critical_pages: write the number of pages you want to reserve
> for the critical pool into this file
> 
> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
> the system is in an emergency state and authorize the kernel to dip into
> the critical pool to satisfy critical allocations.
> 
> We mark critical allocations with the __GFP_CRITICAL flag, and when the
> system is in an emergency state, we are allowed to delve into this pool to
> satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
> normal means.

Ugh, relying on userspace to tell you that you need to dip into emergency
pool seems to be racy and unreliable. How can you guarantee that userspace
is scheduled soon enough in case of OOM?
							Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 20:42     ` Avi Kivity
  2005-11-19  0:10       ` Paul Jackson
@ 2005-11-21  5:36       ` Matthew Dobson
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, Linux Memory Management

Appologies for the delay in responding to comments, but I have been en
route to the East Coast of the US to visit family.

Avi Kivity wrote:
> Matthew Dobson wrote:
> 
>> Avi Kivity wrote:
>>  
>>
>>> 1. If you have two subsystems which allocate critical pages, how do you
>>> protect against the condition where one subsystem allocates all the
>>> critical memory, causing the second to oom?
>>>   
>>
>>
>> You don't.  You make sure that you size the critical pool
>> appropriately for
>> your workload.
>>
>>  
>>
> This may not be possible. What if subsystem A depends on subsystem B to
> do its work, both are critical, and subsystem A allocated all the memory
> reserve?
> If A and B have different allocation thresholds, the deadlock is avoided.
> 
> At the very least you need a critical pool per subsystem.

As Paul suggested in his follow up to your mail, to even attempt a
"guarantee" that you won't still run out of memory, your subsystem does
need an upper bound on how much memory it could possibly need.  If there is
NO upper limit, then the possibility of exhausting your critical pool is
very real.


>>> 2. There already exists a critical pool: ordinary allocations fail if
>>> free memory is below some limit, but special processes (kswapd) can
>>> allocate that memory by setting PF_MEMALLOC. Perhaps this should be
>>> extended, possibly with a per-process threshold.
>>>   
>>
>>
>> The exception for threads with PF_MEMALLOC set is there because those
>> threads are essentially promising that if the kernel gives them memory,
>> they will use that memory to free up MORE memory.  If we ignore that
>> promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
>> zone_watermarks, we'll simply OOM faster, and potentially in situations
>> that could be avoided (ie: we steal memory that kswapd could have used to
>> free up more memory).
>>  
>>
> Sure, but that's just an example of a critical subsystem.
> 
> If we introduce yet another mechanism for critical memory allocation,
> we'll have a hard time making different subsystems, which use different
> critical allocation mechanisms, play well together.
> 
> I propose that instead of a single watermark, there should be a
> watermark per critical subsystem. The watermarks would be arranged
> according to the dependency graph, with the depended-on services allowed
> to go the deepest into the reserves.
> 
> (instead of PF_MEMALLOC have a tsk->memory_allocation_threshold, or
> similar. set it to 0 for kswapd, and for other systems according to taste)

Your idea is certainly an interesting approach to solving the problem.  I'm
not sure it quite does what I'm looking for, but I'll have to think about
your idea some more to be sure.  One problem is that networking doesn't
have a specific "task" associated with it, where we could set a
memory_allocation_threshold.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-18 19:56 ` [RFC][PATCH 0/8] Critical Page Pool Chris Wright
@ 2005-11-21  5:47   ` Matthew Dobson
  2005-11-21 13:29     ` Pavel Machek
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:47 UTC (permalink / raw)
  To: Chris Wright; +Cc: linux-kernel, Linux Memory Management

Chris Wright wrote:
> * Matthew Dobson (colpatch@us.ibm.com) wrote:
> 
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
> 
> 
> How do you size this pool?

Trial and error.  If you want networking to survive with no memory other
than the critical pool for 2 minutes, for example, you pick a random value,
block all other allocations (I have a test patch to do this), and send a
boatload of packets at the box.  If it OOMs, you need a bigger pool.
Lather, rinse, repeat.


> Allocations are interrupt driven, so how to you
> ensure you're allocating for the cluster network traffic you care about?

On the receive side, you can't. :(  You *have* to allocate an skbuff for
the packet, and only a couple levels up the networking 7-layer burrito can
you tell if you can toss the packet as non-critical or keep it.  On the
send side, you can create a simple socket flag that tags all that socket's
SEND requests as critical.


>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
> 
> 
> Seems odd to me.  Why make this another knob?  How did you run to set this
> flag if you're in emergency and kswapd is going nuts?

We did this because we didn't want __GFP_CRITICAL allocations  dipping into
the pool in the case of a transient low mem situation.  In those cases we
want to force the task to do writeback to get a page (as usual), so that
the critical pool will be full when the system REALLY goes critical.  We
also open the in_emergency file when the app starts so that we can just
write to it and don't need to try to open it when kswapd is going nuts.

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/8] Create Critical Page Pool
  2005-11-19  0:08   ` Paul Jackson
@ 2005-11-21  5:50     ` Matthew Dobson
  2005-11-21  5:54       ` Paul Jackson
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, linux-mm

Paul Jackson wrote:
> Total nit:
> 
>  #define __GFP_HARDWALL   ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
> +#define __GFP_CRITICAL	((__force gfp_t)0x80000u) /* Critical allocation. MUST succeed! */
> 
> Looks like you used a space instead of a tab.

It's a tab on my side...  Maybe some whitespace munging by Thunderbird?
Will make sure it's definitely a tab on the next itteration.

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/8] Create emergency trigger
  2005-11-19  0:21   ` Paul Jackson
@ 2005-11-21  5:51     ` Matthew Dobson
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:51 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, linux-mm

Paul Jackson wrote:
>>@@ -876,6 +879,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
>> 	int can_try_harder;
>> 	int did_some_progress;
>> 
>>+	if (is_emergency_alloc(gfp_mask)) {
> 
> 
> Can this check for is_emergency_alloc be moved lower in __alloc_pages?
> 
> I don't see any reason why most __alloc_pages() calls, that succeed
> easily in the first loop over the zonelist, have to make this check.
> This would save one conditional test and jump on the most heavily
> used code path in __alloc_pages().

Good point, Paul.  Will make sure that gets moved.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-20  7:45 ` Keith Owens
@ 2005-11-21  5:53   ` Matthew Dobson
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:53 UTC (permalink / raw)
  To: Keith Owens; +Cc: linux-kernel, Linux Memory Management

Keith Owens wrote:
> On Fri, 18 Nov 2005 11:32:57 -0800, 
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
>>We have a clustering product that needs to be able to guarantee that the
>>networking system won't stop functioning in the case of OOM/low memory
>>condition.  The current mempool system is inadequate because to keep the
>>whole networking stack functioning, we need more than 1 or 2 slab caches to
>>be guaranteed.  We need to guarantee that any request made with a specific
>>flag will succeed, assuming of course that you've made your "critical page
>>pool" big enough.
>>
>>The following patch series implements such a critical page pool.  It
>>creates 2 userspace triggers:
>>
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
>>
>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
> 
> 
> FWIW, the Kernel Debugger (KDB) has similar problems where the system
> is dying because of lack of memory but KDB must call some functions
> that use kmalloc.  A related problem is that sometimes KDB is invoked
> from a non maskable interrupt, so I could not even trust the state of
> the spinlocks and the chains in the slab code.
> 
> I worked around the problem by adding a last ditch allocator.  Extract
> from the kdb patch.

Ahh... very interesting.  And dissapointingly much smaller than mine. :(

Thanks for the patch and the feedback!

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/8] Create Critical Page Pool
  2005-11-21  5:50     ` Matthew Dobson
@ 2005-11-21  5:54       ` Paul Jackson
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Jackson @ 2005-11-21  5:54 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: linux-kernel, linux-mm

Matthew wrote:
> It's a tab on my side.

Oh - ok.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-20 23:04 ` Pavel Machek
@ 2005-11-21  5:58   ` Matthew Dobson
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Dobson @ 2005-11-21  5:58 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, Linux Memory Management

Pavel Machek wrote:
> On Fri 18-11-05 11:32:57, Matthew Dobson wrote:
> 
>>We have a clustering product that needs to be able to guarantee that the
>>networking system won't stop functioning in the case of OOM/low memory
>>condition.  The current mempool system is inadequate because to keep the
>>whole networking stack functioning, we need more than 1 or 2 slab caches to
>>be guaranteed.  We need to guarantee that any request made with a specific
>>flag will succeed, assuming of course that you've made your "critical page
>>pool" big enough.
>>
>>The following patch series implements such a critical page pool.  It
>>creates 2 userspace triggers:
>>
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
>>
>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
>>
>>We mark critical allocations with the __GFP_CRITICAL flag, and when the
>>system is in an emergency state, we are allowed to delve into this pool to
>>satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>>normal means.
> 
> 
> Ugh, relying on userspace to tell you that you need to dip into emergency
> pool seems to be racy and unreliable. How can you guarantee that userspace
> is scheduled soon enough in case of OOM?
> 							Pavel

It's not really for userspace to tell us that we're about to OOM, as the
kernel is in a far better position to determine that.  It is to let the
kernel know that *something* has gone wrong, and we've got to keep
networking (or any other user of __GFP_CRITICAL) up for a few minutes, *no
matter what*.  We may not ever OOM, or even run terribly low on memory, but
the trigger allows the use of the pool IF that happens.

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-21  5:47   ` Matthew Dobson
@ 2005-11-21 13:29     ` Pavel Machek
  2005-12-06 22:54       ` Matthew Dobson
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2005-11-21 13:29 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: Chris Wright, linux-kernel, Linux Memory Management

Hi!

> > * Matthew Dobson (colpatch@us.ibm.com) wrote:
> > 
> >>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
> >>for the critical pool into this file
> > 
> > 
> > How do you size this pool?
> 
> Trial and error.  If you want networking to survive with no memory other
> than the critical pool for 2 minutes, for example, you pick a random value,
> block all other allocations (I have a test patch to do this), and send a
> boatload of packets at the box.  If it OOMs, you need a bigger pool.
> Lather, rinse, repeat.

...and then you find out that your test was not "bad enough" or that
it needs more memory on different machines. It may be good enough hack
for your usage, but I do not think it belongs in mainline.
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-11-21 13:29     ` Pavel Machek
@ 2005-12-06 22:54       ` Matthew Dobson
  2005-12-10  8:39         ` Pavel Machek
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Dobson @ 2005-12-06 22:54 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Chris Wright, linux-kernel, Linux Memory Management

Pavel Machek wrote:
> Hi!
> 
> 
>>>* Matthew Dobson (colpatch@us.ibm.com) wrote:
>>>
>>>
>>>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>>>for the critical pool into this file
>>>
>>>
>>>How do you size this pool?
>>
>>Trial and error.  If you want networking to survive with no memory other
>>than the critical pool for 2 minutes, for example, you pick a random value,
>>block all other allocations (I have a test patch to do this), and send a
>>boatload of packets at the box.  If it OOMs, you need a bigger pool.
>>Lather, rinse, repeat.
> 
> 
> ...and then you find out that your test was not "bad enough" or that
> it needs more memory on different machines. It may be good enough hack
> for your usage, but I do not think it belongs in mainline.
> 								Pavel

Way late in responding to this, but...

Apropriate sizing of this pool is a known issue.  For example, we want to
use it to keep the networking stack alive during extreme memory pressure
situations.  The only way to size the pool so as to *guarantee* that it
will not be exhausted during the 2 minute window we need would be to ensure
that the pool has at least (TOTAL_BANDWITH_OF_ALL_NICS * 120 seconds) bytes
available.  In the case of a simple system with a single GigE adapter we'd
need (1 gigbit/sec * 120 sec) = 120 gigabits = 15 gigabytes of reserve
pool.  That is obviously completely impractical, considering many boxes
have multiple GigE adapters or even 10 GigE adapters.  It is also
incredibly unlikely that the NIC will be hit with a continuous stream of
packets at a level that would completely saturate the link.  Starting with
an educated guess and some test runs with a reasonble workload should give
you a good idea of how much space you'd *realistically* need to reserve.
Given any reserve size less than the theoretical maximum you obviously
can't *guarantee* the pool won't be exhausted, but you can be pretty confident.

-Matt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/8] Critical Page Pool
  2005-12-06 22:54       ` Matthew Dobson
@ 2005-12-10  8:39         ` Pavel Machek
  0 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2005-12-10  8:39 UTC (permalink / raw)
  To: Matthew Dobson; +Cc: Chris Wright, linux-kernel, Linux Memory Management

Hi!

> > ...and then you find out that your test was not "bad enough" or that
> > it needs more memory on different machines. It may be good enough hack
> > for your usage, but I do not think it belongs in mainline.
> > 								Pavel
> 
> Way late in responding to this, but...
> 
> Apropriate sizing of this pool is a known issue.  For example, we want to
> use it to keep the networking stack alive during extreme memory pressure
> situations.  The only way to size the pool so as to *guarantee* that it
> will not be exhausted during the 2 minute window we need would be to ensure
> that the pool has at least (TOTAL_BANDWITH_OF_ALL_NICS * 120 seconds) bytes
> available.  In the case of a simple system with a single GigE adapter we'd
> need (1 gigbit/sec * 120 sec) = 120 gigabits = 15 gigabytes of reserve
> pool.  That is obviously completely impractical, considering many boxes

And it is not enough... If someone hits you with small packets,
allocation overhead is going to be high.
							Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2005-12-10  8:39 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-18 19:32 [RFC][PATCH 0/8] Critical Page Pool Matthew Dobson
2005-11-18 19:36 ` [RFC][PATCH 1/8] Create " Matthew Dobson
2005-11-19  0:08   ` Paul Jackson
2005-11-21  5:50     ` Matthew Dobson
2005-11-21  5:54       ` Paul Jackson
2005-11-18 19:36 ` [RFC][PATCH 2/8] Create emergency trigger Matthew Dobson
2005-11-19  0:21   ` Paul Jackson
2005-11-21  5:51     ` Matthew Dobson
2005-11-18 19:41 ` [RFC][PATCH 4/8] Fix a bug in scsi_get_command Matthew Dobson
2005-11-18 19:43 ` [RFC][PATCH 5/8] get_object/return_object Matthew Dobson
2005-11-18 19:44 ` [RFC][PATCH 6/8] slab_destruct Matthew Dobson
2005-11-18 19:44 ` [RFC][PATCH 0/8] Critical Page Pool Avi Kivity
2005-11-18 19:51   ` Matthew Dobson
2005-11-18 20:42     ` Avi Kivity
2005-11-19  0:10       ` Paul Jackson
2005-11-21  5:36       ` Matthew Dobson
2005-11-18 19:45 ` [RFC][PATCH 7/8] __cache_grow() Matthew Dobson
2005-11-18 19:47 ` [RFC][PATCH 8/8] Add support critical pool support to the slab allocator Matthew Dobson
2005-11-18 19:56 ` [RFC][PATCH 0/8] Critical Page Pool Chris Wright
2005-11-21  5:47   ` Matthew Dobson
2005-11-21 13:29     ` Pavel Machek
2005-12-06 22:54       ` Matthew Dobson
2005-12-10  8:39         ` Pavel Machek
2005-11-20  7:45 ` Keith Owens
2005-11-21  5:53   ` Matthew Dobson
2005-11-20 23:04 ` Pavel Machek
2005-11-21  5:58   ` Matthew Dobson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox