Netdev List
 help / color / mirror / Atom feed
* [PATCH -mmotm 10/30] mm: __GFP_MEMALLOC
From: Xiaotian Feng @ 2010-07-13 10:18 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 78618cc83e6a21b876c30a8fb3940ccc1f5b99e1 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:49:16 +0800
Subject: [PATCH 10/30] mm: __GFP_MEMALLOC

__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC.

It allows one to pass along the memalloc state in object related allocation
flags as opposed to task related flags, such as sk->sk_allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 975609c..c608e26 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -47,6 +47,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -98,7 +99,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21d64f7..f7d3060 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 09/30] mm: system wide ALLOC_NO_WATERMARK
From: Xiaotian Feng @ 2010-07-13 10:18 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 6d9ae36dfe34cdbc6e1fc52e6db0b27286eb4b58 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:46:50 +0800
Subject: [PATCH 09/30] mm: system wide ALLOC_NO_WATERMARK

The reserve is proportionally distributed over all (!highmem) zones in the
system. So we need to allow an emergency allocation access to all zones. In
order to do that we need to break out of any mempolicy boundaries we might
have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed to be
within mempolicy boundaries. For instance IRQs don't even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/page_alloc.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6c85e9e..21d64f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1995,6 +1995,11 @@ restart:
 rebalance:
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		/*
+		 * break out mempolicy boundaries
+		 */
+		zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 08/30] mm: emergency pool
From: Xiaotian Feng @ 2010-07-13 10:18 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 973ce2ee797025f055ccb61359180e53c3ecc681 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:45:48 +0800
Subject: [PATCH 08/30] mm: emergency pool

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mmzone.h |    3 ++
 mm/page_alloc.c        |   84 ++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 ++--
 3 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9ed9c45..75a0871 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -279,6 +279,7 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
+	unsigned long           pages_emerg;    /* emergency pool */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
@@ -770,6 +771,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
+int adjust_memalloc_reserve(int pages);
+
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 extern char numa_zonelist_order[];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b1fa10..6c85e9e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -167,6 +167,8 @@ static char * const zone_names[MAX_NR_ZONES] = {
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned long __meminitdata nr_all_pages;
@@ -1457,7 +1459,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min+z->lowmem_reserve[classzone_idx]+z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1946,7 +1948,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
-	int alloc_flags;
+	int alloc_flags = 0;
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	struct task_struct *p = current;
@@ -2078,8 +2080,8 @@ rebalance:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -2414,9 +2416,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(min_wmark_pages(zone)),
-			K(low_wmark_pages(zone)),
-			K(high_wmark_pages(zone)),
+			K(zone->pages_emerg + min_wmark_pages(zone)),
+			K(zone->pages_emerg + low_wmark_pages(zone)),
+			K(zone->pages_emerg + high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -4779,7 +4781,7 @@ static void calculate_totalreserve_pages(void)
 			}
 
 			/* we treat the high watermark as reserved pages. */
-			max += high_wmark_pages(zone);
+			max += high_wmark_pages(zone) + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4837,7 +4839,8 @@ static void setup_per_zone_lowmem_reserve(void)
  */
 static void __setup_per_zone_wmarks(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -4849,11 +4852,13 @@ static void __setup_per_zone_wmarks(void)
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -4872,12 +4877,14 @@ static void __setup_per_zone_wmarks(void)
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->watermark[WMARK_MIN] = min_pages;
+			zone->pages_emerg = 0;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->watermark[WMARK_MIN] = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
@@ -4942,6 +4949,63 @@ void setup_per_zone_wmarks(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_wmarks();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 15a14b1..af91d5c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   min_wmark_pages(zone),
-		   low_wmark_pages(zone),
-		   high_wmark_pages(zone),
+		   zone->pages_emerg + min_wmark_pages(zone),
+		   zone->pages_emerg + min_wmark_pages(zone),
+		   zone->pages_emerg + high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
 		   zone->present_pages);
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 07/30] mm: allow PF_MEMALLOC from softirq context
From: Xiaotian Feng @ 2010-07-13 10:18 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 32f474cffc76551ddb792454845bd473634219b5 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:41:53 +0800
Subject: [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being associated
with a task, and therefore not having task flags to fiddle with - thus the gfp
to alloc flag mapping ignores the task flags when in interrupts (hard or soft)
context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some trickery.
We basically borrow the task flags from whatever process happens to be
preempted by the softirq.

So we modify the gfp to alloc flags mapping to not exclude task flags in
softirq context, and modify the softirq code to save, clear and restore the
PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag cannot
leak back into the preempted process.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f25798..85b74b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1765,6 +1765,13 @@ static inline void rcu_copy_process(struct task_struct *p)
 
 #endif
 
+static inline void tsk_restore_flags(struct task_struct *p,
+				     unsigned long pflags, unsigned long mask)
+{
+	p->flags &= ~mask;
+	p->flags |= pflags & mask;
+}
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..0770e78 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -194,6 +194,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -246,6 +248,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9989c5..4b1fa10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1928,9 +1928,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 06/30] mm: kmem_alloc_estimate()
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 4a2dff5cf02e9d7f6ee9345c337697c4ab66c6dc Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:41:22 +0800
Subject: [PATCH 06/30] mm: kmem_alloc_estimate()

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/slab.h |    4 ++
 mm/slab.c            |   75 +++++++++++++++++++++++++++++++++++++++++++
 mm/slob.c            |   67 ++++++++++++++++++++++++++++++++++++++
 mm/slub.c            |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 233 insertions(+), 0 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 49d1247..b57b9ca 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kern_ptr_validate(const void *ptr, unsigned long size);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+			gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -144,6 +146,8 @@ void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 void kzfree(const void *);
 size_t ksize(const void *);
+unsigned kmalloc_estimate_objs(size_t, gfp_t, int);
+unsigned kmalloc_estimate_bytes(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
diff --git a/mm/slab.c b/mm/slab.c
index d8cd757..2a0dd0d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3913,6 +3913,81 @@ const char *kmem_cache_name(struct kmem_cache *cachep)
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+		gfp_t flags, int objects)
+{
+	/*
+	 * (1) memory for objects,
+	 */
+	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
+	unsigned nr_pages = nr_slabs << cachep->gfporder;
+
+	/*
+	 * (2) memory for each per-cpu queue (nr_cpu_ids),
+	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
+	 * (4) some amount of memory for the slab management structures
+	 *
+	 * XXX: truely account these
+	 */
+	nr_pages += 1 + ilog2(nr_pages);
+
+	return nr_pages;
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = kmem_find_general_cachep(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+	struct cache_sizes *csizep = malloc_sizes;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (csizep = malloc_sizes; csizep->cs_cachep; csizep++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & __GFP_DMA))
+			s = csizep->cs_dmacachep;
+		else
+#endif
+			s = csizep->cs_cachep;
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
+/*
  * This initializes kmem_list3 or resizes various caches for all nodes.
  */
 static int alloc_kmemlist(struct kmem_cache *cachep, gfp_t gfp)
diff --git a/mm/slob.c b/mm/slob.c
index b84b611..0caf938 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -695,6 +695,73 @@ int slab_is_available(void)
 	return slob_ready;
 }
 
+static __slob_estimate(unsigned size, unsigned align, unsigned objects)
+{
+	unsigned nr_pages;
+
+	size = SLOB_UNIT * SLOB_UNITS(size + align - 1);
+
+	if (size <= PAGE_SIZE) {
+		nr_pages = DIV_ROUND_UP(objects, PAGE_SIZE / size);
+	} else {
+		nr_pages = objects << get_order(size);
+	}
+
+	return nr_pages;
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *c, gfp_t flags, int objects)
+{
+	unsigned size = c->size;
+
+	if (c->flags & SLAB_DESTROY_BY_RCU)
+		size += sizeof(struct slob_rcu);
+
+	return __slob_estimate(size, c->align, objects);
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	unsigned align = max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
+
+	return __slob_estimate(size, align, count);
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+
+	/*
+	 * Multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 *
+	 * While not true for slob, it cannot do worse than that for sequential
+	 * allocations.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * Our power of two series starts at PAGE_SIZE, so add one page.
+	 */
+	pages++;
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
 void __init kmem_cache_init(void)
 {
 	slob_ready = 1;
diff --git a/mm/slub.c b/mm/slub.c
index 7a5d6dc..056545e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2435,6 +2435,42 @@ const char *kmem_cache_name(struct kmem_cache *s)
 }
 EXPORT_SYMBOL(kmem_cache_name);
 
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ *
+ * We should use s->min_objects because those are the least efficient.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long pages;
+	struct kmem_cache_order_objects x;
+
+	if (WARN_ON(!s) || WARN_ON(!oo_objects(s->min)))
+		return 0;
+
+	x = s->min;
+	pages = DIV_ROUND_UP(objects, oo_objects(x)) << oo_order(x);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object. Use s->max_objects because that's the worst case.
+	 */
+	x = s->oo;
+	if (oo_objects(x) > 1) {
+		/*
+		 * Account the possible additional overhead if per cpu slabs
+		 * are currently empty and have to be allocated. This is very
+		 * unlikely but a possible scenario immediately after
+		 * kmem_cache_shrink.
+		 */
+		pages += num_possible_cpus() << oo_order(x);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmem_alloc_estimate);
+
 static void list_slab_objects(struct kmem_cache *s, struct page *page,
 							const char *text)
 {
@@ -2868,6 +2904,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < PAGE_SHIFT; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 05/30] mm: sl[au]b: add knowledge of reserve pages
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From fba0bdebc34d3db41a2c975eb38e9548ea5c2ed1 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Tue, 13 Jul 2010 10:40:05 +0800
Subject: [PATCH 05/30] mm: sl[au]b: add knowledge of reserve pages

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

The basic pattern used for all # allocators is the following, for each active
slab page we store if it came from an emergency allocation. When we find it
did, make sure the current allocation context would have been able to allocate
page from the emergency reserves as well. In that case allow the allocation. If
not, force a new slab allocation. When that works the memory pressure has
lifted enough to allow this context to get an object, otherwise fail the
allocation.

[mszeredi@suse.cz: Fix use of uninitialized variable in cache_grow]
[dfeng@redhat.com: Minor fix related with SLABDEBUG]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/slub_def.h |    1 +
 mm/slab.c                |   62 +++++++++++++++++++++++++++++++++++++++------
 mm/slob.c                |   16 +++++++++++-
 mm/slub.c                |   42 ++++++++++++++++++++++++++-----
 4 files changed, 104 insertions(+), 17 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 6447a72..9ef61f4 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -39,6 +39,7 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	int reserve;		/* Did the current page come from the reserve */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/slab.c b/mm/slab.c
index 4e9c46f..d8cd757 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -120,6 +120,8 @@
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
 
+#include 	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -244,7 +246,8 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	unsigned int touched;
+	unsigned int touched:1,
+		     reserve:1;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
@@ -680,6 +683,27 @@ static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
 	return cachep->array[smp_processor_id()];
 }
 
+/*
+ * If the last page came from the reserves, and the current allocation context
+ * does not have access to them, force an allocation to test the watermarks.
+ */
+static inline int slab_force_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+	if (unlikely(cpu_cache_get(cachep)->reserve) &&
+			!(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		return 1;
+
+	return 0;
+}
+
+static inline void slab_set_reserve(struct kmem_cache *cachep, int reserve)
+{
+	struct array_cache *ac = cpu_cache_get(cachep);
+
+	if (unlikely(ac->reserve != reserve))
+		ac->reserve = reserve;
+}
+
 static inline struct kmem_cache *__find_general_cachep(size_t size,
 							gfp_t gfpflags)
 {
@@ -886,6 +910,7 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 		nc->limit = entries;
 		nc->batchcount = batchcount;
 		nc->touched = 0;
+		nc->reserve = 0;
 		spin_lock_init(&nc->lock);
 	}
 	return nc;
@@ -1674,7 +1699,8 @@ __initcall(cpucache_init);
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		int *reserve)
 {
 	struct page *page;
 	int nr_pages;
@@ -1696,6 +1722,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (!page)
 		return NULL;
 
+	*reserve = page->reserve;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2128,6 +2155,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
 	cpu_cache_get(cachep)->touched = 0;
+	cpu_cache_get(cachep)->reserve = 0;
 	cachep->batchcount = 1;
 	cachep->limit = BOOT_CPUCACHE_ENTRIES;
 	return 0;
@@ -2813,6 +2841,7 @@ static int cache_grow(struct kmem_cache *cachep,
 	size_t offset;
 	gfp_t local_flags;
 	struct kmem_list3 *l3;
+	int reserve = -1;
 
 	/*
 	 * Be lazy and only check for valid flags here,  keeping it out of the
@@ -2851,7 +2880,7 @@ static int cache_grow(struct kmem_cache *cachep,
 	 * 'nodeid'.
 	 */
 	if (!objp)
-		objp = kmem_getpages(cachep, local_flags, nodeid);
+		objp = kmem_getpages(cachep, local_flags, nodeid, &reserve);
 	if (!objp)
 		goto failed;
 
@@ -2868,6 +2897,8 @@ static int cache_grow(struct kmem_cache *cachep,
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
+	if (reserve != -1)
+		slab_set_reserve(cachep, reserve);
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */
@@ -3002,7 +3033,8 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep,
+		gfp_t flags, int must_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -3012,6 +3044,8 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(must_refill))
+		goto force_grow;
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3081,11 +3115,14 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || must_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3175,17 +3212,18 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	int must_refill = slab_force_alloc(cachep, flags);
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && !must_refill)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, must_refill);
 		/*
 		 * the 'ac' may be updated by cache_alloc_refill(),
 		 * and kmemleak_erase() requires its correct value.
@@ -3243,7 +3281,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
-	int nid;
+	int nid, reserve;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
@@ -3280,10 +3318,12 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, numa_mem_id());
+		obj = kmem_getpages(cache, local_flags, numa_mem_id(),
+				    &reserve);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
+			slab_set_reserve(cache, reserve);
 			/*
 			 * Insert into the appropriate per node queues
 			 */
@@ -3323,6 +3363,9 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
+	if (unlikely(slab_force_alloc(cachep, flags)))
+		goto force_grow;
+
 retry:
 	check_irq_off();
 	spin_lock(&l3->list_lock);
@@ -3360,6 +3403,7 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;
diff --git a/mm/slob.c b/mm/slob.c
index 3f19a34..b84b611 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -71,6 +71,7 @@
 #include <trace/events/kmem.h>
 
 #include <asm/atomic.h>
+#include "internal.h"
 
 /*
  * slob_block has a field 'units', which indicates size of block if +ve,
@@ -193,6 +194,11 @@ struct slob_rcu {
 static DEFINE_SPINLOCK(slob_lock);
 
 /*
+ * tracks the reserve state for the allocator.
+ */
+static int slob_reserve;
+
+/*
  * Encode the given size and next info into a free slob block s.
  */
 static void set_slob(slob_t *s, slobidx_t size, slob_t *next)
@@ -242,7 +248,7 @@ static int slob_last(slob_t *s)
 
 static void *slob_new_pages(gfp_t gfp, int order, int node)
 {
-	void *page;
+	struct page *page;
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
@@ -254,6 +260,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
+	slob_reserve = page->reserve;
+
 	return page_address(page);
 }
 
@@ -326,6 +334,11 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 	slob_t *b = NULL;
 	unsigned long flags;
 
+	if (unlikely(slob_reserve)) {
+		if (!(gfp_to_alloc_flags(gfp) & ALLOC_NO_WATERMARKS))
+			goto grow;
+	}
+
 	if (size < SLOB_BREAK1)
 		slob_list = &free_slob_small;
 	else if (size < SLOB_BREAK2)
@@ -364,6 +377,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 	}
 	spin_unlock_irqrestore(&slob_lock, flags);
 
+grow:
 	/* Not enough space: must allocate a new page */
 	if (!b) {
 		b = slob_new_pages(gfp & ~__GFP_ZERO, 0, node);
diff --git a/mm/slub.c b/mm/slub.c
index 7bb7940..7a5d6dc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -27,6 +27,8 @@
 #include <linux/memory.h>
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
+#include "internal.h"
+
 
 /*
  * Lock order:
@@ -1139,7 +1141,8 @@ static void setup_object(struct kmem_cache *s, struct page *page,
 		s->ctor(object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static
+struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	void *start;
@@ -1153,6 +1156,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
+
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
@@ -1606,10 +1611,20 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
+	int reserve;
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
+	if (unlikely(c->reserve)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves we
+		 * must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto grow_slab;
+	}
 	if (!c->page)
 		goto new_slab;
 
@@ -1623,8 +1638,8 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
-		goto debug;
+	if (unlikely(SLABDEBUG && PageSlubDebug(c->page) || c->reserve))
+		goto slow_path;
 
 	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
@@ -1646,16 +1661,18 @@ new_slab:
 		goto load_freelist;
 	}
 
+grow_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	new = new_slab(s, gfpflags, node, &reserve);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
+		c->reserve = reserve;
 		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1667,10 +1684,20 @@ new_slab:
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
-debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+
+slow_path:
+	if (!c->reserve && !alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
+	/*
+	 * Avoid the slub fast path in slab_alloc() by not setting
+	 * c->freelist and the fast path in slab_free() by making
+	 * node_match() fail by setting c->node to -1.
+	 *
+	 * We use this for for debug and reserve checks which need
+	 * to be done for each allocation.
+	 */
+
 	c->page->inuse++;
 	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
@@ -2095,10 +2122,11 @@ static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
 	struct page *page;
 	struct kmem_cache_node *n;
 	unsigned long flags;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 04/30] mm: tag reseve pages
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 14f7823986429b8374d18e3f648cd84575296e03 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 18:00:37 +0800
Subject: [PATCH 04/30] mm: tag reseve pages

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Since low-memory situations are transient, and unrelated the the actual
page (any page can be on the freelist when we run low), don't mark the
page in any permanent way - just pass along the information to the
allocatee.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a95a202 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -71,6 +71,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 72a6be5..b9989c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1656,8 +1656,10 @@ zonelist_scan:
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 03/30] mm: expose gfp_to_alloc_flags()
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 8a554f86ab2fd4d8a1daecaef4cc5bb3901c4423 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:59:52 +0800
Subject: [PATCH 03/30] mm: expose gfp_to_alloc_flags()

Expose the gfp to alloc_flags mapping, so we can use it in other parts
of the vm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/internal.h   |   15 +++++++++++++++
 mm/page_alloc.c |   16 +---------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..3e2cc3a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -185,6 +185,21 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
+#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
+
+#define ALLOC_HARDER		0x10 /* try to alloc harder */
+#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebf0af7..72a6be5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1345,19 +1345,6 @@ failed:
 	return NULL;
 }
 
-/* The ALLOC_WMARK bits are used as an index to zone->watermark */
-#define ALLOC_WMARK_MIN		WMARK_MIN
-#define ALLOC_WMARK_LOW		WMARK_LOW
-#define ALLOC_WMARK_HIGH	WMARK_HIGH
-#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
-
-/* Mask to get the watermark bits */
-#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
-
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1911,8 +1898,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
-static inline int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+int gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	struct task_struct *p = current;
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 02/30] Swap over network documentation
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From 8c68e4dc644be32cd82ba9711ba3ef89cb687cdf Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:59:16 +0800
Subject: [PATCH 02/30] Swap over network documentation

Document describing the problem and proposed solution

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 Documentation/network-swap.txt |  268 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 268 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/network-swap.txt

diff --git a/Documentation/network-swap.txt b/Documentation/network-swap.txt
new file mode 100644
index 0000000..760af2e
--- /dev/null
+++ b/Documentation/network-swap.txt
@@ -0,0 +1,268 @@
+
+Problem:
+   When Linux needs to allocate memory it may find that there is
+   insufficient free memory so it needs to reclaim space that is in
+   use but not needed at the moment.  There are several options:
+
+   1/ Shrink a kernel cache such as the inode or dentry cache.  This
+      is fairly easy but provides limited returns.
+   2/ Discard 'clean' pages from the page cache.  This is easy, and
+      works well as long as there are clean pages in the page cache.
+      Similarly clean 'anonymous' pages can be discarded - if there
+      are any.
+   3/ Write out some dirty page-cache pages so that they become clean.
+      The VM limits the number of dirty page-cache pages to e.g. 40%
+      of available memory so that (among other reasons) a "sync" will
+      not take excessively long.  So there should never be excessive
+      amounts of dirty pagecache.
+      Writing out dirty page-cache pages involves work by the
+      filesystem which may need to allocate memory itself.  To avoid
+      deadlock, filesystems use GFP_NOFS when allocating memory on the
+      write-out path.  When this is used, cleaning dirty page-cache
+      pages is not an option so if the filesystem finds that  memory
+      is tight, another option must be found.
+   4/ Write out dirty anonymous pages to the "Swap" partition/file.
+      This is the most interesting for a couple of reasons.
+      a/ Unlike dirty page-cache pages, there is no need to write anon
+         pages out unless we are actually short of memory.  Thus they
+         tend to be left to last.
+      b/ Anon pages tend to be updated randomly and unpredictably, and
+         flushing them out of memory can have a very significant
+         performance impact on the process using them.  This contrasts
+         with page-cache pages which are often written sequentially
+         and often treated as "write-once, read-many".
+      So anon pages tend to be left until last to be cleaned, and may
+      be the only cleanable pages while there are still some dirty
+      page-cache pages (which are waiting on a GFP_NOFS allocation).
+
+[I don't find the above wholly satisfying.  There seems to be too much
+ hand-waving.  If someone can provide better text explaining why
+ swapout is a special case, that would be great.]
+
+So we need to be able to write to the swap file/partition without
+needing to allocate any memory ... or only a small well controlled
+amount.
+
+The VM reserves a small amount of memory that can only be allocated
+for use as part of the swap-out procedure.  It is only available to
+processes with the PF_MEMALLOC flag set, which is typically just the
+memory cleaner.
+
+Traditionally swap-out is performed directly to block devices (swap
+files on block-device filesystems are supported by examining the
+mapping from file offset to device offset in advance, and then using
+the device offsets to write directly to the device).  Block devices
+are (required to be) written to pre-allocate any memory that might be
+needed during write-out, and to block when the pre-allocated memory is
+exhausted and no other memory is available.  They can be sure not to
+block forever as the pre-allocated memory will be returned as soon as
+the data it is being used for has been written out.  The primary
+mechanism for pre-allocating memory is called "mempools".
+
+This approach does not work for writing anonymous pages
+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
+
+
+The main reason that it does not work is that when data from an anon
+page is written to the network, we must wait for a reply to confirm
+the data is safe.  Receiving that reply will consume memory and,
+significantly, we need to allocate memory to an incoming packet before
+we can tell if it is the reply we are waiting for or not.
+
+The secondary reason is that the network code is not written to use
+mempools and in most cases does not need to use them.  Changing all
+allocations in the networking layer to use mempools would be quite
+intrusive, and would waste memory, and probably cause a slow-down in
+the common case of not swapping over the network.
+
+These problems are addressed by enhancing the system of memory
+reserves used by PF_MEMALLOC and requiring any in-kernel networking
+client that is used for swap-out to indicate which sockets are used
+for swapout so they can be handled specially in low memory situations.
+
+There are several major parts to this enhancement:
+
+1/ page->reserve, GFP_MEMALLOC
+
+  To handle low memory conditions we need to know when those
+  conditions exist.  Having a global "low on memory" flag seems easy,
+  but its implementation is problematic.  Instead we make it possible
+  to tell if a recent memory allocation required use of the emergency
+  memory pool.
+  For pages returned by alloc_page, the new page->reserve flag
+  can be tested.  If this is set, then a low memory condition was
+  current when the page was allocated, so the memory should be used
+  carefully. (Because low memory conditions are transient, this
+  state is kept in an overloaded member instead of in page flags, which
+  would suggest a more permanent state.)
+
+  For memory allocated using slab/slub: If a page that is added to a
+  kmem_cache is found to have page->reserve set, then a  s->reserve
+  flag is set for the whole kmem_cache.  Further allocations will only
+  be returned from that page (or any other page in the cache) if they
+  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
+  Non-emergency allocations will block in alloc_page until a
+  non-reserve page is available.  Once a non-reserve page has been
+  added to the cache, the s->reserve flag on the cache is removed.
+
+  Because slab objects have no individual state its hard to pass
+  reserve state along, the current code relies on a regular alloc
+  failing. There are various allocation wrappers help here.
+
+  This allows us to
+   a/ request use of the emergency pool when allocating memory
+     (GFP_MEMALLOC), and
+   b/ to find out if the emergency pool was used.
+
+2/ SK_MEMALLOC, sk_buff->emergency.
+
+  When memory from the reserve is used to store incoming network
+  packets, the memory must be freed (and the packet dropped) as soon
+  as we find out that the packet is not for a socket that is used for
+  swap-out.
+  To achieve this we have an ->emergency flag for skbs, and an
+  SK_MEMALLOC flag for sockets.
+  When memory is allocated for an skb, it is allocated with
+  GFP_MEMALLOC (if we are currently swapping over the network at
+  all).  If a subsequent test shows that the emergency pool was used,
+  ->emergency is set.
+  When the skb is finally attached to its destination socket, the
+  SK_MEMALLOC flag on the socket is tested.  If the skb has
+  ->emergency set, but the socket does not have SK_MEMALLOC set, then
+  the skb is immediately freed and the packet is dropped.
+  This ensures that reserve memory is never queued on a socket that is
+  not used for swapout.
+
+  Similarly, if an skb is ever queued for delivery to user-space for
+  example by netfilter, the ->emergency flag is tested and the skb is
+  released if ->emergency is set. (so obviously the storage route may
+  not pass through a userspace helper, otherwise the packets will never
+  arrive and we'll deadlock)
+
+  This ensures that memory from the emergency reserve can be used to
+  allow swapout to proceed, but will not get caught up in any other
+  network queue.
+
+
+3/ pages_emergency
+
+  The above would be sufficient if the total memory below the lowest
+  memory watermark (i.e the size of the emergency reserve) were known
+  to be enough to hold all transient allocations needed for writeout.
+  I'm a little blurry on how big the current emergency pool is, but it
+  isn't big and certainly hasn't been sized to allow network traffic
+  to consume any.
+
+  We could simply make the size of the reserve bigger. However in the
+  common case that we are not swapping over the network, that would be
+  a waste of memory.
+
+  So a new "watermark" is defined: pages_emergency.  This is
+  effectively added to the current low water marks, so that pages from
+  this emergency pool can only be allocated if one of PF_MEMALLOC or
+  GFP_MEMALLOC are set.
+
+  pages_emergency can be changed dynamically based on need.  When
+  swapout over the network is required, pages_emergency is increased
+  to cover the maximum expected load.  When network swapout is
+  disabled, pages_emergency is decreased.
+
+  To determine how much to increase it by, we introduce reservation
+  groups....
+
+3a/ reservation groups
+
+  The memory used transiently for swapout can be in a number of
+  different places.  e.g. the network route cache, the network
+  fragment cache, in transit between network card and socket, or (in
+  the case of NFS) in sunrpc data structures awaiting a reply.
+  We need to ensure each of these is limited in the amount of memory
+  they use, and that the maximum is included in the reserve.
+
+  The memory required by the network layer only needs to be reserved
+  once, even if there are multiple swapout paths using the network
+  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
+  the same time would be unusual).
+
+  So we create a tree of reservation groups.  The network might
+  register a collection of reservations, but not mark them as being in
+  use.  NFS and sunrpc might similarly register a collection of
+  reservations, and attach it to the network reservations as it
+  depends on them.
+  When swapout over NFS is requested, the NFS/sunrpc reservations are
+  activated which implicitly activates the network reservations.
+
+  The total new reservation is added to pages_emergency.
+
+  Provided each memory usage stays beneath the registered limit (at
+  least when allocating memory from reserves), the system will never
+  run out of emergency memory, and swapout will not deadlock.
+
+  It is worth noting here that it is not critical that each usage
+  stays beneath the limit 100% of the time.  Occasional excess is
+  acceptable provided that the memory will be freed  again within a
+  short amount of time that does *not* require waiting for any event
+  that itself might require memory.
+  This is because, at all stages of transmit and receive, it is
+  acceptable to discard all transient memory associated with a
+  particular writeout and try again later.  On transmit, the page can
+  be re-queued for later transmission.  On receive, the packet can be
+  dropped assuming that the peer will resend after a timeout.
+
+  Thus allocations that are truly transient and will be freed without
+  blocking do not strictly need to be reserved for.  Doing so might
+  still be a good idea to ensure forward progress doesn't take too
+  long.
+
+4/ low-mem accounting
+
+  Most places that might hold on to emergency memory (e.g. route
+  cache, fragment cache etc) already place a limit on the amount of
+  memory that they can use.  This limit can simply be reserved using
+  the above mechanism and no more needs to be done.
+
+  However some memory usage might not be accounted with sufficient
+  firmness to allow an appropriate emergency reservation.  The
+  in-flight skbs for incoming packets is one such example.
+
+  To support this, a low-overhead mechanism for accounting memory
+  usage against the reserves is provided.  This mechanism uses the
+  same data structure that is used to store the emergency memory
+  reservations through the addition of a 'usage' field.
+
+  Before we attempt allocation from the memory reserves, we much check
+  if the resulting 'usage' is below the reservation. If so, we increase
+  the usage and attempt the allocation (which should succeed). If
+  the projected 'usage' exceeds the reservation we'll either fail the
+  allocation, or wait for 'usage' to decrease enough so that it would
+  succeed, depending on __GFP_WAIT.
+
+  When memory that was allocated for that purpose is freed, the
+  'usage' field is checked again.  If it is non-zero, then the size of
+  the freed memory is subtracted from the usage, making sure the usage
+  never becomes less than zero.
+
+  This provides adequate accounting with minimal overheads when not in
+  a low memory condition.  When a low memory condition is encountered
+  it does add the cost of a spin lock necessary to serialise updates
+  to 'usage'.
+
+
+
+5/ swapon/swapoff/swap_out/swap_in
+
+  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
+  any network socket that it uses, and can know when to account
+  reserve memory carefully, new address_space_operations are
+  available.
+  "swapon" requests that an address space (i.e a file) be make ready
+  for swapout.  swap_out and swap_in request the actual IO.  They
+  together must ensure that each swap_out request can succeed without
+  allocating more emergency memory that was reserved by swapon. swapoff
+  is used to reverse the state changes caused by swapon when we disable
+  the swap file.
+
+
+Thanks for reading this far.  I hope it made sense :-)
+
+Neil Brown (with updates from Peter Zijlstra)
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 01/30] mm: serialize access to min_free_kbytes
From: Xiaotian Feng @ 2010-07-13 10:17 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem
In-Reply-To: <20100713101650.2835.15245.sendpatchset@danny.redhat>

>From bc55bacd6bcc0f8a69c0d7e0d554c78237233e07 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Mon, 12 Jul 2010 17:58:34 +0800
Subject: [PATCH 01/30] mm: serialize access to min_free_kbytes

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_wmarks(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50a6d10..ebf0af7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,6 +165,7 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 static unsigned long __meminitdata nr_kernel_pages;
@@ -4839,13 +4840,13 @@ static void setup_per_zone_lowmem_reserve(void)
 }
 
 /**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * __setup_per_zone_wmarks - called when min_free_kbytes changes
  * or when memory is hot-{added|removed}
  *
  * Ensures that the watermark[min,low,high] values for each zone are set
  * correctly with respect to min_free_kbytes.
  */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4943,6 +4944,15 @@ static void __init setup_per_zone_inactive_ratio(void)
 		calculate_zone_inactive_ratio(zone);
 }
 
+void setup_per_zone_wmarks(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_wmarks();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4978,7 +4988,7 @@ static int __init init_per_zone_wmark_min(void)
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_wmarks();
+	__setup_per_zone_wmarks();
 	setup_per_zone_lowmem_reserve();
 	setup_per_zone_inactive_ratio();
 	return 0;
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mmotm 00/30] [RFC] swap over nfs -v21
From: Xiaotian Feng @ 2010-07-13 10:16 UTC (permalink / raw)
  To: linux-mm, linux-nfs, netdev
  Cc: riel, cl, a.p.zijlstra, Xiaotian Feng, linux-kernel, lwang,
	penberg, akpm, davem

Hi,

Here's the latest version of swap over NFS series since -v20 last October. We decide to push
this feature as it is useful for NAS or virt environment.

The patches are against the mmotm-2010-07-01. We can split the patchset into following parts:

Patch 1 - 12: provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Patch 13 - 15: Provide some generic network infrastructure needed later on.

Patch 16 - 21: reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them.

Patch 22 - 23: Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

Patch 24 - 27: convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

Patch 28 - 30: minor bug fixing with latest -mmotm.

[some history]
v19: http://lwn.net/Articles/301915/
v20: http://lwn.net/Articles/355350/

Changes since v20:
	- rebased to mmotm-2010-07-01
	- dropped the null pointer deref patch for the root cause is wrong SWP_FILE enum
	- some minor build fixes
	- fix a null pointer deref with mmotm-2010-07-01
	- fix a bug when swap with multi files on the same nfs server

Regards
Xiaotian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH NEXT 4/5] qlcnic: aer support
From: amit.salecha @ 2010-07-13  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, ameen.rahman, Sucheta Chakraborty, Amit Kumar Salecha
In-Reply-To: <1279014420-8829-1-git-send-email-amit.salecha@qlogic.com>

From: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>

Pci error recovery support added.

Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
 drivers/net/qlcnic/qlcnic.h      |    1 +
 drivers/net/qlcnic/qlcnic_main.c |  137 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 137 insertions(+), 1 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic.h b/drivers/net/qlcnic/qlcnic.h
index 02a50e6..0644642 100644
--- a/drivers/net/qlcnic/qlcnic.h
+++ b/drivers/net/qlcnic/qlcnic.h
@@ -911,6 +911,7 @@ struct qlcnic_mac_req {
 #define __QLCNIC_DEV_UP 		1
 #define __QLCNIC_RESETTING		2
 #define __QLCNIC_START_FW 		4
+#define __QLCNIC_AER			5
 
 #define QLCNIC_INTERRUPT_TEST		1
 #define QLCNIC_LOOPBACK_TEST		2
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index c8275f0..462cb6b 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -34,6 +34,7 @@
 #include <linux/ipv6.h>
 #include <linux/inetdevice.h>
 #include <linux/sysfs.h>
+#include <linux/aer.h>
 
 MODULE_DESCRIPTION("QLogic 1/10 GbE Converged/Intelligent Ethernet Driver");
 MODULE_LICENSE("GPL");
@@ -1306,6 +1307,7 @@ qlcnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_out_disable_pdev;
 
 	pci_set_master(pdev);
+	pci_enable_pcie_error_reporting(pdev);
 
 	netdev = alloc_etherdev(sizeof(struct qlcnic_adapter));
 	if (!netdev) {
@@ -1437,6 +1439,7 @@ static void __devexit qlcnic_remove(struct pci_dev *pdev)
 
 	qlcnic_release_firmware(adapter);
 
+	pci_disable_pcie_error_reporting(pdev);
 	pci_release_regions(pdev);
 	pci_disable_device(pdev);
 	pci_set_drvdata(pdev, NULL);
@@ -2521,6 +2524,9 @@ static void
 qlcnic_schedule_work(struct qlcnic_adapter *adapter,
 		work_func_t func, int delay)
 {
+	if (test_bit(__QLCNIC_AER, &adapter->state))
+		return;
+
 	INIT_DELAYED_WORK(&adapter->fw_work, func);
 	schedule_delayed_work(&adapter->fw_work, round_jiffies_relative(delay));
 }
@@ -2631,6 +2637,128 @@ reschedule:
 	qlcnic_schedule_work(adapter, qlcnic_fw_poll_work, FW_POLL_DELAY);
 }
 
+static int qlcnic_is_first_func(struct pci_dev *pdev)
+{
+	struct pci_dev *oth_pdev;
+	int val = pdev->devfn;
+
+	while (val-- > 0) {
+		oth_pdev = pci_get_domain_bus_and_slot(pci_domain_nr
+			(pdev->bus), pdev->bus->number,
+			PCI_DEVFN(PCI_SLOT(pdev->devfn), val));
+
+		if (oth_pdev && (oth_pdev->current_state != PCI_D3cold))
+			return 0;
+	}
+	return 1;
+}
+
+static int qlcnic_attach_func(struct pci_dev *pdev)
+{
+	int err, first_func;
+	struct qlcnic_adapter *adapter = pci_get_drvdata(pdev);
+	struct net_device *netdev = adapter->netdev;
+
+	pdev->error_state = pci_channel_io_normal;
+
+	err = pci_enable_device(pdev);
+	if (err)
+		return err;
+
+	pci_set_power_state(pdev, PCI_D0);
+	pci_set_master(pdev);
+	pci_restore_state(pdev);
+
+	first_func = qlcnic_is_first_func(pdev);
+
+	if (qlcnic_api_lock(adapter))
+		return -EINVAL;
+
+	if (first_func) {
+		adapter->need_fw_reset = 1;
+		set_bit(__QLCNIC_START_FW, &adapter->state);
+		QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_INITIALIZING);
+		QLCDB(adapter, DRV, "Restarting fw\n");
+	}
+	qlcnic_api_unlock(adapter);
+
+	err = adapter->nic_ops->start_firmware(adapter);
+	if (err)
+		return err;
+
+	qlcnic_clr_drv_state(adapter);
+	qlcnic_setup_intr(adapter);
+
+	if (netif_running(netdev)) {
+		err = qlcnic_attach(adapter);
+		if (err) {
+			qlcnic_clr_all_drv_state(adapter);
+			clear_bit(__QLCNIC_AER, &adapter->state);
+			netif_device_attach(netdev);
+			return err;
+		}
+
+		err = qlcnic_up(adapter, netdev);
+		if (err)
+			goto done;
+
+		qlcnic_config_indev_addr(netdev, NETDEV_UP);
+	}
+ done:
+	netif_device_attach(netdev);
+	return err;
+}
+
+static pci_ers_result_t qlcnic_io_error_detected(struct pci_dev *pdev,
+						pci_channel_state_t state)
+{
+	struct qlcnic_adapter *adapter = pci_get_drvdata(pdev);
+	struct net_device *netdev = adapter->netdev;
+
+	if (state == pci_channel_io_perm_failure)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	if (state == pci_channel_io_normal)
+		return PCI_ERS_RESULT_RECOVERED;
+
+	set_bit(__QLCNIC_AER, &adapter->state);
+	netif_device_detach(netdev);
+
+	cancel_delayed_work_sync(&adapter->fw_work);
+
+	if (netif_running(netdev))
+		qlcnic_down(adapter, netdev);
+
+	qlcnic_detach(adapter);
+	qlcnic_teardown_intr(adapter);
+
+	clear_bit(__QLCNIC_RESETTING, &adapter->state);
+
+	pci_save_state(pdev);
+	pci_disable_device(pdev);
+
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+static pci_ers_result_t qlcnic_io_slot_reset(struct pci_dev *pdev)
+{
+	return qlcnic_attach_func(pdev) ? PCI_ERS_RESULT_DISCONNECT :
+				PCI_ERS_RESULT_RECOVERED;
+}
+
+static void qlcnic_io_resume(struct pci_dev *pdev)
+{
+	struct qlcnic_adapter *adapter = pci_get_drvdata(pdev);
+
+	pci_cleanup_aer_uncorrect_error_status(pdev);
+
+	if ((QLCRD32(adapter, QLCNIC_CRB_DEV_STATE) == QLCNIC_DEV_READY) &&
+			   (test_and_clear_bit(__QLCNIC_AER, &adapter->state)))
+		qlcnic_schedule_work(adapter, qlcnic_fw_poll_work,
+						FW_POLL_DELAY);
+}
+
+
 static int
 qlcnicvf_start_firmware(struct qlcnic_adapter *adapter)
 {
@@ -3436,6 +3564,11 @@ static void
 qlcnic_config_indev_addr(struct net_device *dev, unsigned long event)
 { }
 #endif
+static struct pci_error_handlers qlcnic_err_handler = {
+	.error_detected = qlcnic_io_error_detected,
+	.slot_reset = qlcnic_io_slot_reset,
+	.resume = qlcnic_io_resume,
+};
 
 static struct pci_driver qlcnic_driver = {
 	.name = qlcnic_driver_name,
@@ -3446,7 +3579,9 @@ static struct pci_driver qlcnic_driver = {
 	.suspend = qlcnic_suspend,
 	.resume = qlcnic_resume,
 #endif
-	.shutdown = qlcnic_shutdown
+	.shutdown = qlcnic_shutdown,
+	.err_handler = &qlcnic_err_handler
+
 };
 
 static int __init qlcnic_init_module(void)
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH NEXT 1/5] qlcnic: fix pause params setting
From: amit.salecha @ 2010-07-13  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, ameen.rahman, Rajesh Borundia, Amit Kumar Salecha
In-Reply-To: <1279014420-8829-1-git-send-email-amit.salecha@qlogic.com>

From: Rajesh Borundia <rajesh.borundia@qlogic.com>

Turning off rx pause param and autoneg param is not supported so
return error in that case.

Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
 drivers/net/qlcnic/qlcnic_ethtool.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic_ethtool.c b/drivers/net/qlcnic/qlcnic_ethtool.c
index baf5a52..8599993 100644
--- a/drivers/net/qlcnic/qlcnic_ethtool.c
+++ b/drivers/net/qlcnic/qlcnic_ethtool.c
@@ -578,8 +578,13 @@ qlcnic_set_pauseparam(struct net_device *netdev,
 		}
 		QLCWR32(adapter, QLCNIC_NIU_GB_PAUSE_CTL, val);
 	} else if (adapter->ahw.port_type == QLCNIC_XGBE) {
+
+		if (!pause->rx_pause || pause->autoneg)
+			return -EOPNOTSUPP;
+
 		if ((port < 0) || (port > QLCNIC_NIU_MAX_XG_PORTS))
 			return -EIO;
+
 		val = QLCRD32(adapter, QLCNIC_NIU_XG_PAUSE_CTL);
 		if (port == 0) {
 			if (pause->tx_pause)
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH NEXT 5/5] qlcnic: restore NPAR config data after recovery
From: amit.salecha @ 2010-07-13  9:47 UTC (permalink / raw)
  To: davem
  Cc: netdev, ameen.rahman, Anirban Chakraborty, Rajesh Borundia,
	Amit Kumar Salecha
In-Reply-To: <1279014420-8829-1-git-send-email-amit.salecha@qlogic.com>

From: Anirban Chakraborty <anirban.chakraborty@qlogic.com>

NPAR configuration which is programmed in fw, need to
restore after fw recovery.

Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
 drivers/net/qlcnic/qlcnic.h      |    3 +-
 drivers/net/qlcnic/qlcnic_ctx.c  |    2 +
 drivers/net/qlcnic/qlcnic_main.c |   80 ++++++++++++++++++++++++++++++--------
 3 files changed, 67 insertions(+), 18 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic.h b/drivers/net/qlcnic/qlcnic.h
index 0644642..9f96298 100644
--- a/drivers/net/qlcnic/qlcnic.h
+++ b/drivers/net/qlcnic/qlcnic.h
@@ -949,7 +949,6 @@ struct qlcnic_adapter {
 	u8 has_link_events;
 	u8 fw_type;
 	u16 tx_context_id;
-	u16 mtu;
 	u16 is_up;
 
 	u16 link_speed;
@@ -1044,6 +1043,8 @@ struct qlcnic_pci_info {
 
 struct qlcnic_npar_info {
 	u16	vlan_id;
+	u16	min_bw;
+	u16	max_bw;
 	u8	phy_port;
 	u8	type;
 	u8	active;
diff --git a/drivers/net/qlcnic/qlcnic_ctx.c b/drivers/net/qlcnic/qlcnic_ctx.c
index cdd44b4..cc5d861 100644
--- a/drivers/net/qlcnic/qlcnic_ctx.c
+++ b/drivers/net/qlcnic/qlcnic_ctx.c
@@ -636,6 +636,8 @@ int qlcnic_get_nic_info(struct qlcnic_adapter *adapter,
 			QLCNIC_CDRP_CMD_GET_NIC_INFO);
 
 	if (err == QLCNIC_RCODE_SUCCESS) {
+		npar_info->pci_func = le16_to_cpu(nic_info->pci_func);
+		npar_info->op_mode = le16_to_cpu(nic_info->op_mode);
 		npar_info->phys_port = le16_to_cpu(nic_info->phys_port);
 		npar_info->switch_mode = le16_to_cpu(nic_info->switch_mode);
 		npar_info->max_tx_ques = le16_to_cpu(nic_info->max_tx_ques);
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index 462cb6b..38c8d9c 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -507,6 +507,8 @@ qlcnic_init_pci_info(struct qlcnic_adapter *adapter)
 			adapter->npars[pfn].type = pci_info[i].type;
 			adapter->npars[pfn].phy_port = pci_info[i].default_port;
 			adapter->npars[pfn].mac_learning = DEFAULT_MAC_LEARN;
+			adapter->npars[pfn].min_bw = pci_info[i].tx_min_bw;
+			adapter->npars[pfn].max_bw = pci_info[i].tx_max_bw;
 		}
 
 		for (i = 0; i < QLCNIC_NIU_MAX_XG_PORTS; i++)
@@ -771,6 +773,50 @@ qlcnic_check_options(struct qlcnic_adapter *adapter)
 }
 
 static int
+qlcnic_reset_npar_config(struct qlcnic_adapter *adapter)
+{
+	int i, err = 0;
+	struct qlcnic_npar_info *npar;
+	struct qlcnic_info nic_info;
+
+	if (!(adapter->flags & QLCNIC_ESWITCH_ENABLED)
+				|| !adapter->need_fw_reset)
+		return 0;
+
+	if (adapter->op_mode == QLCNIC_MGMT_FUNC) {
+		/* Set the NPAR config data after FW reset */
+		for (i = 0; i < QLCNIC_MAX_PCI_FUNC; i++) {
+			npar = &adapter->npars[i];
+			if (npar->type != QLCNIC_TYPE_NIC)
+				continue;
+			err = qlcnic_get_nic_info(adapter, &nic_info, i);
+			if (err)
+				goto err_out;
+			nic_info.min_tx_bw = npar->min_bw;
+			nic_info.max_tx_bw = npar->max_bw;
+			err = qlcnic_set_nic_info(adapter, &nic_info);
+			if (err)
+				goto err_out;
+
+			if (npar->enable_pm) {
+				err = qlcnic_config_port_mirroring(adapter,
+						npar->dest_npar, 1, i);
+				if (err)
+					goto err_out;
+
+			}
+			npar->mac_learning = DEFAULT_MAC_LEARN;
+			npar->host_vlan_tag = 0;
+			npar->promisc_mode = 0;
+			npar->discard_tagged = 0;
+			npar->vlan_id = 0;
+		}
+	}
+err_out:
+	return err;
+}
+
+static int
 qlcnic_start_firmware(struct qlcnic_adapter *adapter)
 {
 	int val, err, first_boot;
@@ -834,10 +880,9 @@ wait_init:
 	qlcnic_idc_debug_info(adapter, 1);
 
 	qlcnic_check_options(adapter);
-
-	if (adapter->flags & QLCNIC_ESWITCH_ENABLED &&
-		adapter->op_mode != QLCNIC_NON_PRIV_FUNC)
-		qlcnic_dev_set_npar_ready(adapter);
+	if (qlcnic_reset_npar_config(adapter))
+		goto err_out;
+	qlcnic_dev_set_npar_ready(adapter);
 
 	adapter->need_fw_reset = 0;
 
@@ -2486,6 +2531,7 @@ qlcnic_dev_request_reset(struct qlcnic_adapter *adapter)
 {
 	u32 state;
 
+	adapter->need_fw_reset = 1;
 	if (qlcnic_api_lock(adapter))
 		return;
 
@@ -2506,6 +2552,9 @@ qlcnic_dev_set_npar_ready(struct qlcnic_adapter *adapter)
 {
 	u32 state;
 
+	if (!(adapter->flags & QLCNIC_ESWITCH_ENABLED) ||
+		adapter->op_mode == QLCNIC_NON_PRIV_FUNC)
+		return;
 	if (qlcnic_api_lock(adapter))
 		return;
 
@@ -3154,9 +3203,8 @@ qlcnic_sysfs_write_esw_config(struct file *file, struct kobject *kobj,
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct qlcnic_adapter *adapter = dev_get_drvdata(dev);
 	struct qlcnic_esw_func_cfg *esw_cfg;
-	u8 id, discard_tagged, promsc_mode, mac_learn;
-	u8 vlan_tagging, pci_func, vlan_id;
 	int count, rem, i, ret;
+	u8 id, pci_func;
 
 	count	= size / sizeof(struct qlcnic_esw_func_cfg);
 	rem	= size % sizeof(struct qlcnic_esw_func_cfg);
@@ -3171,17 +3219,13 @@ qlcnic_sysfs_write_esw_config(struct file *file, struct kobject *kobj,
 	for (i = 0; i < count; i++) {
 		pci_func = esw_cfg[i].pci_func;
 		id = adapter->npars[pci_func].phy_port;
-		vlan_tagging = esw_cfg[i].host_vlan_tag;
-		promsc_mode = esw_cfg[i].promisc_mode;
-		mac_learn = esw_cfg[i].mac_learning;
-		vlan_id	= esw_cfg[i].vlan_id;
-		discard_tagged = esw_cfg[i].discard_tagged;
-		ret = qlcnic_config_switch_port(adapter, id, vlan_tagging,
-						discard_tagged,
-						promsc_mode,
-						mac_learn,
-						pci_func,
-						vlan_id);
+		ret = qlcnic_config_switch_port(adapter, id,
+						esw_cfg[i].host_vlan_tag,
+						esw_cfg[i].discard_tagged,
+						esw_cfg[i].promisc_mode,
+						esw_cfg[i].mac_learning,
+						esw_cfg[i].pci_func,
+						esw_cfg[i].vlan_id);
 		if (ret)
 			return ret;
 	}
@@ -3282,6 +3326,8 @@ qlcnic_sysfs_write_npar_config(struct file *file, struct kobject *kobj,
 		ret = qlcnic_set_nic_info(adapter, &nic_info);
 		if (ret)
 			return ret;
+		adapter->npars[i].min_bw = nic_info.min_tx_bw;
+		adapter->npars[i].max_bw = nic_info.max_tx_bw;
 	}
 
 	return size;
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH NEXT 0/5]qlcnic: aer support
From: amit.salecha @ 2010-07-13  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, ameen.rahman


Hi
   Series of 5 to support aer and minor fixes.
   Please apply them on net-next.
-Amit

^ permalink raw reply

* [PATCH NEXT 3/5] qlcnic: fix netdev notifier in error path
From: amit.salecha @ 2010-07-13  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, ameen.rahman, Amit Kumar Salecha
In-Reply-To: <1279014420-8829-1-git-send-email-amit.salecha@qlogic.com>

From: Amit Kumar Salecha <amit.salecha@qlogic.com>

netdev notifier are not unregistered if pci_register_driver fails.

Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
 drivers/net/qlcnic/qlcnic_main.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index d511fd1..c8275f0 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -3451,6 +3451,7 @@ static struct pci_driver qlcnic_driver = {
 
 static int __init qlcnic_init_module(void)
 {
+	int ret;
 
 	printk(KERN_INFO "%s\n", qlcnic_driver_string);
 
@@ -3459,8 +3460,15 @@ static int __init qlcnic_init_module(void)
 	register_inetaddr_notifier(&qlcnic_inetaddr_cb);
 #endif
 
+	ret = pci_register_driver(&qlcnic_driver);
+	if (ret) {
+#ifdef CONFIG_INET
+		unregister_inetaddr_notifier(&qlcnic_inetaddr_cb);
+		unregister_netdevice_notifier(&qlcnic_netdev_cb);
+#endif
+	}
 
-	return pci_register_driver(&qlcnic_driver);
+	return ret;
 }
 
 module_init(qlcnic_init_module);
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH NEXT 2/5] qlcnic: disable tx timeout recovery
From: amit.salecha @ 2010-07-13  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, ameen.rahman, Amit Kumar Salecha
In-Reply-To: <1279014420-8829-1-git-send-email-amit.salecha@qlogic.com>

From: Amit Kumar Salecha <amit.salecha@qlogic.com>

Disable tx timeout recovery, if auto_fw_reset is disable

Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
 drivers/net/qlcnic/qlcnic_main.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index 4d18313..d511fd1 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -2581,7 +2581,8 @@ qlcnic_check_health(struct qlcnic_adapter *adapter)
 		if (adapter->need_fw_reset)
 			goto detach;
 
-		if (adapter->reset_context) {
+		if (adapter->reset_context &&
+				auto_fw_reset == AUTO_FW_RESET_ENABLED) {
 			qlcnic_reset_hw_context(adapter);
 			adapter->netdev->trans_start = jiffies;
 		}
@@ -2594,7 +2595,8 @@ qlcnic_check_health(struct qlcnic_adapter *adapter)
 
 	qlcnic_dev_request_reset(adapter);
 
-	clear_bit(__QLCNIC_FW_ATTACHED, &adapter->state);
+	if ((auto_fw_reset == AUTO_FW_RESET_ENABLED))
+		clear_bit(__QLCNIC_FW_ATTACHED, &adapter->state);
 
 	dev_info(&netdev->dev, "firmware hang detected\n");
 
-- 
1.6.0.2


^ permalink raw reply related

* Re: [PATCH] netfilter: xtables: userspace notification target
From: Pablo Neira Ayuso @ 2010-07-13  8:50 UTC (permalink / raw)
  To: Changli Gao
  Cc: Samuel Ortiz, Patrick McHardy, David S. Miller, netdev,
	netfilter-devel, Luciano Coelho
In-Reply-To: <AANLkTil2EgQbzUqYNHAYpIWJvyyE6AWq1TpvxrqVsD7k@mail.gmail.com>

On 13/07/10 08:18, Changli Gao wrote:
> On Tue, Jul 13, 2010 at 8:11 AM, Samuel Ortiz <sameo@linux.intel.com> wrote:
>>
>> The userspace notification Xtables target sends a netlink notification
>> whenever a packet hits the target. Notifications have a label attribute
>> for userspace to match it against a previously set rule. The rules also
>> take a --all option to switch between sending a notification for all
>> packets or for the first one only.
>> Userspace can also send a netlink message to toggle this switch while the
>> target is in place. This target uses the nefilter netlink framework.
>>
>> This target combined with various matches (quota, rateest, etc..) allows
>> userspace to make decisions on interfaces handling. One could for example
>> decide to switch between power saving modes depending on estimated rate
>> thresholds.
>>
> 
> It much like the following iptables rules.
> 
> iptables -N log_and_drop
> iptables -A log_and_drop -j NFLOG --nflog-group 1 --nflog-prefix "log_and_drop"
> iptables -A log_and_drop -j DROP
> 
> ...
> iptables ... -m quota --quota-bytes 20000 -j log_and_drop
> ...

Indeed, this looks to me like something that you can do with NFLOG and
some combination of matches.

^ permalink raw reply

* [PATCH] xfrm: do not assume that template resolving always returns xfrms
From: Timo Teräs @ 2010-07-13  7:29 UTC (permalink / raw)
  To: netdev, David Miller, linux; +Cc: Timo Teräs
In-Reply-To: <20100712.212041.236240543.davem@davemloft.net>

xfrm_resolve_and_create_bundle() assumed that, if policies indicated
presence of xfrms, bundle template resolution would always return
some xfrms. This is not true for 'use' level policies which can
result in no xfrm's being applied if there is no suitable xfrm states.
This fixes a crash by this incorrect assumption.

Reported-by: George Spelvin <linux@horizon.com>
Bisected-by: George Spelvin <linux@horizon.com>
Tested-by: George Spelvin <linux@horizon.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 net/xfrm/xfrm_policy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index af1c173..a7ec5a8 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1594,8 +1594,8 @@ xfrm_resolve_and_create_bundle(struct xfrm_policy **pols, int num_pols,
 
 	/* Try to instantiate a bundle */
 	err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
-	if (err < 0) {
-		if (err != -EAGAIN)
+	if (err <= 0) {
+		if (err != 0 && err != -EAGAIN)
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
 		return ERR_PTR(err);
 	}
@@ -1678,6 +1678,13 @@ xfrm_bundle_lookup(struct net *net, struct flowi *fl, u16 family, u8 dir,
 			goto make_dummy_bundle;
 		dst_hold(&xdst->u.dst);
 		return oldflo;
+	} else if (new_xdst == NULL) {
+		num_xfrms = 0;
+		if (oldflo == NULL)
+			goto make_dummy_bundle;
+		xdst->num_xfrms = 0;
+		dst_hold(&xdst->u.dst);
+		return oldflo;
 	}
 
 	/* Kill the previous bundle */
@@ -1760,6 +1767,10 @@ restart:
 				xfrm_pols_put(pols, num_pols);
 				err = PTR_ERR(xdst);
 				goto dropdst;
+			} else if (xdst == NULL) {
+				num_xfrms = 0;
+				drop_pols = num_pols;
+				goto no_transform;
 			}
 
 			spin_lock_bh(&xfrm_policy_sk_bundle_lock);
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH] rfs: call sock_rps_record_flow() in tcp_splice_read()
From: Changli Gao @ 2010-07-13  7:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Tom Herbert,
	netdev, Changli Gao

rfs: call sock_rps_record_flow() in tcp_splice_read()

call sock_rps_record_flow() in tcp_splice_read(), so the applications using
splice(2) or sendfile(2) can utilize RFS.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 net/ipv4/tcp.c |    1 +
 1 file changed, 1 insertion(+)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9fce8a8..86b9f67 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -608,6 +608,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	ssize_t spliced;
 	int ret;
 
+	sock_rps_record_flow(sk);
 	/*
 	 * We can't seek on a socket input
 	 */

^ permalink raw reply related

* Re: [PATCH repost] sched: export sched_set/getaffinity to modules
From: Sridhar Samudrala @ 2010-07-13  6:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Peter Zijlstra, Tejun Heo, Ingo Molnar, netdev,
	lkml, kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev,
	Jiri Kosina, Thomas Gleixner, Andi Kleen
In-Reply-To: <20100704090005.GA8078@redhat.com>

On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
>    
>> On 07/02, Peter Zijlstra wrote:
>>      
>>> On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
>>>        
>>>>   Does  it (Tejun's kthread_clone() patch) also  inherit the
>>>> cgroup of the caller?
>>>>          
>>> Of course, its a simple do_fork() which inherits everything just as you
>>> would expect from a similar sys_clone()/sys_fork() call.
>>>        
>> Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
>> from ioctl(), right?
>>
>> Then the new thread becomes the natural child of the caller, and it shares
>> ->mm with the parent. And files, dup_fd() without CLONE_FS.
>>
>> Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
>> TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
>> just because the parent gets SIGQUIT or abother coredumpable signal.
>> Or the new thread can recieve SIGSTOP via ^Z.
>>
>> Perhaps this is OK, I do not know. Just to remind that kernel_thread()
>> is merely clone(CLONE_VM).
>>
>> Oleg.
>>      
>
> Right. Doing this might break things like flush.  The signal and exit
> behaviour needs to be examined carefully. I am also unsure whether
> using such threads might be more expensive than inheriting kthreadd.
>
>    
Should we just leave it to the userspace to set the cgroup/cpumask after 
qemu starts the guest and
the vhost threads?

Thanks
Sridhar

^ permalink raw reply

* Re: [PATCH] netfilter: xtables: userspace notification target
From: Changli Gao @ 2010-07-13  6:18 UTC (permalink / raw)
  To: Samuel Ortiz
  Cc: Patrick McHardy, David S. Miller, netdev, netfilter-devel,
	Luciano Coelho
In-Reply-To: <20100713001115.GA3751@sortiz-mobl>

On Tue, Jul 13, 2010 at 8:11 AM, Samuel Ortiz <sameo@linux.intel.com> wrote:
>
> The userspace notification Xtables target sends a netlink notification
> whenever a packet hits the target. Notifications have a label attribute
> for userspace to match it against a previously set rule. The rules also
> take a --all option to switch between sending a notification for all
> packets or for the first one only.
> Userspace can also send a netlink message to toggle this switch while the
> target is in place. This target uses the nefilter netlink framework.
>
> This target combined with various matches (quota, rateest, etc..) allows
> userspace to make decisions on interfaces handling. One could for example
> decide to switch between power saving modes depending on estimated rate
> thresholds.
>

It much like the following iptables rules.

iptables -N log_and_drop
iptables -A log_and_drop -j NFLOG --nflog-group 1 --nflog-prefix "log_and_drop"
iptables -A log_and_drop -j DROP

...
iptables ... -m quota --quota-bytes 20000 -j log_and_drop
...

>  include/linux/netfilter/Kbuild             |    1 +
>  include/linux/netfilter/nfnetlink.h        |    5 +-
>  include/linux/netfilter/nfnetlink_compat.h |    1 +
>  include/linux/netfilter/xt_NFNOTIF.h       |   55 +++++
>  net/netfilter/Kconfig                      |   17 ++
>  net/netfilter/Makefile                     |    1 +
>  net/netfilter/xt_NFNOTIF.c                 |  300 ++++++++++++++++++++++++++++
>  7 files changed, 379 insertions(+), 1 deletions(-)
>  create mode 100644 include/linux/netfilter/xt_NFNOTIF.h
>  create mode 100644 net/netfilter/xt_NFNOTIF.c
>
> diff --git a/include/linux/netfilter/Kbuild b/include/linux/netfilter/Kbuild
> index bb103f4..1b80b27 100644
> --- a/include/linux/netfilter/Kbuild
> +++ b/include/linux/netfilter/Kbuild
> @@ -12,6 +12,7 @@ header-y += xt_IDLETIMER.h
>  header-y += xt_LED.h
>  header-y += xt_MARK.h
>  header-y += xt_NFLOG.h
> +header-y += xt_NFNOTIF.h
>  header-y += xt_NFQUEUE.h
>  header-y += xt_RATEEST.h
>  header-y += xt_SECMARK.h
> diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
> index 361d6b5..e336f03 100644
> --- a/include/linux/netfilter/nfnetlink.h
> +++ b/include/linux/netfilter/nfnetlink.h
> @@ -18,6 +18,8 @@ enum nfnetlink_groups {
>  #define NFNLGRP_CONNTRACK_EXP_UPDATE   NFNLGRP_CONNTRACK_EXP_UPDATE
>        NFNLGRP_CONNTRACK_EXP_DESTROY,
>  #define NFNLGRP_CONNTRACK_EXP_DESTROY  NFNLGRP_CONNTRACK_EXP_DESTROY
> +       NFNLGRP_NFNOTIF,
> +#define NFNLGRP_NFNOTIF                        NFNLGRP_NFNOTIF
>        __NFNLGRP_MAX,
>  };
>  #define NFNLGRP_MAX    (__NFNLGRP_MAX - 1)
> @@ -47,7 +49,8 @@ struct nfgenmsg {
>  #define NFNL_SUBSYS_QUEUE              3
>  #define NFNL_SUBSYS_ULOG               4
>  #define NFNL_SUBSYS_OSF                        5
> -#define NFNL_SUBSYS_COUNT              6
> +#define NFNL_SUBSYS_NFNOTIF            6
> +#define NFNL_SUBSYS_COUNT              7
>
>  #ifdef __KERNEL__
>
> diff --git a/include/linux/netfilter/nfnetlink_compat.h b/include/linux/netfilter/nfnetlink_compat.h
> index ffb9503..dca8ab2 100644
> --- a/include/linux/netfilter/nfnetlink_compat.h
> +++ b/include/linux/netfilter/nfnetlink_compat.h
> @@ -13,6 +13,7 @@
>  #define NF_NETLINK_CONNTRACK_EXP_NEW           0x00000008
>  #define NF_NETLINK_CONNTRACK_EXP_UPDATE                0x00000010
>  #define NF_NETLINK_CONNTRACK_EXP_DESTROY       0x00000020
> +#define NF_NETLINK_NFNOTIF                     0x00000040
>
>  /* Generic structure for encapsulation optional netfilter information.
>  * It is reminiscent of sockaddr, but with sa_family replaced
> diff --git a/include/linux/netfilter/xt_NFNOTIF.h b/include/linux/netfilter/xt_NFNOTIF.h
> new file mode 100644
> index 0000000..8fae827
> --- /dev/null
> +++ b/include/linux/netfilter/xt_NFNOTIF.h
> @@ -0,0 +1,55 @@
> +/*
> + * linux/include/linux/netfilter/xt_NFNOTIF.h
> + *
> + * Header file for Xtables notification target module.
> + *
> + * Copyright (C) 2010 Intel Corporation
> + * Samuel Ortiz <samuel.ortiz@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> + * 02110-1301, USA.
> + */
> +
> +#ifndef _XT_NFNOTIF_H
> +#define _XT_NFNOTIF_H
> +
> +#include <linux/types.h>
> +
> +enum nfnotif_msg_type {
> +       NFNOTIF_TG_MSG_PACKETS,
> +
> +       NFNOTIF_TG_MSG_MAX
> +};
> +
> +enum nfnotif_attr_type {
> +       NFNOTIF_TG_ATTR_UNSPEC,
> +       NFNOTIF_TG_ATTR_LABEL,
> +       NFNOTIF_TG_ATTR_SEND_NOTIF,
> +
> +       __NFNOTIF_TG_ATTR_AFTER_LAST
> +};
> +#define NFNOTIF_TG_ATTR_MAX (__NFNOTIF_TG_ATTR_AFTER_LAST - 1)
> +
> +#define MAX_NFNOTIF_LABEL_SIZE 31
> +
> +struct nfnotif_tg_info {
> +       __u8 all_packets;
> +
> +       char label[MAX_NFNOTIF_LABEL_SIZE];
> +
> +       /* for kernel module internal use only */
> +       struct nfnotif_tg *notif __attribute((aligned(8)));
> +};
> +
> +#endif
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index aa2f106..0e2de36 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -469,6 +469,23 @@ config NETFILTER_XT_TARGET_NFQUEUE
>
>          To compile it as a module, choose M here.  If unsure, say N.
>
> +config NETFILTER_XT_TARGET_NFNOTIF
> +       tristate '"NFNOTIF" target Support'
> +       depends on NETFILTER_ADVANCED
> +       select NETFILTER_NETLINK
> +       help
> +
> +         This option adds the `NFNOTIF' target, which allows to send
> +         netfilter netlink messages when packets hit the target.
> +
> +         This target comes with an option to specify if one wants all
> +         packets hitting the target to trigger the netlink message
> +         transmission, or only the first one.
> +         It also listen on its netfilter netlink subsystem for messages
> +         allowing to reset the above option.
> +
> +         To compile it as a module, choose M here.  If unsure, say N.
> +
>  config NETFILTER_XT_TARGET_NOTRACK
>        tristate  '"NOTRACK" target support'
>        depends on IP_NF_RAW || IP6_NF_RAW
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index e28420a..5d9c9e9 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP) += xt_TCPOPTSTRIP.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_TEE) += xt_TEE.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_TRACE) += xt_TRACE.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
> +obj-$(CONFIG_NETFILTER_XT_TARGET_NFNOTIF) += xt_NFNOTIF.o
>
>  # matches
>  obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
> diff --git a/net/netfilter/xt_NFNOTIF.c b/net/netfilter/xt_NFNOTIF.c
> new file mode 100644
> index 0000000..e6e906b
> --- /dev/null
> +++ b/net/netfilter/xt_NFNOTIF.c
> @@ -0,0 +1,300 @@
> +/*
> + * linux/net/netfilter/xt_NFNOTIF.c
> + *
> + * Copyright (C) 2010 Intel Corporation
> + * Samuel Ortiz <samuel.ortiz@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> + * 02110-1301, USA.
> + *
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/netfilter.h>
> +#include <linux/netfilter/x_tables.h>
> +#include <linux/netfilter/nfnetlink.h>
> +#include <linux/netfilter/xt_NFNOTIF.h>
> +
> +struct nfnotif_tg {
> +       struct list_head entry;
> +       struct work_struct work;
> +
> +       char *label;
> +       __u8 all_packets;
> +       struct net *net;
> +
> +       __u8 send_notif;
> +
> +       unsigned int refcnt;
> +};
> +
> +static LIST_HEAD(nfnotif_tg_list);
> +static DEFINE_MUTEX(list_mutex);
> +
> +static int __nfnotif_tg_netlink_send(struct nfnotif_tg *nfnotif)
> +{
> +       struct nlmsghdr *nlh;
> +       struct nfgenmsg *nfmsg;
> +       struct sk_buff *skb;
> +       struct net *net = nfnotif->net;
> +       unsigned int type;
> +       int flags;
> +
> +       type = NFNL_SUBSYS_NFNOTIF << 8;
> +       flags = NLM_F_CREATE;
> +
> +       skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +       if (skb == NULL)
> +               goto error_out;
> +
> +       nlh = nlmsg_put(skb, 0, 0, type, sizeof(*nfmsg), flags);
> +       if (nlh == NULL)
> +               goto nlmsg_put_failure;
> +
> +       nfmsg = nlmsg_data(nlh);
> +       nfmsg->version      = NFNETLINK_V0;
> +       nfmsg->res_id       = 0;
> +
> +       NLA_PUT_STRING(skb, NFNOTIF_TG_ATTR_LABEL, nfnotif->label);
> +
> +       nlmsg_end(skb, nlh);
> +
> +       return nfnetlink_send(skb, net, 0, NFNLGRP_NFNOTIF, 0, GFP_KERNEL);
> +
> +nla_put_failure:
> +       nlmsg_cancel(skb, nlh);
> +
> +nlmsg_put_failure:
> +       kfree_skb(skb);
> +
> +error_out:
> +       return nfnetlink_set_err(net, 0, 0, -ENOBUFS);
> +}
> +
> +static void nfnotif_tg_work(struct work_struct *work)
> +{
> +       struct nfnotif_tg *notif = container_of(work, struct nfnotif_tg, work);
> +
> +
> +       if (__nfnotif_tg_netlink_send(notif) < 0)
> +               pr_debug("Could not send notification");
> +
> +       if (!notif->all_packets)
> +               notif->send_notif = 0;
> +}
> +
> +static struct nfnotif_tg *__nfnotif_tg_find_by_label(const char *label)
> +{
> +       struct nfnotif_tg *entry;
> +
> +       BUG_ON(!label);
> +
> +       list_for_each_entry(entry, &nfnotif_tg_list, entry) {
> +               if (!strcmp(label, entry->label))
> +                       return entry;
> +       }
> +
> +       return NULL;
> +}
> +
> +static int nfnotif_tg_create(struct nfnotif_tg_info *info)
> +{
> +       info->notif = kmalloc(sizeof(*info->notif), GFP_KERNEL);
> +       if (!info->notif) {
> +               pr_debug("Couldn't allocate notification\n");
> +               return -ENOMEM;
> +       }
> +
> +       info->notif->label = kstrdup(info->label, GFP_KERNEL);
> +       if (!info->notif->label) {
> +               pr_debug("Couldn't allocate label\n");
> +               kfree(info->notif);
> +               return -ENOMEM;
> +       }
> +
> +       info->notif->all_packets = info->all_packets;
> +       info->notif->send_notif = 1;
> +
> +       list_add(&info->notif->entry, &nfnotif_tg_list);
> +
> +       info->notif->refcnt = 1;
> +
> +       INIT_WORK(&info->notif->work, nfnotif_tg_work);
> +
> +       return 0;
> +}
> +
> +static unsigned int nfnotif_tg_target(struct sk_buff *skb,
> +                                     const struct xt_action_param *par)
> +{
> +       const struct nfnotif_tg_info *info = par->targinfo;
> +
> +       BUG_ON(!info->notif);
> +
> +       if (!info->notif->send_notif)
> +               return XT_CONTINUE;
> +
> +       pr_debug("Sending notification for %s\n", info->label);
> +
> +       schedule_work(&info->notif->work);
> +

Why do you use another kernel activity: kernel thread? netlink
messages can be sent in atomic context.

> +       return XT_CONTINUE;
> +}
> +
> +static int nfnotif_tg_checkentry(const struct xt_tgchk_param *par)
> +{
> +       struct nfnotif_tg_info *info = par->targinfo;
> +       int ret;
> +
> +       pr_debug("Checkentry targinfo %s\n", info->label);
> +
> +       if (info->label[0] == '\0' ||
> +           strnlen(info->label,
> +                   MAX_NFNOTIF_LABEL_SIZE) == MAX_NFNOTIF_LABEL_SIZE) {
> +               pr_debug("Label is empty or not nul-terminated\n");
> +               return -EINVAL;
> +       }
> +
> +       mutex_lock(&list_mutex);
> +
> +       info->notif = __nfnotif_tg_find_by_label(info->label);
> +       if (info->notif) {
> +               info->notif->refcnt++;
> +
> +               pr_debug("Increased refcnt for %s to %u\n",
> +                        info->label, info->notif->refcnt);
> +       } else {
> +               ret = nfnotif_tg_create(info);
> +               if (ret < 0) {
> +                       pr_debug("Failed to create notification\n");
> +                       mutex_unlock(&list_mutex);
> +                       return ret;
> +               }
> +       }
> +
> +       info->notif->net = par->net;
> +
> +       mutex_unlock(&list_mutex);
> +       return 0;
> +}
> +
> +static void nfnotif_tg_destroy(const struct xt_tgdtor_param *par)
> +{
> +       const struct nfnotif_tg_info *info = par->targinfo;
> +
> +       pr_debug("Destroy targinfo %s\n", info->label);
> +
> +       mutex_lock(&list_mutex);
> +
> +       if (--info->notif->refcnt == 0) {
> +               pr_debug("Deleting notification %s\n", info->label);
> +
> +               list_del(&info->notif->entry);
> +               kfree(info->notif->label);
> +               kfree(info->notif);
> +       }
> +
> +       mutex_unlock(&list_mutex);
> +}
> +
> +static struct xt_target nfnotif_tg __read_mostly = {
> +       .name           = "NFNOTIF",
> +       .family         = NFPROTO_UNSPEC,
> +       .target         = nfnotif_tg_target,
> +       .targetsize     = sizeof(struct nfnotif_tg_info),
> +       .checkentry     = nfnotif_tg_checkentry,
> +       .destroy        = nfnotif_tg_destroy,
> +       .me             = THIS_MODULE,
> +};
> +
> +static int nfnotif_msg_send_notif(struct sock *nfnl, struct sk_buff *skb,
> +                                 const struct nlmsghdr *nlh,
> +                                 const struct nlattr * const attrs[])
> +{
> +       struct nfnotif_tg *notif;
> +       char *label;
> +       u8 send_notif;
> +
> +       if (attrs[NFNOTIF_TG_ATTR_LABEL] == NULL ||
> +           attrs[NFNOTIF_TG_ATTR_SEND_NOTIF] == NULL)
> +               return -EINVAL;
> +
> +       label = nla_data(attrs[NFNOTIF_TG_ATTR_LABEL]);
> +       send_notif = nla_get_u8(attrs[NFNOTIF_TG_ATTR_SEND_NOTIF]);
> +
> +       pr_debug("Label %s send %d\n", label, send_notif);
> +
> +       notif = __nfnotif_tg_find_by_label(label);
> +       if (notif == NULL)
> +               return -EINVAL;
> +
> +       notif->send_notif = send_notif;
> +
> +       return 0;
> +}
> +
> +
> +static const struct nla_policy nfnotif_nla_policy[NFNOTIF_TG_ATTR_MAX + 1] = {
> +       [NFNOTIF_TG_ATTR_LABEL]            = { .type = NLA_NUL_STRING },
> +       [NFNOTIF_TG_ATTR_SEND_NOTIF]       = { .type = NLA_U8 },
> +};
> +
> +static const struct nfnl_callback nfnotif_cb[NFNOTIF_TG_MSG_MAX] = {
> +       [NFNOTIF_TG_MSG_PACKETS]   = { .call = nfnotif_msg_send_notif,
> +                                      .attr_count = NFNOTIF_TG_ATTR_MAX,
> +                                      .policy = nfnotif_nla_policy },
> +};
> +
> +static const struct nfnetlink_subsystem nfnotif_subsys = {
> +       .name                           = "nfnotif",
> +       .subsys_id                      = NFNL_SUBSYS_NFNOTIF,
> +       .cb_count                       = NFNOTIF_TG_MSG_MAX,
> +       .cb                             = nfnotif_cb,
> +};
> +
> +static int __init nfnotif_tg_init(void)
> +{
> +       int ret;
> +
> +       ret = nfnetlink_subsys_register(&nfnotif_subsys);
> +       if (ret < 0) {
> +               pr_err("%s: Cannot register with nfnetlink\n", __func__);
> +               return ret;
> +       }
> +
> +       ret = xt_register_target(&nfnotif_tg);
> +       if (ret < 0) {
> +               pr_err("%s: Cannot register target\n", __func__);
> +               nfnetlink_subsys_unregister(&nfnotif_subsys);
> +       }
> +
> +       return ret;
> +}
> +
> +static void __exit nfnotif_tg_exit(void)
> +{
> +       nfnetlink_subsys_unregister(&nfnotif_subsys);
> +       xt_unregister_target(&nfnotif_tg);
> +}
> +
> +module_init(nfnotif_tg_init);
> +module_exit(nfnotif_tg_exit);
> +
> +MODULE_AUTHOR("Samuel Ortiz <samuel.ortiz@intel.com>");
> +MODULE_DESCRIPTION("Xtables: userspace notification");
> +MODULE_LICENSE("GPL v2");
> --
> 1.7.1
>
> --
> Intel Open Source Technology Centre
> http://oss.intel.com/
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter: xtables: userspace notification target
From: Jan Engelhardt @ 2010-07-13  5:56 UTC (permalink / raw)
  To: Samuel Ortiz
  Cc: Patrick McHardy, David S. Miller, netdev, netfilter-devel,
	Luciano Coelho
In-Reply-To: <20100713001115.GA3751@sortiz-mobl>


On Tuesday 2010-07-13 02:11, Samuel Ortiz wrote:
>
>The userspace notification Xtables target sends a netlink notification
>whenever a packet hits the target. Notifications have a label attribute
>for userspace to match it against a previously set rule. The rules also
>take a --all option to switch between sending a notification for all
>packets or for the first one only.
>Userspace can also send a netlink message to toggle this switch while the
>target is in place. This target uses the nefilter netlink framework.

Would it not make sense to modify that module?
Sounds an awful lot like NFQUEUE without passing the payload :)

>+++ b/net/netfilter/xt_NFNOTIF.c
>+struct nfnotif_tg {
>+	struct list_head entry;
>+	struct work_struct work;
>+
>+	char *label;
>+	__u8 all_packets;
>+	struct net *net;
>+
>+	__u8 send_notif;
>+
>+	unsigned int refcnt;
>+};

Has unnecessary padding holes.

^ permalink raw reply

* Re: iproute, batch-cmds, and mac-vlans.
From: Ben Greear @ 2010-07-13  5:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: NetDev
In-Reply-To: <20100712221958.2a87b0e3@nehalam>

On 07/12/2010 10:19 PM, Stephen Hemminger wrote:
> On Mon, 12 Jul 2010 21:49:20 -0700
> Ben Greear<greearb@candelatech.com>  wrote:
>
>> After too much time debugging, I finally realized that the ip
>> tool was truncating my command because the mac-vlan device name
>> had a '#' in it.
>>
>> ]# cat /tmp/foo.txt
>> ru add to 10.99.21.1 iif eth0#0 lookup local pref 11
>>
>>
>> # IP tool has some hacked up debugging code
>> ]# ip -batch /tmp/foo.txt
>>    argc: 4
>>    arg -:to:-
>>    arg -:iif:-
>> WARNING:  Using TABLE_MAIN in iprule_modify, table_ok: 0  cmd: 32
>>
>>
>> So, it acts on eth0 instead of eth0#0, and silently ignores the 'lookup local pref 11'.
>>
>> I understand that it is trying to parse # as comments, but would you
>> all be interested in a patch that allowed ignoring '#' except
>> when it is the first non-whitespace character on a line, and maybe
>> when preceded by whitespace?  This would of course have the possibility
>> of breaking someone's script somewhere, so it could be enabled with
>> a new command line arg, perhaps.
>
> Putting # in device name just sounds like a bad idea.

It's been the standard naming for mac-vlans since we started supporting them.

In case you change your mind, this patch seems to work..though I can't figure out
how to trigger the second bit of code in the while loop, so it may not be right.

I'll move my iproute2 tree to github in case someone else wants to give
it a try.

diff --git a/lib/utils.c b/lib/utils.c
index a60d884..ad8e1ac 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -25,7 +25,7 @@
  #include <linux/pkt_sched.h>
  #include <time.h>
  #include <sys/time.h>
-
+#include <ctype.h>

  #include "utils.h"

@@ -708,8 +708,23 @@ ssize_t getcmdline(char **linep, size_t *lenp, FILE *in)
         ++cmdlineno;

         cp = strchr(*linep, '#');
-       if (cp)
-               *cp = '\0';
+
+       /* We don't want to treat the # in the middle of a word as
+        * a comment..makes batch commands dealing with mac-vlans: eth0#1
+        * silently do the wrong thing.  So, tighten up the # syntax a bit.
+        *
+        * # at start of line comments rest of line
+        * # preceded by a whitespace character comments rest of line.
+        */
+       while (cp) {
+               if (cp &&
+                   ((cp == *linep) /* starts line */
+                    || ((cp > *linep) && isspace(*(cp - 1))))) { /* follows space */
+                       *cp = '\0';
+                       break;
+               }
+               cp = strchr(cp+1, '#');
+       }

         while ((cp = strstr(*linep, "\\\n")) != NULL) {
                 char *line1 = NULL;
@@ -725,9 +740,16 @@ ssize_t getcmdline(char **linep, size_t *lenp, FILE *in)
                 *cp = 0;

                 cp = strchr(line1, '#');
-               if (cp)
-                       *cp = '\0';
-
+               while (cp) {
+                       if (cp &&
+                           ((cp == line1) /* starts line */
+                            || ((cp > line1) && isspace(*(cp - 1))))) { /* follows space */
+                               *cp = '\0';
+                               break;
+                       }
+                       cp = strchr(cp+1, '#');
+               }
+
                 *lenp = strlen(*linep) + strlen(line1) + 1;
                 *linep = realloc(*linep, *lenp);
                 if (!*linep) {

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply related

* Re: iproute, batch-cmds, and mac-vlans.
From: Stephen Hemminger @ 2010-07-13  5:19 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4C3BF050.7040206@candelatech.com>

On Mon, 12 Jul 2010 21:49:20 -0700
Ben Greear <greearb@candelatech.com> wrote:

> After too much time debugging, I finally realized that the ip
> tool was truncating my command because the mac-vlan device name
> had a '#' in it.
> 
> ]# cat /tmp/foo.txt
> ru add to 10.99.21.1 iif eth0#0 lookup local pref 11
> 
> 
> # IP tool has some hacked up debugging code
> ]# ip -batch /tmp/foo.txt
>   argc: 4
>   arg -:to:-
>   arg -:iif:-
> WARNING:  Using TABLE_MAIN in iprule_modify, table_ok: 0  cmd: 32
> 
> 
> So, it acts on eth0 instead of eth0#0, and silently ignores the 'lookup local pref 11'.
> 
> I understand that it is trying to parse # as comments, but would you
> all be interested in a patch that allowed ignoring '#' except
> when it is the first non-whitespace character on a line, and maybe
> when preceded by whitespace?  This would of course have the possibility
> of breaking someone's script somewhere, so it could be enabled with
> a new command line arg, perhaps.

Putting # in device name just sounds like a bad idea.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox