Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 04/14] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>

This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery.  We basically borrow the task flags from whatever process
happens to be preempted by the softirq.

So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    5 ++++-
 3 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..791536c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1869,6 +1869,13 @@ static inline void rcu_copy_process(struct task_struct *p)
 
 #endif
 
+static inline void tsk_restore_flags(struct task_struct *p,
+				     unsigned long pflags, unsigned long mask)
+{
+	p->flags &= ~mask;
+	p->flags |= pflags & mask;
+}
+
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p,
 			       const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index fca82c3..f773afe 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -265,6 +267,7 @@ restart:
 
 	account_system_vtime(current);
 	__local_bh_enable(SOFTIRQ_OFFSET);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fd18c..31e0eb2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2060,7 +2060,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
 		if (gfp_mask & __GFP_MEMALLOC)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+		else if (!in_irq() && (current->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 03/14] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>

__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC. It allows one to pass along the memalloc state
in object related allocation flags as opposed to task related flags,
such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC
as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag
which is now enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h             |   10 ++++++++--
 include/linux/mm_types.h        |    2 +-
 include/trace/events/gfpflags.h |    1 +
 mm/page_alloc.c                 |   14 ++++++--------
 mm/slab.c                       |    2 +-
 5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..38acdc7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -23,6 +23,7 @@ struct vm_area_struct;
 #define ___GFP_REPEAT		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
+#define ___GFP_MEMALLOC		0x2000u
 #define ___GFP_COMP		0x4000u
 #define ___GFP_ZERO		0x8000u
 #define ___GFP_NOMEMALLOC	0x10000u
@@ -75,9 +76,14 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
+#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
+							 * This takes precedence over the
+							 * __GFP_MEMALLOC flag if both are
+							 * set
+							 */
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
@@ -127,7 +133,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3716e9f..0be3d43 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,7 +54,7 @@ struct page {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub first free object */
 			bool pfmemalloc;	/* If set by the page allocator,
-						 * ALLOC_PFMEMALLOC was set
+						 * ALLOC_NO_WATERMARKS was set
 						 * and the low watermark was not
 						 * met implying that the system
 						 * is under some pressure. The
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9fe3a366..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -30,6 +30,7 @@
 	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
 	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
 	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_MEMALLOC,		"GFP_MEMALLOC"},	\
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 561cb61..03fd18c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1369,7 +1369,6 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2058,11 +2057,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if ((current->flags & PF_MEMALLOC) ||
-			unlikely(test_thread_flag(TIF_MEMDIE))) {
-		alloc_flags |= ALLOC_PFMEMALLOC;
-
-		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
@@ -2071,7 +2069,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -2253,7 +2251,7 @@ got_pg:
 	 * steps that will free more memory. The caller should avoid the
 	 * page being used for !PFMEMALLOC purposes.
 	 */
-	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+	page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 
 	return page;
 }
diff --git a/mm/slab.c b/mm/slab.c
index 1dd03e0..25f69ec 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3031,7 +3031,7 @@ static int cache_grow(struct kmem_cache *cachep,
 	if (!slabp)
 		goto opps1;
 
-	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (pfmemalloc) {
 		struct array_cache *ac = cpu_cache_get(cachep);
 		slabp->pfmemalloc = true;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 02/14] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory.  To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

Pages allocated from the reserve are returned with page->pfmemalloc
set and it is up to the caller to determine how the page should be
protected.  SLAB restricts access to any page with page->pfmemalloc set
to callers which are known to able to access the PFMEMALLOC reserve. If
one is not available, an attempt is made to allocate a new page rather
than use a reserve. SLUB is a bit more relaxed in that it only records
if the current per-CPU page was allocated from PFMEMALLOC reserve and
uses another partial slab if the caller does not have the necessary
GFP or process flags. This was found to be sufficient in tests to
avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |    9 ++
 include/linux/slub_def.h |    1 +
 mm/internal.h            |    3 +
 mm/page_alloc.c          |   27 +++++-
 mm/slab.c                |  216 +++++++++++++++++++++++++++++++++++++++-------
 mm/slub.c                |   35 +++++++-
 6 files changed, 248 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 774b895..3716e9f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -53,6 +53,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f58d641..d41a9a4 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -43,6 +43,7 @@ struct kmem_cache_cpu {
 	unsigned long tid;	/* Globally unique transaction id */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+	bool pfmemalloc;	/* Slab page had pfmemalloc set */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/internal.h b/mm/internal.h
index d071d380..a520f3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -193,6 +193,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d8bd0e..561cb61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -656,6 +656,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
+	page->pfmemalloc = false;
 	if (PageAnon(page))
 		page->mapping = NULL;
 	for (i = 0; i < (1 << order); i++)
@@ -1174,6 +1175,7 @@ void free_hot_cold_page(struct page *page, int cold)
 
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
+	page->pfmemalloc = false;
 	local_irq_save(flags);
 	if (unlikely(wasMlocked))
 		free_page_mlock(page);
@@ -1367,6 +1369,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2055,16 +2058,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2237,8 +2246,16 @@ nopage:
 got_pg:
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+	return page;
 }
 
 /*
diff --git a/mm/slab.c b/mm/slab.c
index 6d90a09..1dd03e0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -121,6 +121,8 @@
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -227,6 +229,7 @@ struct slab {
 			unsigned int inuse;	/* num of objs active in slab */
 			kmem_bufctl_t free;
 			unsigned short nodeid;
+			bool pfmemalloc;	/* Slab had pfmemalloc set */
 		};
 		struct slab_rcu __slab_cover_slab_rcu;
 	};
@@ -248,15 +251,37 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	unsigned int touched;
+	bool touched;
+	bool pfmemalloc;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -929,12 +954,100 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 		nc->avail = 0;
 		nc->limit = entries;
 		nc->batchcount = batchcount;
-		nc->touched = 0;
+		nc->touched = false;
 		spin_lock_init(&nc->lock);
 	}
 	return nc;
 }
 
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+
+	if (!ac->pfmemalloc)
+		return;
+
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (slabp->pfmemalloc)
+			return;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (slabp->pfmemalloc)
+			return;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (slabp->pfmemalloc)
+			return;
+
+	ac->pfmemalloc = false;
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			slabp->pfmemalloc = false;
+			clear_obj_pfmemalloc(&objp);
+			check_ac_pfmemalloc(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	struct slab *slabp;
+
+	/* If there are pfmemalloc slabs, check if the object is part of one */
+	if (unlikely(ac->pfmemalloc)) {
+		slabp = virt_to_slab(objp);
+
+		if (slabp->pfmemalloc)
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1111,7 +1224,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1719,7 +1832,8 @@ __initcall(cpucache_init);
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		bool *pfmemalloc)
 {
 	struct page *page;
 	int nr_pages;
@@ -1740,6 +1854,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page)
 		return NULL;
+	*pfmemalloc = page->pfmemalloc;
 
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
@@ -2172,7 +2287,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
 	cpu_cache_get(cachep)->avail = 0;
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
-	cpu_cache_get(cachep)->touched = 0;
+	cpu_cache_get(cachep)->touched = false;
 	cachep->batchcount = 1;
 	cachep->limit = BOOT_CPUCACHE_ENTRIES;
 	return 0;
@@ -2730,6 +2845,7 @@ static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
 	slabp->s_mem = objp + colour_off;
 	slabp->nodeid = nodeid;
 	slabp->free = 0;
+	slabp->pfmemalloc = false;
 	return slabp;
 }
 
@@ -2861,7 +2977,7 @@ static void slab_map_pages(struct kmem_cache *cache, struct slab *slab,
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
 static int cache_grow(struct kmem_cache *cachep,
-		gfp_t flags, int nodeid, void *objp)
+		gfp_t flags, int nodeid, void *objp, bool pfmemalloc)
 {
 	struct slab *slabp;
 	size_t offset;
@@ -2905,7 +3021,7 @@ static int cache_grow(struct kmem_cache *cachep,
 	 * 'nodeid'.
 	 */
 	if (!objp)
-		objp = kmem_getpages(cachep, local_flags, nodeid);
+		objp = kmem_getpages(cachep, local_flags, nodeid, &pfmemalloc);
 	if (!objp)
 		goto failed;
 
@@ -2915,6 +3031,13 @@ static int cache_grow(struct kmem_cache *cachep,
 	if (!slabp)
 		goto opps1;
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (pfmemalloc) {
+		struct array_cache *ac = cpu_cache_get(cachep);
+		slabp->pfmemalloc = true;
+		ac->pfmemalloc = true;
+	}
+
 	slab_map_pages(cachep, slabp, objp);
 
 	cache_init_objs(cachep, slabp);
@@ -3056,16 +3179,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3083,7 +3209,7 @@ retry:
 
 	/* See if we can refill from the shared array */
 	if (l3->shared && transfer_objects(ac, l3->shared, batchcount)) {
-		l3->shared->touched = 1;
+		l3->shared->touched = true;
 		goto alloc_done;
 	}
 
@@ -3115,8 +3241,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3135,18 +3261,25 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
-		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
+force_grow:
+		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL, false);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
-		if (!ac->avail)		/* objects refilled by interrupt? */
+		/* objects refilled by interrupt? */
+		if (!ac->avail) {
+			node = numa_node_id();
 			goto retry;
+		}
 	}
-	ac->touched = 1;
-	return ac->entry[--ac->avail];
+	ac->touched = true;
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3228,23 +3361,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
-		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		ac->touched = true;
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3297,6 +3442,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	bool pfmemalloc;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
@@ -3333,7 +3479,8 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, numa_mem_id());
+		obj = kmem_getpages(cache, local_flags, numa_mem_id(),
+							&pfmemalloc);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
@@ -3341,7 +3488,7 @@ retry:
 			 * Insert into the appropriate per node queues
 			 */
 			nid = page_to_nid(virt_to_page(obj));
-			if (cache_grow(cache, flags, nid, obj)) {
+			if (cache_grow(cache, flags, nid, obj, pfmemalloc)) {
 				obj = ____cache_alloc_node(cache,
 					flags | GFP_THISNODE, nid);
 				if (!obj)
@@ -3413,7 +3560,7 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
-	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
+	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL, false);
 	if (x)
 		goto retry;
 
@@ -3563,9 +3710,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3678,12 +3828,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 
 	if (likely(ac->avail < ac->limit)) {
 		STATS_INC_FREEHIT(cachep);
-		ac->entry[ac->avail++] = objp;
+		ac_put_obj(cachep, ac, objp);
 		return;
 	} else {
 		STATS_INC_FREEMISS(cachep);
 		cache_flusharray(cachep, ac);
-		ac->entry[ac->avail++] = objp;
+		ac_put_obj(cachep, ac, objp);
 	}
 }
 
@@ -4110,7 +4260,7 @@ static void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3,
 	if (!ac || !ac->avail)
 		return;
 	if (ac->touched && !force) {
-		ac->touched = 0;
+		ac->touched = false;
 	} else {
 		spin_lock_irq(&l3->list_lock);
 		if (ac->avail) {
diff --git a/mm/slub.c b/mm/slub.c
index 7c54fe8..8bf91a6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -32,6 +32,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1414,7 +1416,8 @@ static void setup_object(struct kmem_cache *s, struct page *page,
 		s->ctor(object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node,
+							bool *pfmemalloc)
 {
 	struct page *page;
 	void *start;
@@ -1429,6 +1432,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 		goto out;
 
 	inc_slabs_node(s, page_to_nid(page), page->objects);
+	*pfmemalloc = page->pfmemalloc;
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
 
@@ -2027,6 +2031,14 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
 	}
 }
 
+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+	if (unlikely(c->pfmemalloc))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -2053,6 +2065,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	unsigned long flags;
 	struct page new;
 	unsigned long counters;
+	bool pfmemalloc = false;
 
 	local_irq_save(flags);
 #ifdef CONFIG_PREEMPT
@@ -2077,6 +2090,16 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+		deactivate_slab(s, c);
+		goto new_slab;
+	}
+
 	stat(s, ALLOC_SLOWPATH);
 
 	do {
@@ -2129,7 +2152,7 @@ new_slab:
 		goto load_freelist;
 	}
 
-	page = new_slab(s, gfpflags, node);
+	page = new_slab(s, gfpflags, node, &pfmemalloc);
 
 	if (page) {
 		c = __this_cpu_ptr(s->cpu_slab);
@@ -2147,6 +2170,7 @@ new_slab:
 		stat(s, ALLOC_SLAB);
 		c->node = page_to_nid(page);
 		c->page = page;
+		c->pfmemalloc = pfmemalloc;
 
 		if (kmem_cache_debug(s))
 			goto debug;
@@ -2209,8 +2233,8 @@ redo:
 	barrier();
 
 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
-
+	if (unlikely(!object || !node_match(c, node) ||
+					!pfmemalloc_match(c, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
@@ -2669,10 +2693,11 @@ static void early_kmem_cache_node_alloc(int node)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
+	bool pfmemalloc;	/* Ignore this early in boot */
 
 	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+	page = new_slab(kmem_cache_node, GFP_NOWAIT, node, &pfmemalloc);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 01/14] mm: Serialize access to min_free_kbytes
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>

There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement.  Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.

[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..9d8bd0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5113,14 +5113,7 @@ static void setup_per_zone_lowmem_reserve(void)
 	calculate_totalreserve_pages();
 }
 
-/**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
- * or when memory is hot-{added|removed}
- *
- * Ensures that the watermark[min,low,high] values for each zone are set
- * correctly with respect to min_free_kbytes.
- */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -5175,6 +5168,20 @@ void setup_per_zone_wmarks(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * or when memory is hot-{added|removed}
+ *
+ * Ensures that the watermark[min,low,high] values for each zone are set
+ * correctly with respect to min_free_kbytes.
+ */
+void setup_per_zone_wmarks(void)
+{
+	mutex_lock(&zonelists_mutex);
+	__setup_per_zone_wmarks();
+	mutex_unlock(&zonelists_mutex);
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has to
  * do too much work, but large enough that each inactive page has a chance
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 00/14] Swap-over-NBD without deadlocking V7
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mel Gorman

Testing rattled out a few bugs in process throttling. Otherwise,
very little has changed.

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate
it with swapon. Swap over the network is considered as an option in
diskless systems. The two likely scenarios are when blade servers
are used as part of a cluster where the form factor or maintenance
costs do not allow the use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
.  There is also documentation and tutorials
on how to setup swap over NBD at places like
https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP .  The
nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes
if swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools
like normal block devices do. As the host cannot control where they
receive packets from, they cannot reliably work out in advance how
much memory they might need.

Some years ago, Peter Ziljstra developed a series of patches that
supported swap over an NFS that some distributions are carrying in
their kernels. This patch series borrows very heavily from Peter's
work to support swapping over NBD as a pre-requisite to supporting
swap-over-NFS. The bulk of the complexity is concerned with preserving
memory that is allocated from the PFMEMALLOC reserves for use by the
network layer which is needed for both NBD and NFS.

Patch 1 serialises access to min_free_kbytes. It's not strictly needed
	by this series but as the series cares about watermarks in
	general, it's a harmless fix. It could be merged independently.

Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeying memory.

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 6-10 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean
	pages. If packets are received and stored in pages that were
	allocated under low-memory situations and are unrelated to
	the VM, the packets are dropped.

Patch 11 is a micro-optimisation to avoid a function call in the
	common case.

Patch 12 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 13 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get
	throttled on a waitqueue if 50% of the PFMEMALLOC reserves are
	depleted.  It is expected that kswapd and the direct reclaimers
	already running will clean enough pages for the low watermark
	to be reached and the throttled processes are woken up.

Patch 14 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances. Here is the results from netperf using
slab as an example

NETPERF UDP
      64   237.47 ( 0.00%)    237.34 (-0.05%) 
     128   472.69 ( 0.00%)    465.96 (-1.44%) 
     256   926.82 ( 0.00%)    948.40 ( 2.28%) 
    1024  3260.08 ( 0.00%)   3266.50 ( 0.20%) 
    2048  5535.11 ( 0.00%)   5453.55 (-1.50%) 
    3312  7496.60 ( 0.00%)*  7574.44 ( 1.03%) 
             1.12%             1.00%        
    4096  8266.35 ( 0.00%)*  8240.06 (-0.32%)*
             1.18%             1.49%        
    8192 11026.01 ( 0.00%)  11010.44 (-0.14%) 
   16384 14653.98 ( 0.00%)  14666.97 ( 0.09%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       2156.64   1873.27
Total Elapsed Time (seconds)               2570.09   2234.10

NETPERF TCP
                   netperf-tcp       tcp-swapnbd
                  vanilla-slab         v4r3-slab
      64  1250.76 ( 0.00%)   1256.52 ( 0.46%) 
     128  2290.70 ( 0.00%)   2336.43 ( 1.96%) 
     256  3668.42 ( 0.00%)   3751.17 ( 2.21%) 
    1024  7214.33 ( 0.00%)   7237.23 ( 0.32%) 
    2048  8230.01 ( 0.00%)   8280.02 ( 0.60%) 
    3312  8634.95 ( 0.00%)   8758.62 ( 1.41%) 
    4096  8851.18 ( 0.00%)   9045.88 ( 2.15%) 
    8192 10067.59 ( 0.00%)  10263.30 ( 1.91%) 
   16384 11523.26 ( 0.00%)  11654.78 ( 1.13%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1450.23    1389.8
Total Elapsed Time (seconds)               1450.41   1390.35

Here is the equivalent test for SLUB

                   netperf-udp       udp-swapnbd
                  vanilla-slub         v4r3-slub
      64   235.33 ( 0.00%)    237.80 ( 1.04%) 
     128   465.92 ( 0.00%)    469.98 ( 0.86%) 
     256   907.16 ( 0.00%)    907.58 ( 0.05%) 
    1024  3240.25 ( 0.00%)   3255.56 ( 0.47%) 
    2048  5564.87 ( 0.00%)   5446.46 (-2.17%) 
    3312  7427.65 ( 0.00%)*  7650.00 ( 2.91%) 
             1.33%             1.00%        
    4096  8004.51 ( 0.00%)*  8132.79 ( 1.58%)*
             1.05%             1.21%        
    8192 11079.60 ( 0.00%)  10927.09 (-1.40%) 
   16384 14737.38 ( 0.00%)  15019.50 ( 1.88%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       2056.21   2160.38
Total Elapsed Time (seconds)               2426.09   2498.16

NETPERF TCP
                   netperf-tcp       tcp-swapnbd
                  vanilla-slub         v4r3-slub
      64  1251.64 ( 0.00%)   1262.89 ( 0.89%) 
     128  2289.88 ( 0.00%)   2332.94 ( 1.85%) 
     256  3654.34 ( 0.00%)   3736.48 ( 2.20%) 
    1024  7192.47 ( 0.00%)   7286.96 ( 1.30%) 
    2048  8243.55 ( 0.00%)   8291.50 ( 0.58%) 
    3312  8664.16 ( 0.00%)   8799.88 ( 1.54%) 
    4096  8869.13 ( 0.00%)   9018.12 ( 1.65%) 
    8192 10009.53 ( 0.00%)  10214.26 ( 2.00%) 
   16384 11470.78 ( 0.00%)  11685.20 ( 1.83%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1368.28   1511.81
Total Elapsed Time (seconds)               1370.33   1510.42

Time to completion varied a lot but this can happen with netperf as
it tries to find results within a sufficiently high confidence. There
were some small gains and losses but they are close to the variances
seen between kernel releases.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure. Without the patches, the machine locks up within
minutes and runs to completion with them applied.

 drivers/block/nbd.c             |    7 +-
 include/linux/gfp.h             |   13 ++-
 include/linux/mm_types.h        |    9 ++
 include/linux/mmzone.h          |    1 +
 include/linux/sched.h           |    7 +
 include/linux/skbuff.h          |   21 +++-
 include/linux/slub_def.h        |    1 +
 include/linux/vm_event_item.h   |    1 +
 include/net/sock.h              |   19 +++
 include/trace/events/gfpflags.h |    1 +
 kernel/softirq.c                |    3 +
 mm/page_alloc.c                 |   57 +++++++--
 mm/slab.c                       |  240 +++++++++++++++++++++++++++++++++------
 mm/slub.c                       |   35 +++++-
 mm/vmscan.c                     |   72 ++++++++++++
 mm/vmstat.c                     |    1 +
 net/core/dev.c                  |   48 +++++++-
 net/core/filter.c               |    8 ++
 net/core/skbuff.c               |   95 +++++++++++++---
 net/core/sock.c                 |   42 +++++++
 net/ipv4/tcp.c                  |    3 +-
 net/ipv4/tcp_output.c           |   13 +-
 net/ipv6/tcp_ipv6.c             |   12 ++-
 23 files changed, 623 insertions(+), 86 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* __net_exit bogusly defined as __exit_refok ?
From: Jan Beulich @ 2011-10-06 11:55 UTC (permalink / raw)
  To: davem, xemul; +Cc: netdev

Realizing that this has been this way a for a rather long time, I still wonder
why it was done that way: __exit_refok (evaluating to __ref) allows these
functions to reference __init functions and __initdata objects (which is
wrong, since those can get called in the context of __exit code, at which
point .init.* sections are already gone).

Second, __exit_refok results in the code to not be discarded at all
(with the original patch's description wrongly indicating that without
NET_NS the exit functions would never be called - they get called from
unregister_pernet_operations(), which generally gets invoked from
modules' __exit sections), which is the same as if no section
placement annotation was present.

Thus, rather than being the only user of __exit_refok (which by itself
is a dubious construct), it would seem to make more sense to make
__net_exit resolve to nothingregardless of NET_NS  (short of going
through the code and remove all uses of it) and delete __exit_refok.

One alternative might be to make __net_exit at least resolve to
__init_or_module, as __exit functions won't be called without
MODULES. Or really, you'd want something that resolves to __init
when built into the kernel, and to nothing when built as a module.
Both, however, would require some adjustments to modpost's
section mismatch checking.

Jan

^ permalink raw reply

* Re: A new 40G Network driver ready to submit to the kernel tree
From: Francois Romieu @ 2011-10-06 11:00 UTC (permalink / raw)
  To: Joyce Yu - System Software; +Cc: netdev
In-Reply-To: <4E8CE4A0.4060202@oracle.com>

Joyce Yu - System Software <joyce.yu@oracle.com> :
[...]
> I have a new 40G Network driver ready to submit to the kernel tree.
> The driver has been ported to latest linux-3.0-rc5 and
> net-2.6-353e5c9 tree.

It is a bit old. It could be better to rebase it against David Miller's
net-next tree as the drivers/net/ ethernet device tree has undergone
some changes.

The net-next tree is currently available at:

git://github.com/davem330/net-next.git

> The driver versions for 2.6.18 and 2.6.32 based kernel have been
> fully tested and released to the customer.  Shall I just send the
> driverxx.c and driverxx.h for net-2.6-353e5c9 and linux-3.0-rc5
> to this alias?

The remarks from 04/11 are still relevant but it will be a good start.

Did it went through internal reviewing by some usual Linux folks at
oracle (hint, hint) ?

-- 
Ueimor

^ permalink raw reply

* [net-next 8/9] igb: Alternate MAC Address EEPROM Updates
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Akeem G. Abodunrin, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>

This code check word 0x37 in the EEPROM, if it is 0xFFFF _or_ 0x0000, then
there is no Alternate MAC Address in the EEPROM.

Signed-off-by: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>
Tested-by:  Aaron Brown  <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_mac.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_mac.c b/drivers/net/ethernet/intel/igb/e1000_mac.c
index 2b5ef76..7907183 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mac.c
+++ b/drivers/net/ethernet/intel/igb/e1000_mac.c
@@ -198,10 +198,10 @@ s32 igb_check_alt_mac_addr(struct e1000_hw *hw)
 		goto out;
 	}

-	if (nvm_alt_mac_addr_offset == 0xFFFF) {
+	if ((nvm_alt_mac_addr_offset == 0xFFFF) ||
+	    (nvm_alt_mac_addr_offset == 0x0000))
 		/* There is no Alternate MAC Address */
 		goto out;
-	}

 	if (hw->bus.func == E1000_FUNC_1)
 		nvm_alt_mac_addr_offset += E1000_ALT_MAC_ADDRESS_OFFSET_LAN1;
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 9/9] igb: Alternate MAC Address Updates for Func2&3
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Akeem G. Abodunrin, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>

Only function 1 has support for Alternate MAC Address in the EEPROM before,
this update now allow function 2 and 3 to have support for Alternate MAC
Address in the EEPROM.

Signed-off-by: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>
Tested-by:  Aaron Brown  <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_mac.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_mac.c b/drivers/net/ethernet/intel/igb/e1000_mac.c
index 7907183..872119d 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mac.c
+++ b/drivers/net/ethernet/intel/igb/e1000_mac.c
@@ -205,6 +205,11 @@ s32 igb_check_alt_mac_addr(struct e1000_hw *hw)

 	if (hw->bus.func == E1000_FUNC_1)
 		nvm_alt_mac_addr_offset += E1000_ALT_MAC_ADDRESS_OFFSET_LAN1;
+	if (hw->bus.func == E1000_FUNC_2)
+		nvm_alt_mac_addr_offset += E1000_ALT_MAC_ADDRESS_OFFSET_LAN2;
+
+	if (hw->bus.func == E1000_FUNC_3)
+		nvm_alt_mac_addr_offset += E1000_ALT_MAC_ADDRESS_OFFSET_LAN3;
 	for (i = 0; i < ETH_ALEN; i += 2) {
 		offset = nvm_alt_mac_addr_offset + (i >> 1);
 		ret_val = hw->nvm.ops.read(hw, offset, 1, &nvm_data);
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 7/9] igb: Code to prevent overwriting SFP I2C
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Akeem G. Abodunrin, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>

This patch fixes "overwrite" problem. without this fix, SFP I2C EEPROM
data, which is located at A0 can be overwritten by the phy write function.

Signed-off-by: "Akeem G. Abodunrin" <akeem.g.abodunrin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_phy.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_phy.c b/drivers/net/ethernet/intel/igb/e1000_phy.c
index e662554..7edf31e 100644
--- a/drivers/net/ethernet/intel/igb/e1000_phy.c
+++ b/drivers/net/ethernet/intel/igb/e1000_phy.c
@@ -306,6 +306,12 @@ s32 igb_write_phy_reg_i2c(struct e1000_hw *hw, u32 offset, u16 data)
 	u32 i, i2ccmd = 0;
 	u16 phy_data_swapped;

+	/* Prevent overwritting SFP I2C EEPROM which is at A0 address.*/
+	if ((hw->phy.addr == 0) || (hw->phy.addr > 7)) {
+		hw_dbg("PHY I2C Address %d is out of range.\n",
+			  hw->phy.addr);
+		return -E1000_ERR_CONFIG;
+	}

 	/* Swap the data bytes for the I2C interface */
 	phy_data_swapped = ((data >> 8) & 0x00FF) | ((data << 8) & 0xFF00);
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 6/9] ixgbe: X540 devices RX PFC frames pause traffic even if disabled
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: John Fastabend, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>

Receiving PFC (priority flow control) frames while the feature
is off should not pause the traffic class. On the X540 devices
the traffic class react to frames if it was previously enabled
because the field is incorrectly cleared.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c |   12 +++++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h      |    2 +-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
index 45fe710..32cd97b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
@@ -271,13 +271,23 @@ s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en, u8 *prio_tc)
 		reg |= IXGBE_MFLCN_RPFCE | IXGBE_MFLCN_DPF;
 
 		if (hw->mac.type == ixgbe_mac_X540) {
-			reg &= ~(IXGBE_MFLCN_RPFCE_MASK | 0x10);
+			reg &= ~IXGBE_MFLCN_RPFCE_MASK;
 			reg |= pfc_en << IXGBE_MFLCN_RPFCE_SHIFT;
 		}
 
 		IXGBE_WRITE_REG(hw, IXGBE_MFLCN, reg);
 
 	} else {
+		/* X540 devices have a RX bit that should be cleared
+		 * if PFC is disabled on all TCs but PFC features is
+		 * enabled.
+		 */
+		if (hw->mac.type == ixgbe_mac_X540) {
+			reg = IXGBE_READ_REG(hw, IXGBE_MFLCN);
+			reg &= ~IXGBE_MFLCN_RPFCE_MASK;
+			IXGBE_WRITE_REG(hw, IXGBE_MFLCN, reg);
+		}
+
 		for (i = 0; i < MAX_TRAFFIC_CLASS; i++)
 			hw->mac.ops.fc_enable(hw, i);
 	}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index 4ea909c..d1d6894 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -1850,7 +1850,7 @@ enum {
 #define IXGBE_MFLCN_DPF         0x00000002 /* Discard Pause Frame */
 #define IXGBE_MFLCN_RPFCE       0x00000004 /* Receive Priority FC Enable */
 #define IXGBE_MFLCN_RFCE        0x00000008 /* Receive FC Enable */
-#define IXGBE_MFLCN_RPFCE_MASK	0x00000FE0 /* Receive FC Mask */
+#define IXGBE_MFLCN_RPFCE_MASK	0x00000FF0 /* Receive FC Mask */
 
 #define IXGBE_MFLCN_RPFCE_SHIFT		 4
 
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 5/9] ixgbe: DCB X540 devices support max traffic class of 4
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: John Fastabend, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>

X540 devices can only support up to 4 traffic classes and
guarantee a "lossless" traffic class on some platforms.
This patch sets the X540 devices to initialize a max
traffic class value of 4 at probe time.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 ++++++++++++++++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h |    1 +
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2b8ff95..1f936c8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4990,17 +4990,32 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 	spin_lock_init(&adapter->fdir_perfect_lock);
 
 #ifdef CONFIG_IXGBE_DCB
+	switch (hw->mac.type) {
+	case ixgbe_mac_X540:
+		adapter->dcb_cfg.num_tcs.pg_tcs = X540_TRAFFIC_CLASS;
+		adapter->dcb_cfg.num_tcs.pfc_tcs = X540_TRAFFIC_CLASS;
+		break;
+	default:
+		adapter->dcb_cfg.num_tcs.pg_tcs = MAX_TRAFFIC_CLASS;
+		adapter->dcb_cfg.num_tcs.pfc_tcs = MAX_TRAFFIC_CLASS;
+		break;
+	}
+
 	/* Configure DCB traffic classes */
 	for (j = 0; j < MAX_TRAFFIC_CLASS; j++) {
 		tc = &adapter->dcb_cfg.tc_config[j];
 		tc->path[DCB_TX_CONFIG].bwg_id = 0;
 		tc->path[DCB_TX_CONFIG].bwg_percent = 12 + (j & 1);
-		tc->path[DCB_TX_CONFIG].up_to_tc_bitmap = 1 << j;
 		tc->path[DCB_RX_CONFIG].bwg_id = 0;
 		tc->path[DCB_RX_CONFIG].bwg_percent = 12 + (j & 1);
-		tc->path[DCB_RX_CONFIG].up_to_tc_bitmap = 1 << j;
 		tc->dcb_pfc = pfc_disabled;
 	}
+
+	/* Initialize default user to priority mapping, UPx->TC0 */
+	tc = &adapter->dcb_cfg.tc_config[0];
+	tc->path[DCB_TX_CONFIG].up_to_tc_bitmap = 0xFF;
+	tc->path[DCB_RX_CONFIG].up_to_tc_bitmap = 0xFF;
+
 	adapter->dcb_cfg.bw_percentage[DCB_TX_CONFIG][0] = 100;
 	adapter->dcb_cfg.bw_percentage[DCB_RX_CONFIG][0] = 100;
 	adapter->dcb_cfg.pfc_mode_enable = false;
@@ -7019,7 +7034,7 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 	}
 
 	/* Hardware supports up to 8 traffic classes */
-	if (tc > MAX_TRAFFIC_CLASS ||
+	if (tc > adapter->dcb_cfg.num_tcs.pg_tcs ||
 	    (hw->mac.type == ixgbe_mac_82598EB && tc < MAX_TRAFFIC_CLASS))
 		return -EINVAL;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index baad0cb..4ea909c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -406,6 +406,7 @@
 
 /* DCB registers */
 #define MAX_TRAFFIC_CLASS        8
+#define X540_TRAFFIC_CLASS       4
 #define IXGBE_RMCS      0x03D00
 #define IXGBE_DPMCS     0x07F40
 #define IXGBE_PDPMCS    0x0CD00
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 4/9] ixgbe: fixup hard dependencies on supporting 8 traffic classes
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: John Fastabend, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>

This patch correctly configures DCB when less than 8 traffic classes
are available in hardware.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c       |   20 +++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h       |    3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c |   38 ++++++++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c    |   60 +++++++++++++++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   12 +++-
 6 files changed, 101 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
index 3d44b15..318caf4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
@@ -231,6 +231,18 @@ void ixgbe_dcb_unpack_prio(struct ixgbe_dcb_config *cfg, int direction,
 	}
 }
 
+void ixgbe_dcb_unpack_map(struct ixgbe_dcb_config *cfg, int direction, u8 *map)
+{
+	int i, up;
+	unsigned long bitmap;
+
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		bitmap = cfg->tc_config[i].path[direction].up_to_tc_bitmap;
+		for_each_set_bit(up, &bitmap, MAX_USER_PRIORITY)
+			map[up] = i;
+	}
+}
+
 /**
  * ixgbe_dcb_hw_config - Config and enable DCB
  * @hw: pointer to hardware structure
@@ -245,10 +257,9 @@ s32 ixgbe_dcb_hw_config(struct ixgbe_hw *hw,
 	u8 pfc_en;
 	u8 ptype[MAX_TRAFFIC_CLASS];
 	u8 bwgid[MAX_TRAFFIC_CLASS];
+	u8 prio_tc[MAX_TRAFFIC_CLASS];
 	u16 refill[MAX_TRAFFIC_CLASS];
 	u16 max[MAX_TRAFFIC_CLASS];
-	/* CEE does not define a priority to tc mapping so map 1:1 */
-	u8 prio_tc[MAX_TRAFFIC_CLASS] = {0, 1, 2, 3, 4, 5, 6, 7};
 
 	/* Unpack CEE standard containers */
 	ixgbe_dcb_unpack_pfc(dcb_config, &pfc_en);
@@ -256,6 +267,7 @@ s32 ixgbe_dcb_hw_config(struct ixgbe_hw *hw,
 	ixgbe_dcb_unpack_max(dcb_config, max);
 	ixgbe_dcb_unpack_bwgid(dcb_config, DCB_TX_CONFIG, bwgid);
 	ixgbe_dcb_unpack_prio(dcb_config, DCB_TX_CONFIG, ptype);
+	ixgbe_dcb_unpack_map(dcb_config, DCB_TX_CONFIG, prio_tc);
 
 	switch (hw->mac.type) {
 	case ixgbe_mac_82598EB:
@@ -274,7 +286,7 @@ s32 ixgbe_dcb_hw_config(struct ixgbe_hw *hw,
 }
 
 /* Helper routines to abstract HW specifics from DCB netlink ops */
-s32 ixgbe_dcb_hw_pfc_config(struct ixgbe_hw *hw, u8 pfc_en)
+s32 ixgbe_dcb_hw_pfc_config(struct ixgbe_hw *hw, u8 pfc_en, u8 *prio_tc)
 {
 	int ret = -EINVAL;
 
@@ -284,7 +296,7 @@ s32 ixgbe_dcb_hw_pfc_config(struct ixgbe_hw *hw, u8 pfc_en)
 		break;
 	case ixgbe_mac_82599EB:
 	case ixgbe_mac_X540:
-		ret = ixgbe_dcb_config_pfc_82599(hw, pfc_en);
+		ret = ixgbe_dcb_config_pfc_82599(hw, pfc_en, prio_tc);
 		break;
 	default:
 		break;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h
index df095a9..e162775 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h
@@ -145,6 +145,7 @@ void ixgbe_dcb_unpack_refill(struct ixgbe_dcb_config *, int, u16 *);
 void ixgbe_dcb_unpack_max(struct ixgbe_dcb_config *, u16 *);
 void ixgbe_dcb_unpack_bwgid(struct ixgbe_dcb_config *, int, u8 *);
 void ixgbe_dcb_unpack_prio(struct ixgbe_dcb_config *, int, u8 *);
+void ixgbe_dcb_unpack_map(struct ixgbe_dcb_config *, int, u8 *);
 
 /* DCB credits calculation */
 s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_hw *,
@@ -154,7 +155,7 @@ s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_hw *,
 s32 ixgbe_dcb_hw_ets(struct ixgbe_hw *hw, struct ieee_ets *ets, int max);
 s32 ixgbe_dcb_hw_ets_config(struct ixgbe_hw *hw, u16 *refill, u16 *max,
 			    u8 *bwg_id, u8 *prio_type, u8 *tc_prio);
-s32 ixgbe_dcb_hw_pfc_config(struct ixgbe_hw *hw, u8 pfc_en);
+s32 ixgbe_dcb_hw_pfc_config(struct ixgbe_hw *hw, u8 pfc_en, u8 *tc_prio);
 s32 ixgbe_dcb_hw_config(struct ixgbe_hw *, struct ixgbe_dcb_config *);
 
 /* DCB definitions for credit calculation */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
index 02f6724..45fe710 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c
@@ -59,9 +59,9 @@ s32 ixgbe_dcb_config_rx_arbiter_82599(struct ixgbe_hw *hw,
 	reg = IXGBE_RTRPCS_RRM | IXGBE_RTRPCS_RAC | IXGBE_RTRPCS_ARBDIS;
 	IXGBE_WRITE_REG(hw, IXGBE_RTRPCS, reg);
 
-	/* Map all traffic classes to their UP, 1 to 1 */
+	/* Map all traffic classes to their UP */
 	reg = 0;
-	for (i = 0; i < MAX_TRAFFIC_CLASS; i++)
+	for (i = 0; i < MAX_USER_PRIORITY; i++)
 		reg |= (prio_tc[i] << (i * IXGBE_RTRUP2TC_UP_SHIFT));
 	IXGBE_WRITE_REG(hw, IXGBE_RTRUP2TC, reg);
 
@@ -169,9 +169,9 @@ s32 ixgbe_dcb_config_tx_data_arbiter_82599(struct ixgbe_hw *hw,
 	      IXGBE_RTTPCS_ARBDIS;
 	IXGBE_WRITE_REG(hw, IXGBE_RTTPCS, reg);
 
-	/* Map all traffic classes to their UP, 1 to 1 */
+	/* Map all traffic classes to their UP */
 	reg = 0;
-	for (i = 0; i < MAX_TRAFFIC_CLASS; i++)
+	for (i = 0; i < MAX_USER_PRIORITY; i++)
 		reg |= (prio_tc[i] << (i * IXGBE_RTTUP2TC_UP_SHIFT));
 	IXGBE_WRITE_REG(hw, IXGBE_RTTUP2TC, reg);
 
@@ -205,16 +205,36 @@ s32 ixgbe_dcb_config_tx_data_arbiter_82599(struct ixgbe_hw *hw,
  * ixgbe_dcb_config_pfc_82599 - Configure priority flow control
  * @hw: pointer to hardware structure
  * @pfc_en: enabled pfc bitmask
+ * @prio_tc: priority to tc assignments indexed by priority
  *
  * Configure Priority Flow Control (PFC) for each traffic class.
  */
-s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en)
+s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en, u8 *prio_tc)
 {
-	u32 i, reg;
+	u32 i, j, reg;
+	u8 max_tc = 0;
+
+	for (i = 0; i < MAX_USER_PRIORITY; i++)
+		if (prio_tc[i] > max_tc)
+			max_tc = prio_tc[i];
 
 	/* Configure PFC Tx thresholds per TC */
 	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
-		int enabled = pfc_en & (1 << i);
+		int enabled = 0;
+
+		if (i > max_tc) {
+			reg = 0;
+			IXGBE_WRITE_REG(hw, IXGBE_FCRTL_82599(i), reg);
+			IXGBE_WRITE_REG(hw, IXGBE_FCRTH_82599(i), reg);
+			continue;
+		}
+
+		for (j = 0; j < MAX_USER_PRIORITY; j++) {
+			if ((prio_tc[j] == i) && (pfc_en & (1 << j))) {
+				enabled = 1;
+				break;
+			}
+		}
 
 		reg = hw->fc.low_water << 10;
 
@@ -251,7 +271,7 @@ s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en)
 		reg |= IXGBE_MFLCN_RPFCE | IXGBE_MFLCN_DPF;
 
 		if (hw->mac.type == ixgbe_mac_X540) {
-			reg &= ~IXGBE_MFLCN_RPFCE_MASK;
+			reg &= ~(IXGBE_MFLCN_RPFCE_MASK | 0x10);
 			reg |= pfc_en << IXGBE_MFLCN_RPFCE_SHIFT;
 		}
 
@@ -338,7 +358,7 @@ s32 ixgbe_dcb_hw_config_82599(struct ixgbe_hw *hw, u8 pfc_en, u16 *refill,
 					       bwg_id, prio_type);
 	ixgbe_dcb_config_tx_data_arbiter_82599(hw, refill, max,
 					       bwg_id, prio_type, prio_tc);
-	ixgbe_dcb_config_pfc_82599(hw, pfc_en);
+	ixgbe_dcb_config_pfc_82599(hw, pfc_en, prio_tc);
 	ixgbe_dcb_config_tc_stats_82599(hw);
 
 	return 0;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h
index 08d1749..a59d5dc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h
@@ -93,7 +93,7 @@
 /* DCB hardware-specific driver APIs */
 
 /* DCB PFC functions */
-s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en);
+s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw, u8 pfc_en, u8 *prio_tc);
 
 /* DCB hw initialization */
 s32 ixgbe_dcb_config_rx_arbiter_82599(struct ixgbe_hw *hw,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c
index 1d38955..be66bb6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c
@@ -123,7 +123,7 @@ static u8 ixgbe_dcbnl_set_state(struct net_device *netdev, u8 state)
 		return err;
 
 	if (state > 0)
-		err = ixgbe_setup_tc(netdev, MAX_TRAFFIC_CLASS);
+		err = ixgbe_setup_tc(netdev, adapter->dcb_cfg.num_tcs.pg_tcs);
 	else
 		err = ixgbe_setup_tc(netdev, 0);
 
@@ -158,6 +158,10 @@ static void ixgbe_dcbnl_set_pg_tc_cfg_tx(struct net_device *netdev, int tc,
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
+	/* Abort a bad configuration */
+	if (ffs(up_map) > adapter->dcb_cfg.num_tcs.pg_tcs)
+		return;
+
 	if (prio != DCB_ATTR_VALUE_UNDEFINED)
 		adapter->temp_dcb_cfg.tc_config[tc].path[0].prio_type = prio;
 	if (bwg_id != DCB_ATTR_VALUE_UNDEFINED)
@@ -178,6 +182,10 @@ static void ixgbe_dcbnl_set_pg_tc_cfg_tx(struct net_device *netdev, int tc,
 	    (adapter->temp_dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap !=
 	     adapter->dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap))
 		adapter->dcb_set_bitmap |= BIT_PG_TX;
+
+	if (adapter->temp_dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap !=
+	     adapter->dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap)
+		adapter->dcb_set_bitmap |= BIT_PFC;
 }
 
 static void ixgbe_dcbnl_set_pg_bwg_cfg_tx(struct net_device *netdev, int bwg_id,
@@ -198,6 +206,10 @@ static void ixgbe_dcbnl_set_pg_tc_cfg_rx(struct net_device *netdev, int tc,
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
+	/* Abort bad configurations */
+	if (ffs(up_map) > adapter->dcb_cfg.num_tcs.pg_tcs)
+		return;
+
 	if (prio != DCB_ATTR_VALUE_UNDEFINED)
 		adapter->temp_dcb_cfg.tc_config[tc].path[1].prio_type = prio;
 	if (bwg_id != DCB_ATTR_VALUE_UNDEFINED)
@@ -218,6 +230,10 @@ static void ixgbe_dcbnl_set_pg_tc_cfg_rx(struct net_device *netdev, int tc,
 	    (adapter->temp_dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap !=
 	     adapter->dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap))
 		adapter->dcb_set_bitmap |= BIT_PG_RX;
+
+	if (adapter->temp_dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap !=
+	     adapter->dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap)
+		adapter->dcb_set_bitmap |= BIT_PFC;
 }
 
 static void ixgbe_dcbnl_set_pg_bwg_cfg_rx(struct net_device *netdev, int bwg_id,
@@ -296,7 +312,7 @@ static void ixgbe_dcbnl_get_pfc_cfg(struct net_device *netdev, int priority,
 static u8 ixgbe_dcbnl_set_all(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	int ret;
+	int ret, i;
 #ifdef IXGBE_FCOE
 	struct dcb_app app = {
 			      .selector = DCB_APP_IDTYPE_ETHTYPE,
@@ -370,18 +386,11 @@ static u8 ixgbe_dcbnl_set_all(struct net_device *netdev)
 	}
 #endif
 
-	if (adapter->dcb_set_bitmap & BIT_PFC) {
-		u8 pfc_en;
-		ixgbe_dcb_unpack_pfc(&adapter->dcb_cfg, &pfc_en);
-		ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc_en);
-		ret = DCB_HW_CHG;
-	}
-
 	if (adapter->dcb_set_bitmap & (BIT_PG_TX|BIT_PG_RX)) {
 		u16 refill[MAX_TRAFFIC_CLASS], max[MAX_TRAFFIC_CLASS];
 		u8 bwg_id[MAX_TRAFFIC_CLASS], prio_type[MAX_TRAFFIC_CLASS];
 		/* Priority to TC mapping in CEE case default to 1:1 */
-		u8 prio_tc[MAX_TRAFFIC_CLASS] = {0, 1, 2, 3, 4, 5, 6, 7};
+		u8 prio_tc[MAX_USER_PRIORITY];
 		int max_frame = adapter->netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
 
 #ifdef IXGBE_FCOE
@@ -401,9 +410,25 @@ static u8 ixgbe_dcbnl_set_all(struct net_device *netdev)
 				       DCB_TX_CONFIG, bwg_id);
 		ixgbe_dcb_unpack_prio(&adapter->dcb_cfg,
 				      DCB_TX_CONFIG, prio_type);
+		ixgbe_dcb_unpack_map(&adapter->dcb_cfg,
+				     DCB_TX_CONFIG, prio_tc);
 
 		ixgbe_dcb_hw_ets_config(&adapter->hw, refill, max,
 					bwg_id, prio_type, prio_tc);
+
+		for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++)
+			netdev_set_prio_tc_map(netdev, i, prio_tc[i]);
+	}
+
+	if (adapter->dcb_set_bitmap & BIT_PFC) {
+		u8 pfc_en;
+		u8 prio_tc[MAX_USER_PRIORITY];
+
+		ixgbe_dcb_unpack_map(&adapter->dcb_cfg,
+				     DCB_TX_CONFIG, prio_tc);
+		ixgbe_dcb_unpack_pfc(&adapter->dcb_cfg, &pfc_en);
+		ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc_en, prio_tc);
+		ret = DCB_HW_CHG;
 	}
 
 	if (adapter->dcb_cfg.pfc_mode_enable)
@@ -460,10 +485,10 @@ static u8 ixgbe_dcbnl_getnumtcs(struct net_device *netdev, int tcid, u8 *num)
 	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
 		switch (tcid) {
 		case DCB_NUMTCS_ATTR_PG:
-			*num = MAX_TRAFFIC_CLASS;
+			*num = adapter->dcb_cfg.num_tcs.pg_tcs;
 			break;
 		case DCB_NUMTCS_ATTR_PFC:
-			*num = MAX_TRAFFIC_CLASS;
+			*num = adapter->dcb_cfg.num_tcs.pfc_tcs;
 			break;
 		default:
 			rval = -EINVAL;
@@ -532,7 +557,7 @@ static int ixgbe_dcbnl_ieee_getets(struct net_device *dev,
 	if (!my_ets)
 		return -EINVAL;
 
-	ets->ets_cap = MAX_TRAFFIC_CLASS;
+	ets->ets_cap = adapter->dcb_cfg.num_tcs.pg_tcs;
 	ets->cbs = my_ets->cbs;
 	memcpy(ets->tc_tx_bw, my_ets->tc_tx_bw, sizeof(ets->tc_tx_bw));
 	memcpy(ets->tc_rx_bw, my_ets->tc_rx_bw, sizeof(ets->tc_rx_bw));
@@ -569,6 +594,9 @@ static int ixgbe_dcbnl_ieee_setets(struct net_device *dev,
 	if (max_tc)
 		max_tc++;
 
+	if (max_tc > adapter->dcb_cfg.num_tcs.pg_tcs)
+		return -EINVAL;
+
 	if (max_tc != netdev_get_num_tc(dev))
 		ixgbe_setup_tc(dev, max_tc);
 
@@ -589,7 +617,7 @@ static int ixgbe_dcbnl_ieee_getpfc(struct net_device *dev,
 	if (!my_pfc)
 		return -EINVAL;
 
-	pfc->pfc_cap = MAX_TRAFFIC_CLASS;
+	pfc->pfc_cap = adapter->dcb_cfg.num_tcs.pfc_tcs;
 	pfc->pfc_en = my_pfc->pfc_en;
 	pfc->mbc = my_pfc->mbc;
 	pfc->delay = my_pfc->delay;
@@ -606,6 +634,7 @@ static int ixgbe_dcbnl_ieee_setpfc(struct net_device *dev,
 				   struct ieee_pfc *pfc)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	u8 *prio_tc;
 
 	if (!(adapter->dcbx_cap & DCB_CAP_DCBX_VER_IEEE))
 		return -EINVAL;
@@ -617,8 +646,9 @@ static int ixgbe_dcbnl_ieee_setpfc(struct net_device *dev,
 			return -ENOMEM;
 	}
 
+	prio_tc = adapter->ixgbe_ieee_ets->prio_tc;
 	memcpy(adapter->ixgbe_ieee_pfc, pfc, sizeof(*adapter->ixgbe_ieee_pfc));
-	return ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc->pfc_en);
+	return ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc->pfc_en, prio_tc);
 }
 
 #ifdef IXGBE_FCOE
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 757e98e..2b8ff95 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3363,8 +3363,10 @@ static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 
 		if (adapter->ixgbe_ieee_pfc) {
 			struct ieee_pfc *pfc = adapter->ixgbe_ieee_pfc;
+			u8 *prio_tc = adapter->ixgbe_ieee_ets->prio_tc;
 
-			ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc->pfc_en);
+			ixgbe_dcb_hw_pfc_config(&adapter->hw, pfc->pfc_en,
+						prio_tc);
 		}
 	}
 
@@ -4241,7 +4243,6 @@ static inline bool ixgbe_set_dcb_queues(struct ixgbe_adapter *adapter)
 	q = min((int)num_online_cpus(), per_tc_q);
 
 	for (i = 0; i < tcs; i++) {
-		netdev_set_prio_tc_map(dev, i, i);
 		netdev_set_tc_queue(dev, i, q, offset);
 		offset += q;
 	}
@@ -4994,8 +4995,10 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 		tc = &adapter->dcb_cfg.tc_config[j];
 		tc->path[DCB_TX_CONFIG].bwg_id = 0;
 		tc->path[DCB_TX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->path[DCB_TX_CONFIG].up_to_tc_bitmap = 1 << j;
 		tc->path[DCB_RX_CONFIG].bwg_id = 0;
 		tc->path[DCB_RX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->path[DCB_RX_CONFIG].up_to_tc_bitmap = 1 << j;
 		tc->dcb_pfc = pfc_disabled;
 	}
 	adapter->dcb_cfg.bw_percentage[DCB_TX_CONFIG][0] = 100;
@@ -6704,12 +6707,13 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 		tx_flags |= IXGBE_TX_FLAGS_SW_VLAN;
 	}
 
+	/* DCB maps skb priorities 0-7 onto 3 bit PCP of VLAN tag. */
 	if ((adapter->flags & IXGBE_FLAG_DCB_ENABLED) &&
 	    ((tx_flags & (IXGBE_TX_FLAGS_HW_VLAN | IXGBE_TX_FLAGS_SW_VLAN)) ||
 	     (skb->priority != TC_PRIO_CONTROL))) {
 		tx_flags &= ~IXGBE_TX_FLAGS_VLAN_PRIO_MASK;
-		tx_flags |= tx_ring->dcb_tc <<
-			    IXGBE_TX_FLAGS_VLAN_PRIO_SHIFT;
+		tx_flags |= (skb->priority & 0x7) <<
+					IXGBE_TX_FLAGS_VLAN_PRIO_SHIFT;
 		if (tx_flags & IXGBE_TX_FLAGS_SW_VLAN) {
 			struct vlan_ethhdr *vhdr;
 			if (skb_header_cloned(skb) &&
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 2/9] e1000e: WoL fails on device ID 0x1501
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

PCI device ID 0x1501 has a hardware bug when the link downshifts for
whatever reason which requires a workaround.  The workaround already exists
for other similar devices but is not called for 0x1501 (it should be called
for any ICH8-based device that uses a GbE PHY).  There is also one other
instance when the workaround should be called - after disabling gigabit
speed when going to Sx.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 4ec5a5a..ad34de0 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -811,7 +811,7 @@ static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
 	}
 
 	if ((adapter->hw.mac.type == e1000_ich8lan) &&
-	    (adapter->hw.phy.type == e1000_phy_igp_3))
+	    (adapter->hw.phy.type != e1000_phy_ife))
 		adapter->flags |= FLAG_LSC_GIG_SPEED_DROP;
 
 	/* Enable workaround for 82579 w/ ME enabled */
@@ -3642,15 +3642,14 @@ void e1000e_igp3_phy_powerdown_workaround_ich8lan(struct e1000_hw *hw)
  *  LPLU, Gig disable, MDIC PHY reset):
  *    1) Set Kumeran Near-end loopback
  *    2) Clear Kumeran Near-end loopback
- *  Should only be called for ICH8[m] devices with IGP_3 Phy.
+ *  Should only be called for ICH8[m] devices with any 1G Phy.
  **/
 void e1000e_gig_downshift_workaround_ich8lan(struct e1000_hw *hw)
 {
 	s32 ret_val;
 	u16 reg_data;
 
-	if ((hw->mac.type != e1000_ich8lan) ||
-	    (hw->phy.type != e1000_phy_igp_3))
+	if ((hw->mac.type != e1000_ich8lan) || (hw->phy.type == e1000_phy_ife))
 		return;
 
 	ret_val = e1000e_read_kmrn_reg(hw, E1000_KMRNCTRLSTA_DIAG_OFFSET,
@@ -3686,6 +3685,9 @@ void e1000_suspend_workarounds_ich8lan(struct e1000_hw *hw)
 	phy_ctrl |= E1000_PHY_CTRL_D0A_LPLU | E1000_PHY_CTRL_GBE_DISABLE;
 	ew32(PHY_CTRL, phy_ctrl);
 
+	if (hw->mac.type == e1000_ich8lan)
+		e1000e_gig_downshift_workaround_ich8lan(hw);
+
 	if (hw->mac.type >= e1000_pchlan) {
 		e1000_oem_bits_config_ich8lan(hw, false);
 		e1000_phy_hw_reset_ich8lan(hw);
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 3/9] ixgbe: Fix PFC mask generation
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Mark Rustad, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mark Rustad <mark.d.rustad@intel.com>

Fix PFC mask generation to OR in only a single bit for each priority in
the PFC mask returned via netlink.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
index 83bf7cc..3d44b15 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c
@@ -184,7 +184,7 @@ void ixgbe_dcb_unpack_pfc(struct ixgbe_dcb_config *cfg, u8 *pfc_en)
 
 	*pfc_en = 0;
 	for (i = 0; i < MAX_TRAFFIC_CLASS; i++)
-		*pfc_en |= (cfg->tc_config[i].dcb_pfc & 0xF) << i;
+		*pfc_en |= !!(cfg->tc_config[i].dcb_pfc & 0xF) << i;
 }
 
 void ixgbe_dcb_unpack_refill(struct ixgbe_dcb_config *cfg, int direction,
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 1/9] e1000e: WoL can fail on 82578DM
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1317898959-16550-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

During suspend, the PHY must be reset for workaround updates to take effect
without restarting auto-negotiation.  Also, set the disable GbE and enable
Low Power Link Up (LPLU) if the EEPROM is configured to do likewise in
either D0 or non-D0a instead of just the latter.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 3b063e1..4ec5a5a 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1319,16 +1319,20 @@ static s32 e1000_oem_bits_config_ich8lan(struct e1000_hw *hw, bool d0_state)
 
 		if (mac_reg & E1000_PHY_CTRL_D0A_LPLU)
 			oem_reg |= HV_OEM_BITS_LPLU;
+
+		/* Set Restart auto-neg to activate the bits */
+		if (!e1000_check_reset_block(hw))
+			oem_reg |= HV_OEM_BITS_RESTART_AN;
 	} else {
-		if (mac_reg & E1000_PHY_CTRL_NOND0A_GBE_DISABLE)
+		if (mac_reg & (E1000_PHY_CTRL_GBE_DISABLE |
+			       E1000_PHY_CTRL_NOND0A_GBE_DISABLE))
 			oem_reg |= HV_OEM_BITS_GBE_DIS;
 
-		if (mac_reg & E1000_PHY_CTRL_NOND0A_LPLU)
+		if (mac_reg & (E1000_PHY_CTRL_D0A_LPLU |
+			       E1000_PHY_CTRL_NOND0A_LPLU))
 			oem_reg |= HV_OEM_BITS_LPLU;
 	}
-	/* Restart auto-neg to activate the bits */
-	if (!e1000_check_reset_block(hw))
-		oem_reg |= HV_OEM_BITS_RESTART_AN;
+
 	ret_val = hw->phy.ops.write_reg_locked(hw, HV_OEM_BITS, oem_reg);
 
 out:
@@ -3684,6 +3688,7 @@ void e1000_suspend_workarounds_ich8lan(struct e1000_hw *hw)
 
 	if (hw->mac.type >= e1000_pchlan) {
 		e1000_oem_bits_config_ich8lan(hw, false);
+		e1000_phy_hw_reset_ich8lan(hw);
 		ret_val = hw->phy.ops.acquire(hw);
 		if (ret_val)
 			return;
-- 
1.7.6.4

^ permalink raw reply related

* [net-next 0/9][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2011-10-06 11:02 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

The following series contains updates to e1000e, igb and ixgbe.  Here
is a quick summary:
  - e1000e: fixes for 2 WoL issues
  - igb: fix for I2C, and 2 Alt. MAC address updates
  - ixgbe: fix dependencies for 8 traffic classes, add X540 traffic
    class support and a fix for PFC mask generation

The following are changes since commit f0cd7bdc042310b6b104f133bbfd520a72b3c08a:
  bnx2x: remove some dead code
and are available in the git repository at
  git://github.com/Jkirsher/net-next.git

Akeem G. Abodunrin (3):
  igb: Code to prevent overwriting SFP I2C
  igb: Alternate MAC Address EEPROM Updates
  igb: Alternate MAC Address Updates for Func2&3

Bruce Allan (2):
  e1000e: WoL can fail on 82578DM
  e1000e: WoL fails on device ID 0x1501

John Fastabend (3):
  ixgbe: fixup hard dependencies on supporting 8 traffic classes
  ixgbe: DCB X540 devices support max traffic class of 4
  ixgbe: X540 devices RX PFC frames pause traffic even if disabled

Mark Rustad (1):
  ixgbe: Fix PFC mask generation

 drivers/net/ethernet/intel/e1000e/ich8lan.c        |   25 +++++---
 drivers/net/ethernet/intel/igb/e1000_mac.c         |    9 ++-
 drivers/net/ethernet/intel/igb/e1000_phy.c         |    6 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c       |   22 ++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.h       |    3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c |   46 ++++++++++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.h |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c    |   60 +++++++++++++++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   29 ++++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h      |    3 +-
 10 files changed, 158 insertions(+), 47 deletions(-)

-- 
1.7.6.4

^ permalink raw reply

* Re: [PATCH] IPv6: DAD from bonding iface is treated as dup address from others
From: Neil Horman @ 2011-10-06 11:00 UTC (permalink / raw)
  To: Yinglin Sun
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <1317873550-1677-1-git-send-email-Yinglin.Sun@emc.com>

On Wed, Oct 05, 2011 at 08:59:10PM -0700, Yinglin Sun wrote:
> Steps to reproduce this issue:
> 1. create bond0 over eth0 and eth1, set the mode to balance-xor
> 2. add an IPv6 address to bond0
> 3. DAD packet is sent out from one slave and then is looped back from
> the other slave. Therefore, it is treated as a duplicate address and
> stays tentative afterwards:
>    kern.info:
>        Oct  5 11:50:18 testvm1 kernel: [  129.224353] bond0: IPv6 duplicate address 1234::1 detected!
> 
> Signed-off-by: Yinglin Sun <Yinglin.Sun@emc.com>
> ---
>  net/ipv6/ndisc.c |   15 +++++++++++++--
>  1 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index 9da6e02..c82f4c7 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -809,9 +809,10 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  
>  		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>  			if (dad) {
> +				const unsigned char *sadr;
> +				sadr = skb_mac_header(skb);
> +
>  				if (dev->type == ARPHRD_IEEE802_TR) {
> -					const unsigned char *sadr;
> -					sadr = skb_mac_header(skb);
>  					if (((sadr[8] ^ dev->dev_addr[0]) & 0x7f) == 0 &&
>  					    sadr[9] == dev->dev_addr[1] &&
>  					    sadr[10] == dev->dev_addr[2] &&
> @@ -821,6 +822,16 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  						/* looped-back to us */
>  						goto out;
>  					}
> +				} else if (dev->type == ARPHRD_ETHER) {
> +					if (sadr[6] == dev->dev_addr[0] &&
> +					    sadr[7] == dev->dev_addr[1] &&
> +					    sadr[8] == dev->dev_addr[2] &&
> +					    sadr[9] == dev->dev_addr[3] &&
> +					    sadr[10] == dev->dev_addr[4] &&
> +					    sadr[11] == dev->dev_addr[5]) {
> +						/* looped-back to us */
> +						goto out;
> +					}
>  				}
>  
>  				/*
> -- 
> 1.7.4.1
> 
Nack, This seems like it will just completely break DAD.  What if theres another
system out there with the same mac address.  A response from that system would
get dropped by this filter, instead of causing The local system to stop using
the address.  What you really want to do is modify
bond_should_deliver_exact_match to detect this frame on the inactive slave or
some such, and drop the frame there.

Neil

^ permalink raw reply

* bnx2 rxhash
From: Jasper Spaans @ 2011-10-06  9:57 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: mchan

[-- Attachment #1: Type: text/plain, Size: 4318 bytes --]

Hi list,

I'm doing some experiments with rxhash and packet capturing, and I'm
seeing some unexpected results when receiving data (which happens in
promiscuous mode). This behaviour seems to be similar to that seen in
the question asked previously by Christophe Ngo Van Duc:
http://kerneltrap.com/mailarchive/linux-netdev/2010/7/2/6280474

The data I'm processing consists of (almost 100% tcp) traffic between an
ssl-offloader and a cluster of webservers, so the range of mac-addresses
is rather limited. This ssl-offloader does preserve the IP-address and
tcp ports of the clients, so if the rxhash is based on that data, it
should be distributed evenly.

Is there anything I can do about this?


Thanks,
Jasper



The raw data:

spaans@dhcp-074:~$ uname -a
Linux dhcp-074 3.0.4 #7 SMP Tue Sep 13 12:52:01 CEST 2011 x86_64 GNU/Linux

spaans@dhcp-074:~$ grep eth6 /proc/interrupts  | cut -b -30,270-
 142:      57070          0      PCI-MSI-edge      eth6-0
 143:          0          0      PCI-MSI-edge      eth6-1
 144:          0          0      PCI-MSI-edge      eth6-2
 145:          0          0      PCI-MSI-edge      eth6-3
 146:          0          0      PCI-MSI-edge      eth6-4
 147:      17127          0      PCI-MSI-edge      eth6-5
 148:         24          0      PCI-MSI-edge      eth6-6
 149:          0          0      PCI-MSI-edge      eth6-7

This is rather different from the behaviour I get with an intel card:

spaans@dhcp-074:~$ grep eth2 /proc/interrupts  | cut -b -30,270-
  98:          3          0      PCI-MSI-edge      eth2
  99:       4492          0      PCI-MSI-edge      eth2-TxRx-0
 100:       4647          0      PCI-MSI-edge      eth2-TxRx-1
 101:       4369          0      PCI-MSI-edge      eth2-TxRx-2
 102:       4579          0      PCI-MSI-edge      eth2-TxRx-3
 103:       4184          0      PCI-MSI-edge      eth2-TxRx-4
 104:      51268          0      PCI-MSI-edge      eth2-TxRx-5
 105:       4610          0      PCI-MSI-edge      eth2-TxRx-6
 106:       4528          0      PCI-MSI-edge      eth2-TxRx-7

In both cases, only Tx-5 was used for sending data.


Information about the nics:

from dmesg:
[    5.857778] igb 0000:08:00.0: Intel(R) Gigabit Ethernet Network
Connection
[    5.857780] igb 0000:08:00.0: eth6: (PCIe:2.5Gb/s:Width x4)
00:1b:21:b9:b8:dc
[    5.858101] igb 0000:08:00.0: eth6: PBA No: G18771-002
[    5.858102] igb 0000:08:00.0: Using MSI-X interrupts. 8 rx queue(s),
8 tx queue(s)
note that this interface is renamed to eth2 after boot

spaans@dhcp-074:~$ ip link sh eth2
8: eth2: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq
state UP qlen 1000
    link/ether 00:1b:21:b9:b8:dc brd ff:ff:ff:ff:ff:ff


spaans@dhcp-074:~$ sudo ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off
spaans@dhcp-074:~$ sudo ethtool -i eth2
driver: igb
version: 3.0.6-k2
firmware-version: 1.5-1
bus-info: 0000:08:00.0

vs

from dmesg:
[    5.090860] bnx2 0000:02:00.0: eth2: Broadcom NetXtreme II BCM5709
1000Base-T (C0) PCI Express found at mem d8000000, IRQ 32, node addr
18:03:73:fa:0f:0e
note that this interface is renamed to eth6 after boot

spaans@dhcp-074:~$ ip link sh eth6
4: eth6: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq
state UP qlen 1000
    link/ether 18:03:73:fa:0f:0e brd ff:ff:ff:ff:ff:ff

spaans@dhcp-074:~$ sudo ethtool -k eth6
Offload parameters for eth6:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: on
spaans@dhcp-074:~$ sudo ethtool -i eth6
driver: bnx2
version: 2.1.6
firmware-version: 6.2.12 bc 5.2.3 NCSI 2.0.11
bus-info: 0000:02:00.0


-- 
 /\____/\   Ir. Jasper Spaans      
 \   (_)/   Fox-IT Experts in IT Security!
  \    X    T: +31-15-2847999
   \  / \   M: +31-6-41588725   
    \/      KvK Haaglanden 27301624



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2679 bytes --]

^ permalink raw reply

* Asserting ECN from userspace?
From: David Täht @ 2011-10-05  6:18 UTC (permalink / raw)
  To: netdev, bloat-devel

No sooner had I noted (with pleasure) the kernel's new ability to
correctly set the dscp bits on IPv6 TCP streams without messing with the
negotiated ECN status, that I found several use cases where being able
to assert ECN from userspace (for either ipv4, or ipv6) would be useful.

1) Applications such as bittorrent (transmission, etc) that are much
more aware of their overall environment could assert ECN on their UDP
streams to indicate congestion.

2) Test tools. It would be nice to be able, from userspace, to easily
diagnose if ECN was working on a stream, end to end, and being able to
set and receive the ECN bits on a less algorithmic basis (ie, not wedged
deep within a kernel aqm such as RED or SFB)

3) Web Proxies. A web proxy could note when it was experiencing
congestion on one side of the proxied connection (or another) and signal
the other side to slow down.

Ah, ECN, we hardly know ye.

as for item 1 I'm hard pressed to think of a case where setting the ECN
bits on udp streams would introduce a security problem.

As for 2, can live without.

As for 3... perhaps a grantable network capability? A proxy could
acquire privs to twiddle those bits before dropping root privs.

That begs the question of how to see those bits in the first place. OOB
data?

And twiddling them, on a per stream basis, for a single packet, would
seem to require something more robust than setsockopt/getsockopt
(although that would work for udp streams)

^ permalink raw reply

* Re: [PATCH net] mscan: zero accidentally copied register content
From: Wolfram Sang @ 2011-10-06  9:24 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: Oliver Hartkopp, Linux Netdev List, Andre Naujoks
In-Reply-To: <4E8D7065.8040905@grandegger.com>

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]


> Why do you want to change 16-bit accesses in general? They are faster
> than two 8 bit accesses. 

Yup, was thinking the same.

> 
> > IMHO this fix is small and clear and especially not risky. I wonder if
> > reworking the 16 bit register access is worth the effort?
> 
> I would prefer:
> 
>  	if (!(frame->can_id & CAN_RTR_FLAG)) {
>  		void __iomem *data = &regs->rx.dsr1_0;
>  		u16 *payload = (u16 *)frame->data;
>  
>  		for (i = 0; i < frame->can_dlc / 2; i++) {
>  			*payload++ = in_be16(data);
>  			data += 2 + _MSCAN_RESERVED_DSR_SIZE;
>  		}
> 		/* copy remaining byte */

if any

> 		if (frame->can_dlc & 1)
> 			frame->data[frame->can_dlc - 1] = in_8(data);

Ack.

Regards,

   Wolfram

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH net] mscan: zero accidentally copied register content
From: Wolfgang Grandegger @ 2011-10-06  9:09 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Wolfram Sang, Linux Netdev List, Andre Naujoks
In-Reply-To: <4E8D528D.8020607@hartkopp.net>

Hi Oliver,

On 10/06/2011 09:02 AM, Oliver Hartkopp wrote:
> On 10/05/11 18:10, Oliver Hartkopp wrote:
> 
>> On 10/05/11 17:51, Wolfram Sang wrote:
> 
>>>> +		/* zero accidentally copied register content at odd DLCs */
>>>> +		if (frame->can_dlc & 1)
>>>> +			frame->data[frame->can_dlc] = 0;
>>>>  	}
>>>>  
>>>>  	out_8(&regs->canrflg, MSCAN_RXF);
>>>
>>> Nice catch, but wouldn't it be more elegant to never have an invalid byte
>>> in the first place?
>>>
>>> if (can_dlc & 1)
>>> 	*payload = in_be16() & mask;
>>>
>>
>>
>> Hm, then i would rather think about changing the for() statement and to read
>> byte-by-byte instead of the current in_be16() usage with the 16bit access
>> drawbacks ...
>>
> 
> 
> I think if one would like to rework the 16bit register access (which is used
> in the rx path /and/ in the tx path also) this should go via net-next after
> some discussion and testing.

Why do you want to change 16-bit accesses in general? They are faster
than two 8 bit accesses. 

> IMHO this fix is small and clear and especially not risky. I wonder if
> reworking the 16 bit register access is worth the effort?

I would prefer:

 	if (!(frame->can_id & CAN_RTR_FLAG)) {
 		void __iomem *data = &regs->rx.dsr1_0;
 		u16 *payload = (u16 *)frame->data;
 
 		for (i = 0; i < frame->can_dlc / 2; i++) {
 			*payload++ = in_be16(data);
 			data += 2 + _MSCAN_RESERVED_DSR_SIZE;
 		}
		/* copy remaining byte */
		if (frame->can_dlc & 1)
			frame->data[frame->can_dlc - 1] = in_8(data);
 	}

Wolfgang.

^ permalink raw reply

* Re: linux-next: manual merge of the staging tree with the net tree
From: Arend van Spriel @ 2011-10-06  8:59 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Greg KH, linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Eliad Peller, John W. Linville, David Miller,
	netdev@vger.kernel.org
In-Reply-To: <20111006155810.14a6b5c199edc388bbf11437@canb.auug.org.au>

On 10/06/2011 06:58 AM, Stephen Rothwell wrote:
> Hi Greg,
>
> Today's linux-next merge of the staging tree got a conflict in
> drivers/staging/brcm80211/brcmsmac/mac80211_if.c between commit
> 37a41b4affa3 ("mac80211: add ieee80211_vif param to tsf functions") from
> the net tree and commit 3956b4a2ddb0 ("staging: brcm80211: remove locking
> macro definitions") from the staging tree.
>
> I fixed it up (which essentially means I used the staging tree version
> with one small change to the brcms_ops_conf_tx prototype) and can carry
> the fixes as necessary.

Thanks, Stephen

I had to make the mac80211 conf_tx callback change  for the mainline 
patch as well.

Acked.

Gr. AvS

^ permalink raw reply

* Re: [PATCH v4 7/8] Display current tcp memory allocation in kmem cgroup
From: Kirill A. Shutemov @ 2011-10-06  8:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, avagin
In-Reply-To: <4E8ACD6E.3090208@parallels.com>

On Tue, Oct 04, 2011 at 01:10:06PM +0400, Glauber Costa wrote:
> On 10/03/2011 04:36 PM, Kirill A. Shutemov wrote:
> > On Mon, Oct 03, 2011 at 04:26:41PM +0400, Glauber Costa wrote:
> >> On 10/03/2011 04:25 PM, Kirill A. Shutemov wrote:
> >>> On Mon, Oct 03, 2011 at 04:19:18PM +0400, Glauber Costa wrote:
> >>>> On 10/03/2011 04:14 PM, Kirill A. Shutemov wrote:
> >>>>> On Mon, Oct 03, 2011 at 02:18:42PM +0400, Glauber Costa wrote:
> >>>>>> This patch introduces kmem.tcp_current_memory file, living in the
> >>>>>> kmem_cgroup filesystem. It is a simple read-only file that displays the
> >>>>>> amount of kernel memory currently consumed by the cgroup.
> >>>>>>
> >>>>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >>>>>> CC: David S. Miller<davem@davemloft.net>
> >>>>>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >>>>>> CC: Eric W. Biederman<ebiederm@xmission.com>
> >>>>>> ---
> >>>>>>     Documentation/cgroups/memory.txt |    1 +
> >>>>>>     mm/memcontrol.c                  |   11 +++++++++++
> >>>>>>     2 files changed, 12 insertions(+), 0 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> >>>>>> index 1ffde3e..f5a539d 100644
> >>>>>> --- a/Documentation/cgroups/memory.txt
> >>>>>> +++ b/Documentation/cgroups/memory.txt
> >>>>>> @@ -79,6 +79,7 @@ Brief summary of control files.
> >>>>>>      memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> >>>>>>     				   independent of user limits
> >>>>>>      memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
> >>>>>> + memory.kmem.tcp.current_memory  # show current tcp buf memory allocation
> >>>>>
> >>>>> Both are in pages, right?
> >>>>> Shouldn't it be scaled to bytes and named uniform with other memcg file?
> >>>>> memory.kmem.tcp.limit_in_bytes/usage_in_bytes.
> >>>>>
> >>>> You are absolutely correct.
> >>>> Since the internal tcp comparison works, I just ended up never noticing
> >>>> this.
> >>>
> >>> Should we have failcnt and max_usage_in_bytes for tcp as well?
> >>>
> >>
> >> Well, we get a fail count from the tracer anyway, so I don't really see
> >> a need for that. I see value in having it for the slab allocation
> >> itself, but since this only controls the memory pressure framework, I
> >> think we can live without it.
> >>
> >> That said, this is not a strong opinion. I can add it if you'd prefer.
> >
> > It's good for userspace to have the same set of files for all domains:
> >   - memory;
> >   - memory.memsw;
> >   - memory.kmem;
> >   - memory.kmem.tcp;
> >   - etc.
> > Userspace can reuse code for handling them in this case.
> >
> Ok. Back on this.
> 
> Not all domains have all files anyway.

$ ls -l *.{failcnt,limit_in_bytes,max_usage_in_bytes,usage_in_bytes}
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.failcnt
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.memsw.failcnt
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.memsw.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct  6 11:34 memory.memsw.max_usage_in_bytes
-r--r--r-- 1 root root 0 Oct  6 11:34 memory.memsw.usage_in_bytes
-r--r--r-- 1 root root 0 Oct  6 11:34 memory.usage_in_bytes

Hm?..

> max_usage seems to be a property of the main memcg, not of its domains.
> failcnt is present on memsw, and on that only. The problem here, is that 
> this can fail ( and usually will ) in codepaths outside the memory
> controller. (see net/core/sock.c:__sk_mem_schedule)

+1 reason to use res_counter. It provides all data needed for this files.
 
> Also, max_usage makes sense for kernel memory as a whole, but I don't 
> think it makes sense here as we're only controlling a specific pressure 
> condition.

max_usage is reasonable for everything you can limit. It allows you to
track if limit is set appropriate.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 6/8] tcp buffer limitation: per-cgroup limit
From: Glauber Costa @ 2011-10-06  8:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <1317805090.2473.28.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 10/05/2011 12:58 PM, Eric Dumazet wrote:
> Le mercredi 05 octobre 2011 à 12:08 +0400, Glauber Costa a écrit :
>> On 10/04/2011 04:48 PM, Eric Dumazet wrote:
>
>>> 2) Could you add const qualifiers when possible to your pointers ?
>>
>> Well, I'll go over the patches again and see where I can add them.
>> Any specific place site you're concerned about?
>
> Everywhere its possible :
>
> It helps reader to instantly knows if a function is about to change some
> part of the object or only read it, without reading function body.
Sure it does.

So, give me your opinion on this:

most of the acessors inside struct sock do not modify the pointers,
but return an address of an element inside it (that can later on be
modified by the caller.

I think it is fine for the purpose of clarity, but to avoid warnings we 
end up having to do stuff like this:

+#define CONSTCG(m) ((struct mem_cgroup *)(m))
+long *tcp_sysctl_mem(const struct mem_cgroup *memcg)
+{
+       return CONSTCG(memcg)->tcp.tcp_prot_mem;
+}

Is it acceptable?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox