Netdev List
 help / color / mirror / Atom feed
* [PATCH 03/17] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks
From: Mel Gorman @ 2012-05-17 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266231-8031-1-git-send-email-mgorman@suse.de>

From: Christoph Lameter <cl@linux.com>

This patch removes the check for pfmemalloc from the alloc hotpath and
puts the logic after the election of a new per cpu slab. For a pfmemalloc
page we do not use the fast path but force the use of the slow path which
is also used for the debug case.

This has the side-effect of weakening pfmemalloc processing in the
following way;

1. A process that is allocating for network swap calls __slab_alloc.
   pfmemalloc_match is true so the freelist is loaded and c->freelist is
   now pointing to a pfmemalloc page.

2. A process that is attempting normal allocations calls slab_alloc,
   finds the pfmemalloc page on the freelist and uses it because it did
   not check pfmemalloc_match()

The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they
are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
later patch will still protect the system as processes will get throttled
if the pfmemalloc reserves get depleted but performance will not degrade
as smoothly.

[mgorman@suse.de: Expanded changelog]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f0909bf..f8cbec4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2298,11 +2298,11 @@ new_slab:
 		}
 	}
 
-	if (likely(!kmem_cache_debug(s)))
+	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(c, gfpflags)))
 		goto load_freelist;
 
 	/* Only entered in the debug case */
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (kmem_cache_debug(s) && !alloc_debug_processing(s, c->page, object, addr))
 		goto new_slab;	/* Slab failed checks. Next slab needed */
 
 	c->freelist = get_freepointer(s, object);
@@ -2352,8 +2352,7 @@ redo:
 	barrier();
 
 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node) ||
-					!pfmemalloc_match(c, gfpflags)))
+	if (unlikely(!object || !node_match(c, node)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 02/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-05-17 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266231-8031-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark are
returned with page->pfmemalloc set and it is up to the caller to determine
how the page should be protected. SLAB restricts access to any page with
page->pfmemalloc set to callers which are known to able to access the
PFMEMALLOC reserve. If one is not available, an attempt is made to allocate
a new page rather than use a reserve. SLUB is a bit more relaxed in that
it only records if the current per-CPU page was allocated from PFMEMALLOC
reserve and uses another partial slab if the caller does not have the
necessary GFP or process flags. This was found to be sufficient in tests
to avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h   |    9 +++
 include/linux/page-flags.h |   28 +++++++
 mm/internal.h              |    3 +
 mm/page_alloc.c            |   27 +++++--
 mm/slab.c                  |  192 +++++++++++++++++++++++++++++++++++++++-----
 mm/slub.c                  |   27 ++++++-
 6 files changed, 261 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..56a465f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -53,6 +53,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	__ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 2189af4..bff60d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,6 +239,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53c5f8f..4032332 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,6 +1463,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2208,16 +2209,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2405,10 +2412,18 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	return page;
 }
 
 /*
@@ -2459,6 +2474,8 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	else
+		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/slab.c b/mm/slab.c
index e901a36..b190cac 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
 
 #include <trace/events/kmem.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -951,6 +980,102 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+	unsigned long flags;
+
+	if (!pfmemalloc_active)
+		return;
+
+	spin_lock_irqsave(&l3->list_lock, flags);
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	pfmemalloc_active = false;
+out:
+	spin_unlock_irqrestore(&l3->list_lock, flags);
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			clear_obj_pfmemalloc(&objp);
+			check_ac_pfmemalloc(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1127,7 +1252,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1809,6 +1934,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (unlikely(page->pfmemalloc))
+		pfmemalloc_active = true;
+
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -1816,9 +1945,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page + i);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -1851,6 +1984,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 	while (i--) {
 		BUG_ON(!PageSlab(page));
 		__ClearPageSlab(page);
+		__ClearPageSlabPfmemalloc(page);
 		page++;
 	}
 	if (current->reclaim_state)
@@ -3120,16 +3254,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3179,8 +3316,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3199,18 +3336,22 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3292,23 +3433,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3630,9 +3783,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3750,7 +3906,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 		cache_flusharray(cachep, ac);
 	}
 
-	ac->entry[ac->avail++] = objp;
+	ac_put_obj(cachep, ac, objp);
 }
 
 /**
diff --git a/mm/slub.c b/mm/slub.c
index ffe13fd..f0909bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
+	if (page->pfmemalloc)
+		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
 
@@ -1414,6 +1418,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 		-pages);
 
 	__ClearPageSlab(page);
+	__ClearPageSlabPfmemalloc(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
@@ -2153,6 +2158,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	return object;
 }
 
+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+	if (unlikely(PageSlabPfmemalloc(c->page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
  * or deactivate the page.
@@ -2225,6 +2238,16 @@ redo:
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+		deactivate_slab(s, c);
+		goto new_slab;
+	}
+
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	object = c->freelist;
 	if (object)
@@ -2329,8 +2352,8 @@ redo:
 	barrier();
 
 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
-
+	if (unlikely(!object || !node_match(c, node) ||
+					!pfmemalloc_match(c, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 01/17] mm: Serialize access to min_free_kbytes
From: Mel Gorman @ 2012-05-17 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266231-8031-1-git-send-email-mgorman@suse.de>

There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement.  Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.

[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 918330f..53c5f8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4976,14 +4976,7 @@ static void setup_per_zone_lowmem_reserve(void)
 	calculate_totalreserve_pages();
 }
 
-/**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
- * or when memory is hot-{added|removed}
- *
- * Ensures that the watermark[min,low,high] values for each zone are set
- * correctly with respect to min_free_kbytes.
- */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -5038,6 +5031,20 @@ void setup_per_zone_wmarks(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * or when memory is hot-{added|removed}
+ *
+ * Ensures that the watermark[min,low,high] values for each zone are set
+ * correctly with respect to min_free_kbytes.
+ */
+void setup_per_zone_wmarks(void)
+{
+	mutex_lock(&zonelists_mutex);
+	__setup_per_zone_wmarks();
+	mutex_unlock(&zonelists_mutex);
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has to
  * do too much work, but large enough that each inactive page has a chance
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 00/17] Swap-over-NBD without deadlocking V11
From: Mel Gorman @ 2012-05-17 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Mel Gorman

Mostly addressing feedback from David Miller.

Changeloc since V10
  o Rebase to 3.4-rc5
  o Coding style fixups						      (davem)
  o API consistency						      (davem)
  o Rename sk_allocation to sk_gfp_atomic and use only when necessary (davem)
  o Use static branches for sk_memalloc_socks			      (davem)
  o Use static branch checks in fast paths			      (davem)
  o Document concerns about PF_MEMALLOC leaking flags		      (davem)
  o Locking fix in slab						      (mel)

Changelog since V9
  o Rebase to 3.4-rc5
  o Clarify comment on why PF_MEMALLOC is cleared in softirq handling (akpm)
  o Only set page->pfmemalloc if ALLOC_NO_WATERMARKS was required     (rientjes)

Changelog since V8
  o Rebase to 3.4-rc2
  o Use page flag instead of slab fields to keep structures the same size
  o Properly detect allocations from softirq context that use PF_MEMALLOC
  o Ensure kswapd does not sleep while processes are throttled
  o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
  o Rebase to 3.3-rc2
  o Take greater care propagating page->pfmemalloc to skb
  o Propagate pfmemalloc from netdev_alloc_page to skb where possible
  o Release RCU lock properly on preempt kernel

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 serialises access to min_free_kbytes. It's not strictly needed
	by this series but as the series cares about watermarks in
	general, it's a harmless fix. It could be merged independently
	and may be if CMA is merged in advance.

Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patch 7 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patches 8-14 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 14 is a micro-optimisation to avoid a function call in the
	common case.

Patch 15 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 16 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 17 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 40% faster.

                                         3.4.0-rc2     3.4.0-rc2
                                      vanilla-slab     swapnbd
Sys Time Running Test (seconds)              87.90     73.45
User+Sys Time Running Test (seconds)         91.93     76.91
Total Elapsed Time (seconds)               4174.37   2953.96

 drivers/block/nbd.c                               |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/gfp.h                               |   13 +-
 include/linux/mm_types.h                          |    9 +
 include/linux/mmzone.h                            |    1 +
 include/linux/page-flags.h                        |   28 +++
 include/linux/sched.h                             |    7 +
 include/linux/skbuff.h                            |   83 +++++++-
 include/linux/vm_event_item.h                     |    1 +
 include/net/sock.h                                |   19 ++
 include/trace/events/gfpflags.h                   |    1 +
 kernel/softirq.c                                  |    9 +
 mm/page_alloc.c                                   |   69 +++++--
 mm/slab.c                                         |  216 +++++++++++++++++++--
 mm/slub.c                                         |   28 ++-
 mm/vmscan.c                                       |  131 ++++++++++++-
 mm/vmstat.c                                       |    1 +
 net/core/dev.c                                    |   53 ++++-
 net/core/filter.c                                 |    8 +
 net/core/skbuff.c                                 |   94 +++++++--
 net/core/sock.c                                   |   42 ++++
 net/ipv4/tcp_output.c                             |    9 +-
 net/ipv6/tcp_ipv6.c                               |    8 +-
 29 files changed, 764 insertions(+), 87 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC 13/13] USB: Disable hub-initiated LPM for comms devices.
From: Greg Kroah-Hartman @ 2012-05-17 14:49 UTC (permalink / raw)
  To: Sarah Sharp
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, Alan Stern,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	gigaset307x-common-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	ath9k-devel-xDcbHBWguxHbcTqmT+pZeQ,
	libertas-dev-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	users-poMEt7QlJxcwIE2E9O76wjtx2kNaKg5H
In-Reply-To: <20120517045220.GA4772@xanatos>

On Wed, May 16, 2012 at 09:52:20PM -0700, Sarah Sharp wrote:
> On Wed, May 16, 2012 at 04:20:19PM -0700, Greg Kroah-Hartman wrote:
> > On Wed, May 16, 2012 at 03:45:28PM -0700, Sarah Sharp wrote:
> > > [Resending with a smaller Cc list]
> > > 
> > > Hub-initiated LPM is not good for USB communications devices.  Comms
> > > devices should be able to tell when their link can go into a lower power
> > > state, because they know when an incoming transmission is finished.
> > > Ideally, these devices would slam their links into a lower power state,
> > > using the device-initiated LPM, after finishing the last packet of their
> > > data transfer.
> > > 
> > > If we enable the idle timeouts for the parent hubs to enable
> > > hub-initiated LPM, we will get a lot of useless LPM packets on the bus
> > > as the devices reject LPM transitions when they're in the middle of
> > > receiving data.  Worse, some devices might blindly accept the
> > > hub-initiated LPM and power down their radios while they're in the
> > > middle of receiving a transmission.
> > > 
> > > The Intel Windows folks are disabling hub-initiated LPM for all USB
> > > communications devices under a xHCI USB 3.0 host.  In order to keep
> > > the Linux behavior as close as possible to Windows, we need to do the
> > > same in Linux.
> > 
> > How is the USB core on Windows determining that LPM should be turned off
> > for these devices?  Surely they aren't modifying each individual driver
> > like this is, right?  Any way we also can do this in the core?
> 
> No, I don't think they're modifying individual drivers.  Maybe they
> placed a shim/filter driver below other drivers?

They can do this in their driver by just watching the device class type.

> Basically, I don't know the exact details of what the Windows folks are
> doing.  The recommendation from the Intel Windows team was simply to
> turn hub-initiated LPM off for "all communications devices".  Perhaps
> the Windows USB core is looking for specific USB class codes?  Or maybe
> it has some older API that lets the core know it's a communications
> device?
> 
> I'm not really sure we can do it in the USB core with out basically
> duplicating all the class/PID/VID matching in the communications driver.
> I think just adding a flag might be the best way.  I'm open to
> suggestions though.

You can detect something as "simple" as a class type, which I bet is all
that Windows is going to be able to do as well.

> > Or, turn it around the other way, and only enable it if we know it's
> > safe to do so, in each driver, but I guess that would be even messier.
> 
> Yeah, I think it would be messier.

Ok, this is probably the best solution for us as well, sorry for the
noise.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next 2/4] e1000: remove workaround for Errata 23 from jumbo alloc
From: Ben Hutchings @ 2012-05-17 14:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Sebastian Andrzej Siewior, netdev, gospo, sassmann
In-Reply-To: <1337254070-32500-3-git-send-email-jeffrey.t.kirsher@intel.com>

On Thu, 2012-05-17 at 04:27 -0700, Jeff Kirsher wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> According to the comment, errata 23 says that the memory we allocate
> can't cross a 64KiB boundary. In case of jumbo frames we allocate
> complete pages which can never cross the 64KiB boundary because
> PAGE_SIZE should be a multiple of 64KiB so we stop either before the

Should be a factor, not multiple.

[...]
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -4391,30 +4391,6 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
>  			break;
>  		}
>  
> -		/* Fix for errata 23, can't cross 64kB boundary */
> -		if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
> -			struct sk_buff *oldskb = skb;
> -			e_err(rx_err, "skb align check failed: %u bytes at "
> -			      "%p\n", bufsz, skb->data);
> -			/* Try again, without freeing the previous */
> -			skb = netdev_alloc_skb_ip_align(netdev, bufsz);
> -			/* Failed allocation, critical failure */
> -			if (!skb) {
> -				dev_kfree_skb(oldskb);
> -				adapter->alloc_rx_buff_failed++;
> -				break;
> -			}
> -
> -			if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
> -				/* give up */
> -				dev_kfree_skb(skb);
> -				dev_kfree_skb(oldskb);
> -				break; /* while (cleaned_count--) */
> -			}
> -
> -			/* Use new allocation */
> -			dev_kfree_skb(oldskb);
> -		}
[...]

I don't believe PAGE_SIZE is >64K on any architecture, but perhaps you
should replace the run-time check with:

		BUILD_BUG_ON(PAGE_SIZE > 0x10000);

in case that changes in future.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-17 13:42 UTC (permalink / raw)
  To: netdev, e1000-devel, jeffrey.t.kirsher, jesse.brandeburg,
	therbert, eric.dumazet, davem
In-Reply-To: <668eeb0d42a1678d9083a58deb3ac40d@visp.net.lb>

Found commit that cause problem:

author	Tom Herbert <therbert@google.com>
Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
committer	David S. Miller <davem@davemloft.net>
Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
commit	3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
tree	d6670a4f94b2b9dedacc38edb6f0e1306b889f6b	tree | snapshot
parent	114cf5802165ee93e3ab461c9c505cd94a08b800	commit | diff
e1000e: Support for byte queue limits

Changes to e1000e to use byte queue limits.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

If i reverse it, problem disappearing.

How i reproduce it:
In two consoles do "fast" ping to nearby host
ping 194.146.XXX.XXX -s1472 -i0.0001
ping 194.146.XXX.XXX -s1472 -i0.1

For third open ssh to host with "problem", open mcedit, and just scroll 
down large text file.
After few seconds some "stalls" will occur, and in ping history i can 
see:
1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms


If i apply small patch - problem will disappear. Sure it is not a 
solution, but
let me know how i can help to debug problem more.

--- netdev.c    2012-05-12 20:08:37.000000000 +0300
+++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
@@ -1135,7 +1135,7 @@

         tx_ring->next_to_clean = i;

-       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
+//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);

  #define TX_WAKE_THRESHOLD 32
         if (count && netif_carrier_ok(netdev) &&
@@ -2263,7 +2263,7 @@
                 e1000_put_txbuf(adapter, buffer_info);
         }

-       netdev_reset_queue(adapter->netdev);
+//     netdev_reset_queue(adapter->netdev);
         size = sizeof(struct e1000_buffer) * tx_ring->count;
         memset(tx_ring->buffer_info, 0, size);

@@ -5056,7 +5056,7 @@
         /* if count is 0 then mapping error has occurred */
         count = e1000_tx_map(adapter, skb, first, max_per_txd, 
nr_frags, mss);
         if (count) {
-               netdev_sent_queue(netdev, skb->len);
+//             netdev_sent_queue(netdev, skb->len);
                 e1000_tx_queue(adapter, tx_flags, count);
                 /* Make sure there is space in the ring for the next 
send. */
                 e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);



On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
> Hi
>
> I have two identical servers, Sun Fire X4150, both has different
> flavors of Linux, x86_64 and i386.
> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> I am using now interface:
> #ethtool -i eth0
> driver: e1000e
> version: 1.9.5-k
> firmware-version: 2.1-11
> bus-info: 0000:04:00.0
> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>
> i386 was acting as NAT and shaper, and as soon as i removed shaper
> from it, i started to experience strange lockups, e.g. traffic is
> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
> around 1000ms) with dropped packets counter increasing. I was
> suspecting it is due load, but it seems was wrong.
> Recently, on another server, x86_64 i am using as development, i
> upgrade kernel (it was old, from 2.6 series) and on completely idle
> machine started to experience same latency spikes, while i am just
> running mc and for example typing in text editor - i notice "stalls".
> After i investigate it a little more, i notice also small amount of
> drops on interface. No tcpdump running. Also this machine is idle, 
> and
> the only traffic there - some small broadcasts from network, my ssh,
> and ping.
>
> Dropped packets in ifconfig
>           RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
> Counter is increasing sometimes, when this stall happening.
>
> ethtool -S is clean, there is no dropped packets.
>
> I did tried to check load (mpstat and perf), there is nothing
> suspicious, latencytop also doesn't show anything suspicious.
> dropwatch report a lot of drops, but mostly because there is some
> broadcasts and etc. tcpdump at the moment of such drops doesn't show
> anything suspicious.
> Changed qdisc from default fifo_fast to bfifo, without any result.
> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i
> don't remember for sure. I thik on some kernels like 3.1 probably it
> doesn't occur, i will check it soon, because it is not always 
> reliable
> to reproduce it. All tests i did on 3.4.0-rc7.
>
> I did run also in background tcpdump, additionally iptables with
> timestamps, and at time when stall occured, seems i am still 
> receiving
> packets properly, also on iperf udp  (from some host to this SunFire)
> at this moments no packets missing. But i am sure RX interface errors
> are increasing.
> If i do iperf from SunFire to test host - there is packetloss at
> moments when stall occured.
>
> I suspect that by some reason network card stop to transmit, but
> unable to pinpoint issue. All other hosts in this network are fine 
> and
> don't have such problems.
> Can you help me with that please? Maybe i can provide more debug
> information, compile with patches and etc. Also i will try to 
> fallback
> to 3.1 and 3.0 kernels.
>
> Here it is how it occurs and i am reproducing it:
> I'm just opening file, and start to scroll it in mc, then in another
> console i run ping
> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
> ttl=64 time=0.485 ms
> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
> ttl=64 time=0.470 ms
> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
> ttl=64 time=424 ms
> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
> ttl=64 time=324 ms
> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
> ttl=64 time=214 ms
> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
> ttl=64 time=104 ms
> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
> ttl=64 time=1.15 ms
> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
> ttl=64 time=0.477 ms
> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
> ttl=64 time=0.667 ms
>
> Remote host tcpdump:
> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 163, length 1480
> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 164, length 1480
> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 164, length 1480
> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 165, length 1480
> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 165, length 1480
> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 166, length 1480
> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 166, length 1480
> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 167, length 1480
> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 167, length 1480
> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 168, length 1480
> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 168, length 1480
>
> Local host(SunFire) tcpdump:
> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 162, length 1480
> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 163, length 1480
> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 163, length 1480
> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 164, length 1480
> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 165, length 1480
> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 166, length 1480
> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 167, length 1480
> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 164, length 1480
> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 168, length 1480
> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 165, length 1480
> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 166, length 1480
> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 167, length 1480
> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 168, length 1480
> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 169, length 1480
> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 169, length 1480
>
> lspci -t
> centaur src # lspci -t
> -[0000:00]-+-00.0
>            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>            |               |               \-02.0-[04]--+-00.0
>            |               |                            \-00.1
>            |               \-00.3-[05]--
>            +-03.0-[06]--
>            +-04.0-[07]----00.0
>            +-05.0-[08]--
>            +-06.0-[09]--
>            +-07.0-[0a]--
>            +-08.0
>            +-10.0
>            +-10.1
>            +-10.2
>            +-11.0
>            +-13.0
>            +-15.0
>            +-16.0
>            +-1c.0-[0b]--+-00.0
>            |            \-00.1
>            +-1d.0
>            +-1d.1
>            +-1d.2
>            +-1d.3
>            +-1d.7
>            +-1e.0-[0c]----05.0
>            +-1f.0
>            +-1f.1
>            +-1f.2
>            \-1f.3
> lspci
> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
> Controller Hub (rev b1)
> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 2 (rev b1)
> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 3 (rev b1)
> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x8 Port 4-5 (rev b1)
> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 5 (rev b1)
> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x8 Port 6-7 (rev b1)
> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 7 (rev b1)
> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
> Engine (rev b1)
> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
> Registers (rev b1)
> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
> Registers (rev b1)
> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
> Registers (rev b1)
> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
> Registers (rev b1)
> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
> PCI Express Root Port 1 (rev 09)
> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #1 (rev 09)
> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #2 (rev 09)
> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #3 (rev 09)
> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #4 (rev 09)
> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset EHCI USB2 Controller (rev 09)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
> LPC Interface Controller (rev 09)
> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
> Controller (rev 09)
> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
> Controller (rev 09)
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
> Controller (rev 09)
> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Upstream Port (rev 01)
> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
> PCI-X Bridge (rev 01)
> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Downstream Port E1 (rev 01)
> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Downstream Port E3 (rev 01)
> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
> Graphics Family
>
>
> dmesg:
> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
> 7.3.21-k8-NAPI
> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:f8
> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
> Connection
> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
> FFFFFF-0FF
> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:f9
> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
> Connection
> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
> FFFFFF-0FF
> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:fa
> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
> Connection
> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
> C83246-002
> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:fb
> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
> Connection
> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
> C83246-002
> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: None
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Network engineer
Denys Fedoryshchenko

Dora Highway - Center Cebaco - 2nd Floor
Beirut, Lebanon
Tel:	+961 1 247373
E-Mail: denys@visp.net.lb

^ permalink raw reply

* Re: [net] e1000: Prevent reset task killing itself.
From: Greg KH @ 2012-05-17 13:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Tushar Dave, netdev, gospo, sassmann, stable
In-Reply-To: <1337252690-29687-1-git-send-email-jeffrey.t.kirsher@intel.com>

On Thu, May 17, 2012 at 04:04:50AM -0700, Jeff Kirsher wrote:
> From: Tushar Dave <tushar.n.dave@intel.com>
> 
> Killing reset task while adapter is resetting causes deadlock.
> Only kill reset task if adapter is not resetting.
> Ref bug #43132 on bugzilla.kernel.org
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
> @stable - this patch is applicable back to 3.1 kernels

In the future, so this information will not be lost, please do as
Documentation/stable_kernel_rules.txt show and put this information in
the cc: line to stable.  So for this patch, you should say:

 Cc: stable <stable@vger.kernel.org> [3.1+]
in the signed-off-by: area above.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 3/5] net: Change mail address of Oskar Schirmer
From: Jiri Kosina @ 2012-05-17 13:18 UTC (permalink / raw)
  To: Oskar Schirmer; +Cc: linux-kernel, hannes, shess, netdev
In-Reply-To: <1337161280-28953-4-git-send-email-oskar@scara.com>

On Wed, 16 May 2012, Oskar Schirmer wrote:

> That old mail address doesnt exist any more.
> This changes all occurences to my new address.
> 
> Signed-off-by: Oskar Schirmer <oskar@scara.com>
> Cc: netdev@vger.kernel.org
> ---
>  drivers/net/ethernet/s6gmac.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/s6gmac.c b/drivers/net/ethernet/s6gmac.c
> index 1895605..6f0d284 100644
> --- a/drivers/net/ethernet/s6gmac.c
> +++ b/drivers/net/ethernet/s6gmac.c
> @@ -1,7 +1,7 @@
>  /*
>   * Ethernet driver for S6105 on chip network device
>   * (c)2008 emlix GmbH http://www.emlix.com
> - * Authors:	Oskar Schirmer <os@emlix.com>
> + * Authors:	Oskar Schirmer <oskar@scara.com>
>   *		Daniel Gloeckner <dg@emlix.com>
>   *
>   * This program is free software; you can redistribute it and/or
> @@ -1070,4 +1070,4 @@ module_exit(s6gmac_exit);
>  
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("S6105 on chip Ethernet driver");
> -MODULE_AUTHOR("Oskar Schirmer <os@emlix.com>");
> +MODULE_AUTHOR("Oskar Schirmer <oskar@scara.com>");

I am applying this to trivial.git.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply

* Re: [PATCH net-next] tcp: bool conversions
From: Ben Hutchings @ 2012-05-17 12:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1337246134.4740.5.camel@edumazet-laptop>

On Thu, 2012-05-17 at 11:15 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> bool conversions where possible.

There's a bit more than bool conversions here:

[...]
> --- a/net/ipv4/tcp_hybla.c
> +++ b/net/ipv4/tcp_hybla.c
[...]
> @@ -24,8 +24,7 @@ struct hybla {
>  	u32   minrtt;	      /* Minimum smoothed round trip time value seen */
>  };
>  
> -/* Hybla reference round trip time (default= 1/40 sec = 25 ms),
> -   expressed in jiffies */
> +/* Hybla reference round trip time (default= 1/40 sec = 25 ms), in ms */
>  static int rtt0 = 25;
>  module_param(rtt0, int, 0644);
>  MODULE_PARM_DESC(rtt0, "reference rout trip time (ms)");
> @@ -39,7 +38,7 @@ static inline void hybla_recalc_param (struct sock *sk)
>  	ca->rho_3ls = max_t(u32, tcp_sk(sk)->srtt / msecs_to_jiffies(rtt0), 8);
>  	ca->rho = ca->rho_3ls >> 3;
>  	ca->rho2_7ls = (ca->rho_3ls * ca->rho_3ls) << 1;
> -	ca->rho2 = ca->rho2_7ls >>7;
> +	ca->rho2 = ca->rho2_7ls >> 7;
>  }
>  
>  static void hybla_init(struct sock *sk)
[...] 
> @@ -67,6 +66,7 @@ static void hybla_init(struct sock *sk)
>  static void hybla_state(struct sock *sk, u8 ca_state)
>  {
>  	struct hybla *ca = inet_csk_ca(sk);
> +
>  	ca->hybla_en = (ca_state == TCP_CA_Open);
>  }
>  
[...]
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
[...]
> -static __inline__ int tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
> +static bool tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
>  { 
[...]
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
[...]
>  /* Does at least the first segment of SKB fit into the send window? */
> -static inline int tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
> -				   unsigned int cur_mss)
> +static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
> +			     const struct sk_buff *skb,
> +			     unsigned int cur_mss)
[...]

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [net-next 4/4] igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.
From: Jeff Kirsher @ 2012-05-17 12:31 UTC (permalink / raw)
  To: davem; +Cc: Matthew Vick, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337257912-8487-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Matthew Vick <matthew.vick@intel.com>

Under certain scenarios, it's possible that bursty manageability traffic
over the BMC-to-OS path may overrun the internal manageability receive
buffer causing dropped manageability packets. Clearing this bit prevents
this situation by interrupting coalescing to allow manageability traffic
through.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 ++
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 6409f85..ec7e4fe 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -301,6 +301,8 @@
 							* transactions */
 #define E1000_DMACR_DMAC_LX_SHIFT       28
 #define E1000_DMACR_DMAC_EN             0x80000000 /* Enable DMA Coalescing */
+/* DMA Coalescing BMC-to-OS Watchdog Enable */
+#define E1000_DMACR_DC_BMC2OSW_EN	0x00008000
 
 #define E1000_DMCTXTH_DMCTTHR_MASK      0x00000FFF /* DMA Coalescing Transmit
 							* Threshold */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9bbf1a2..dd3bfe8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7147,6 +7147,9 @@ static void igb_init_dmac(struct igb_adapter *adapter, u32 pba)
 
 			/* watchdog timer= +-1000 usec in 32usec intervals */
 			reg |= (1000 >> 5);
+
+			/* Disable BMC-to-OS Watchdog Enable */
+			reg &= ~E1000_DMACR_DC_BMC2OSW_EN;
 			wr32(E1000_DMACR, reg);
 
 			/*
-- 
1.7.7.6

^ permalink raw reply related

* [net-next v2 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 12:31 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337257912-8487-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The code seems to want to look at the last byte where the HW puts some
information. Since the skb->data area is never seen by the HW I guess it
does not work as expected. We pass the page address to the HW so I
*think* in order to get to the last byte where the information might be
one should use the page buffer and take a look.
This is of course not more than just compile tested.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index fefbf4d..37b7d1c 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
-			u8 last_byte = *(skb->data + length - 1);
+			u8 *mapped;
+			u8 last_byte;
+
+			mapped = page_address(buffer_info->page);
+			last_byte = *(mapped + length - 1);
 			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
 				       last_byte)) {
 				spin_lock_irqsave(&adapter->stats_lock,
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 2/4] e1000: remove workaround for Errata 23 from jumbo alloc
From: Jeff Kirsher @ 2012-05-17 12:31 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337257912-8487-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

According to the comment, errata 23 says that the memory we allocate
can't cross a 64KiB boundary. In case of jumbo frames we allocate
complete pages which can never cross the 64KiB boundary because
PAGE_SIZE should be a multiple of 64KiB so we stop either before the
boundary or start after it but never cross it. Furthermore the check
seems bogus because it looks at skb->data which is not seen by the HW
at all because we only pass the DMA address of the page we allocated. So
I *think* the workaround is not required here.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |   24 ------------------------
 1 files changed, 0 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f1aef68..fefbf4d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4391,30 +4391,6 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 			break;
 		}
 
-		/* Fix for errata 23, can't cross 64kB boundary */
-		if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-			struct sk_buff *oldskb = skb;
-			e_err(rx_err, "skb align check failed: %u bytes at "
-			      "%p\n", bufsz, skb->data);
-			/* Try again, without freeing the previous */
-			skb = netdev_alloc_skb_ip_align(netdev, bufsz);
-			/* Failed allocation, critical failure */
-			if (!skb) {
-				dev_kfree_skb(oldskb);
-				adapter->alloc_rx_buff_failed++;
-				break;
-			}
-
-			if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-				/* give up */
-				dev_kfree_skb(skb);
-				dev_kfree_skb(oldskb);
-				break; /* while (cleaned_count--) */
-			}
-
-			/* Use new allocation */
-			dev_kfree_skb(oldskb);
-		}
 		buffer_info->skb = skb;
 		buffer_info->length = adapter->rx_buffer_len;
 check_page:
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 1/4] e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS
From: Jeff Kirsher @ 2012-05-17 12:31 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337257912-8487-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

This define is needed by i217.

Reported-by: Bjorn Mork <bjorn@mork.no>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 11c4666..351a409 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -76,7 +76,7 @@
 /* Extended Device Control */
 #define E1000_CTRL_EXT_LPCD  0x00000004     /* LCD Power Cycle Done */
 #define E1000_CTRL_EXT_SDP3_DATA 0x00000080 /* Value of SW Definable Pin 3 */
-#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000004 /* Force SMBus mode*/
+#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000800 /* Force SMBus mode */
 #define E1000_CTRL_EXT_EE_RST    0x00002000 /* Reinitialize from EEPROM */
 #define E1000_CTRL_EXT_SPD_BYPS  0x00008000 /* Speed Select Bypass */
 #define E1000_CTRL_EXT_RO_DIS    0x00020000 /* Relaxed Ordering disable */
-- 
1.7.7.6

^ permalink raw reply related

* [net-next v2 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-05-17 12:31 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series of patches contains updates for e1000, e1000e and igb.

v2: correct the patch for e1000 that looks for the last byte in a page to
    what was actually sent and tested.  What I originally sent in v1 of
    the series was a munge between v1 & v2 of what Sebastian sent me.

The following are changes since commit dc6b9b78234fecdc6d2ca5e1629185718202bcf5:
  net: include/net/sock.h cleanup
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Bruce Allan (1):
  e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS

Matthew Vick (1):
  igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.

Sebastian Andrzej Siewior (2):
  e1000: remove workaround for Errata 23 from jumbo alloc
  e1000: look in the page and not in skb->data for the last byte

 drivers/net/ethernet/intel/e1000/e1000_main.c  |   30 ++++--------------------
 drivers/net/ethernet/intel/e1000e/defines.h    |    2 +-
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 +
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 ++
 4 files changed, 11 insertions(+), 26 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 12:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

I'm facing a regression in stable 3.2.17 and 3.0.31 which is
exhibited by your patch 'tcp: allow splice() to build full TSO
packets' which unfortunately I am very interested in !

What I'm observing is that TCP transmits using splice() stall
quite quickly if I'm using pipes larger than 64kB (even 65537
is enough to reliably observe the stall).

I'm seeing this on haproxy running on a small ARM machine (a
dockstar), which exchanges data through a gig switch with my
development PC. The NIC (mv643xx) doesn't support TSO but has
GSO enabled. If I disable GSO, the problem remains. I can however
make the problem disappear by disabling SG or Tx checksumming.
BTW, using recv/send() instead of splice() also gets rid of the
problem.

I can also reduce the risk of seeing the problem by increasing
the default TCP buffer sizes in tcp_wmem. By default I'm running
at 16kB, but if I increase the output buffer size above the pipe
size, the problem *seems* to disappear though I can't be certain,
since larger buffers generally means the problem takes longer to
appear, probably due to the fact that the buffers don't need to
be filled. Still I'm certain that with 64k TCP buffers and 128k
pipes I'm still seeing it.

With strace, I'm seeing data fill up the pipe with the splice()
call responsible for pushing the data to the output socket returing
-1 EAGAIN. During this time, the client receives no data.

Something bugs me, I have tested with a dummy server of mine,
httpterm, which uses tee+splice() to push data outside, and it
has no problem filling the gig pipe, and correctly recoverers
from the EAGAIN :

  send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
  tee(0x3, 0x6, 0x10000, 0x2)             = 42552
  splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
  ...
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
  tee(0x3, 0x6, 0x10000, 0x2)             = 50744
  splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
  tee(0x3, 0x6, 0x10000, 0x2)             = 30264
  splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)

etc...

It's only with haproxy which uses splice() to copy data between
two sockets that I'm getting the issue (data forwarded from fd 0xe
to fd 0x6) :

  16:03:17.797144 pipe([36, 37])          = 0
  16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
  16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
  16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
  16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
  16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
  16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
  16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
  16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
  16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
  16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
  16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
  16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
  16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
  16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
  16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
  16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
  16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
  16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
  16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
  16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
  16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)

So output fd 6 hangs here and will not appear anymore until
here where I pressed Ctrl-C to stop the test :

  16:03:24.740985 gettimeofday({1324652604, 741042}, NULL) = 0
  16:03:24.741148 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:24.951762 gettimeofday({1324652604, 951838}, NULL) = 0
  16:03:24.951956 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EPIPE (Broken pipe)

I tried disabling LRO/GRO at the input interface (which happens to be
the same) to see if fragmentation of input data had any impact on this
but nothing chnages.

Please note that I'm not even certain the patch is the culprit, I'm
suspecting that by improving splice() efficiency, it might make a
latent issue become more visible. I have no data to back this
feeling, but nothing strikes me in your patch.

I don't know what I can do to troubleshoot this issue. I don't want
to pollute the list with network captures nor strace outputs, but I
have them if you're interested in verifying a few things.

I have another platform available for a test (Atom+82574L supporting
TSO). I'll rebuild and boot on this one to see if I observe the same
behaviour.

If you have any suggestion about things to check of tweaks to change
in the code, I'm quite open to experiment.

Best regards,
Willy

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 12:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <4FB4E782.3070502@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 675 bytes --]

On Thu, 2012-05-17 at 13:56 +0200, Sebastian Andrzej Siewior wrote:
> On 05/17/2012 01:50 PM, Jeff Kirsher wrote:
> >
> > Your correct, I apologize.  This was my fault, I applied your v1 of the
> > patch and then realized there was a v2.
> >
> > I will re-send the series with the correct patch.
> 
> Okay. I haven't seen [0] in the series. Did you merge it somewhere?
> 
> [0] http://thread.gmane.org/gmane.linux.drivers.e1000.devel/10019
> 
> Sebastian

No, not yet.  Aaron is still validating that patch since it was actually
the last one you sent me.  I expect to be pushing it in the next day or
so with some ixgbe patches, once it finishes validation.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Sebastian Andrzej Siewior @ 2012-05-17 11:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <1337255417.2714.49.camel@jtkirshe-mobl>

On 05/17/2012 01:50 PM, Jeff Kirsher wrote:
>
> Your correct, I apologize.  This was my fault, I applied your v1 of the
> patch and then realized there was a v2.
>
> I will re-send the series with the correct patch.

Okay. I haven't seen [0] in the series. Did you merge it somewhere?

[0] http://thread.gmane.org/gmane.linux.drivers.e1000.devel/10019

Sebastian

^ permalink raw reply

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: Marek Lindner @ 2012-05-17 11:53 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <201205121626.38520.lindner_marek-LWAfsSFWpa4@public.gmane.org>


David,

> On Tuesday, May 01, 2012 08:59:04 David Miller wrote:
> > From: Antonio Quartulli <ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>
> > Date: Tue, 1 May 2012 00:22:30 +0200
> > 
> > > However this patch also contains a procedure which queries the neigh
> > > table in order to understand whether a given host is known or not.
> > > Would it be possible to do that in another way (Without manually
> > > touching the table)?
> > > 
> > > Instead, in the next patch (patch 06/15) batman-adv manually increase
> > > the neigh timeouts. Do you think we should avoid doing that as well?
> > > If we are allowed to do that, how can we perform the same operation in
> > > a cleaner way?
> > > 
> > > Last question: why can't other modules use exported functions? Are you
> > > going to change them as well?
> > 
> > I really don't have time to discuss your neigh issues right now as I'm
> > busy speaking at conferences and dealing with the backlog of other
> > patches.
> > 
> > You'll need to find someone else to discuss it with you, sorry.
> 
> I hope now is a good moment to bring the questions back onto the table. We
> still are not sure how to proceed because we have no clear picture of what
> is going to come and how the exported functions are supposed to be used.
> 
> David, if you don't have the time to discuss the ARP handling with us could
> you name someone who knows your plans and the code equally well ? So far,
> nobody has stepped up.

let me add another piece of information: The distributed ARP table does not 
really depend on the kernel's ARP table. We can easily write our own backend 
to be totally independent of the kernel's ARP table. Initially, we thought it 
might be considered a smart move if the code made use of existing kernel 
infrastructure instead of writing our own storage / user space API / etc, 
hence duplicating what is already there. But if you feel this is the better 
way forward we certainly will make the necessary changes.

Regards,
Marek

^ permalink raw reply

* Re: [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-05-17 11:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, sassmann
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

On Thu, 2012-05-17 at 04:27 -0700, Jeff Kirsher wrote:
> This series of patches contains updates for e1000, e1000e and igb.
> 
> The following are changes since commit dc6b9b78234fecdc6d2ca5e1629185718202bcf5:
>   net: include/net/sock.h cleanup
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
> 
> Bruce Allan (1):
>   e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS
> 
> Matthew Vick (1):
>   igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.
> 
> Sebastian Andrzej Siewior (2):
>   e1000: remove workaround for Errata 23 from jumbo alloc
>   e1000: look in the page and not in skb->data for the last byte
> 
>  drivers/net/ethernet/intel/e1000/e1000_main.c  |   30 ++++--------------------
>  drivers/net/ethernet/intel/e1000e/defines.h    |    2 +-
>  drivers/net/ethernet/intel/igb/e1000_defines.h |    2 +
>  drivers/net/ethernet/intel/igb/igb_main.c      |    3 ++
>  4 files changed, 11 insertions(+), 26 deletions(-)
> 

v2 of the series will be coming.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 11:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <4FB4E34F.2050004@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On Thu, 2012-05-17 at 13:38 +0200, Sebastian Andrzej Siewior wrote:
> On 05/17/2012 01:27 PM, Jeff Kirsher wrote:
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > index fefbf4d..6ac80c8 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > @@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >   		/* errors is only valid for DD + EOP descriptors */
> >   		if (unlikely((status&  E1000_RXD_STAT_EOP)&&
> >   		(rx_desc->errors&  E1000_RXD_ERR_FRAME_ERR_MASK))) {
> > -			u8 last_byte = *(skb->data + length - 1);
> > +			u8 *mapped;
> > +			u8 last_byte;
> > +
> > +			mapped = kmap_atomic(buffer_info->page);
> > +			last_byte = *(mapped + length - 1);
> >   			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
> >   				       last_byte)) {
> >   				spin_lock_irqsave(&adapter->stats_lock,
> 
> This is not what I've sent. My original patch [0] hat a unmap as well. 
> One comment was, that kmap_atomic() is too much overhead because the 
> page can never be highmem. So I changed it to page_address() [1].
> 
> [0] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10008
> [1] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10012
> 
> Sebastian

Your correct, I apologize.  This was my fault, I applied your v1 of the
patch and then realized there was a v2.

I will re-send the series with the correct patch.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] smsc95xx: add FLAG_POINTTOPOINT flag for driver_info
From: Ben Hutchings @ 2012-05-17 11:45 UTC (permalink / raw)
  To: Xiao Jiang; +Cc: steve.glendinning, gregkh, netdev, linux-usb, linux-kernel
In-Reply-To: <4FB4B98E.7000208@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

On Thu, 2012-05-17 at 16:40 +0800, Xiao Jiang wrote:
> Ben Hutchings wrote:
> > On Wed, 2012-05-16 at 16:01 +0800, jgq516@gmail.com wrote:
> >   
> >> From: Xiao Jiang <jgq516@gmail.com>
> >>
> >> commit c26134 introduced FLAG_POINTTOPOINT flag for USB ethernet devices
> >> which possibly use "usb%d" names, add this flag to make sure pandaboard
> >> can mount nfs with smsc95xx NIC.
> >>     
> >
> > These are normal Ethernet interfaces, whereas FLAG_POINTTOPOINT is for
> > devices that use non-standard short physical links.
> >
> >   
> This flag is used by some usb NICs, I amn't familiar with those cards 
> perhaps those are
> non-standard short physical links as you said.
> But smsc95xx seems need this flag to use "usb%d" name,

But this is a regular Ethernet interface and should be named
accordingly.

> at least my 
> pandaboard can't
> mount nfs with eth0 name, is there other ways to avoid nfs issue with 
> keep smsc95xx's
> name unchange? thanks.
[...]

I don't know what this NFS issue is, but I don't see how this can be the
correct solution.

Ben.

-- 
Ben Hutchings
Every program is either trivial or else contains at least one bug

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Sebastian Andrzej Siewior @ 2012-05-17 11:38 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <1337254070-32500-4-git-send-email-jeffrey.t.kirsher@intel.com>

On 05/17/2012 01:27 PM, Jeff Kirsher wrote:
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index fefbf4d..6ac80c8 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>   		/* errors is only valid for DD + EOP descriptors */
>   		if (unlikely((status&  E1000_RXD_STAT_EOP)&&
>   		(rx_desc->errors&  E1000_RXD_ERR_FRAME_ERR_MASK))) {
> -			u8 last_byte = *(skb->data + length - 1);
> +			u8 *mapped;
> +			u8 last_byte;
> +
> +			mapped = kmap_atomic(buffer_info->page);
> +			last_byte = *(mapped + length - 1);
>   			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
>   				       last_byte)) {
>   				spin_lock_irqsave(&adapter->stats_lock,

This is not what I've sent. My original patch [0] hat a unmap as well. 
One comment was, that kmap_atomic() is too much overhead because the 
page can never be highmem. So I changed it to page_address() [1].

[0] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10008
[1] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10012

Sebastian

^ permalink raw reply

* [net-next 2/4] e1000: remove workaround for Errata 23 from jumbo alloc
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

According to the comment, errata 23 says that the memory we allocate
can't cross a 64KiB boundary. In case of jumbo frames we allocate
complete pages which can never cross the 64KiB boundary because
PAGE_SIZE should be a multiple of 64KiB so we stop either before the
boundary or start after it but never cross it. Furthermore the check
seems bogus because it looks at skb->data which is not seen by the HW
at all because we only pass the DMA address of the page we allocated. So
I *think* the workaround is not required here.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |   24 ------------------------
 1 files changed, 0 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f1aef68..fefbf4d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4391,30 +4391,6 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 			break;
 		}
 
-		/* Fix for errata 23, can't cross 64kB boundary */
-		if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-			struct sk_buff *oldskb = skb;
-			e_err(rx_err, "skb align check failed: %u bytes at "
-			      "%p\n", bufsz, skb->data);
-			/* Try again, without freeing the previous */
-			skb = netdev_alloc_skb_ip_align(netdev, bufsz);
-			/* Failed allocation, critical failure */
-			if (!skb) {
-				dev_kfree_skb(oldskb);
-				adapter->alloc_rx_buff_failed++;
-				break;
-			}
-
-			if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-				/* give up */
-				dev_kfree_skb(skb);
-				dev_kfree_skb(oldskb);
-				break; /* while (cleaned_count--) */
-			}
-
-			/* Use new allocation */
-			dev_kfree_skb(oldskb);
-		}
 		buffer_info->skb = skb;
 		buffer_info->length = adapter->rx_buffer_len;
 check_page:
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The code seems to want to look at the last byte where the HW puts some
information. Since the skb->data area is never seen by the HW I guess it
does not work as expected. We pass the page address to the HW so I
*think* in order to get to the last byte where the information might be
one should use the page buffer and take a look.
This is of course not more than just compile tested.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index fefbf4d..6ac80c8 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
-			u8 last_byte = *(skb->data + length - 1);
+			u8 *mapped;
+			u8 last_byte;
+
+			mapped = kmap_atomic(buffer_info->page);
+			last_byte = *(mapped + length - 1);
 			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
 				       last_byte)) {
 				spin_lock_irqsave(&adapter->stats_lock,
-- 
1.7.7.6

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox