[PATCH] cpuset: mm: Remove memory barrier damage from the page allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
@ 2012-03-02 11:23 Mel Gorman
  2012-03-02 16:19 ` Christoph Lameter
  2012-03-02 21:21 ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2012-03-02 11:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miao Xie, Peter Zijlstra, Christoph Lameter, linux-mm,
	linux-kernel

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch ditches the memory the memory barriers and introduces something
similar to a seq lock using atomics on the write side. I know that atomics
have similar cost to memory barriers ordinarily but for the fast paths,
an atomic read should be a simple read on most architectures and be less
painful in general. The main bulk of the patch is the retry logic if the
nodemask changes in a manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is a
risk. If it is, the sequence number gets updated each time the mask is about
to change. An allocation in progress may succeed if the nodemask update
happens during allocation but this is not of concern. More importantly,
it should not fail an allocation that should have succeeded.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v1r1
Clients   1 UserTime       0.07 (  0.00%)   0.07 (  8.11%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 ( -0.68%)
Clients   4 UserTime       0.08 (  0.00%)   0.08 (  0.66%)
Clients   1 SysTime        0.70 (  0.00%)   0.67 (  4.15%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.47%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.39%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.94%)
Clients   4 WallTime       0.38 (  0.00%)   0.38 (  1.18%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 519277.40 (  4.35%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429866.80 (  3.67%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258865.73 (  0.35%)
Clients   1 Flt/sec      495161.39 (  0.00%) 516980.82 (  4.41%)
Clients   2 Flt/sec      820325.95 (  0.00%) 849962.87 (  3.61%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1023208.69 (  0.31%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    133.18
User+Sys Time Running Test (seconds)         164.2    161.41
Total Elapsed Time (seconds)                123.46    122.20

The overall improvement is tiny but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h |   39 ++++++++++++++-------------------------
 include/linux/sched.h  |    2 +-
 kernel/cpuset.c        |   44 ++++++++++----------------------------------
 mm/filemap.c           |   11 +++++++----
 mm/hugetlb.c           |    7 +++++--
 mm/mempolicy.c         |   28 +++++++++++++++++++++-------
 mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
 mm/slab.c              |   11 +++++++----
 mm/slub.c              |   32 +++++++++++++++++---------------
 mm/vmscan.c            |    2 --
 10 files changed, 105 insertions(+), 104 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..ba6d217 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  * reading current mems_allowed and mempolicy in the fastpath must protected
  * by get_mems_allowed()
  */
-static inline void get_mems_allowed(void)
+static inline unsigned long get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return atomic_read(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate
+ */
+static inline bool put_mems_allowed(unsigned long seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return seq == atomic_read(&current->mems_allowed_seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
 {
 	task_lock(current);
+	atomic_inc(&current->mems_allowed_seq);
 	current->mems_allowed = nodemask;
 	task_unlock(current);
 }
@@ -234,12 +221,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned long get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned long seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..295dc1b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1498,7 +1498,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	atomic_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..6bdefed 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,22 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		atomic_inc(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		atomic_inc(&tsk->mems_allowed_seq);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		atomic_inc(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..cf40c4c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -498,12 +498,15 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	int n;
 	struct page *page;
+	unsigned long cpuset_mems_cookie;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..613698e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -460,8 +460,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned long cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -490,7 +492,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 err:
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 06b145f..d80e16f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1843,18 +1843,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned long cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned long cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2186ec..b8084d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned long cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a tasks mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned long cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..0b4f6ea 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3312,11 +3310,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned long cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
@@ -3372,7 +3373,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..6515c98 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
+	unsigned long cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1604,23 +1605,24 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			object = get_partial_node(s, n, c);
-			if (object) {
-				put_mems_allowed();
-				return object;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				object = get_partial_node(s, n, c);
+				if (object) {
+					put_mems_allowed(cpuset_mems_cookie);
+					return object;
+				}
 			}
 		}
-	}
-	put_mems_allowed();
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..fccc048 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman
@ 2012-03-02 16:19 ` Christoph Lameter
  2012-03-02 17:43   ` Mel Gorman
  2012-03-02 21:21 ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2012-03-02 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel

On Fri, 2 Mar 2012, Mel Gorman wrote:

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index e9eaec5..ba6d217 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>   * reading current mems_allowed and mempolicy in the fastpath must protected
>   * by get_mems_allowed()
>   */
> -static inline void get_mems_allowed(void)
> +static inline unsigned long get_mems_allowed(void)
>  {
> -	current->mems_allowed_change_disable++;
> -
> -	/*
> -	 * ensure that reading mems_allowed and mempolicy happens after the
> -	 * update of ->mems_allowed_change_disable.
> -	 *
> -	 * the write-side task finds ->mems_allowed_change_disable is not 0,
> -	 * and knows the read-side task is reading mems_allowed or mempolicy,
> -	 * so it will clear old bits lazily.
> -	 */
> -	smp_mb();
> +	return atomic_read(&current->mems_allowed_seq);
>  }
>
> -static inline void put_mems_allowed(void)
> +/*
> + * If this returns false, the operation that took place after get_mems_allowed
> + * may have failed. It is up to the caller to retry the operation if
> + * appropriate
> + */
> +static inline bool put_mems_allowed(unsigned long seq)
>  {
> -	/*
> -	 * ensure that reading mems_allowed and mempolicy before reducing
> -	 * mems_allowed_change_disable.
> -	 *
> -	 * the write-side task will know that the read-side task is still
> -	 * reading mems_allowed or mempolicy, don't clears old bits in the
> -	 * nodemask.
> -	 */
> -	smp_mb();
> -	--ACCESS_ONCE(current->mems_allowed_change_disable);
> +	return seq == atomic_read(&current->mems_allowed_seq);
>  }

Use seqlock instead of the counter? Seems that you are recoding much of
what a seqlock does. A seqlock also allows you to have a writer that sort
of blocks the reades if necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 16:19 ` Christoph Lameter
@ 2012-03-02 17:43   ` Mel Gorman
  2012-03-02 19:53     ` Christoph Lameter
  2012-03-02 21:25     ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2012-03-02 17:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel

On Fri, Mar 02, 2012 at 10:19:55AM -0600, Christoph Lameter wrote:
> On Fri, 2 Mar 2012, Mel Gorman wrote:
> 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index e9eaec5..ba6d217 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> >   * reading current mems_allowed and mempolicy in the fastpath must protected
> >   * by get_mems_allowed()
> >   */
> > -static inline void get_mems_allowed(void)
> > +static inline unsigned long get_mems_allowed(void)
> >  {
> > -	current->mems_allowed_change_disable++;
> > -
> > -	/*
> > -	 * ensure that reading mems_allowed and mempolicy happens after the
> > -	 * update of ->mems_allowed_change_disable.
> > -	 *
> > -	 * the write-side task finds ->mems_allowed_change_disable is not 0,
> > -	 * and knows the read-side task is reading mems_allowed or mempolicy,
> > -	 * so it will clear old bits lazily.
> > -	 */
> > -	smp_mb();
> > +	return atomic_read(&current->mems_allowed_seq);
> >  }
> >
> > -static inline void put_mems_allowed(void)
> > +/*
> > + * If this returns false, the operation that took place after get_mems_allowed
> > + * may have failed. It is up to the caller to retry the operation if
> > + * appropriate
> > + */
> > +static inline bool put_mems_allowed(unsigned long seq)
> >  {
> > -	/*
> > -	 * ensure that reading mems_allowed and mempolicy before reducing
> > -	 * mems_allowed_change_disable.
> > -	 *
> > -	 * the write-side task will know that the read-side task is still
> > -	 * reading mems_allowed or mempolicy, don't clears old bits in the
> > -	 * nodemask.
> > -	 */
> > -	smp_mb();
> > -	--ACCESS_ONCE(current->mems_allowed_change_disable);
> > +	return seq == atomic_read(&current->mems_allowed_seq);
> >  }
> 
> Use seqlock instead of the counter? Seems that you are recoding much of
> what a seqlock does. A seqlock also allows you to have a writer that sort
> of blocks the reades if necessary.
> 

I considered using a seqlock but it isn't cheap. The read side is heavy
with the possibility that it starts spinning and incurs a read barrier
(looking at read_seqbegin()) here. The retry block incurs another read
barrier so basically it would not be no better than what is there currently
(which at a 4% performance hit, sucks)

In the case of seqlocks, a reader will backoff if a writer is in progress
but the page allocator doesn't need that which is why I felt it was ok
to special case.  Instead, it will try allocate a page while the update
is in progress and only take special action if the allocation will fail.
Allocation failure is an unusual situation that can trigger application
exit or an OOM so it's ok to treat it as a slow path. A normal seqlock
would retry unconditionally and potentially have to handle the case
where it needs to free the page before retrying which is pointless.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 17:43   ` Mel Gorman
@ 2012-03-02 19:53     ` Christoph Lameter
  2012-03-02 21:25     ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2012-03-02 19:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel

On Fri, 2 Mar 2012, Mel Gorman wrote:

> I considered using a seqlock but it isn't cheap. The read side is heavy
> with the possibility that it starts spinning and incurs a read barrier
> (looking at read_seqbegin()) here. The retry block incurs another read
> barrier so basically it would not be no better than what is there currently
> (which at a 4% performance hit, sucks)

Oh. You dont have a read barrier? So your approach is buggy? We could have
read a state before someone else incremented the seq counter, then cached
it, then we read the counter, did the processing and found that the
sequid was not changed?

> In the case of seqlocks, a reader will backoff if a writer is in progress
> but the page allocator doesn't need that which is why I felt it was ok

You can just not use the writer section if you think that is ok. Doubt it
but lets at least start using a known serialization construct that would
allow us to fix it up if we find that we need to update multiple variables
protected by the seqlock.

> Allocation failure is an unusual situation that can trigger application
> exit or an OOM so it's ok to treat it as a slow path. A normal seqlock
> would retry unconditionally and potentially have to handle the case
> where it needs to free the page before retrying which is pointless.

It will only retry as long as the writer hold the "lock". Like a spinlock
the holdoff times depends on the size of the critical section and
initially you could just avoid having write sections.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 17:43   ` Mel Gorman
  2012-03-02 19:53     ` Christoph Lameter
@ 2012-03-02 21:25     ` Peter Zijlstra
  2012-03-02 23:47       ` David Rientjes
  2012-03-05  9:35       ` Mel Gorman
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2012-03-02 21:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Miao Xie, linux-mm,
	linux-kernel

On Fri, 2012-03-02 at 17:43 +0000, Mel Gorman wrote:
> 
> I considered using a seqlock but it isn't cheap. The read side is heavy
> with the possibility that it starts spinning and incurs a read barrier
> (looking at read_seqbegin()) here. The retry block incurs another read
> barrier so basically it would not be no better than what is there currently
> (which at a 4% performance hit, sucks) 

Use seqcount.

Also, for the write side it doesn't really matter, changing mems_allowed
should be rare and is an 'expensive' operation anyway.

For the read side you can do:

again:
  seq = read_seqcount_begin(&current->mems_seq);

  page = do_your_allocator_muck();

  if (!page && read_seqcount_retry(&current->mems_seq, seq))
    goto again;

  oom();

That way, you only have one smp_rmb() in your fath path,
read_seqcount_begin() doesn't spin, and you only incur the second
smp_rmb() when you've completely failed to allocate anything.

smp_rmb() is basicaly free on x86, other archs will incur some overhead,
but you need a barrier as Christoph pointed out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 21:25     ` Peter Zijlstra
@ 2012-03-02 23:47       ` David Rientjes
  2012-03-05  9:44         ` Mel Gorman
  2012-03-05  9:35       ` Mel Gorman
  1 sibling, 1 reply; 20+ messages in thread
From: David Rientjes @ 2012-03-02 23:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Christoph Lameter, Andrew Morton, Miao Xie, linux-mm,
	linux-kernel

On Fri, 2 Mar 2012, Peter Zijlstra wrote:

> Also, for the write side it doesn't really matter, changing mems_allowed
> should be rare and is an 'expensive' operation anyway.
> 

It's very expensive even without memory barriers since the page allocator 
wraps itself in {get,put}_mems_allowed() until a page or NULL is returned 
and an update to current's set of allowed mems can stall indefinitely 
trying to change the nodemask during this time.  The thread changing 
cpuset.mems is holding cgroup_mutex the entire time which locks out 
changes, including adding additional nodes to current's set of allowed 
mems.  If direct reclaim takes a long time or an oom killed task fails to 
exit quickly (or the allocation is __GFP_NOFAIL and we just spin 
indefinitely holding get_mems_allowed()), then it's not uncommon to see a 
write to cpuset.mems taking minutes while holding the mutex, if it ever 
actually returns at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 23:47       ` David Rientjes
@ 2012-03-05  9:44         ` Mel Gorman
  2012-03-06 23:31           ` David Rientjes
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-03-05  9:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Peter Zijlstra, Christoph Lameter, Andrew Morton, Miao Xie,
	linux-mm, linux-kernel

On Fri, Mar 02, 2012 at 03:47:17PM -0800, David Rientjes wrote:
> On Fri, 2 Mar 2012, Peter Zijlstra wrote:
> 
> > Also, for the write side it doesn't really matter, changing mems_allowed
> > should be rare and is an 'expensive' operation anyway.
> > 
> 
> It's very expensive even without memory barriers since the page allocator 
> wraps itself in {get,put}_mems_allowed() until a page or NULL is returned 
> and an update to current's set of allowed mems can stall indefinitely 
> trying to change the nodemask during this time. 

Hmm, this sounds problematic. Are you seeing a problem with the behaviour
with the patch applied or the existing behaviour?

If you are talking about the patch, I am missing something. The retry
only takes place if there is a parallel update of the nodemask and the
page allocation fails. There would need to be continual updates of the
nodemask that led to false allocation failures for it to stall indefinitely.

On the updating of the cpumask side, there is no longer the "yield and retry"
logic so it also should not stall indefinitely trying to change the nodemask.
The write_seqcount_begin() does not wait for the reader side to complete
so it should also not be stalling for long periods of time.

Did I miss something?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-05  9:44         ` Mel Gorman
@ 2012-03-06 23:31           ` David Rientjes
  0 siblings, 0 replies; 20+ messages in thread
From: David Rientjes @ 2012-03-06 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Christoph Lameter, Andrew Morton, Miao Xie,
	linux-mm, linux-kernel

On Mon, 5 Mar 2012, Mel Gorman wrote:

> > It's very expensive even without memory barriers since the page allocator 
> > wraps itself in {get,put}_mems_allowed() until a page or NULL is returned 
> > and an update to current's set of allowed mems can stall indefinitely 
> > trying to change the nodemask during this time. 
> 
> Hmm, this sounds problematic. Are you seeing a problem with the behaviour
> with the patch applied or the existing behaviour?
> 

Sorry, yes, this is with the existing behavior prior to your patch.  We 
definitely need fixes for get_mems_allowed() because it's possible that a 
write to cpuset.mems will never return even when trying to add nodes to 
its nodemask in oom conditions if one of the cpuset's tasks is looping 
forever in the page allocator.

I'll review your updated version posted from today.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 21:25     ` Peter Zijlstra
  2012-03-02 23:47       ` David Rientjes
@ 2012-03-05  9:35       ` Mel Gorman
  1 sibling, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2012-03-05  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Andrew Morton, Miao Xie, linux-mm,
	linux-kernel

On Fri, Mar 02, 2012 at 10:25:29PM +0100, Peter Zijlstra wrote:
> On Fri, 2012-03-02 at 17:43 +0000, Mel Gorman wrote:
> > 
> > I considered using a seqlock but it isn't cheap. The read side is heavy
> > with the possibility that it starts spinning and incurs a read barrier
> > (looking at read_seqbegin()) here. The retry block incurs another read
> > barrier so basically it would not be no better than what is there currently
> > (which at a 4% performance hit, sucks) 
> 
> Use seqcount.
> 

Ok. I wanted to avoid the stall on the read side (it stalls if there is
a writer in progress) but it's not worth a special case. As you say,
changing mems_allowed should be rare and even then it only becomes a
problem if the new mask may cause false failures.

> Also, for the write side it doesn't really matter, changing mems_allowed
> should be rare and is an 'expensive' operation anyway.
> 
> For the read side you can do:
> 
> again:
>   seq = read_seqcount_begin(&current->mems_seq);
> 
>   page = do_your_allocator_muck();
> 
>   if (!page && read_seqcount_retry(&current->mems_seq, seq))
>     goto again;
> 
>   oom();
> 

This is effectively what I did. It's still behind get/put mems allowed
so it can be disabled in !CPUSET configurations.

> That way, you only have one smp_rmb() in your fath path,
> read_seqcount_begin() doesn't spin, and you only incur the second
> smp_rmb() when you've completely failed to allocate anything.
> 
> smp_rmb() is basicaly free on x86, other archs will incur some overhead,
> but you need a barrier as Christoph pointed out.

Thanks Peter and Christoph for the feedback. Can you give this version a
look over? The performance figures are updated so the performance gain is
still there.

---8<---
cpuset: mm: Reduce large amounts of memory barrier related damage

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write sides
with a sequence counter with just read barriers on the fast path side.
This is much cheaper on some architectures, including x86.  The main bulk
of the patch is the retry logic if the nodemask changes in a manner that
can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is a
risk. If it is, the sequence number gets bumped and parallel allocators
will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v2r1
Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    132.17
User+Sys Time Running Test (seconds)         164.2    160.13
Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h |   38 +++++++++++++-------------------------
 include/linux/sched.h  |    2 +-
 kernel/cpuset.c        |   43 ++++++++-----------------------------------
 mm/filemap.c           |   11 +++++++----
 mm/hugetlb.c           |    7 +++++--
 mm/mempolicy.c         |   28 +++++++++++++++++++++-------
 mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
 mm/slab.c              |   11 +++++++----
 mm/slub.c              |   32 +++++++++++++++++---------------
 mm/vmscan.c            |    2 --
 10 files changed, 102 insertions(+), 105 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..104f701 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -92,33 +92,19 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  * reading current mems_allowed and mempolicy in the fastpath must protected
  * by get_mems_allowed()
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -234,12 +220,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..a0bb87a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1498,7 +1498,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5014493 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..56a1e11 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -498,12 +498,15 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	int n;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..5f1e959 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -460,8 +460,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -490,7 +492,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 err:
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 06b145f..013d981 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1843,18 +1843,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2186ec..3a667da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a tasks mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..ae2db04 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3312,11 +3310,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
@@ -3372,7 +3373,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..ac5dddc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1604,23 +1605,24 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			object = get_partial_node(s, n, c);
-			if (object) {
-				put_mems_allowed();
-				return object;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				object = get_partial_node(s, n, c);
+				if (object) {
+					put_mems_allowed(cpuset_mems_cookie);
+					return object;
+				}
 			}
 		}
-	}
-	put_mems_allowed();
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..fccc048 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman
  2012-03-02 16:19 ` Christoph Lameter
@ 2012-03-02 21:21 ` Peter Zijlstra
  2012-03-05 20:18   ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2012-03-02 21:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel

On Fri, 2012-03-02 at 11:23 +0000, Mel Gorman wrote:
> For extra style points, the commit introduced the use of yield() in an
> implementation of what looks like a spinning mutex.

Andrew, could you simply say no to any patch adding a yield()? There's a
99% chance its a bug, as was this. 

This code would life-lock when cpuset_change_task_nodemask() would be
called by the highest priority FIFO task on UP or when pinned to the
same cpu the task doing get_mems_allowed().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator
  2012-03-02 21:21 ` Peter Zijlstra
@ 2012-03-05 20:18   ` Andrew Morton
  2012-03-06  2:01     ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2012-03-05 20:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel,
	Joe Perches

On Fri, 02 Mar 2012 22:21:02 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2012-03-02 at 11:23 +0000, Mel Gorman wrote:
> > For extra style points, the commit introduced the use of yield() in an
> > implementation of what looks like a spinning mutex.
> 
> Andrew, could you simply say no to any patch adding a yield()? There's a
> 99% chance its a bug, as was this. 

I'd normally at least poke my tongue out at it - I must have missed
this one.

> This code would life-lock when cpuset_change_task_nodemask() would be
> called by the highest priority FIFO task on UP or when pinned to the
> same cpu the task doing get_mems_allowed().

Joe, can we please have a checkpatch rule?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-05 20:18   ` Andrew Morton
@ 2012-03-06  2:01     ` Joe Perches
  2012-03-06 12:45       ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Joe Perches @ 2012-03-06  2:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel

Use of yield() is deprecated or at least generally undesirable.

Add a checkpatch warning when it's used.
Suggest cpu_relax instead.

Signed-off-by: Joe Perches <joe@perches.com>
---
> Joe, can we please have a checkpatch rule?

Something like this?

 scripts/checkpatch.pl |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index e32ea7f..80ad474 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3300,6 +3300,11 @@ sub process {
 			     "__func__ should be used instead of gcc specific __FUNCTION__\n"  . $herecurr);
 		}
 
+# check for use of yield()
+		if ($line =~ /\byield\s*\(\s*\)/ {
+			WARN("YIELD",
+			     "yield() is deprecated, consider cpu_relax()\n"  . $herecurr);
+		}
 # check for semaphores initialized locked
 		if ($line =~ /^.\s*sema_init.+,\W?0\W?\)/) {
 			WARN("CONSIDER_COMPLETION",


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06  2:01     ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches
@ 2012-03-06 12:45       ` Peter Zijlstra
  2012-03-06 13:14         ` Glauber Costa
  2012-03-06 17:41         ` Joe Perches
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2012-03-06 12:45 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel, Theodore Ts'o

On Mon, 2012-03-05 at 18:01 -0800, Joe Perches wrote:

> +# check for use of yield()
> +		if ($line =~ /\byield\s*\(\s*\)/ {
> +			WARN("YIELD",
> +			     "yield() is deprecated, consider cpu_relax()\n"  . $herecurr);
> +		}

Its not deprecated as such, its just a very dangerous and ill considered
API.

cpu_relax() is not a good substitute suggestion in that its still a busy
wait and prone to much of the same problems.

The case at hand was a life-lock due to expecting that yield() would run
another process which it needed in order to complete. Yield() does not
provide that guarantee.

Looking at fs/ext4/mballoc.c, we have this gem:

		/*
                 * Yield the CPU here so that we don't get soft lockup
                 * in non preempt case.
                 */
                yield();

This is of course complete crap as well.. I suspect they want
cond_resched() there. And:

                        /* let others to free the space */
                        yield();

Like said, yield() doesn't guarantee anything like running anybody else,
does it rely on that? Or is it optimistic?

Another fun user:

void tasklet_kill(struct tasklet_struct *t)
{
        if (in_interrupt())
                printk("Attempt to kill tasklet from interrupt\n");

        while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) {
                do {
                        yield();
                } while (test_bit(TASKLET_STATE_SCHED, &t->state));
        }
        tasklet_unlock_wait(t);
        clear_bit(TASKLET_STATE_SCHED, &t->state);
}

The only reason that doesn't explode is because running tasklets is
non-preemptible, However since they're non-preemptible they shouldn't
run long and you might as well busy spin. If they can run long, yield()
isn't your biggest problem.

mm/memory_hotplug.c has two yield() calls in offline_pages() and I've no
idea what they're trying to achieve.

But really, yield() is basically _always_ the wrong thing. The right
thing can be:

 cond_resched(); wait_event(); or something entirely different.

So instead of suggesting an alternative, I would suggest thinking about
the actual problem in order to avoid the non-thinking solutions the
checkpatch brigade is so overly fond of :/

Maybe something like:

 "yield() is dangerous and wrong, rework your code to not use it."

That at least requires some sort of thinking and doesn't suggest blind
substitution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 12:45       ` Peter Zijlstra
@ 2012-03-06 13:14         ` Glauber Costa
  2012-03-06 13:25           ` Peter Zijlstra
  2012-03-06 17:41         ` Joe Perches
  1 sibling, 1 reply; 20+ messages in thread
From: Glauber Costa @ 2012-03-06 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie,
	Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o

On 03/06/2012 04:45 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-05 at 18:01 -0800, Joe Perches wrote:
>
>> +# check for use of yield()
>> +		if ($line =~ /\byield\s*\(\s*\)/ {
>> +			WARN("YIELD",
>> +			     "yield() is deprecated, consider cpu_relax()\n"  . $herecurr);
>> +		}
>
> Its not deprecated as such, its just a very dangerous and ill considered
> API.
>
> cpu_relax() is not a good substitute suggestion in that its still a busy
> wait and prone to much of the same problems.
>
> The case at hand was a life-lock due to expecting that yield() would run
> another process which it needed in order to complete. Yield() does not
> provide that guarantee.
>
> Looking at fs/ext4/mballoc.c, we have this gem:
>
>
> 		/*
>                   * Yield the CPU here so that we don't get soft lockup
>                   * in non preempt case.
>                   */
>                  yield();
>
> This is of course complete crap as well.. I suspect they want
> cond_resched() there. And:
>
>                          /* let others to free the space */
>                          yield();
>
> Like said, yield() doesn't guarantee anything like running anybody else,
> does it rely on that? Or is it optimistic?
>
> Another fun user:
>
> void tasklet_kill(struct tasklet_struct *t)
> {
>          if (in_interrupt())
>                  printk("Attempt to kill tasklet from interrupt\n");
>
>          while (test_and_set_bit(TASKLET_STATE_SCHED,&t->state)) {
>                  do {
>                          yield();
>                  } while (test_bit(TASKLET_STATE_SCHED,&t->state));
>          }
>          tasklet_unlock_wait(t);
>          clear_bit(TASKLET_STATE_SCHED,&t->state);
> }
>
> The only reason that doesn't explode is because running tasklets is
> non-preemptible, However since they're non-preemptible they shouldn't
> run long and you might as well busy spin. If they can run long, yield()
> isn't your biggest problem.
>
> mm/memory_hotplug.c has two yield() calls in offline_pages() and I've no
> idea what they're trying to achieve.
>
> But really, yield() is basically _always_ the wrong thing. The right
> thing can be:
>
>   cond_resched(); wait_event(); or something entirely different.
>
> So instead of suggesting an alternative, I would suggest thinking about
> the actual problem in order to avoid the non-thinking solutions the
> checkpatch brigade is so overly fond of :/
>
> Maybe something like:
>
>   "yield() is dangerous and wrong, rework your code to not use it."
>
> That at least requires some sort of thinking and doesn't suggest blind
> substitution.
>

Can't we point people to some Documentation file that explains the 
alternatives?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 13:14         ` Glauber Costa
@ 2012-03-06 13:25           ` Peter Zijlstra
  2012-03-06 13:27             ` Glauber Costa
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2012-03-06 13:25 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie,
	Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o

On Tue, 2012-03-06 at 17:14 +0400, Glauber Costa wrote:
> 
> Can't we point people to some Documentation file that explains the 
> alternatives? 

Not sure that's a finite set.. however I think we covered the most
popular ones in this thread. One could use a lkml.kernel.org link.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 13:25           ` Peter Zijlstra
@ 2012-03-06 13:27             ` Glauber Costa
  0 siblings, 0 replies; 20+ messages in thread
From: Glauber Costa @ 2012-03-06 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie,
	Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o

On 03/06/2012 05:25 PM, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 17:14 +0400, Glauber Costa wrote:
>>
>> Can't we point people to some Documentation file that explains the
>> alternatives?
>
> Not sure that's a finite set.. however I think we covered the most
> popular ones in this thread. One could use a lkml.kernel.org link.
>
Yes, I think that would work. Summarizing your arguments in an in-tree 
Documentation file would be good as well, IMHO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 12:45       ` Peter Zijlstra
  2012-03-06 13:14         ` Glauber Costa
@ 2012-03-06 17:41         ` Joe Perches
  2012-03-06 17:54           ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Joe Perches @ 2012-03-06 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel, Theodore Ts'o

On Tue, 2012-03-06 at 13:45 +0100, Peter Zijlstra wrote:
> The case at hand was a life-lock due to expecting that yield() would run
> another process which it needed in order to complete. Yield() does not
> provide that guarantee.

OK.

Perhaps the kernel-doc comments in sched/core.c
should/could be expanded/updated.

/**
 * sys_sched_yield - yield the current processor to other threads.
 *
 * This function yields the current CPU to other tasks. If there are no
 * other threads running on this CPU then this function will return.
 */

[]

/**
 * yield - yield the current processor to other threads.
 *
 * This is a shortcut for kernel-space yielding - it marks the
 * thread runnable and calls sys_sched_yield().
 */
void __sched yield(void)
{
	set_current_state(TASK_RUNNING);
	sys_sched_yield();
}
EXPORT_SYMBOL(yield);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 17:41         ` Joe Perches
@ 2012-03-06 17:54           ` Peter Zijlstra
  2012-03-06 18:00             ` Joe Perches
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2012-03-06 17:54 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel, Theodore Ts'o

On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote:
> Perhaps the kernel-doc comments in sched/core.c
> should/could be expanded/updated. 

Something like this?

---
 kernel/sched/core.c |   20 ++++++++++++++++++--
 1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2963fbb..a05a0f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq);
 /**
  * yield - yield the current processor to other threads.
  *
- * This is a shortcut for kernel-space yielding - it marks the
- * thread runnable and calls sys_sched_yield().
+ * Do not ever use this function, there's a 99% chance you're doing it wrong.
+ *
+ * The scheduler is at all times free to pick the calling task as the most
+ * eligible task to run, if removing the yield() call from your code breaks
+ * it, its already broken.
+ *
+ * Typical broken usage is:
+ *
+ * while (!event)
+ * 	yield();
+ *
+ * where one assumes that yield() will let 'the other' process run that will
+ * make event true. If the current task is a SCHED_FIFO task that will never
+ * happen. Never use yield() as a progress guarantee!!
+ *
+ * If you want to use yield() to wait for something, use wait_event().
+ * If you want to use yield() to be 'nice' for others, use cond_resched().
+ * If you still want to use yield(), do not!
  */
 void __sched yield(void)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 17:54           ` Peter Zijlstra
@ 2012-03-06 18:00             ` Joe Perches
  2012-03-06 18:17               ` Joe Perches
  0 siblings, 1 reply; 20+ messages in thread
From: Joe Perches @ 2012-03-06 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel, Theodore Ts'o

On Tue, 2012-03-06 at 18:54 +0100, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote:
> > Perhaps the kernel-doc comments in sched/core.c
> > should/could be expanded/updated. 
> 
> Something like this?
> 
> ---
>  kernel/sched/core.c |   20 ++++++++++++++++++--
>  1 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2963fbb..a05a0f7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq);
>  /**
>   * yield - yield the current processor to other threads.
>   *
> - * This is a shortcut for kernel-space yielding - it marks the
> - * thread runnable and calls sys_sched_yield().
> + * Do not ever use this function, there's a 99% chance you're doing it wrong.
> + *
> + * The scheduler is at all times free to pick the calling task as the most
> + * eligible task to run, if removing the yield() call from your code breaks
> + * it, its already broken.
> + *
> + * Typical broken usage is:
> + *
> + * while (!event)
> + * 	yield();
> + *
> + * where one assumes that yield() will let 'the other' process run that will
> + * make event true. If the current task is a SCHED_FIFO task that will never
> + * happen. Never use yield() as a progress guarantee!!
> + *
> + * If you want to use yield() to wait for something, use wait_event().
> + * If you want to use yield() to be 'nice' for others, use cond_resched().
> + * If you still want to use yield(), do not!
>   */

Yes. I'll update the checkpatch message to say something like:

"Using yield() is generally wrong.  See yield() kernel-doc (sched/core.c)"


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] checkpatch: Warn on use of yield()
  2012-03-06 18:00             ` Joe Perches
@ 2012-03-06 18:17               ` Joe Perches
  0 siblings, 0 replies; 20+ messages in thread
From: Joe Perches @ 2012-03-06 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm,
	linux-kernel, Theodore Ts'o

On Tue, 2012-03-06 at 10:00 -0800, Joe Perches wrote:
> On Tue, 2012-03-06 at 18:54 +0100, Peter Zijlstra wrote:
> > On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote:
> > > Perhaps the kernel-doc comments in sched/core.c
> > > should/could be expanded/updated. 
> > Something like this?
[]
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
[]
> > @@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq);
> >  /**
> >   * yield - yield the current processor to other threads.

Perhaps the phrase "other threads" is poor word choice.  Maybe:

yield - yield the current processor to a runnable thread
        (might be the current thread)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-03-06 23:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman
2012-03-02 16:19 ` Christoph Lameter
2012-03-02 17:43   ` Mel Gorman
2012-03-02 19:53     ` Christoph Lameter
2012-03-02 21:25     ` Peter Zijlstra
2012-03-02 23:47       ` David Rientjes
2012-03-05  9:44         ` Mel Gorman
2012-03-06 23:31           ` David Rientjes
2012-03-05  9:35       ` Mel Gorman
2012-03-02 21:21 ` Peter Zijlstra
2012-03-05 20:18   ` Andrew Morton
2012-03-06  2:01     ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches
2012-03-06 12:45       ` Peter Zijlstra
2012-03-06 13:14         ` Glauber Costa
2012-03-06 13:25           ` Peter Zijlstra
2012-03-06 13:27             ` Glauber Costa
2012-03-06 17:41         ` Joe Perches
2012-03-06 17:54           ` Peter Zijlstra
2012-03-06 18:00             ` Joe Perches
2012-03-06 18:17               ` Joe Perches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).