* [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator @ 2012-03-02 11:23 Mel Gorman 2012-03-02 16:19 ` Christoph Lameter 2012-03-02 21:21 ` Peter Zijlstra 0 siblings, 2 replies; 20+ messages in thread From: Mel Gorman @ 2012-03-02 11:23 UTC (permalink / raw) To: Andrew Morton Cc: Miao Xie, Peter Zijlstra, Christoph Lameter, linux-mm, linux-kernel Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing cpuset's mems] wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch ditches the memory the memory barriers and introduces something similar to a seq lock using atomics on the write side. I know that atomics have similar cost to memory barriers ordinarily but for the fast paths, an atomic read should be a simple read on most architectures and be less painful in general. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets updated each time the mask is about to change. An allocation in progress may succeed if the nodemask update happens during allocation but this is not of concern. More importantly, it should not fail an allocation that should have succeeded. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v1r1 Clients 1 UserTime 0.07 ( 0.00%) 0.07 ( 8.11%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( -0.68%) Clients 4 UserTime 0.08 ( 0.00%) 0.08 ( 0.66%) Clients 1 SysTime 0.70 ( 0.00%) 0.67 ( 4.15%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.47%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.39%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.94%) Clients 4 WallTime 0.38 ( 0.00%) 0.38 ( 1.18%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 519277.40 ( 4.35%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429866.80 ( 3.67%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258865.73 ( 0.35%) Clients 1 Flt/sec 495161.39 ( 0.00%) 516980.82 ( 4.41%) Clients 2 Flt/sec 820325.95 ( 0.00%) 849962.87 ( 3.61%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1023208.69 ( 0.31%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 133.18 User+Sys Time Running Test (seconds) 164.2 161.41 Total Elapsed Time (seconds) 123.46 122.20 The overall improvement is tiny but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/cpuset.h | 39 ++++++++++++++------------------------- include/linux/sched.h | 2 +- kernel/cpuset.c | 44 ++++++++++---------------------------------- mm/filemap.c | 11 +++++++---- mm/hugetlb.c | 7 +++++-- mm/mempolicy.c | 28 +++++++++++++++++++++------- mm/page_alloc.c | 33 +++++++++++++++++++++++---------- mm/slab.c | 11 +++++++---- mm/slub.c | 32 +++++++++++++++++--------------- mm/vmscan.c | 2 -- 10 files changed, 105 insertions(+), 104 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index e9eaec5..ba6d217 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p); * reading current mems_allowed and mempolicy in the fastpath must protected * by get_mems_allowed() */ -static inline void get_mems_allowed(void) +static inline unsigned long get_mems_allowed(void) { - current->mems_allowed_change_disable++; - - /* - * ensure that reading mems_allowed and mempolicy happens after the - * update of ->mems_allowed_change_disable. - * - * the write-side task finds ->mems_allowed_change_disable is not 0, - * and knows the read-side task is reading mems_allowed or mempolicy, - * so it will clear old bits lazily. - */ - smp_mb(); + return atomic_read(¤t->mems_allowed_seq); } -static inline void put_mems_allowed(void) +/* + * If this returns false, the operation that took place after get_mems_allowed + * may have failed. It is up to the caller to retry the operation if + * appropriate + */ +static inline bool put_mems_allowed(unsigned long seq) { - /* - * ensure that reading mems_allowed and mempolicy before reducing - * mems_allowed_change_disable. - * - * the write-side task will know that the read-side task is still - * reading mems_allowed or mempolicy, don't clears old bits in the - * nodemask. - */ - smp_mb(); - --ACCESS_ONCE(current->mems_allowed_change_disable); + return seq == atomic_read(¤t->mems_allowed_seq); } static inline void set_mems_allowed(nodemask_t nodemask) { task_lock(current); + atomic_inc(¤t->mems_allowed_seq); current->mems_allowed = nodemask; task_unlock(current); } @@ -234,12 +221,14 @@ static inline void set_mems_allowed(nodemask_t nodemask) { } -static inline void get_mems_allowed(void) +static inline unsigned long get_mems_allowed(void) { + return 0; } -static inline void put_mems_allowed(void) +static inline bool put_mems_allowed(unsigned long seq) { + return true; } #endif /* !CONFIG_CPUSETS */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 7d379a6..295dc1b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1498,7 +1498,7 @@ struct task_struct { #endif #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; /* Protected by alloc_lock */ - int mems_allowed_change_disable; + atomic_t mems_allowed_seq; /* Seqence no to catch updates */ int cpuset_mem_spread_rotor; int cpuset_slab_spread_rotor; #endif diff --git a/kernel/cpuset.c b/kernel/cpuset.c index a09ac2b..6bdefed 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk, { bool need_loop; -repeat: /* * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. @@ -983,45 +982,22 @@ repeat: */ need_loop = task_has_mempolicy(tsk) || !nodes_intersects(*newmems, tsk->mems_allowed); - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); - mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); - /* - * ensure checking ->mems_allowed_change_disable after setting all new - * allowed nodes. - * - * the read-side task can see an nodemask with new allowed nodes and - * old allowed nodes. and if it allocates page when cpuset clears newly - * disallowed ones continuous, it can see the new allowed bits. - * - * And if setting all new allowed nodes is after the checking, setting - * all new allowed nodes and clearing newly disallowed ones will be done - * continuous, and the read-side task may find no node to alloc page. - */ - smp_mb(); + if (need_loop) + atomic_inc(&tsk->mems_allowed_seq); - /* - * Allocation of memory is very fast, we needn't sleep when waiting - * for the read-side. - */ - while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) { - task_unlock(tsk); - if (!task_curr(tsk)) - yield(); - goto repeat; - } + nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); + mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); - /* - * ensure checking ->mems_allowed_change_disable before clearing all new - * disallowed nodes. - * - * if clearing newly disallowed bits before the checking, the read-side - * task may find no node to alloc page. - */ - smp_mb(); + if (need_loop) + atomic_inc(&tsk->mems_allowed_seq); mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2); tsk->mems_allowed = *newmems; + + if (need_loop) + atomic_inc(&tsk->mems_allowed_seq); + task_unlock(tsk); } diff --git a/mm/filemap.c b/mm/filemap.c index b662757..cf40c4c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -498,12 +498,15 @@ struct page *__page_cache_alloc(gfp_t gfp) { int n; struct page *page; + unsigned long cpuset_mems_cookie; if (cpuset_do_page_mem_spread()) { - get_mems_allowed(); - n = cpuset_mem_spread_node(); - page = alloc_pages_exact_node(n, gfp, 0); - put_mems_allowed(); + do { + cpuset_mems_cookie = get_mems_allowed(); + n = cpuset_mem_spread_node(); + page = alloc_pages_exact_node(n, gfp, 0); + } while (!put_mems_allowed(cpuset_mems_cookie) && !page); + return page; } return alloc_pages(gfp, 0); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f34bd8..613698e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -460,8 +460,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, struct zonelist *zonelist; struct zone *zone; struct zoneref *z; + unsigned long cpuset_mems_cookie; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); zonelist = huge_zonelist(vma, address, htlb_alloc_mask, &mpol, &nodemask); /* @@ -490,7 +492,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, } err: mpol_cond_put(mpol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 06b145f..d80e16f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1843,18 +1843,24 @@ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, int node) { - struct mempolicy *pol = get_vma_policy(current, vma, addr); + struct mempolicy *pol; struct zonelist *zl; struct page *page; + unsigned long cpuset_mems_cookie; + +retry_cpuset: + pol = get_vma_policy(current, vma, addr); + cpuset_mems_cookie = get_mems_allowed(); - get_mems_allowed(); if (unlikely(pol->mode == MPOL_INTERLEAVE)) { unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } zl = policy_zonelist(gfp, pol, node); @@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, struct page *page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); __mpol_put(pol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } /* @@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } @@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) { struct mempolicy *pol = current->mempolicy; struct page *page; + unsigned long cpuset_mems_cookie; if (!pol || in_interrupt() || (gfp & __GFP_THISNODE)) pol = &default_policy; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* * No reference counting needed for current->mempolicy * nor system default_policy @@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) page = __alloc_pages_nodemask(gfp, order, policy_zonelist(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)); - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(alloc_pages_current); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d2186ec..b8084d0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, { enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; - struct page *page; + struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); + unsigned long cpuset_mems_cookie; gfp_mask &= gfp_allowed_mask; @@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (unlikely(!zonelist->_zonerefs->zone)) return NULL; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* The preferred zone is used for statistics later */ first_zones_zonelist(zonelist, high_zoneidx, nodemask ? : &cpuset_current_mems_allowed, &preferred_zone); - if (!preferred_zone) { - put_mems_allowed(); - return NULL; - } + if (!preferred_zone) + goto out; /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, @@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); - put_mems_allowed(); trace_mm_page_alloc(page, order, gfp_mask, migratetype); + +out: + /* + * When updating a tasks mems_allowed, it is possible to race with + * parallel threads in such a way that an allocation can fail while + * the mask is being updated. If a page allocation is about to fail, + * check if the cpuset changed during allocation and if so, retry. + */ + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); @@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid) bool skip_free_areas_node(unsigned int flags, int nid) { bool ret = false; + unsigned long cpuset_mems_cookie; if (!(flags & SHOW_MEM_FILTER_NODES)) goto out; - get_mems_allowed(); - ret = !node_isset(nid, cpuset_current_mems_allowed); - put_mems_allowed(); + do { + cpuset_mems_cookie = get_mems_allowed(); + ret = !node_isset(nid, cpuset_current_mems_allowed); + } while (!put_mems_allowed(cpuset_mems_cookie)); out: return ret; } diff --git a/mm/slab.c b/mm/slab.c index f0bd785..0b4f6ea 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; nid_alloc = nid_here = numa_mem_id(); - get_mems_allowed(); if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD)) nid_alloc = cpuset_slab_spread_node(); else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); - put_mems_allowed(); if (nid_alloc != nid_here) return ____cache_alloc_node(cachep, flags, nid_alloc); return NULL; @@ -3312,11 +3310,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; int nid; + unsigned long cpuset_mems_cookie; if (flags & __GFP_THISNODE) return NULL; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); @@ -3372,7 +3373,9 @@ retry: } } } - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj)) + goto retry_cpuset; return obj; } diff --git a/mm/slub.c b/mm/slub.c index 4907563..6515c98 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *object; + unsigned long cpuset_mems_cookie; /* * The defrag ratio allows a configuration of the tradeoffs between @@ -1604,23 +1605,24 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, get_cycles() % 1024 > s->remote_node_defrag_ratio) return NULL; - get_mems_allowed(); - zonelist = node_zonelist(slab_node(current->mempolicy), flags); - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { - struct kmem_cache_node *n; - - n = get_node(s, zone_to_nid(zone)); - - if (n && cpuset_zone_allowed_hardwall(zone, flags) && - n->nr_partial > s->min_partial) { - object = get_partial_node(s, n, c); - if (object) { - put_mems_allowed(); - return object; + do { + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + struct kmem_cache_node *n; + + n = get_node(s, zone_to_nid(zone)); + + if (n && cpuset_zone_allowed_hardwall(zone, flags) && + n->nr_partial > s->min_partial) { + object = get_partial_node(s, n, c); + if (object) { + put_mems_allowed(cpuset_mems_cookie); + return object; + } } } - } - put_mems_allowed(); + } while (!put_mems_allowed(cpuset_mems_cookie)); #endif return NULL; } diff --git a/mm/vmscan.c b/mm/vmscan.c index c52b235..fccc048 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long writeback_threshold; bool aborted_reclaim; - get_mems_allowed(); delayacct_freepages_start(); if (global_reclaim(sc)) @@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, out: delayacct_freepages_end(); - put_mems_allowed(); if (sc->nr_reclaimed) return sc->nr_reclaimed; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman @ 2012-03-02 16:19 ` Christoph Lameter 2012-03-02 17:43 ` Mel Gorman 2012-03-02 21:21 ` Peter Zijlstra 1 sibling, 1 reply; 20+ messages in thread From: Christoph Lameter @ 2012-03-02 16:19 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel On Fri, 2 Mar 2012, Mel Gorman wrote: > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index e9eaec5..ba6d217 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p); > * reading current mems_allowed and mempolicy in the fastpath must protected > * by get_mems_allowed() > */ > -static inline void get_mems_allowed(void) > +static inline unsigned long get_mems_allowed(void) > { > - current->mems_allowed_change_disable++; > - > - /* > - * ensure that reading mems_allowed and mempolicy happens after the > - * update of ->mems_allowed_change_disable. > - * > - * the write-side task finds ->mems_allowed_change_disable is not 0, > - * and knows the read-side task is reading mems_allowed or mempolicy, > - * so it will clear old bits lazily. > - */ > - smp_mb(); > + return atomic_read(¤t->mems_allowed_seq); > } > > -static inline void put_mems_allowed(void) > +/* > + * If this returns false, the operation that took place after get_mems_allowed > + * may have failed. It is up to the caller to retry the operation if > + * appropriate > + */ > +static inline bool put_mems_allowed(unsigned long seq) > { > - /* > - * ensure that reading mems_allowed and mempolicy before reducing > - * mems_allowed_change_disable. > - * > - * the write-side task will know that the read-side task is still > - * reading mems_allowed or mempolicy, don't clears old bits in the > - * nodemask. > - */ > - smp_mb(); > - --ACCESS_ONCE(current->mems_allowed_change_disable); > + return seq == atomic_read(¤t->mems_allowed_seq); > } Use seqlock instead of the counter? Seems that you are recoding much of what a seqlock does. A seqlock also allows you to have a writer that sort of blocks the reades if necessary. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 16:19 ` Christoph Lameter @ 2012-03-02 17:43 ` Mel Gorman 2012-03-02 19:53 ` Christoph Lameter 2012-03-02 21:25 ` Peter Zijlstra 0 siblings, 2 replies; 20+ messages in thread From: Mel Gorman @ 2012-03-02 17:43 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel On Fri, Mar 02, 2012 at 10:19:55AM -0600, Christoph Lameter wrote: > On Fri, 2 Mar 2012, Mel Gorman wrote: > > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > > index e9eaec5..ba6d217 100644 > > --- a/include/linux/cpuset.h > > +++ b/include/linux/cpuset.h > > @@ -92,38 +92,25 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p); > > * reading current mems_allowed and mempolicy in the fastpath must protected > > * by get_mems_allowed() > > */ > > -static inline void get_mems_allowed(void) > > +static inline unsigned long get_mems_allowed(void) > > { > > - current->mems_allowed_change_disable++; > > - > > - /* > > - * ensure that reading mems_allowed and mempolicy happens after the > > - * update of ->mems_allowed_change_disable. > > - * > > - * the write-side task finds ->mems_allowed_change_disable is not 0, > > - * and knows the read-side task is reading mems_allowed or mempolicy, > > - * so it will clear old bits lazily. > > - */ > > - smp_mb(); > > + return atomic_read(¤t->mems_allowed_seq); > > } > > > > -static inline void put_mems_allowed(void) > > +/* > > + * If this returns false, the operation that took place after get_mems_allowed > > + * may have failed. It is up to the caller to retry the operation if > > + * appropriate > > + */ > > +static inline bool put_mems_allowed(unsigned long seq) > > { > > - /* > > - * ensure that reading mems_allowed and mempolicy before reducing > > - * mems_allowed_change_disable. > > - * > > - * the write-side task will know that the read-side task is still > > - * reading mems_allowed or mempolicy, don't clears old bits in the > > - * nodemask. > > - */ > > - smp_mb(); > > - --ACCESS_ONCE(current->mems_allowed_change_disable); > > + return seq == atomic_read(¤t->mems_allowed_seq); > > } > > Use seqlock instead of the counter? Seems that you are recoding much of > what a seqlock does. A seqlock also allows you to have a writer that sort > of blocks the reades if necessary. > I considered using a seqlock but it isn't cheap. The read side is heavy with the possibility that it starts spinning and incurs a read barrier (looking at read_seqbegin()) here. The retry block incurs another read barrier so basically it would not be no better than what is there currently (which at a 4% performance hit, sucks) In the case of seqlocks, a reader will backoff if a writer is in progress but the page allocator doesn't need that which is why I felt it was ok to special case. Instead, it will try allocate a page while the update is in progress and only take special action if the allocation will fail. Allocation failure is an unusual situation that can trigger application exit or an OOM so it's ok to treat it as a slow path. A normal seqlock would retry unconditionally and potentially have to handle the case where it needs to free the page before retrying which is pointless. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 17:43 ` Mel Gorman @ 2012-03-02 19:53 ` Christoph Lameter 2012-03-02 21:25 ` Peter Zijlstra 1 sibling, 0 replies; 20+ messages in thread From: Christoph Lameter @ 2012-03-02 19:53 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Miao Xie, Peter Zijlstra, linux-mm, linux-kernel On Fri, 2 Mar 2012, Mel Gorman wrote: > I considered using a seqlock but it isn't cheap. The read side is heavy > with the possibility that it starts spinning and incurs a read barrier > (looking at read_seqbegin()) here. The retry block incurs another read > barrier so basically it would not be no better than what is there currently > (which at a 4% performance hit, sucks) Oh. You dont have a read barrier? So your approach is buggy? We could have read a state before someone else incremented the seq counter, then cached it, then we read the counter, did the processing and found that the sequid was not changed? > In the case of seqlocks, a reader will backoff if a writer is in progress > but the page allocator doesn't need that which is why I felt it was ok You can just not use the writer section if you think that is ok. Doubt it but lets at least start using a known serialization construct that would allow us to fix it up if we find that we need to update multiple variables protected by the seqlock. > Allocation failure is an unusual situation that can trigger application > exit or an OOM so it's ok to treat it as a slow path. A normal seqlock > would retry unconditionally and potentially have to handle the case > where it needs to free the page before retrying which is pointless. It will only retry as long as the writer hold the "lock". Like a spinlock the holdoff times depends on the size of the critical section and initially you could just avoid having write sections. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 17:43 ` Mel Gorman 2012-03-02 19:53 ` Christoph Lameter @ 2012-03-02 21:25 ` Peter Zijlstra 2012-03-02 23:47 ` David Rientjes 2012-03-05 9:35 ` Mel Gorman 1 sibling, 2 replies; 20+ messages in thread From: Peter Zijlstra @ 2012-03-02 21:25 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Lameter, Andrew Morton, Miao Xie, linux-mm, linux-kernel On Fri, 2012-03-02 at 17:43 +0000, Mel Gorman wrote: > > I considered using a seqlock but it isn't cheap. The read side is heavy > with the possibility that it starts spinning and incurs a read barrier > (looking at read_seqbegin()) here. The retry block incurs another read > barrier so basically it would not be no better than what is there currently > (which at a 4% performance hit, sucks) Use seqcount. Also, for the write side it doesn't really matter, changing mems_allowed should be rare and is an 'expensive' operation anyway. For the read side you can do: again: seq = read_seqcount_begin(¤t->mems_seq); page = do_your_allocator_muck(); if (!page && read_seqcount_retry(¤t->mems_seq, seq)) goto again; oom(); That way, you only have one smp_rmb() in your fath path, read_seqcount_begin() doesn't spin, and you only incur the second smp_rmb() when you've completely failed to allocate anything. smp_rmb() is basicaly free on x86, other archs will incur some overhead, but you need a barrier as Christoph pointed out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 21:25 ` Peter Zijlstra @ 2012-03-02 23:47 ` David Rientjes 2012-03-05 9:44 ` Mel Gorman 2012-03-05 9:35 ` Mel Gorman 1 sibling, 1 reply; 20+ messages in thread From: David Rientjes @ 2012-03-02 23:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Christoph Lameter, Andrew Morton, Miao Xie, linux-mm, linux-kernel On Fri, 2 Mar 2012, Peter Zijlstra wrote: > Also, for the write side it doesn't really matter, changing mems_allowed > should be rare and is an 'expensive' operation anyway. > It's very expensive even without memory barriers since the page allocator wraps itself in {get,put}_mems_allowed() until a page or NULL is returned and an update to current's set of allowed mems can stall indefinitely trying to change the nodemask during this time. The thread changing cpuset.mems is holding cgroup_mutex the entire time which locks out changes, including adding additional nodes to current's set of allowed mems. If direct reclaim takes a long time or an oom killed task fails to exit quickly (or the allocation is __GFP_NOFAIL and we just spin indefinitely holding get_mems_allowed()), then it's not uncommon to see a write to cpuset.mems taking minutes while holding the mutex, if it ever actually returns at all. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 23:47 ` David Rientjes @ 2012-03-05 9:44 ` Mel Gorman 2012-03-06 23:31 ` David Rientjes 0 siblings, 1 reply; 20+ messages in thread From: Mel Gorman @ 2012-03-05 9:44 UTC (permalink / raw) To: David Rientjes Cc: Peter Zijlstra, Christoph Lameter, Andrew Morton, Miao Xie, linux-mm, linux-kernel On Fri, Mar 02, 2012 at 03:47:17PM -0800, David Rientjes wrote: > On Fri, 2 Mar 2012, Peter Zijlstra wrote: > > > Also, for the write side it doesn't really matter, changing mems_allowed > > should be rare and is an 'expensive' operation anyway. > > > > It's very expensive even without memory barriers since the page allocator > wraps itself in {get,put}_mems_allowed() until a page or NULL is returned > and an update to current's set of allowed mems can stall indefinitely > trying to change the nodemask during this time. Hmm, this sounds problematic. Are you seeing a problem with the behaviour with the patch applied or the existing behaviour? If you are talking about the patch, I am missing something. The retry only takes place if there is a parallel update of the nodemask and the page allocation fails. There would need to be continual updates of the nodemask that led to false allocation failures for it to stall indefinitely. On the updating of the cpumask side, there is no longer the "yield and retry" logic so it also should not stall indefinitely trying to change the nodemask. The write_seqcount_begin() does not wait for the reader side to complete so it should also not be stalling for long periods of time. Did I miss something? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-05 9:44 ` Mel Gorman @ 2012-03-06 23:31 ` David Rientjes 0 siblings, 0 replies; 20+ messages in thread From: David Rientjes @ 2012-03-06 23:31 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Christoph Lameter, Andrew Morton, Miao Xie, linux-mm, linux-kernel On Mon, 5 Mar 2012, Mel Gorman wrote: > > It's very expensive even without memory barriers since the page allocator > > wraps itself in {get,put}_mems_allowed() until a page or NULL is returned > > and an update to current's set of allowed mems can stall indefinitely > > trying to change the nodemask during this time. > > Hmm, this sounds problematic. Are you seeing a problem with the behaviour > with the patch applied or the existing behaviour? > Sorry, yes, this is with the existing behavior prior to your patch. We definitely need fixes for get_mems_allowed() because it's possible that a write to cpuset.mems will never return even when trying to add nodes to its nodemask in oom conditions if one of the cpuset's tasks is looping forever in the page allocator. I'll review your updated version posted from today. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 21:25 ` Peter Zijlstra 2012-03-02 23:47 ` David Rientjes @ 2012-03-05 9:35 ` Mel Gorman 1 sibling, 0 replies; 20+ messages in thread From: Mel Gorman @ 2012-03-05 9:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Christoph Lameter, Andrew Morton, Miao Xie, linux-mm, linux-kernel On Fri, Mar 02, 2012 at 10:25:29PM +0100, Peter Zijlstra wrote: > On Fri, 2012-03-02 at 17:43 +0000, Mel Gorman wrote: > > > > I considered using a seqlock but it isn't cheap. The read side is heavy > > with the possibility that it starts spinning and incurs a read barrier > > (looking at read_seqbegin()) here. The retry block incurs another read > > barrier so basically it would not be no better than what is there currently > > (which at a 4% performance hit, sucks) > > Use seqcount. > Ok. I wanted to avoid the stall on the read side (it stalls if there is a writer in progress) but it's not worth a special case. As you say, changing mems_allowed should be rare and even then it only becomes a problem if the new mask may cause false failures. > Also, for the write side it doesn't really matter, changing mems_allowed > should be rare and is an 'expensive' operation anyway. > > For the read side you can do: > > again: > seq = read_seqcount_begin(¤t->mems_seq); > > page = do_your_allocator_muck(); > > if (!page && read_seqcount_retry(¤t->mems_seq, seq)) > goto again; > > oom(); > This is effectively what I did. It's still behind get/put mems allowed so it can be disabled in !CPUSET configurations. > That way, you only have one smp_rmb() in your fath path, > read_seqcount_begin() doesn't spin, and you only incur the second > smp_rmb() when you've completely failed to allocate anything. > > smp_rmb() is basicaly free on x86, other archs will incur some overhead, > but you need a barrier as Christoph pointed out. Thanks Peter and Christoph for the feedback. Can you give this version a look over? The performance figures are updated so the performance gain is still there. ---8<--- cpuset: mm: Reduce large amounts of memory barrier related damage Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing cpuset's mems] wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/cpuset.h | 38 +++++++++++++------------------------- include/linux/sched.h | 2 +- kernel/cpuset.c | 43 ++++++++----------------------------------- mm/filemap.c | 11 +++++++---- mm/hugetlb.c | 7 +++++-- mm/mempolicy.c | 28 +++++++++++++++++++++------- mm/page_alloc.c | 33 +++++++++++++++++++++++---------- mm/slab.c | 11 +++++++---- mm/slub.c | 32 +++++++++++++++++--------------- mm/vmscan.c | 2 -- 10 files changed, 102 insertions(+), 105 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index e9eaec5..104f701 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -92,33 +92,19 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p); * reading current mems_allowed and mempolicy in the fastpath must protected * by get_mems_allowed() */ -static inline void get_mems_allowed(void) +static inline unsigned int get_mems_allowed(void) { - current->mems_allowed_change_disable++; - - /* - * ensure that reading mems_allowed and mempolicy happens after the - * update of ->mems_allowed_change_disable. - * - * the write-side task finds ->mems_allowed_change_disable is not 0, - * and knows the read-side task is reading mems_allowed or mempolicy, - * so it will clear old bits lazily. - */ - smp_mb(); + return read_seqcount_begin(¤t->mems_allowed_seq); } -static inline void put_mems_allowed(void) +/* + * If this returns false, the operation that took place after get_mems_allowed + * may have failed. It is up to the caller to retry the operation if + * appropriate + */ +static inline bool put_mems_allowed(unsigned int seq) { - /* - * ensure that reading mems_allowed and mempolicy before reducing - * mems_allowed_change_disable. - * - * the write-side task will know that the read-side task is still - * reading mems_allowed or mempolicy, don't clears old bits in the - * nodemask. - */ - smp_mb(); - --ACCESS_ONCE(current->mems_allowed_change_disable); + return !read_seqcount_retry(¤t->mems_allowed_seq, seq); } static inline void set_mems_allowed(nodemask_t nodemask) @@ -234,12 +220,14 @@ static inline void set_mems_allowed(nodemask_t nodemask) { } -static inline void get_mems_allowed(void) +static inline unsigned int get_mems_allowed(void) { + return 0; } -static inline void put_mems_allowed(void) +static inline bool put_mems_allowed(unsigned int seq) { + return true; } #endif /* !CONFIG_CPUSETS */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 7d379a6..a0bb87a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1498,7 +1498,7 @@ struct task_struct { #endif #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; /* Protected by alloc_lock */ - int mems_allowed_change_disable; + seqcount_t mems_allowed_seq; /* Seqence no to catch updates */ int cpuset_mem_spread_rotor; int cpuset_slab_spread_rotor; #endif diff --git a/kernel/cpuset.c b/kernel/cpuset.c index a09ac2b..5014493 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk, { bool need_loop; -repeat: /* * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. @@ -983,45 +982,19 @@ repeat: */ need_loop = task_has_mempolicy(tsk) || !nodes_intersects(*newmems, tsk->mems_allowed); - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); - mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); - /* - * ensure checking ->mems_allowed_change_disable after setting all new - * allowed nodes. - * - * the read-side task can see an nodemask with new allowed nodes and - * old allowed nodes. and if it allocates page when cpuset clears newly - * disallowed ones continuous, it can see the new allowed bits. - * - * And if setting all new allowed nodes is after the checking, setting - * all new allowed nodes and clearing newly disallowed ones will be done - * continuous, and the read-side task may find no node to alloc page. - */ - smp_mb(); + if (need_loop) + write_seqcount_begin(&tsk->mems_allowed_seq); - /* - * Allocation of memory is very fast, we needn't sleep when waiting - * for the read-side. - */ - while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) { - task_unlock(tsk); - if (!task_curr(tsk)) - yield(); - goto repeat; - } - - /* - * ensure checking ->mems_allowed_change_disable before clearing all new - * disallowed nodes. - * - * if clearing newly disallowed bits before the checking, the read-side - * task may find no node to alloc page. - */ - smp_mb(); + nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); + mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2); tsk->mems_allowed = *newmems; + + if (need_loop) + write_seqcount_end(&tsk->mems_allowed_seq); + task_unlock(tsk); } diff --git a/mm/filemap.c b/mm/filemap.c index b662757..56a1e11 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -498,12 +498,15 @@ struct page *__page_cache_alloc(gfp_t gfp) { int n; struct page *page; + unsigned int cpuset_mems_cookie; if (cpuset_do_page_mem_spread()) { - get_mems_allowed(); - n = cpuset_mem_spread_node(); - page = alloc_pages_exact_node(n, gfp, 0); - put_mems_allowed(); + do { + cpuset_mems_cookie = get_mems_allowed(); + n = cpuset_mem_spread_node(); + page = alloc_pages_exact_node(n, gfp, 0); + } while (!put_mems_allowed(cpuset_mems_cookie) && !page); + return page; } return alloc_pages(gfp, 0); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f34bd8..5f1e959 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -460,8 +460,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, struct zonelist *zonelist; struct zone *zone; struct zoneref *z; + unsigned int cpuset_mems_cookie; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); zonelist = huge_zonelist(vma, address, htlb_alloc_mask, &mpol, &nodemask); /* @@ -490,7 +492,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, } err: mpol_cond_put(mpol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 06b145f..013d981 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1843,18 +1843,24 @@ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, int node) { - struct mempolicy *pol = get_vma_policy(current, vma, addr); + struct mempolicy *pol; struct zonelist *zl; struct page *page; + unsigned int cpuset_mems_cookie; + +retry_cpuset: + pol = get_vma_policy(current, vma, addr); + cpuset_mems_cookie = get_mems_allowed(); - get_mems_allowed(); if (unlikely(pol->mode == MPOL_INTERLEAVE)) { unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } zl = policy_zonelist(gfp, pol, node); @@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, struct page *page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); __mpol_put(pol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } /* @@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } @@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) { struct mempolicy *pol = current->mempolicy; struct page *page; + unsigned int cpuset_mems_cookie; if (!pol || in_interrupt() || (gfp & __GFP_THISNODE)) pol = &default_policy; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* * No reference counting needed for current->mempolicy * nor system default_policy @@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) page = __alloc_pages_nodemask(gfp, order, policy_zonelist(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)); - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(alloc_pages_current); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d2186ec..3a667da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, { enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; - struct page *page; + struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); + unsigned int cpuset_mems_cookie; gfp_mask &= gfp_allowed_mask; @@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (unlikely(!zonelist->_zonerefs->zone)) return NULL; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* The preferred zone is used for statistics later */ first_zones_zonelist(zonelist, high_zoneidx, nodemask ? : &cpuset_current_mems_allowed, &preferred_zone); - if (!preferred_zone) { - put_mems_allowed(); - return NULL; - } + if (!preferred_zone) + goto out; /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, @@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); - put_mems_allowed(); trace_mm_page_alloc(page, order, gfp_mask, migratetype); + +out: + /* + * When updating a tasks mems_allowed, it is possible to race with + * parallel threads in such a way that an allocation can fail while + * the mask is being updated. If a page allocation is about to fail, + * check if the cpuset changed during allocation and if so, retry. + */ + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); @@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid) bool skip_free_areas_node(unsigned int flags, int nid) { bool ret = false; + unsigned int cpuset_mems_cookie; if (!(flags & SHOW_MEM_FILTER_NODES)) goto out; - get_mems_allowed(); - ret = !node_isset(nid, cpuset_current_mems_allowed); - put_mems_allowed(); + do { + cpuset_mems_cookie = get_mems_allowed(); + ret = !node_isset(nid, cpuset_current_mems_allowed); + } while (!put_mems_allowed(cpuset_mems_cookie)); out: return ret; } diff --git a/mm/slab.c b/mm/slab.c index f0bd785..ae2db04 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; nid_alloc = nid_here = numa_mem_id(); - get_mems_allowed(); if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD)) nid_alloc = cpuset_slab_spread_node(); else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); - put_mems_allowed(); if (nid_alloc != nid_here) return ____cache_alloc_node(cachep, flags, nid_alloc); return NULL; @@ -3312,11 +3310,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; int nid; + unsigned int cpuset_mems_cookie; if (flags & __GFP_THISNODE) return NULL; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); @@ -3372,7 +3373,9 @@ retry: } } } - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj)) + goto retry_cpuset; return obj; } diff --git a/mm/slub.c b/mm/slub.c index 4907563..ac5dddc 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *object; + unsigned int cpuset_mems_cookie; /* * The defrag ratio allows a configuration of the tradeoffs between @@ -1604,23 +1605,24 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, get_cycles() % 1024 > s->remote_node_defrag_ratio) return NULL; - get_mems_allowed(); - zonelist = node_zonelist(slab_node(current->mempolicy), flags); - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { - struct kmem_cache_node *n; - - n = get_node(s, zone_to_nid(zone)); - - if (n && cpuset_zone_allowed_hardwall(zone, flags) && - n->nr_partial > s->min_partial) { - object = get_partial_node(s, n, c); - if (object) { - put_mems_allowed(); - return object; + do { + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + struct kmem_cache_node *n; + + n = get_node(s, zone_to_nid(zone)); + + if (n && cpuset_zone_allowed_hardwall(zone, flags) && + n->nr_partial > s->min_partial) { + object = get_partial_node(s, n, c); + if (object) { + put_mems_allowed(cpuset_mems_cookie); + return object; + } } } - } - put_mems_allowed(); + } while (!put_mems_allowed(cpuset_mems_cookie)); #endif return NULL; } diff --git a/mm/vmscan.c b/mm/vmscan.c index c52b235..fccc048 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long writeback_threshold; bool aborted_reclaim; - get_mems_allowed(); delayacct_freepages_start(); if (global_reclaim(sc)) @@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, out: delayacct_freepages_end(); - put_mems_allowed(); if (sc->nr_reclaimed) return sc->nr_reclaimed; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman 2012-03-02 16:19 ` Christoph Lameter @ 2012-03-02 21:21 ` Peter Zijlstra 2012-03-05 20:18 ` Andrew Morton 1 sibling, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2012-03-02 21:21 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Miao Xie, Christoph Lameter, linux-mm, linux-kernel On Fri, 2012-03-02 at 11:23 +0000, Mel Gorman wrote: > For extra style points, the commit introduced the use of yield() in an > implementation of what looks like a spinning mutex. Andrew, could you simply say no to any patch adding a yield()? There's a 99% chance its a bug, as was this. This code would life-lock when cpuset_change_task_nodemask() would be called by the highest priority FIFO task on UP or when pinned to the same cpu the task doing get_mems_allowed(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator 2012-03-02 21:21 ` Peter Zijlstra @ 2012-03-05 20:18 ` Andrew Morton 2012-03-06 2:01 ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches 0 siblings, 1 reply; 20+ messages in thread From: Andrew Morton @ 2012-03-05 20:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Joe Perches On Fri, 02 Mar 2012 22:21:02 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Fri, 2012-03-02 at 11:23 +0000, Mel Gorman wrote: > > For extra style points, the commit introduced the use of yield() in an > > implementation of what looks like a spinning mutex. > > Andrew, could you simply say no to any patch adding a yield()? There's a > 99% chance its a bug, as was this. I'd normally at least poke my tongue out at it - I must have missed this one. > This code would life-lock when cpuset_change_task_nodemask() would be > called by the highest priority FIFO task on UP or when pinned to the > same cpu the task doing get_mems_allowed(). Joe, can we please have a checkpatch rule? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-05 20:18 ` Andrew Morton @ 2012-03-06 2:01 ` Joe Perches 2012-03-06 12:45 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Joe Perches @ 2012-03-06 2:01 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel Use of yield() is deprecated or at least generally undesirable. Add a checkpatch warning when it's used. Suggest cpu_relax instead. Signed-off-by: Joe Perches <joe@perches.com> --- > Joe, can we please have a checkpatch rule? Something like this? scripts/checkpatch.pl | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index e32ea7f..80ad474 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3300,6 +3300,11 @@ sub process { "__func__ should be used instead of gcc specific __FUNCTION__\n" . $herecurr); } +# check for use of yield() + if ($line =~ /\byield\s*\(\s*\)/ { + WARN("YIELD", + "yield() is deprecated, consider cpu_relax()\n" . $herecurr); + } # check for semaphores initialized locked if ($line =~ /^.\s*sema_init.+,\W?0\W?\)/) { WARN("CONSIDER_COMPLETION", -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 2:01 ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches @ 2012-03-06 12:45 ` Peter Zijlstra 2012-03-06 13:14 ` Glauber Costa 2012-03-06 17:41 ` Joe Perches 0 siblings, 2 replies; 20+ messages in thread From: Peter Zijlstra @ 2012-03-06 12:45 UTC (permalink / raw) To: Joe Perches Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Mon, 2012-03-05 at 18:01 -0800, Joe Perches wrote: > +# check for use of yield() > + if ($line =~ /\byield\s*\(\s*\)/ { > + WARN("YIELD", > + "yield() is deprecated, consider cpu_relax()\n" . $herecurr); > + } Its not deprecated as such, its just a very dangerous and ill considered API. cpu_relax() is not a good substitute suggestion in that its still a busy wait and prone to much of the same problems. The case at hand was a life-lock due to expecting that yield() would run another process which it needed in order to complete. Yield() does not provide that guarantee. Looking at fs/ext4/mballoc.c, we have this gem: /* * Yield the CPU here so that we don't get soft lockup * in non preempt case. */ yield(); This is of course complete crap as well.. I suspect they want cond_resched() there. And: /* let others to free the space */ yield(); Like said, yield() doesn't guarantee anything like running anybody else, does it rely on that? Or is it optimistic? Another fun user: void tasklet_kill(struct tasklet_struct *t) { if (in_interrupt()) printk("Attempt to kill tasklet from interrupt\n"); while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { do { yield(); } while (test_bit(TASKLET_STATE_SCHED, &t->state)); } tasklet_unlock_wait(t); clear_bit(TASKLET_STATE_SCHED, &t->state); } The only reason that doesn't explode is because running tasklets is non-preemptible, However since they're non-preemptible they shouldn't run long and you might as well busy spin. If they can run long, yield() isn't your biggest problem. mm/memory_hotplug.c has two yield() calls in offline_pages() and I've no idea what they're trying to achieve. But really, yield() is basically _always_ the wrong thing. The right thing can be: cond_resched(); wait_event(); or something entirely different. So instead of suggesting an alternative, I would suggest thinking about the actual problem in order to avoid the non-thinking solutions the checkpatch brigade is so overly fond of :/ Maybe something like: "yield() is dangerous and wrong, rework your code to not use it." That at least requires some sort of thinking and doesn't suggest blind substitution. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 12:45 ` Peter Zijlstra @ 2012-03-06 13:14 ` Glauber Costa 2012-03-06 13:25 ` Peter Zijlstra 2012-03-06 17:41 ` Joe Perches 1 sibling, 1 reply; 20+ messages in thread From: Glauber Costa @ 2012-03-06 13:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On 03/06/2012 04:45 PM, Peter Zijlstra wrote: > On Mon, 2012-03-05 at 18:01 -0800, Joe Perches wrote: > >> +# check for use of yield() >> + if ($line =~ /\byield\s*\(\s*\)/ { >> + WARN("YIELD", >> + "yield() is deprecated, consider cpu_relax()\n" . $herecurr); >> + } > > Its not deprecated as such, its just a very dangerous and ill considered > API. > > cpu_relax() is not a good substitute suggestion in that its still a busy > wait and prone to much of the same problems. > > The case at hand was a life-lock due to expecting that yield() would run > another process which it needed in order to complete. Yield() does not > provide that guarantee. > > Looking at fs/ext4/mballoc.c, we have this gem: > > > /* > * Yield the CPU here so that we don't get soft lockup > * in non preempt case. > */ > yield(); > > This is of course complete crap as well.. I suspect they want > cond_resched() there. And: > > /* let others to free the space */ > yield(); > > Like said, yield() doesn't guarantee anything like running anybody else, > does it rely on that? Or is it optimistic? > > Another fun user: > > void tasklet_kill(struct tasklet_struct *t) > { > if (in_interrupt()) > printk("Attempt to kill tasklet from interrupt\n"); > > while (test_and_set_bit(TASKLET_STATE_SCHED,&t->state)) { > do { > yield(); > } while (test_bit(TASKLET_STATE_SCHED,&t->state)); > } > tasklet_unlock_wait(t); > clear_bit(TASKLET_STATE_SCHED,&t->state); > } > > The only reason that doesn't explode is because running tasklets is > non-preemptible, However since they're non-preemptible they shouldn't > run long and you might as well busy spin. If they can run long, yield() > isn't your biggest problem. > > mm/memory_hotplug.c has two yield() calls in offline_pages() and I've no > idea what they're trying to achieve. > > But really, yield() is basically _always_ the wrong thing. The right > thing can be: > > cond_resched(); wait_event(); or something entirely different. > > So instead of suggesting an alternative, I would suggest thinking about > the actual problem in order to avoid the non-thinking solutions the > checkpatch brigade is so overly fond of :/ > > Maybe something like: > > "yield() is dangerous and wrong, rework your code to not use it." > > That at least requires some sort of thinking and doesn't suggest blind > substitution. > Can't we point people to some Documentation file that explains the alternatives? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 13:14 ` Glauber Costa @ 2012-03-06 13:25 ` Peter Zijlstra 2012-03-06 13:27 ` Glauber Costa 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2012-03-06 13:25 UTC (permalink / raw) To: Glauber Costa Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Tue, 2012-03-06 at 17:14 +0400, Glauber Costa wrote: > > Can't we point people to some Documentation file that explains the > alternatives? Not sure that's a finite set.. however I think we covered the most popular ones in this thread. One could use a lkml.kernel.org link. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 13:25 ` Peter Zijlstra @ 2012-03-06 13:27 ` Glauber Costa 0 siblings, 0 replies; 20+ messages in thread From: Glauber Costa @ 2012-03-06 13:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Joe Perches, Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On 03/06/2012 05:25 PM, Peter Zijlstra wrote: > On Tue, 2012-03-06 at 17:14 +0400, Glauber Costa wrote: >> >> Can't we point people to some Documentation file that explains the >> alternatives? > > Not sure that's a finite set.. however I think we covered the most > popular ones in this thread. One could use a lkml.kernel.org link. > Yes, I think that would work. Summarizing your arguments in an in-tree Documentation file would be good as well, IMHO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 12:45 ` Peter Zijlstra 2012-03-06 13:14 ` Glauber Costa @ 2012-03-06 17:41 ` Joe Perches 2012-03-06 17:54 ` Peter Zijlstra 1 sibling, 1 reply; 20+ messages in thread From: Joe Perches @ 2012-03-06 17:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Tue, 2012-03-06 at 13:45 +0100, Peter Zijlstra wrote: > The case at hand was a life-lock due to expecting that yield() would run > another process which it needed in order to complete. Yield() does not > provide that guarantee. OK. Perhaps the kernel-doc comments in sched/core.c should/could be expanded/updated. /** * sys_sched_yield - yield the current processor to other threads. * * This function yields the current CPU to other tasks. If there are no * other threads running on this CPU then this function will return. */ [] /** * yield - yield the current processor to other threads. * * This is a shortcut for kernel-space yielding - it marks the * thread runnable and calls sys_sched_yield(). */ void __sched yield(void) { set_current_state(TASK_RUNNING); sys_sched_yield(); } EXPORT_SYMBOL(yield); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 17:41 ` Joe Perches @ 2012-03-06 17:54 ` Peter Zijlstra 2012-03-06 18:00 ` Joe Perches 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2012-03-06 17:54 UTC (permalink / raw) To: Joe Perches Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote: > Perhaps the kernel-doc comments in sched/core.c > should/could be expanded/updated. Something like this? --- kernel/sched/core.c | 20 ++++++++++++++++++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2963fbb..a05a0f7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq); /** * yield - yield the current processor to other threads. * - * This is a shortcut for kernel-space yielding - it marks the - * thread runnable and calls sys_sched_yield(). + * Do not ever use this function, there's a 99% chance you're doing it wrong. + * + * The scheduler is at all times free to pick the calling task as the most + * eligible task to run, if removing the yield() call from your code breaks + * it, its already broken. + * + * Typical broken usage is: + * + * while (!event) + * yield(); + * + * where one assumes that yield() will let 'the other' process run that will + * make event true. If the current task is a SCHED_FIFO task that will never + * happen. Never use yield() as a progress guarantee!! + * + * If you want to use yield() to wait for something, use wait_event(). + * If you want to use yield() to be 'nice' for others, use cond_resched(). + * If you still want to use yield(), do not! */ void __sched yield(void) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 17:54 ` Peter Zijlstra @ 2012-03-06 18:00 ` Joe Perches 2012-03-06 18:17 ` Joe Perches 0 siblings, 1 reply; 20+ messages in thread From: Joe Perches @ 2012-03-06 18:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Tue, 2012-03-06 at 18:54 +0100, Peter Zijlstra wrote: > On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote: > > Perhaps the kernel-doc comments in sched/core.c > > should/could be expanded/updated. > > Something like this? > > --- > kernel/sched/core.c | 20 ++++++++++++++++++-- > 1 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2963fbb..a05a0f7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq); > /** > * yield - yield the current processor to other threads. > * > - * This is a shortcut for kernel-space yielding - it marks the > - * thread runnable and calls sys_sched_yield(). > + * Do not ever use this function, there's a 99% chance you're doing it wrong. > + * > + * The scheduler is at all times free to pick the calling task as the most > + * eligible task to run, if removing the yield() call from your code breaks > + * it, its already broken. > + * > + * Typical broken usage is: > + * > + * while (!event) > + * yield(); > + * > + * where one assumes that yield() will let 'the other' process run that will > + * make event true. If the current task is a SCHED_FIFO task that will never > + * happen. Never use yield() as a progress guarantee!! > + * > + * If you want to use yield() to wait for something, use wait_event(). > + * If you want to use yield() to be 'nice' for others, use cond_resched(). > + * If you still want to use yield(), do not! > */ Yes. I'll update the checkpatch message to say something like: "Using yield() is generally wrong. See yield() kernel-doc (sched/core.c)" -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH] checkpatch: Warn on use of yield() 2012-03-06 18:00 ` Joe Perches @ 2012-03-06 18:17 ` Joe Perches 0 siblings, 0 replies; 20+ messages in thread From: Joe Perches @ 2012-03-06 18:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Mel Gorman, Miao Xie, Christoph Lameter, linux-mm, linux-kernel, Theodore Ts'o On Tue, 2012-03-06 at 10:00 -0800, Joe Perches wrote: > On Tue, 2012-03-06 at 18:54 +0100, Peter Zijlstra wrote: > > On Tue, 2012-03-06 at 09:41 -0800, Joe Perches wrote: > > > Perhaps the kernel-doc comments in sched/core.c > > > should/could be expanded/updated. > > Something like this? [] > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c [] > > @@ -4577,8 +4577,24 @@ EXPORT_SYMBOL(__cond_resched_softirq); > > /** > > * yield - yield the current processor to other threads. Perhaps the phrase "other threads" is poor word choice. Maybe: yield - yield the current processor to a runnable thread (might be the current thread) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-03-06 23:31 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-02 11:23 [PATCH] cpuset: mm: Remove memory barrier damage from the page allocator Mel Gorman 2012-03-02 16:19 ` Christoph Lameter 2012-03-02 17:43 ` Mel Gorman 2012-03-02 19:53 ` Christoph Lameter 2012-03-02 21:25 ` Peter Zijlstra 2012-03-02 23:47 ` David Rientjes 2012-03-05 9:44 ` Mel Gorman 2012-03-06 23:31 ` David Rientjes 2012-03-05 9:35 ` Mel Gorman 2012-03-02 21:21 ` Peter Zijlstra 2012-03-05 20:18 ` Andrew Morton 2012-03-06 2:01 ` [RFC PATCH] checkpatch: Warn on use of yield() Joe Perches 2012-03-06 12:45 ` Peter Zijlstra 2012-03-06 13:14 ` Glauber Costa 2012-03-06 13:25 ` Peter Zijlstra 2012-03-06 13:27 ` Glauber Costa 2012-03-06 17:41 ` Joe Perches 2012-03-06 17:54 ` Peter Zijlstra 2012-03-06 18:00 ` Joe Perches 2012-03-06 18:17 ` Joe Perches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).