* percpu: preemptless __per_cpu_counter_add @ 2011-04-13 14:45 Christoph Lameter 2011-04-13 16:49 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-13 14:45 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, linux-mm, eric.dumazet Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_counter_add. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 09:26:19.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-13 09:36:37.000000000 -0500 @@ -71,19 +71,22 @@ EXPORT_SYMBOL(percpu_counter_set); void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { - s64 count; + s64 count, new; - preempt_disable(); - count = __this_cpu_read(*fbc->counters) + amount; - if (count >= batch || count <= -batch) { - spin_lock(&fbc->lock); - fbc->count += count; - __this_cpu_write(*fbc->counters, 0); - spin_unlock(&fbc->lock); - } else { - __this_cpu_write(*fbc->counters, count); - } - preempt_enable(); + do { + count = this_cpu_read(*fbc->counters); + + new = count + amount; + /* In case of overflow fold it into the global counter instead */ + if (new >= batch || new <= -batch) { + spin_lock(&fbc->lock); + fbc->count += __this_cpu_read(*fbc->counters) + amount; + spin_unlock(&fbc->lock); + amount = 0; + new = 0; + } + + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); } EXPORT_SYMBOL(__percpu_counter_add); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: percpu: preemptless __per_cpu_counter_add 2011-04-13 14:45 percpu: preemptless __per_cpu_counter_add Christoph Lameter @ 2011-04-13 16:49 ` Christoph Lameter 2011-04-13 18:56 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-13 16:49 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, linux-mm, eric.dumazet Duh the retry setup if the number overflows is not correct. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 11:43:23.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-13 11:43:30.000000000 -0500 @@ -80,9 +80,14 @@ void __percpu_counter_add(struct percpu_ /* In case of overflow fold it into the global counter instead */ if (new >= batch || new <= -batch) { spin_lock(&fbc->lock); - fbc->count += __this_cpu_read(*fbc->counters) + amount; + count = __this_cpu_read(*fbc->counters); + fbc->count += count + amount; spin_unlock(&fbc->lock); - amount = 0; + /* + * If cmpxchg fails then we need to subtract the amount that + * we found in the percpu value. + */ + amount = -count; new = 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: percpu: preemptless __per_cpu_counter_add 2011-04-13 16:49 ` Christoph Lameter @ 2011-04-13 18:56 ` Tejun Heo 2011-04-13 20:22 ` [PATCH] " Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-13 18:56 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, eric.dumazet On Wed, Apr 13, 2011 at 11:49:51AM -0500, Christoph Lameter wrote: > Duh the retry setup if the number overflows is not correct. > > Signed-off-by: Christoph Lameter <cl@linux.com> Can you please repost folded patch with proper [PATCH] subject line and cc shaohua.li@intel.com so that he can resolve conflicts? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 18:56 ` Tejun Heo @ 2011-04-13 20:22 ` Christoph Lameter 2011-04-13 21:50 ` Tejun Heo 2011-04-14 2:14 ` Eric Dumazet 0 siblings, 2 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-13 20:22 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, linux-mm, eric.dumazet, shaohua.li On Thu, 14 Apr 2011, Tejun Heo wrote: > On Wed, Apr 13, 2011 at 11:49:51AM -0500, Christoph Lameter wrote: > > Duh the retry setup if the number overflows is not correct. > > > > Signed-off-by: Christoph Lameter <cl@linux.com> > > Can you please repost folded patch with proper [PATCH] subject line > and cc shaohua.li@intel.com so that he can resolve conflicts? > > Thanks. Ok here it is: From: Christoph Lameter <cl@linux.com> Subject: [PATCH] percpu: preemptless __per_cpu_counter_add Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_add. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 09:26:19.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-13 09:36:37.000000000 -0500 @@ -71,19 +71,22 @@ EXPORT_SYMBOL(percpu_counter_set); void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { - s64 count; + s64 count, new; - preempt_disable(); - count = __this_cpu_read(*fbc->counters) + amount; - if (count >= batch || count <= -batch) { - spin_lock(&fbc->lock); - fbc->count += count; - __this_cpu_write(*fbc->counters, 0); - spin_unlock(&fbc->lock); - } else { - __this_cpu_write(*fbc->counters, count); - } - preempt_enable(); + do { + count = this_cpu_read(*fbc->counters); + + new = count + amount; + /* In case of overflow fold it into the global counter instead */ + if (new >= batch || new <= -batch) { + spin_lock(&fbc->lock); + fbc->count += __this_cpu_read(*fbc->counters) + amount; + spin_unlock(&fbc->lock); + amount = 0; + new = 0; + } + + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); } EXPORT_SYMBOL(__percpu_counter_add); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 20:22 ` [PATCH] " Christoph Lameter @ 2011-04-13 21:50 ` Tejun Heo 2011-04-13 22:17 ` Christoph Lameter 2011-04-14 2:14 ` Eric Dumazet 1 sibling, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-13 21:50 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, eric.dumazet, shaohua.li Hello, On Wed, Apr 13, 2011 at 03:22:36PM -0500, Christoph Lameter wrote: > + do { > + count = this_cpu_read(*fbc->counters); > + > + new = count + amount; > + /* In case of overflow fold it into the global counter instead */ > + if (new >= batch || new <= -batch) { > + spin_lock(&fbc->lock); > + fbc->count += __this_cpu_read(*fbc->counters) + amount; > + spin_unlock(&fbc->lock); > + amount = 0; > + new = 0; > + } > + > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); Is this correct? If the percpu count changes in the middle, doesn't the count get added twice? Can you please use the cmpxchg() only in the fast path? ie. do { count = this_cpu_read(); if (overflow) { disable preemption and do the slow thing. return; } } while (this_cpu_cmpxchg()); Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 21:50 ` Tejun Heo @ 2011-04-13 22:17 ` Christoph Lameter 2011-04-13 22:23 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-13 22:17 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, linux-mm, eric.dumazet, shaohua.li On Thu, 14 Apr 2011, Tejun Heo wrote: > Hello, > > On Wed, Apr 13, 2011 at 03:22:36PM -0500, Christoph Lameter wrote: > > + do { > > + count = this_cpu_read(*fbc->counters); > > + > > + new = count + amount; > > + /* In case of overflow fold it into the global counter instead */ > > + if (new >= batch || new <= -batch) { > > + spin_lock(&fbc->lock); > > + fbc->count += __this_cpu_read(*fbc->counters) + amount; > > + spin_unlock(&fbc->lock); > > + amount = 0; > > + new = 0; > > + } > > + > > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); > > Is this correct? If the percpu count changes in the middle, doesn't > the count get added twice? Can you please use the cmpxchg() only in > the fast path? ie. Oh gosh this is the old version. The fixed version should do this differently. We need to update the counter that is cpu specific as well otherwise it will overflow again soon. Hmmm... Yes, it could be done while holding the spinlock which disables preempt. Subject: [PATCH] percpu: preemptless __per_cpu_counter_add Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_add. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 32 ++++++++++++++++++++------------ 1 file changed, 20 insertions(+), 12 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 15:19:54.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-13 15:20:04.000000000 -0500 @@ -71,19 +71,27 @@ EXPORT_SYMBOL(percpu_counter_set); void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { - s64 count; + s64 count, new; - preempt_disable(); - count = __this_cpu_read(*fbc->counters) + amount; - if (count >= batch || count <= -batch) { - spin_lock(&fbc->lock); - fbc->count += count; - __this_cpu_write(*fbc->counters, 0); - spin_unlock(&fbc->lock); - } else { - __this_cpu_write(*fbc->counters, count); - } - preempt_enable(); + do { + count = this_cpu_read(*fbc->counters); + + new = count + amount; + /* In case of overflow fold it into the global counter instead */ + if (new >= batch || new <= -batch) { + spin_lock(&fbc->lock); + count = __this_cpu_read(*fbc->counters); + fbc->count += count + amount; + spin_unlock(&fbc->lock); + /* + * If cmpxchg fails then we need to subtract the amount that + * we found in the percpu value. + */ + amount = -count; + new = 0; + } + + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); } EXPORT_SYMBOL(__percpu_counter_add); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 22:17 ` Christoph Lameter @ 2011-04-13 22:23 ` Christoph Lameter 2011-04-13 23:55 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-13 22:23 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, linux-mm, eric.dumazet, shaohua.li Suggested fixup. Return from slowpath and update percpu variable under spinlock. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 17:20:41.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-13 17:21:33.000000000 -0500 @@ -82,13 +82,9 @@ void __percpu_counter_add(struct percpu_ spin_lock(&fbc->lock); count = __this_cpu_read(*fbc->counters); fbc->count += count + amount; + __this_cpu_write(*fbc->counters, 0); spin_unlock(&fbc->lock); - /* - * If cmpxchg fails then we need to subtract the amount that - * we found in the percpu value. - */ - amount = -count; - new = 0; + return; } } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 22:23 ` Christoph Lameter @ 2011-04-13 23:55 ` Tejun Heo 2011-04-14 2:00 ` Eric Dumazet 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-13 23:55 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, eric.dumazet, shaohua.li Hello, Christoph. On Wed, Apr 13, 2011 at 05:23:04PM -0500, Christoph Lameter wrote: > > Suggested fixup. Return from slowpath and update percpu variable under > spinlock. > > Signed-off-by: Christoph Lameter <cl@linux.com> > > --- > lib/percpu_counter.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > Index: linux-2.6/lib/percpu_counter.c > =================================================================== > --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 17:20:41.000000000 -0500 > +++ linux-2.6/lib/percpu_counter.c 2011-04-13 17:21:33.000000000 -0500 > @@ -82,13 +82,9 @@ void __percpu_counter_add(struct percpu_ > spin_lock(&fbc->lock); > count = __this_cpu_read(*fbc->counters); > fbc->count += count + amount; > + __this_cpu_write(*fbc->counters, 0); > spin_unlock(&fbc->lock); > - /* > - * If cmpxchg fails then we need to subtract the amount that > - * we found in the percpu value. > - */ > - amount = -count; > - new = 0; > + return; Yeah, looks pretty good to me now. Just a couple more things. * Please fold this one into the original patch. * While you're restructuring the functions, can you add unlikely to the slow path? It now looks correct to me but just in case, Eric, do you mind reviewing and acking it? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 23:55 ` Tejun Heo @ 2011-04-14 2:00 ` Eric Dumazet 0 siblings, 0 replies; 64+ messages in thread From: Eric Dumazet @ 2011-04-14 2:00 UTC (permalink / raw) To: Tejun Heo; +Cc: Christoph Lameter, akpm, linux-mm, shaohua.li Le jeudi 14 avril 2011 A 08:55 +0900, Tejun Heo a A(C)crit : > Hello, Christoph. > > On Wed, Apr 13, 2011 at 05:23:04PM -0500, Christoph Lameter wrote: > > > > Suggested fixup. Return from slowpath and update percpu variable under > > spinlock. > > > > Signed-off-by: Christoph Lameter <cl@linux.com> > > > > --- > > lib/percpu_counter.c | 8 ++------ > > 1 file changed, 2 insertions(+), 6 deletions(-) > > > > Index: linux-2.6/lib/percpu_counter.c > > =================================================================== > > --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 17:20:41.000000000 -0500 > > +++ linux-2.6/lib/percpu_counter.c 2011-04-13 17:21:33.000000000 -0500 > > @@ -82,13 +82,9 @@ void __percpu_counter_add(struct percpu_ > > spin_lock(&fbc->lock); > > count = __this_cpu_read(*fbc->counters); > > fbc->count += count + amount; > > + __this_cpu_write(*fbc->counters, 0); > > spin_unlock(&fbc->lock); > > - /* > > - * If cmpxchg fails then we need to subtract the amount that > > - * we found in the percpu value. > > - */ > > - amount = -count; > > - new = 0; > > + return; > > Yeah, looks pretty good to me now. Just a couple more things. > > * Please fold this one into the original patch. > > * While you're restructuring the functions, can you add unlikely to > the slow path? > > It now looks correct to me but just in case, Eric, do you mind > reviewing and acking it? > > Thanks. > I am not sure its worth it, considering we hit this on machines where preemption is off (CONFIG_PREEMPT_NONE=y) ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-13 20:22 ` [PATCH] " Christoph Lameter 2011-04-13 21:50 ` Tejun Heo @ 2011-04-14 2:14 ` Eric Dumazet 2011-04-14 21:10 ` Christoph Lameter 1 sibling, 1 reply; 64+ messages in thread From: Eric Dumazet @ 2011-04-14 2:14 UTC (permalink / raw) To: Christoph Lameter; +Cc: Tejun Heo, akpm, linux-mm, shaohua.li Le mercredi 13 avril 2011 A 15:22 -0500, Christoph Lameter a A(C)crit : > On Thu, 14 Apr 2011, Tejun Heo wrote: > > > On Wed, Apr 13, 2011 at 11:49:51AM -0500, Christoph Lameter wrote: > > > Duh the retry setup if the number overflows is not correct. > > > > > > Signed-off-by: Christoph Lameter <cl@linux.com> > > > > Can you please repost folded patch with proper [PATCH] subject line > > and cc shaohua.li@intel.com so that he can resolve conflicts? > > > > Thanks. > > Ok here it is: > > > > > From: Christoph Lameter <cl@linux.com> > Subject: [PATCH] percpu: preemptless __per_cpu_counter_add > > Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_add. > > Signed-off-by: Christoph Lameter <cl@linux.com> > > --- > lib/percpu_counter.c | 27 +++++++++++++++------------ > 1 file changed, 15 insertions(+), 12 deletions(-) > > Index: linux-2.6/lib/percpu_counter.c > =================================================================== > --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 09:26:19.000000000 -0500 > +++ linux-2.6/lib/percpu_counter.c 2011-04-13 09:36:37.000000000 -0500 > @@ -71,19 +71,22 @@ EXPORT_SYMBOL(percpu_counter_set); > > void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) > { > - s64 count; > + s64 count, new; > > - preempt_disable(); > - count = __this_cpu_read(*fbc->counters) + amount; > - if (count >= batch || count <= -batch) { > - spin_lock(&fbc->lock); > - fbc->count += count; > - __this_cpu_write(*fbc->counters, 0); > - spin_unlock(&fbc->lock); > - } else { > - __this_cpu_write(*fbc->counters, count); > - } > - preempt_enable(); > + do { > + count = this_cpu_read(*fbc->counters); > + > + new = count + amount; > + /* In case of overflow fold it into the global counter instead */ > + if (new >= batch || new <= -batch) { > + spin_lock(&fbc->lock); > + fbc->count += __this_cpu_read(*fbc->counters) + amount; > + spin_unlock(&fbc->lock); > + amount = 0; > + new = 0; > + } > + > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); > } > EXPORT_SYMBOL(__percpu_counter_add); > Not sure its a win for my servers, where CONFIG_PREEMPT_NONE=y Maybe use here latest cmpxchg16b stuff instead and get rid of spinlock ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-14 2:14 ` Eric Dumazet @ 2011-04-14 21:10 ` Christoph Lameter 2011-04-14 21:15 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-14 21:10 UTC (permalink / raw) To: Eric Dumazet; +Cc: Tejun Heo, akpm, linux-mm, shaohua.li On Thu, 14 Apr 2011, Eric Dumazet wrote: > Not sure its a win for my servers, where CONFIG_PREEMPT_NONE=y Well the fast path would then also be irq safe. Does that bring us anything? We could not do the cmpxchg in the !PREEMPT case and instead simply store the value. The preempt on/off seems to be a bigger deal for realtime. > Maybe use here latest cmpxchg16b stuff instead and get rid of spinlock ? Shaohua already got an atomic in there. You mean get rid of his preempt disable/enable in the slow path? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-14 21:10 ` Christoph Lameter @ 2011-04-14 21:15 ` Tejun Heo 2011-04-15 17:37 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-14 21:15 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, On Thu, Apr 14, 2011 at 04:10:34PM -0500, Christoph Lameter wrote: > On Thu, 14 Apr 2011, Eric Dumazet wrote: > > > Not sure its a win for my servers, where CONFIG_PREEMPT_NONE=y > > Well the fast path would then also be irq safe. Does that bring us > anything? > > We could not do the cmpxchg in the !PREEMPT case and instead simply store > the value. > > The preempt on/off seems to be a bigger deal for realtime. Also, the cmpxchg used is local one w/o LOCK prefix. It might not bring anything to table on !PREEMPT kernels but at the same time it shouldn't hurt either. One way or the other, some benchmark numbers showing that it at least doesn't hurt would be nice. > > Maybe use here latest cmpxchg16b stuff instead and get rid of spinlock ? > > Shaohua already got an atomic in there. You mean get rid of his preempt > disable/enable in the slow path? I personally care much less about slow path. According to Shaohua, atomic64_t behaves pretty nice and it isn't too complex, so I'd like to stick with that unless complex this_cpu ops can deliver something much better. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-14 21:15 ` Tejun Heo @ 2011-04-15 17:37 ` Christoph Lameter 2011-04-15 18:27 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-15 17:37 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Fri, 15 Apr 2011, Tejun Heo wrote: > > The preempt on/off seems to be a bigger deal for realtime. > > Also, the cmpxchg used is local one w/o LOCK prefix. It might not > bring anything to table on !PREEMPT kernels but at the same time it > shouldn't hurt either. One way or the other, some benchmark numbers > showing that it at least doesn't hurt would be nice. Maybe just fall back to this_cpu_write()? > > > Maybe use here latest cmpxchg16b stuff instead and get rid of spinlock ? > > > > Shaohua already got an atomic in there. You mean get rid of his preempt > > disable/enable in the slow path? > > I personally care much less about slow path. According to Shaohua, > atomic64_t behaves pretty nice and it isn't too complex, so I'd like > to stick with that unless complex this_cpu ops can deliver something > much better. Ok here is a new patch that will allow Shaohua to simply convert the slowpath to a single atomic op. No preemption anymore anywhere. Subject: percpu: preemptless __per_cpu_counter_add V3 Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_add. V3 - separate out the slow path so that the slowpath can also be done with a simple atomic add without preemption enable/disable. - Fallback in the !PREEMPT case to a simple this_cpu_write(). Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-13 17:12:59.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-15 12:34:39.000000000 -0500 @@ -71,19 +71,30 @@ EXPORT_SYMBOL(percpu_counter_set); void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { - s64 count; + s64 count, new, overflow; - preempt_disable(); - count = __this_cpu_read(*fbc->counters) + amount; - if (count >= batch || count <= -batch) { + do { + overflow = 0; + count = this_cpu_read(*fbc->counters); + + new = count + amount; + /* In case of overflow fold it into the global counter instead */ + if (new >= batch || new <= -batch) { + overflow = new; + new = 0; + } +#ifdef CONFIG_PREEMPT + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); +#else + } while (0); + this_cpu_write(*fbc->counters, new); +#endif + + if (unlikely(overflow)) { spin_lock(&fbc->lock); - fbc->count += count; - __this_cpu_write(*fbc->counters, 0); + fbc->count += overflow; spin_unlock(&fbc->lock); - } else { - __this_cpu_write(*fbc->counters, count); } - preempt_enable(); } EXPORT_SYMBOL(__percpu_counter_add); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-15 17:37 ` Christoph Lameter @ 2011-04-15 18:27 ` Tejun Heo 2011-04-15 19:43 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-15 18:27 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph. > void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) > { > - s64 count; > + s64 count, new, overflow; > > - preempt_disable(); > - count = __this_cpu_read(*fbc->counters) + amount; > - if (count >= batch || count <= -batch) { > + do { > + overflow = 0; > + count = this_cpu_read(*fbc->counters); > + > + new = count + amount; > + /* In case of overflow fold it into the global counter instead */ > + if (new >= batch || new <= -batch) { > + overflow = new; > + new = 0; > + } > +#ifdef CONFIG_PREEMPT > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); > +#else > + } while (0); > + this_cpu_write(*fbc->counters, new); > +#endif Eeeek, no. If you want to do the above, please put it in a separate inline function with sufficient comment. > + if (unlikely(overflow)) { > spin_lock(&fbc->lock); > - fbc->count += count; > - __this_cpu_write(*fbc->counters, 0); > + fbc->count += overflow; > spin_unlock(&fbc->lock); Why put this outside and use yet another branch? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-15 18:27 ` Tejun Heo @ 2011-04-15 19:43 ` Christoph Lameter 2011-04-15 23:52 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-15 19:43 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Sat, 16 Apr 2011, Tejun Heo wrote: > > + new = 0; > > + } > > +#ifdef CONFIG_PREEMPT > > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); > > +#else > > + } while (0); > > + this_cpu_write(*fbc->counters, new); > > +#endif > > Eeeek, no. If you want to do the above, please put it in a separate > inline function with sufficient comment. That would not work well with the control flow. Just leave the cmpxchg for both cases? That would make the function irq safe as well! > > + if (unlikely(overflow)) { > > spin_lock(&fbc->lock); > > - fbc->count += count; > > - __this_cpu_write(*fbc->counters, 0); > > + fbc->count += overflow; > > spin_unlock(&fbc->lock); > > Why put this outside and use yet another branch? Because that way we do not need preempt enable/disable. The cmpxchg is used to update the per cpu counter in the slow case as well. All that is left then is to add the count to the global counter. The branches are not an issue since they are forward branches over one (after converting to an atomic operation) or two instructions each. A possible stall is only possible in case of the cmpxchg failing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-15 19:43 ` Christoph Lameter @ 2011-04-15 23:52 ` Tejun Heo 2011-04-18 14:38 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-15 23:52 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph. On Fri, Apr 15, 2011 at 02:43:15PM -0500, Christoph Lameter wrote: > On Sat, 16 Apr 2011, Tejun Heo wrote: > > > > + new = 0; > > > + } > > > +#ifdef CONFIG_PREEMPT > > > + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); > > > +#else > > > + } while (0); > > > + this_cpu_write(*fbc->counters, new); > > > +#endif > > > > Eeeek, no. If you want to do the above, please put it in a separate > > inline function with sufficient comment. > > That would not work well with the control flow. It doesn't have to be that way. ie. static inline bool pcnt_add_cmpxchg(counter, count, new) { /* blah blah */ #ifdef PREEMPT return this_cpu_cmpxchg() == count; #else this_cpu_write(); return true; #endif } void __percpu_counter_add(...) { ... do { ... } while (!pcnt_add_cmpxchg(counter, count, new)) ... } It's the same thing but ifdef'd "} while()"'s are just too ugly. > Just leave the cmpxchg for both cases? That would make the function > irq safe as well! Maybe, I don't know. On x86, it shouldn't be a problem on both 32 and 64bit. Even on archs which lack local cmpxchg, preemption flips are cheap anyway so yeah maybe. > > > + if (unlikely(overflow)) { > > > spin_lock(&fbc->lock); > > > - fbc->count += count; > > > - __this_cpu_write(*fbc->counters, 0); > > > + fbc->count += overflow; > > > spin_unlock(&fbc->lock); > > > > Why put this outside and use yet another branch? > > Because that way we do not need preempt enable/disable. The cmpxchg is > used to update the per cpu counter in the slow case as well. All that is > left then is to add the count to the global counter. > > The branches are not an issue since they are forward branches over one > (after converting to an atomic operation) or two instructions each. A > possible stall is only possible in case of the cmpxchg failing. It's slow path and IMHO it's needlessly complex. I really don't care whether the counter is reloaded once more or the task gets migrated to another cpu before spin_lock() and ends up flushing local counter on a cpu where it isn't strictly necessary. Let's keep it simple. Thank you. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-15 23:52 ` Tejun Heo @ 2011-04-18 14:38 ` Christoph Lameter 2011-04-21 14:43 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-18 14:38 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Sat, 16 Apr 2011, Tejun Heo wrote: > Maybe, I don't know. On x86, it shouldn't be a problem on both 32 and > 64bit. Even on archs which lack local cmpxchg, preemption flips are > cheap anyway so yeah maybe. Preemption flips are not cheap since enabling preemption may mean a call into the scheduler. On RT things get more expensive. Preempt_enable means at least one additional branch. We are saving a branch by not using preempt. > > The branches are not an issue since they are forward branches over one > > (after converting to an atomic operation) or two instructions each. A > > possible stall is only possible in case of the cmpxchg failing. > > It's slow path and IMHO it's needlessly complex. I really don't care > whether the counter is reloaded once more or the task gets migrated to > another cpu before spin_lock() and ends up flushing local counter on a > cpu where it isn't strictly necessary. Let's keep it simple. In order to make it simple I avoided an preempt enable/disable. With Shaohua's patches there will be a simple atomic_add within the last if cluase. I was able to consolidate multiple code paths into the cmpxchg loop with this approach. The one below avoids the #ifdef that is ugly... Subject: percpu: preemptless __per_cpu_counter_add V4 Use this_cpu_cmpxchg to avoid preempt_disable/enable in __percpu_add. Signed-off-by: Christoph Lameter <cl@linux.com> --- lib/percpu_counter.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) Index: linux-2.6/lib/percpu_counter.c =================================================================== --- linux-2.6.orig/lib/percpu_counter.c 2011-04-15 15:34:23.000000000 -0500 +++ linux-2.6/lib/percpu_counter.c 2011-04-18 09:31:37.000000000 -0500 @@ -71,19 +71,25 @@ EXPORT_SYMBOL(percpu_counter_set); void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { - s64 count; + s64 count, new, overflow; - preempt_disable(); - count = __this_cpu_read(*fbc->counters) + amount; - if (count >= batch || count <= -batch) { + do { + count = this_cpu_read(*fbc->counters); + + new = count + amount; + /* In case of overflow fold it into the global counter instead */ + if (new >= batch || new <= -batch) { + overflow = new; + new = 0; + } else + overflow = 0; + } while (this_cpu_cmpxchg(*fbc->counters, count, new) != count); + + if (unlikely(overflow)) { spin_lock(&fbc->lock); - fbc->count += count; - __this_cpu_write(*fbc->counters, 0); + fbc->count += overflow; spin_unlock(&fbc->lock); - } else { - __this_cpu_write(*fbc->counters, count); } - preempt_enable(); } EXPORT_SYMBOL(__percpu_counter_add); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-18 14:38 ` Christoph Lameter @ 2011-04-21 14:43 ` Tejun Heo 2011-04-21 14:58 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-21 14:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph. On Mon, Apr 18, 2011 at 09:38:03AM -0500, Christoph Lameter wrote: > Preemption flips are not cheap since enabling preemption may mean a call > into the scheduler. On RT things get more expensive. > > Preempt_enable means at least one additional branch. We are saving a > branch by not using preempt. It is cheap. The cost of preempt_enable() regarding scheduler call is TIF_NEED_RESCHED check. The scheduler() call occurring afterwards is not the overhead of preemption check, but the overhead of preemption itself. Also, in cases where the preemption check doesn't make sense (I don't think that's the case here), the right thing to do is using preempt_enable_no_resched(). > In order to make it simple I avoided an preempt enable/disable. With > Shaohua's patches there will be a simple atomic_add within the last if > cluase. I was able to consolidate multiple code paths into the cmpxchg > loop with this approach. > > The one below avoids the #ifdef that is ugly... That said, combined with Shaohua's patch, maybe it's better this way. Let's see... Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 14:43 ` Tejun Heo @ 2011-04-21 14:58 ` Tejun Heo 2011-04-21 17:50 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-21 14:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph, Shaohua. On Thu, Apr 21, 2011 at 04:43:00PM +0200, Tejun Heo wrote: > > In order to make it simple I avoided an preempt enable/disable. With > > Shaohua's patches there will be a simple atomic_add within the last if > > cluase. I was able to consolidate multiple code paths into the cmpxchg > > loop with this approach. > > > > The one below avoids the #ifdef that is ugly... > > That said, combined with Shaohua's patch, maybe it's better this way. > Let's see... Unfortunately, I have a new concern for __percpu_counter_sum(), which applies to both your and Shaohua's change. Before these changes, percpu_counter->lock protects whole of batch transfer. IOW, while __percpu_counter_sum() is holding ->lock, it can be sure that batch transfer from percpu counter to the main counter isn't in progress and that the deviation it might see is limited by the number of on-going percpu inc/dec's which is much lower than batch transfers. With the proposed changes to percpu counter, this no longer holds. cl's patch de-couples local counter update from the global counter update and __percpu_counter_sum() can see batch amount of deviation per concurrent updater making the whole visit-each-counter thing more or less meaningless. This, however, can be fixed by putting the whole slow path inside spin_lock() as suggested before so that the whole batch transferring from local to global is enclosed inside spinlock. Unfortunately, Shaohua's atomic64_t update ain't that easy. The whole point of that update was avoiding spinlocks in favor of atomic64_t, which naturally collides with the ability to enclosing local and global updates into the same exclusion block, which is necessary for __percpu_counter_sum() accuracy. So, Christoph, please put the whole slow path inside spin_lock(). Shaohua, unfortunately, I think your change is caught inbetween rock and hard place. Any ideas? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 14:58 ` Tejun Heo @ 2011-04-21 17:50 ` Christoph Lameter 2011-04-21 18:01 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-21 17:50 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Thu, 21 Apr 2011, Tejun Heo wrote: > Unfortunately, I have a new concern for __percpu_counter_sum(), which > applies to both your and Shaohua's change. Before these changes, > percpu_counter->lock protects whole of batch transfer. IOW, while > __percpu_counter_sum() is holding ->lock, it can be sure that batch > transfer from percpu counter to the main counter isn't in progress and > that the deviation it might see is limited by the number of on-going > percpu inc/dec's which is much lower than batch transfers. Yes. But there was already a fuzzyiness coming with the __percpu_counter_sum() not seeing the percpu counters that are being updated due to unserialized access to the counters before this patch. There is no material difference here. The VM statistics counters work the same way and have to deal with similar fuzziness effects. > With the proposed changes to percpu counter, this no longer holds. > cl's patch de-couples local counter update from the global counter > update and __percpu_counter_sum() can see batch amount of deviation > per concurrent updater making the whole visit-each-counter thing more > or less meaningless. This, however, can be fixed by putting the whole > slow path inside spin_lock() as suggested before so that the whole > batch transferring from local to global is enclosed inside spinlock. The local counter increment was already decoupled before. The shifting of the overflow into the global counter was also not serialized before. > Unfortunately, Shaohua's atomic64_t update ain't that easy. The whole > point of that update was avoiding spinlocks in favor of atomic64_t, > which naturally collides with the ability to enclosing local and > global updates into the same exclusion block, which is necessary for > __percpu_counter_sum() accuracy. There was no total accuracy before either. > So, Christoph, please put the whole slow path inside spin_lock(). > Shaohua, unfortunately, I think your change is caught inbetween rock > and hard place. Any ideas? I think there is no new problem here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 17:50 ` Christoph Lameter @ 2011-04-21 18:01 ` Tejun Heo 2011-04-21 18:20 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-21 18:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph. On Thu, Apr 21, 2011 at 12:50:11PM -0500, Christoph Lameter wrote: > Yes. But there was already a fuzzyiness coming with the > __percpu_counter_sum() not seeing the percpu counters that are being > updated due to unserialized access to the counters before this patch. > There is no material difference here. The VM statistics counters work the > same way and have to deal with similar fuzziness effects. That basically amounts to "there's no difference between percpu_counter_sum() and percpu_counter_read()", which is true in the sense that both don't guarantee complete correctness. The only difference between the two is the level of fuziness. The former deviates only by the number of concurrent updaters (and maybe cacheline update latencies) while the latter may deviate in multiples of @batch. If you wanna say that the difference in the level of fuzziness is irrelevant, the first patch of this series should be removing percpu_counter_sum() before making any other changes. So, yeah, here, the level of fuzziness matters and that's exactly why we have the hugely costly percpu_counter_sum(). Even with the proposed change, percpu_counter_sum() would tend to be more accurate because the number of batch deviations is limited by the number of concurrent updaters but it would still be much worse than before. > The local counter increment was already decoupled before. The shifting of > the overflow into the global counter was also not serialized before. No, it wasn't. ... if (count >= batch || count <= -batch) { spin_lock(&fbc->lock); fbc->count += count; __this_cpu_write(*fbc->counters, 0); spin_unlock(&fbc->lock); } else { ... percpu_counter_sum() would see either both the percpu and global counters updated or un-updated. It will never see local counter reset with global counter not updated yet. > There was no total accuracy before either. It's not about total accuracy. It's about different levels of fuzziness. If it can be shown that the different levels of fuzziness doesn't matter and thus percpu_counter_sum() can be removed, I'll be a happy camper. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 18:01 ` Tejun Heo @ 2011-04-21 18:20 ` Christoph Lameter 2011-04-21 18:37 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-21 18:20 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Thu, 21 Apr 2011, Tejun Heo wrote: > The only difference between the two is the level of fuziness. The > former deviates only by the number of concurrent updaters (and maybe > cacheline update latencies) while the latter may deviate in multiples > of @batch. I dont think multiple times of batch is such a concern. Either the per cpu counter is high or the overflow has been folded into the global counter. The interregnum is very short and since the counters are already fuzzy this is tolerable. We do the same thing elsewhere for vmstats. > If you wanna say that the difference in the level of fuzziness is > irrelevant, the first patch of this series should be removing > percpu_counter_sum() before making any other changes. percpu_counter_sum() is more accurate since it considers the per cpu counters. That is vastly different. > > The local counter increment was already decoupled before. The shifting of > > the overflow into the global counter was also not serialized before. > > No, it wasn't. > > ... > if (count >= batch || count <= -batch) { > spin_lock(&fbc->lock); > fbc->count += count; > __this_cpu_write(*fbc->counters, 0); > spin_unlock(&fbc->lock); > } else { > ... > > percpu_counter_sum() would see either both the percpu and global > counters updated or un-updated. It will never see local counter reset > with global counter not updated yet. Sure there is a slight race there and there is no way to avoid that race without a lock. > > There was no total accuracy before either. > > It's not about total accuracy. It's about different levels of > fuzziness. If it can be shown that the different levels of fuzziness > doesn't matter and thus percpu_counter_sum() can be removed, I'll be a > happy camper. percpu_counter_sum() is a totally different animal since it considers the per cpu differentials but while it does that the per cpu differentials can be updated. So the fuzziness is much lower than just looking at the global counter for wich all sorts of counters differentials on multiple cpus can be outstanding over long time periods. Look at mm/vmstat.c. There is __inc_zone_state() which does an analogous thing. and include/linux/vmstat.h:zone_page_state_snapshot() which is analoguous to percpu_counter_sum(). In fact as far as I can tell the percpu_counter stuff was cribbed from that one. What I did is the same process as in mm/vmstat.c:mod_state. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 18:20 ` Christoph Lameter @ 2011-04-21 18:37 ` Tejun Heo 2011-04-21 18:54 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-21 18:37 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, Christoph. On Thu, Apr 21, 2011 at 01:20:39PM -0500, Christoph Lameter wrote: > I dont think multiple times of batch is such a concern. Either the per cpu > counter is high or the overflow has been folded into the global counter. > > The interregnum is very short and since the counters are already fuzzy > this is tolerable. We do the same thing elsewhere for vmstats. We're talking about three different levels of fuzziness. 1. percpu_counter_sum() before the changes May deviate by the number of concurrent updaters and cacheline update latencies. 2. percpu_counter_sum() after the changes May deviate by multiples of @batch; however, the duration during which the deviation may be visible is brief (really? we're allowing preemption between local and global updates). 3. percpu_counter_read() May deviate by multiples of @batch. Deviations are visible almost always. You're arguing that change from #1 to #2 should be okay, which might as well be true, but your change per-se doesn't require such compromise and there's no reason to bundle the two changes together, so, again, please update your patch to avoid the transition from #1 to #2. Shaohua's change requires transition from #1 to #2, which might or might not be okay. I really don't know. You say it should be okay as it came from vmstat and vmstat is updated the same way; however, no matter where it came from, percpu_counter is now used in different places which may or may not have different expectations regarding the level of fuzziness in percpu_counter_sum(), so we would need more than "but vmstat does that too" to make the change. If you haven't noticed yet, I'm not feeling too enthusiastic about cold path optimizations. If cold path is kicking in too often, change the code such that things don't happen that way instead of trying to make cold paths go faster. Leave cold paths robust and easy to understand. So, unless someone can show me that percpu_counter_sum() is unnecessary (ie. the differences between not only #1 and #2 but also between #1 and #3 are irrelevant), I don't think I'm gonna change the slow path. It's silly to micro optimize slow path to begin with and I'm not gonna do that at the cost of subtle functionality change which can bite us in the ass in twisted ways. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 18:37 ` Tejun Heo @ 2011-04-21 18:54 ` Christoph Lameter 2011-04-21 19:08 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-21 18:54 UTC (permalink / raw) To: Tejun Heo; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li On Thu, 21 Apr 2011, Tejun Heo wrote: > > The interregnum is very short and since the counters are already fuzzy > > this is tolerable. We do the same thing elsewhere for vmstats. > > We're talking about three different levels of fuzziness. > > 1. percpu_counter_sum() before the changes > > May deviate by the number of concurrent updaters and cacheline > update latencies. Not true. Any of the updaters may have done multiple updates by the time we are through cycling through the list. > 2. percpu_counter_sum() after the changes > > May deviate by multiples of @batch; however, the duration > during which the deviation may be visible is brief (really? > we're allowing preemption between local and global updates). Well again there is general fuzziness here and we are trying to make the best of it without compromising performance too much. Shaohua's numbers indicate that removing the lock is very advantagous. More over we do the same thing in other places. > So, unless someone can show me that percpu_counter_sum() is > unnecessary (ie. the differences between not only #1 and #2 but also > between #1 and #3 are irrelevant), I don't think I'm gonna change the > slow path. It's silly to micro optimize slow path to begin with and > I'm not gonna do that at the cost of subtle functionality change which > can bite us in the ass in twisted ways. Actually its good to make the code paths for vmstats and percpu counters similar. That is what this does too. Preempt enable/disable in any function that is supposedly fast is something bad that can be avoided with these patches as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 18:54 ` Christoph Lameter @ 2011-04-21 19:08 ` Tejun Heo 2011-04-22 2:33 ` Shaohua Li 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-21 19:08 UTC (permalink / raw) To: Christoph Lameter; +Cc: Eric Dumazet, akpm, linux-mm, shaohua.li Hello, On Thu, Apr 21, 2011 at 01:54:51PM -0500, Christoph Lameter wrote: > Well again there is general fuzziness here and we are trying to make the > best of it without compromising performance too much. Shaohua's numbers > indicate that removing the lock is very advantagous. More over we do the > same thing in other places. The problem with Shaohua's numbers is that it's a pessimistic test case with too low batch count. If an optimization improves such situations without compromising funcitionality or introducing too much complexity, sure, why not? But I'm not sure that's the case here. > Actually its good to make the code paths for vmstats and percpu counters > similar. That is what this does too. > > Preempt enable/disable in any function that is supposedly fast is > something bad that can be avoided with these patches as well. If you really wanna push the _sum() fuziness change, the only way to do that would be auditing all the current users and making sure that it won't affect any of them. It really doesn't matter what vmstat is doing. They're different users. And, no matter what, that's a separate issue from the this_cpu hot path optimizations and should be done separately. So, _please_ update this_cpu patch so that it doesn't change the slow path semantics. Thank you. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-21 19:08 ` Tejun Heo @ 2011-04-22 2:33 ` Shaohua Li 2011-04-26 12:10 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Shaohua Li @ 2011-04-22 2:33 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 2011-04-22 at 03:08 +0800, Tejun Heo wrote: > Hello, > > On Thu, Apr 21, 2011 at 01:54:51PM -0500, Christoph Lameter wrote: > > Well again there is general fuzziness here and we are trying to make the > > best of it without compromising performance too much. Shaohua's numbers > > indicate that removing the lock is very advantagous. More over we do the > > same thing in other places. > > The problem with Shaohua's numbers is that it's a pessimistic test > case with too low batch count. If an optimization improves such > situations without compromising funcitionality or introducing too much > complexity, sure, why not? But I'm not sure that's the case here. > > > Actually its good to make the code paths for vmstats and percpu counters > > similar. That is what this does too. > > > > Preempt enable/disable in any function that is supposedly fast is > > something bad that can be avoided with these patches as well. > > If you really wanna push the _sum() fuziness change, the only way to > do that would be auditing all the current users and making sure that > it won't affect any of them. It really doesn't matter what vmstat is > doing. They're different users. > > And, no matter what, that's a separate issue from the this_cpu hot > path optimizations and should be done separately. So, _please_ update > this_cpu patch so that it doesn't change the slow path semantics. in the original implementation, a updater can change several times too, it can update the count from -(batch -1) to (batch -1) without holding the lock. so we always have batch*num_cpus*2 deviate if we really worry about _sum deviates too much. can we do something like this: percpu_counter_sum { again: sum=0 old = atomic64_read(&fbc->counter) for_each_online_cpu() sum += per cpu counter new = atomic64_read(&fbc->counter) if (new - old > batch * num_cpus || old - new > batch * num_cpus) goto again; return new + sum; } in this way we limited the deviate to number of concurrent updater. This doesn't make _sum too slow too, because we have the batch * num_cpus check. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-22 2:33 ` Shaohua Li @ 2011-04-26 12:10 ` Tejun Heo 2011-04-26 19:02 ` Hugh Dickins 2011-04-27 5:43 ` Shaohua Li 0 siblings, 2 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-26 12:10 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hello, please pardon delay (and probably bad temper). I'm still sick & slow. On Fri, Apr 22, 2011 at 10:33:00AM +0800, Shaohua Li wrote: > > And, no matter what, that's a separate issue from the this_cpu hot > > path optimizations and should be done separately. So, _please_ update > > this_cpu patch so that it doesn't change the slow path semantics. > > in the original implementation, a updater can change several times too, > it can update the count from -(batch -1) to (batch -1) without holding > the lock. so we always have batch*num_cpus*2 deviate That would be a pathelogical case but, even then, after the change the number becomes much higher as it becomes a function of batch * num_updaters, right? I'll try to re-summarize my concerns as my communications don't seem to be getting through very well these few days (likely my fault). The biggest issue I have with the change is that with the suggested changes, the devaition seen by _sum becomes much less predictable. _sum can't be accurate. It never was and never will be, but the deviations have been quite predictable regardless of @batch. It's dependent only on the number and frequency of concurrent updaters. If concurrent updates aren't very frequent and numerous, the caller is guaranteed to get a result which deviates only by quite small margin. If concurrent updates are very frequent and numerous, the caller natuarally can't expect a very accurate result. However, after the change, especially with high @batch count, the result may deviate significantly even with low frequency concurrent updates. @batch deviations won't happen often but will happen once in a while, which is just nasty and makes the API much less useful and those occasional deviations can cause sporadic erratic behaviors - e.g. filesystems use it for free block accounting. It's actually used for somewhat critical decision making. If it were in the fast path, sure, we might and plan for slower contingencies where accuracy is more important, but we're talking about slow path already - it's visiting each per-cpu area for $DEITY's sake, so the tradeoff doesn't make a lot of sense to me. > if we really worry about _sum deviates too much. can we do something > like this: > percpu_counter_sum > { > again: > sum=0 > old = atomic64_read(&fbc->counter) > for_each_online_cpu() > sum += per cpu counter > new = atomic64_read(&fbc->counter) > if (new - old > batch * num_cpus || old - new > batch * num_cpus) > goto again; > return new + sum; > } > in this way we limited the deviate to number of concurrent updater. This > doesn't make _sum too slow too, because we have the batch * num_cpus > check. I don't really worry about _sum performance. It's a quite slow path and most of the cost is from causing cacheline bounces anyway. That said, I don't see how the above would help the deviation problem. Let's say an updater reset per cpu counter but got preempted before updating the global counter. What differences does it make to check fbc->counter before & after like above? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-26 12:10 ` Tejun Heo @ 2011-04-26 19:02 ` Hugh Dickins 2011-04-27 10:28 ` Tejun Heo 2011-04-27 5:43 ` Shaohua Li 1 sibling, 1 reply; 64+ messages in thread From: Hugh Dickins @ 2011-04-26 19:02 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Tue, 26 Apr 2011, Tejun Heo wrote: > > However, after the change, especially with high @batch count, the > result may deviate significantly even with low frequency concurrent > updates. @batch deviations won't happen often but will happen once in > a while, which is just nasty and makes the API much less useful and > those occasional deviations can cause sporadic erratic behaviors - > e.g. filesystems use it for free block accounting. It's actually used > for somewhat critical decision making. This worried me a little when the percpu block counting went into tmpfs, though it's not really critical there. Would it be feasible, with these counters that are used against limits, to have an adaptive batching scheme such that the batches get smaller and smaller, down to 1 and to 0, as the total approaches the limit? (Of course a single global percpu_counter_batch won't do for this.) Perhaps it's a demonstrable logical impossibility, perhaps it would slow down the fast (far from limit) path more than we can afford, perhaps I haven't read enough of this thread and I'm taking it off-topic. Forgive me if so. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-26 19:02 ` Hugh Dickins @ 2011-04-27 10:28 ` Tejun Heo 0 siblings, 0 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-27 10:28 UTC (permalink / raw) To: Hugh Dickins Cc: Shaohua Li, Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hello, Hugh. On Tue, Apr 26, 2011 at 12:02:15PM -0700, Hugh Dickins wrote: > This worried me a little when the percpu block counting went into tmpfs, > though it's not really critical there. > > Would it be feasible, with these counters that are used against limits, > to have an adaptive batching scheme such that the batches get smaller > and smaller, down to 1 and to 0, as the total approaches the limit? > (Of course a single global percpu_counter_batch won't do for this.) > > Perhaps it's a demonstrable logical impossibility, perhaps it would > slow down the fast (far from limit) path more than we can afford, > perhaps I haven't read enough of this thread and I'm taking it > off-topic. Forgive me if so. Hmmm... it seems like what you're suggesting is basically a simple per-cpu slot allocator. * The global counter is initialized with predefined number of slots and per-cpu buffers start at zero (or with initial chunks). * Allocation first consults per-cpu buffer and if enough slots are there, they're consumed. If not, global counter is accessed and some portion of it is transferred to per-cpu buffer and allocation is retried. The amount transferred is adjusted according to the amount of slots available in the global pool. * Free populates the per-cpu buffer. If it meets certain criteria, part of the buffer is returned to the global pool. The principles are easy but as with any per-cpu allocation scheme balancing and reclaiming would be the complex part. ie. What to do when the global and some per-cpu buffers are draining while a few still have some left. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-26 12:10 ` Tejun Heo 2011-04-26 19:02 ` Hugh Dickins @ 2011-04-27 5:43 ` Shaohua Li 2011-04-27 10:20 ` Tejun Heo 1 sibling, 1 reply; 64+ messages in thread From: Shaohua Li @ 2011-04-27 5:43 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Tue, 2011-04-26 at 20:10 +0800, Tejun Heo wrote: > Hello, please pardon delay (and probably bad temper). I'm still sick > & slow. no problem. > On Fri, Apr 22, 2011 at 10:33:00AM +0800, Shaohua Li wrote: > > > And, no matter what, that's a separate issue from the this_cpu hot > > > path optimizations and should be done separately. So, _please_ update > > > this_cpu patch so that it doesn't change the slow path semantics. > > > > in the original implementation, a updater can change several times too, > > it can update the count from -(batch -1) to (batch -1) without holding > > the lock. so we always have batch*num_cpus*2 deviate > > That would be a pathelogical case but, even then, after the change the > number becomes much higher as it becomes a function of batch * > num_updaters, right? I don't understand the difference between batch * num_updaters and batch * num_cpus except preempt. So the only problem here is _add should have preempt disabled? I agree preempt can make deviation worse. except the preempt issue, are there other concerns against the atomic convert? in the preempt disabled case, before/after the atomic convert the deviation is the same (batch*num_cpus) > I'll try to re-summarize my concerns as my communications don't seem > to be getting through very well these few days (likely my fault). > > The biggest issue I have with the change is that with the suggested > changes, the devaition seen by _sum becomes much less predictable. > _sum can't be accurate. It never was and never will be, but the > deviations have been quite predictable regardless of @batch. It's > dependent only on the number and frequency of concurrent updaters. > > If concurrent updates aren't very frequent and numerous, the caller is > guaranteed to get a result which deviates only by quite small margin. > If concurrent updates are very frequent and numerous, the caller > natuarally can't expect a very accurate result. > > However, after the change, especially with high @batch count, the > result may deviate significantly even with low frequency concurrent > updates. @batch deviations won't happen often but will happen once in > a while, which is just nasty and makes the API much less useful and > those occasional deviations can cause sporadic erratic behaviors - > e.g. filesystems use it for free block accounting. It's actually used > for somewhat critical decision making. > > If it were in the fast path, sure, we might and plan for slower > contingencies where accuracy is more important, but we're talking > about slow path already - it's visiting each per-cpu area for $DEITY's > sake, so the tradeoff doesn't make a lot of sense to me. > > > if we really worry about _sum deviates too much. can we do something > > like this: > > percpu_counter_sum > > { > > again: > > sum=0 > > old = atomic64_read(&fbc->counter) > > for_each_online_cpu() > > sum += per cpu counter > > new = atomic64_read(&fbc->counter) > > if (new - old > batch * num_cpus || old - new > batch * num_cpus) > > goto again; > > return new + sum; > > } > > in this way we limited the deviate to number of concurrent updater. This > > doesn't make _sum too slow too, because we have the batch * num_cpus > > check. > > I don't really worry about _sum performance. It's a quite slow path > and most of the cost is from causing cacheline bounces anyway. That > said, I don't see how the above would help the deviation problem. > Let's say an updater reset per cpu counter but got preempted before > updating the global counter. What differences does it make to check > fbc->counter before & after like above? yes, this is a problem. Again I don't mind to disable preempt in _add. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-27 5:43 ` Shaohua Li @ 2011-04-27 10:20 ` Tejun Heo 2011-04-28 3:28 ` Shaohua Li 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-27 10:20 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hello, Shaohua. On Wed, Apr 27, 2011 at 01:43:29PM +0800, Shaohua Li wrote: > > That would be a pathelogical case but, even then, after the change the > > number becomes much higher as it becomes a function of batch * > > num_updaters, right? > > I don't understand the difference between batch * num_updaters and batch > * num_cpus except preempt. So the only problem here is _add should have > preempt disabled? I agree preempt can make deviation worse. > except the preempt issue, are there other concerns against the atomic > convert? in the preempt disabled case, before/after the atomic convert > the deviation is the same (batch*num_cpus) Yes, with preemption disabled, I think the patheological worst case wouldn't be too different. > > I don't really worry about _sum performance. It's a quite slow path > > and most of the cost is from causing cacheline bounces anyway. That > > said, I don't see how the above would help the deviation problem. > > Let's say an updater reset per cpu counter but got preempted before > > updating the global counter. What differences does it make to check > > fbc->counter before & after like above? > > yes, this is a problem. Again I don't mind to disable preempt in _add. Okay, this communication failure isn't my fault. Please re-read what I wrote before, my concern wasn't primarily about pathological worst case - if that many concurrent updates are happening && the counter needs to be accurate, it can't even use atomic counter. It should be doing full exclusion around the counter and the associated operation _together_. I'm worried about sporadic erratic behavior happening regardless of update frequency and preemption would contribute but isn't necessary for that to happen. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-27 10:20 ` Tejun Heo @ 2011-04-28 3:28 ` Shaohua Li 2011-04-28 10:09 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Shaohua Li @ 2011-04-28 3:28 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Wed, 2011-04-27 at 18:20 +0800, Tejun Heo wrote: > Hello, Shaohua. > > On Wed, Apr 27, 2011 at 01:43:29PM +0800, Shaohua Li wrote: > > > That would be a pathelogical case but, even then, after the change the > > > number becomes much higher as it becomes a function of batch * > > > num_updaters, right? > > > > I don't understand the difference between batch * num_updaters and batch > > * num_cpus except preempt. So the only problem here is _add should have > > preempt disabled? I agree preempt can make deviation worse. > > except the preempt issue, are there other concerns against the atomic > > convert? in the preempt disabled case, before/after the atomic convert > > the deviation is the same (batch*num_cpus) > > Yes, with preemption disabled, I think the patheological worst case > wouldn't be too different. > > > > I don't really worry about _sum performance. It's a quite slow path > > > and most of the cost is from causing cacheline bounces anyway. That > > > said, I don't see how the above would help the deviation problem. > > > Let's say an updater reset per cpu counter but got preempted before > > > updating the global counter. What differences does it make to check > > > fbc->counter before & after like above? > > > > yes, this is a problem. Again I don't mind to disable preempt in _add. > > Okay, this communication failure isn't my fault. Please re-read what > I wrote before, my concern wasn't primarily about pathological worst > case - if that many concurrent updates are happening && the counter > needs to be accurate, it can't even use atomic counter. It should be > doing full exclusion around the counter and the associated operation > _together_. > > I'm worried about sporadic erratic behavior happening regardless of > update frequency and preemption would contribute but isn't necessary > for that to happen. Ok, I misunderstood the mail you sent to Christoph, sorry. So you have no problem about the atomic convert. I'll update the patch against base tree, given the preemptless patch has problem. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 3:28 ` Shaohua Li @ 2011-04-28 10:09 ` Tejun Heo 2011-04-28 14:11 ` Christoph Lameter 2011-04-29 8:19 ` Shaohua Li 0 siblings, 2 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-28 10:09 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hey, On Thu, Apr 28, 2011 at 11:28:04AM +0800, Shaohua Li wrote: > > Okay, this communication failure isn't my fault. Please re-read what > > I wrote before, my concern wasn't primarily about pathological worst > > case - if that many concurrent updates are happening && the counter > > needs to be accurate, it can't even use atomic counter. It should be > > doing full exclusion around the counter and the associated operation > > _together_. > > > > I'm worried about sporadic erratic behavior happening regardless of > > update frequency and preemption would contribute but isn't necessary > > for that to happen. > > Ok, I misunderstood the mail you sent to Christoph, sorry. So you have > no problem about the atomic convert. I'll update the patch against base > tree, given the preemptless patch has problem. Hmm... we're now more lost than ever. :-( Can you please re-read my message two replies ago? The one where I talked about sporadic erratic behaviors in length and why I was worried about it. In your last reply, you talked about preemption and that you didn't have problems with disabling preemption, which, unfortunately, doesn't have much to do with my concern with the sporadic erratic behaviors and that's what I pointed out in my previous reply. So, it doesn't feel like anything is resolved. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 10:09 ` Tejun Heo @ 2011-04-28 14:11 ` Christoph Lameter 2011-04-28 14:23 ` Tejun Heo 2011-04-29 8:19 ` Shaohua Li 1 sibling, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 14:11 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > Hmm... we're now more lost than ever. :-( Can you please re-read my > message two replies ago? The one where I talked about sporadic > erratic behaviors in length and why I was worried about it. > > In your last reply, you talked about preemption and that you didn't > have problems with disabling preemption, which, unfortunately, doesn't > have much to do with my concern with the sporadic erratic behaviors > and that's what I pointed out in my previous reply. So, it doesn't > feel like anything is resolved. Sporadic erratic behavior exists today since any thread can add an abitrary number to its local counter while you are adding up all the per cpu differentials. If this happens just after you picked up the value then a single cpu can cause a high deviation. If multiple cpus do this then a high degree of deviation can even be had with todays implementation. Can you show in some tests how the chance of deviations is increased? If at all then in some special sitations. Maybe others get better? The counters were always designed to be racy for performance reasons. Trying to serialize them goes against the design of these things. In order to increase accuracy you have to decrease the allowable delta in the per cpu differentials. Looping over all differentials to get more accuracy is something that may not work as we have seen recently with the VM counters issues that caused bad behavior during reclaim. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:11 ` Christoph Lameter @ 2011-04-28 14:23 ` Tejun Heo 2011-04-28 14:30 ` Tejun Heo 2011-04-28 14:42 ` Christoph Lameter 0 siblings, 2 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-28 14:23 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hello, Christoph. On Thu, Apr 28, 2011 at 09:11:20AM -0500, Christoph Lameter wrote: > Sporadic erratic behavior exists today since any thread can add an > abitrary number to its local counter while you are adding up all the per > cpu differentials. If this happens just after you picked up the value then > a single cpu can cause a high deviation. If multiple cpus do this then a > high degree of deviation can even be had with todays implementation. Yeah, but that's still something which is expected. If the user is adding/subtracing large number concurrently, sure. What I'm concerned about is adding unexpected deviation. Currently, the _sum interface basically provides basically the same level of expectedness (is this even a word?) as atomic_t - the deviations are solely caused and limited by concurrent updates. After the proposed changes, one concurrent updater doing +1 can cause @batch deviation. > Can you show in some tests how the chance of deviations is increased? If > at all then in some special sitations. Maybe others get better? It's kinda obvious, isn't it? Do relatively low freq (say, every 10ms) +1's and continuously do _sum(). Before, _sum() would never deviate much from the real count. After, there will be @batch jumps. If you still need proof code, I would write it but please note that I'm pretty backed up. > The counters were always designed to be racy for performance reasons. > Trying to serialize them goes against the design of these things. In order > to increase accuracy you have to decrease the allowable delta in the per > cpu differentials. > > Looping over all differentials to get more accuracy is something that may > not work as we have seen recently with the VM counters issues that caused > bad behavior during reclaim. Such users then shouldn't use _sum() - maybe rename it to _very_slow_sum() if you're concerned about misusage. percpu_counter() is already used in filesystems to count free blocks and there are times where atomic_t type accuracy is needed and _sum() achieves that. The proposed changes break that. Why do I need to say this over and over again? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:23 ` Tejun Heo @ 2011-04-28 14:30 ` Tejun Heo 2011-04-28 14:58 ` Christoph Lameter 2011-04-28 14:42 ` Christoph Lameter 1 sibling, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 14:30 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 04:23:31PM +0200, Tejun Heo wrote: > Such users then shouldn't use _sum() - maybe rename it to > _very_slow_sum() if you're concerned about misusage. percpu_counter() > is already used in filesystems to count free blocks and there are > times where atomic_t type accuracy is needed and _sum() achieves that. > The proposed changes break that. Why do I need to say this over and > over again? And I'm getting more and more frustrated. THIS IS SLOW PATH. If it's showing up on your profile, bump up @batch. It doesn't make any sense to micro optimize slow path at the cost of introducing such nastiness. Unless someone can show me such nastiness doesn't exist, I'm not gonna take this change. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:30 ` Tejun Heo @ 2011-04-28 14:58 ` Christoph Lameter 0 siblings, 0 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 14:58 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > And I'm getting more and more frustrated. THIS IS SLOW PATH. If it's > showing up on your profile, bump up @batch. It doesn't make any sense > to micro optimize slow path at the cost of introducing such nastiness. > Unless someone can show me such nastiness doesn't exist, I'm not gonna > take this change. I think its dangerous to think about these counters as comparable to atomics. The user always has to keep in mind that these counters are fuzzy and the coder has to consider that for any use of these. Simplifying the slow path (and the rest of the code) by dropping the spinlock is a worthy goal and may allow inlining the whole _add function at some point since it then becomes quite small and no longer needs to do function calls. As a side effect: Seeing the code without a spinlock will make sure that no one will assume that he will get atomic_t type consistency from these counters. Having a lock in there seems to cause strange expectations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:23 ` Tejun Heo 2011-04-28 14:30 ` Tejun Heo @ 2011-04-28 14:42 ` Christoph Lameter 2011-04-28 14:44 ` Tejun Heo 1 sibling, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 14:42 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > On Thu, Apr 28, 2011 at 09:11:20AM -0500, Christoph Lameter wrote: > > Sporadic erratic behavior exists today since any thread can add an > > abitrary number to its local counter while you are adding up all the per > > cpu differentials. If this happens just after you picked up the value then > > a single cpu can cause a high deviation. If multiple cpus do this then a > > high degree of deviation can even be had with todays implementation. > > Yeah, but that's still something which is expected. If the user is > adding/subtracing large number concurrently, sure. What I'm concerned > about is adding unexpected deviation. Currently, the _sum interface > basically provides basically the same level of expectedness (is this > even a word?) as atomic_t - the deviations are solely caused and > limited by concurrent updates. After the proposed changes, one > concurrent updater doing +1 can cause @batch deviation. As I explained before _sum does not provide the accuracy that you think. > > Can you show in some tests how the chance of deviations is increased? If > > at all then in some special sitations. Maybe others get better? > > It's kinda obvious, isn't it? Do relatively low freq (say, every > 10ms) +1's and continuously do _sum(). Before, _sum() would never > deviate much from the real count. After, there will be @batch jumps. > If you still need proof code, I would write it but please note that > I'm pretty backed up. "Obvious" could mean that you are drawing conclusions without a proper reasoning chain. Here you assume certain things about the users of the counters. The same assumptions were made when we had the vm counter issues. The behavior of counter increments is typically not a regular stream but occurs in spurts. > > Looping over all differentials to get more accuracy is something that may > > not work as we have seen recently with the VM counters issues that caused > > bad behavior during reclaim. > > Such users then shouldn't use _sum() - maybe rename it to > _very_slow_sum() if you're concerned about misusage. percpu_counter() > is already used in filesystems to count free blocks and there are > times where atomic_t type accuracy is needed and _sum() achieves that. > The proposed changes break that. Why do I need to say this over and > over again? Because you are making strange assumptions about "accuracy" of the counters? There is no atomic_t type accuracy with per cpu counters as we have shown multiple times. Why are you repeating the same nonsense again and again? If you want atomic_t accuracy then you need to use atomic_t and take the performance penalty. Per cpu counters involve fuzziness and through that fuzziness we gain a performance advantage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:42 ` Christoph Lameter @ 2011-04-28 14:44 ` Tejun Heo 2011-04-28 14:52 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 14:44 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 09:42:49AM -0500, Christoph Lameter wrote: > > > Can you show in some tests how the chance of deviations is increased? If > > > at all then in some special sitations. Maybe others get better? > > > > It's kinda obvious, isn't it? Do relatively low freq (say, every > > 10ms) +1's and continuously do _sum(). Before, _sum() would never > > deviate much from the real count. After, there will be @batch jumps. > > If you still need proof code, I would write it but please note that > > I'm pretty backed up. > > "Obvious" could mean that you are drawing conclusions without a proper > reasoning chain. Here you assume certain things about the users of the > counters. The same assumptions were made when we had the vm counter > issues. The behavior of counter increments is typically not a regular > stream but occurs in spurts. Eh? Are you saying the above can't happen or the above doesn't matter? -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:44 ` Tejun Heo @ 2011-04-28 14:52 ` Christoph Lameter 2011-04-28 14:56 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 14:52 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > Eh? Are you saying the above can't happen or the above doesn't > matter? Its an artificial use case that does not reflect the realities on how these counters are typically used. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:52 ` Christoph Lameter @ 2011-04-28 14:56 ` Tejun Heo 2011-04-28 15:05 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 14:56 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hey, On Thu, Apr 28, 2011 at 09:52:35AM -0500, Christoph Lameter wrote: > On Thu, 28 Apr 2011, Tejun Heo wrote: > > > Eh? Are you saying the above can't happen or the above doesn't > > matter? > > Its an artificial use case that does not reflect the realities on how > these counters are typically used. Gees, Christoph. That is a test case to show the issue prominently, which is what a test case is supposed to do. What it means is that _any_ update can trigger @batch deviation on _sum() regardless of its frequency or concurrency level and that's the nastiness I've been talking about over and over again. For fast path, sure. For slow path, I don't think so. If the tradeoff still doesn't make sense to you, I don't know how to persuade you guys. I'm not gonna take it. You and Shaohua are welcome to go over my head and send the changes to Andrew or Linus, but please keep me cc'd so that I can voice my objection. Thank you. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 14:56 ` Tejun Heo @ 2011-04-28 15:05 ` Christoph Lameter 2011-04-28 15:12 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 15:05 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > Hey, > > On Thu, Apr 28, 2011 at 09:52:35AM -0500, Christoph Lameter wrote: > > On Thu, 28 Apr 2011, Tejun Heo wrote: > > > > > Eh? Are you saying the above can't happen or the above doesn't > > > matter? > > > > Its an artificial use case that does not reflect the realities on how > > these counters are typically used. > > Gees, Christoph. That is a test case to show the issue prominently, > which is what a test case is supposed to do. What it means is that > _any_ update can trigger @batch deviation on _sum() regardless of its > frequency or concurrency level and that's the nastiness I've been > talking about over and over again. As far as I understand it: This is a test case where you want to show us the atomic_t type behavior of _sum. This only works in such an artificial test case. In reality batches of updates will modify any 'accurate' result that you may have obtained from the _sum function. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:05 ` Christoph Lameter @ 2011-04-28 15:12 ` Tejun Heo 2011-04-28 15:22 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 15:12 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 10:05:09AM -0500, Christoph Lameter wrote: > On Thu, 28 Apr 2011, Tejun Heo wrote: > > Gees, Christoph. That is a test case to show the issue prominently, > > which is what a test case is supposed to do. What it means is that > > _any_ update can trigger @batch deviation on _sum() regardless of its > > frequency or concurrency level and that's the nastiness I've been > > talking about over and over again. > > As far as I understand it: This is a test case where you want to show us > the atomic_t type behavior of _sum. This only works in such an artificial > test case. In reality batches of updates will modify any 'accurate' result > that you may have obtained from the _sum function. It seems like we can split hairs all day long about the similarities and differences with atomics, so let's forget about atomics for now. I don't like any update having possibility of causing @batch jumps in _sum() result. That severely limits the usefulness of hugely expensive _sum() and the ability to scale @batch. Not everything in the world is vmstat. Think about other _CURRENT_ use cases in filesystems. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:12 ` Tejun Heo @ 2011-04-28 15:22 ` Christoph Lameter 2011-04-28 15:31 ` Tejun Heo 2011-04-28 15:48 ` Eric Dumazet 0 siblings, 2 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 15:22 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > It seems like we can split hairs all day long about the similarities > and differences with atomics, so let's forget about atomics for now. Ok good idea. So you accept that per cpu counters are inevitably fuzzy and that the result cannot be relied upon in the same way as on atomics? > I don't like any update having possibility of causing @batch jumps in > _sum() result. That severely limits the usefulness of hugely > expensive _sum() and the ability to scale @batch. Not everything in > the world is vmstat. Think about other _CURRENT_ use cases in > filesystems. Of course you can cause jumps > batch. You are looping over 8 cpus and have done cpu 1 and 2 already. Then cpu 1 adds batch-1 to the local per cpu counter. cpu 2 increments its per cpu counter. When _sum returns we have deviation of batch. In the worst case the deviation can increase to (NR_CPUS - 1) * (batch -1). That is for the current code!!! The hugely expensive _sum() is IMHO pretty useless given the above. It is a function that is called with the *hope* of getting a more accurate result. The only way to reduce the fuzziness is by reducing batch. But then you dont need the _sum function. Either that or you need additional external synchronization by the subsystem that prevents updates. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:22 ` Christoph Lameter @ 2011-04-28 15:31 ` Tejun Heo 2011-04-28 15:40 ` Tejun Heo 2011-04-28 15:48 ` Eric Dumazet 1 sibling, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 15:31 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 10:22:19AM -0500, Christoph Lameter wrote: > On Thu, 28 Apr 2011, Tejun Heo wrote: > > > It seems like we can split hairs all day long about the similarities > > and differences with atomics, so let's forget about atomics for now. > > Ok good idea. So you accept that per cpu counters are inevitably fuzzy and > that the result cannot be relied upon in the same way as on atomics? No, I don't, at all. > > I don't like any update having possibility of causing @batch jumps in > > _sum() result. That severely limits the usefulness of hugely > > expensive _sum() and the ability to scale @batch. Not everything in > > the world is vmstat. Think about other _CURRENT_ use cases in > > filesystems. > > Of course you can cause jumps > batch. You are looping over 8 cpus and > have done cpu 1 and 2 already. Then cpu 1 adds batch-1 to the local per > cpu counter. cpu 2 increments its per cpu counter. When _sum returns we > have deviation of batch. > > In the worst case the deviation can increase to (NR_CPUS - 1) * (batch > -1). That is for the current code!!! > > The hugely expensive _sum() is IMHO pretty useless given the above. It is > a function that is called with the *hope* of getting a more accurate > result. No, it isn't. It provides meaningful enough accuracy for many use cases. It is not about extreme limits on pathological cases. It's being reasonably accurate so that it doesn't deviate too much from expectations arising from given level of update frequency and concurrency. Christoph, I'm sorry but I really can't explain it any better and am out of this thread. If you still wanna proceed, you'll need to route the patches yourself. Please keep me cc'd. Thank you. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:31 ` Tejun Heo @ 2011-04-28 15:40 ` Tejun Heo 2011-04-28 15:47 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-28 15:40 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 05:31:01PM +0200, Tejun Heo wrote: > > The hugely expensive _sum() is IMHO pretty useless given the above. It is > > a function that is called with the *hope* of getting a more accurate > > result. ... > Christoph, I'm sorry but I really can't explain it any better and am > out of this thread. If you still wanna proceed, you'll need to route > the patches yourself. Please keep me cc'd. Oh, another way to proceed would be, if _sum() is as useless as you suggested above, remove _sum() first and see how that goes. If that flies, there will be no reason to argue over anything, right? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:40 ` Tejun Heo @ 2011-04-28 15:47 ` Christoph Lameter 0 siblings, 0 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 15:47 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Tejun Heo wrote: > On Thu, Apr 28, 2011 at 05:31:01PM +0200, Tejun Heo wrote: > > > The hugely expensive _sum() is IMHO pretty useless given the above. It is > > > a function that is called with the *hope* of getting a more accurate > > > result. > ... > > Christoph, I'm sorry but I really can't explain it any better and am > > out of this thread. If you still wanna proceed, you'll need to route > > the patches yourself. Please keep me cc'd. > > Oh, another way to proceed would be, if _sum() is as useless as you > suggested above, remove _sum() first and see how that goes. If that > flies, there will be no reason to argue over anything, right? There is no point of arguing anymore since you do not accept that the percpu counters do not have atomic_t consistency properties. The relaxing of the consistency requirements in order to increase performance was the intend when these schemes were created. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:22 ` Christoph Lameter 2011-04-28 15:31 ` Tejun Heo @ 2011-04-28 15:48 ` Eric Dumazet 2011-04-28 15:59 ` Eric Dumazet 1 sibling, 1 reply; 64+ messages in thread From: Eric Dumazet @ 2011-04-28 15:48 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org Le jeudi 28 avril 2011 A 10:22 -0500, Christoph Lameter a A(C)crit : > On Thu, 28 Apr 2011, Tejun Heo wrote: > > > It seems like we can split hairs all day long about the similarities > > and differences with atomics, so let's forget about atomics for now. > > Ok good idea. So you accept that per cpu counters are inevitably fuzzy and > that the result cannot be relied upon in the same way as on atomics? > > > I don't like any update having possibility of causing @batch jumps in > > _sum() result. That severely limits the usefulness of hugely > > expensive _sum() and the ability to scale @batch. Not everything in > > the world is vmstat. Think about other _CURRENT_ use cases in > > filesystems. > > Of course you can cause jumps > batch. You are looping over 8 cpus and > have done cpu 1 and 2 already. Then cpu 1 adds batch-1 to the local per > cpu counter. cpu 2 increments its per cpu counter. When _sum returns we > have deviation of batch. > > In the worst case the deviation can increase to (NR_CPUS - 1) * (batch > -1). That is for the current code!!! > > The hugely expensive _sum() is IMHO pretty useless given the above. It is > a function that is called with the *hope* of getting a more accurate > result. > > The only way to reduce the fuzziness is by reducing batch. But then you > dont need the _sum function. > > Either that or you need additional external synchronization by the > subsystem that prevents updates. > > We could add a seqcount (shared), and increment it each time one cpu changes global count. _sum() could get an additional parameter, the max number of allowed changes during _sum() run. If _sum() notices seqcount was changed too much, restart the loop. s64 __percpu_counter_sum(struct percpu_counter *fbc, unsigned maxfuzzy) { s64 ret; unsigned int oldseq, newseq; int cpu; restart: oldseq = fbc->seqcount; smp_rmb(); ret = fbc->count; for_each_online_cpu(cpu) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); ret += *pcount; } smp_rmb() newseq = fbc->count; if (newseq - oldseq >= maxfuzzy) goto restart; return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:48 ` Eric Dumazet @ 2011-04-28 15:59 ` Eric Dumazet 2011-04-28 16:17 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Eric Dumazet @ 2011-04-28 15:59 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org Le jeudi 28 avril 2011 A 17:48 +0200, Eric Dumazet a A(C)crit : > uld add a seqcount (shared), and increment it each time one cpu > changes global count. > > _sum() could get an additional parameter, the max number of allowed > changes during _sum() run. > > If _sum() notices seqcount was changed too much, restart the loop. > > s64 __percpu_counter_sum(struct percpu_counter *fbc, unsigned maxfuzzy) > { > s64 ret; > unsigned int oldseq, newseq; > int cpu; > restart: > oldseq = fbc->seqcount; > smp_rmb(); > ret = fbc->count; > for_each_online_cpu(cpu) { > s32 *pcount = per_cpu_ptr(fbc->counters, cpu); > ret += *pcount; > } > smp_rmb() > newseq = fbc->count; Sorry, it should be : newseq = fbc->seqcount > if (newseq - oldseq >= maxfuzzy) > goto restart; > return ret; > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 15:59 ` Eric Dumazet @ 2011-04-28 16:17 ` Christoph Lameter 2011-04-28 16:35 ` Eric Dumazet 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 16:17 UTC (permalink / raw) To: Eric Dumazet Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, 28 Apr 2011, Eric Dumazet wrote: > > If _sum() notices seqcount was changed too much, restart the loop. This does not address the issue of cpus adding batch -1 while the loop is going on. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 16:17 ` Christoph Lameter @ 2011-04-28 16:35 ` Eric Dumazet 2011-04-28 16:52 ` Christoph Lameter 2011-04-29 8:32 ` Shaohua Li 0 siblings, 2 replies; 64+ messages in thread From: Eric Dumazet @ 2011-04-28 16:35 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org Le jeudi 28 avril 2011 A 11:17 -0500, Christoph Lameter a A(C)crit : > On Thu, 28 Apr 2011, Eric Dumazet wrote: > > > > If _sum() notices seqcount was changed too much, restart the loop. > > This does not address the issue of cpus adding batch -1 while the > loop is going on. Yes, it does, I left the needed changes to write side as an exercice ;) For example, based on current linux-2.6 code void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) { s64 count; preempt_disable(); count = __this_cpu_read(*fbc->counters) + amount; if (count >= batch || count <= -batch) { spin_lock(&fbc->lock); fbc->seqcount++; fbc->count += count; __this_cpu_write(*fbc->counters, 0); spin_unlock(&fbc->lock); } else { __this_cpu_write(*fbc->counters, count); } preempt_enable(); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 16:35 ` Eric Dumazet @ 2011-04-28 16:52 ` Christoph Lameter 2011-04-28 16:59 ` Eric Dumazet 2011-04-29 8:32 ` Shaohua Li 1 sibling, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-28 16:52 UTC (permalink / raw) To: Eric Dumazet Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org [-- Attachment #1: Type: TEXT/PLAIN, Size: 1148 bytes --] On Thu, 28 Apr 2011, Eric Dumazet wrote: > Le jeudi 28 avril 2011 à 11:17 -0500, Christoph Lameter a écrit : > > On Thu, 28 Apr 2011, Eric Dumazet wrote: > > > > > > If _sum() notices seqcount was changed too much, restart the loop. > > > > This does not address the issue of cpus adding batch -1 while the > > loop is going on. > > > Yes, it does, I left the needed changes to write side as an exercice ;) > > > For example, based on current linux-2.6 code > > > void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) > { > s64 count; > > preempt_disable(); > count = __this_cpu_read(*fbc->counters) + amount; > if (count >= batch || count <= -batch) { > spin_lock(&fbc->lock); > fbc->seqcount++; > fbc->count += count; > __this_cpu_write(*fbc->counters, 0); > spin_unlock(&fbc->lock); > } else { > __this_cpu_write(*fbc->counters, count); > } > preempt_enable(); > } I can still add (batch - 1) without causing the seqcount to be incremented. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 16:52 ` Christoph Lameter @ 2011-04-28 16:59 ` Eric Dumazet 2011-04-29 8:52 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Eric Dumazet @ 2011-04-28 16:59 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org Le jeudi 28 avril 2011 A 11:52 -0500, Christoph Lameter a A(C)crit : > I can still add (batch - 1) without causing the seqcount to be > incremented. It always had been like that, from the very beginning. Point was trying to remove the lock, and Tejun said it was going to increase fuzzyness. I said : Readers should specify what max fuzzyness they allow. Most users dont care at all, but we can provide an API. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 16:59 ` Eric Dumazet @ 2011-04-29 8:52 ` Tejun Heo 0 siblings, 0 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-29 8:52 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Shaohua Li, akpm@linux-foundation.org, linux-mm@kvack.org On Thu, Apr 28, 2011 at 06:59:56PM +0200, Eric Dumazet wrote: > Le jeudi 28 avril 2011 a 11:52 -0500, Christoph Lameter a ecrit : > > > I can still add (batch - 1) without causing the seqcount to be > > incremented. > > It always had been like that, from the very beginning. This doesn't matter. At this level, the order of concurrent operations is not well defined. You might as well say "oh well, then the update happened after the sum is calculated". The problem I have with the interface are two-folds. 1. Is it even necessary? With concurrent updates, we don't and can't define strict order of updates across multiple CPUs. If we sweep the counters without being intervened (IRQ or, well, NMI), it should be and has been acceptable enough. 2. Let's say we need this. Then, @maxfuzzy. Few people are gonna understand it well and use it properly. Why can't you track the actual deviation introduced since sum started instead of tracking the number of deviation events? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 16:35 ` Eric Dumazet 2011-04-28 16:52 ` Christoph Lameter @ 2011-04-29 8:32 ` Shaohua Li 1 sibling, 0 replies; 64+ messages in thread From: Shaohua Li @ 2011-04-29 8:32 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Tejun Heo, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 2011-04-29 at 00:35 +0800, Eric Dumazet wrote: > Le jeudi 28 avril 2011 A 11:17 -0500, Christoph Lameter a A(C)crit : > > On Thu, 28 Apr 2011, Eric Dumazet wrote: > > > > > > If _sum() notices seqcount was changed too much, restart the loop. > > > > This does not address the issue of cpus adding batch -1 while the > > loop is going on. > > > Yes, it does, I left the needed changes to write side as an exercice ;) > > > For example, based on current linux-2.6 code > > > void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) > { > s64 count; > > preempt_disable(); > count = __this_cpu_read(*fbc->counters) + amount; > if (count >= batch || count <= -batch) { > spin_lock(&fbc->lock); > fbc->seqcount++; > fbc->count += count; > __this_cpu_write(*fbc->counters, 0); > spin_unlock(&fbc->lock); > } else { > __this_cpu_write(*fbc->counters, count); > } > preempt_enable(); > } yep, this can resolve Tejun's concern. The problem is how do you determine maxfuzzy. This can mislead user too. Because in the worst case, the deviation is num_cpu*batch. If a user uses a small maxfuzzy and expect the max deviation is maxfuzzy, he actually will get a bigger deviation in worst case. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-28 10:09 ` Tejun Heo 2011-04-28 14:11 ` Christoph Lameter @ 2011-04-29 8:19 ` Shaohua Li 2011-04-29 8:44 ` Tejun Heo 1 sibling, 1 reply; 64+ messages in thread From: Shaohua Li @ 2011-04-29 8:19 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hi, On Thu, 2011-04-28 at 18:09 +0800, Tejun Heo wrote: > On Thu, Apr 28, 2011 at 11:28:04AM +0800, Shaohua Li wrote: > > > Okay, this communication failure isn't my fault. Please re-read what > > > I wrote before, my concern wasn't primarily about pathological worst > > > case - if that many concurrent updates are happening && the counter > > > needs to be accurate, it can't even use atomic counter. It should be > > > doing full exclusion around the counter and the associated operation > > > _together_. > > > > > > I'm worried about sporadic erratic behavior happening regardless of > > > update frequency and preemption would contribute but isn't necessary > > > for that to happen. > > > > Ok, I misunderstood the mail you sent to Christoph, sorry. So you have > > no problem about the atomic convert. I'll update the patch against base > > tree, given the preemptless patch has problem. > > Hmm... we're now more lost than ever. :-( Can you please re-read my > message two replies ago? The one where I talked about sporadic > erratic behaviors in length and why I was worried about it. > > In your last reply, you talked about preemption and that you didn't > have problems with disabling preemption, which, unfortunately, doesn't > have much to do with my concern with the sporadic erratic behaviors > and that's what I pointed out in my previous reply. So, it doesn't > feel like anything is resolved. ok, I got your point. I'd agree there is sporadic erratic behaviors, but I expect there is no problem here. We all agree the worst case is the same before/after the change. Any program should be able to handle the worst case, otherwise the program itself is buggy. Discussing a buggy program is meaningless. After the change, something behavior is changed, but the worst case isn't. So I don't think this is a big problem. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 8:19 ` Shaohua Li @ 2011-04-29 8:44 ` Tejun Heo 2011-04-29 14:02 ` Christoph Lameter 0 siblings, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-29 8:44 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, Apr 29, 2011 at 04:19:31PM +0800, Shaohua Li wrote: > > In your last reply, you talked about preemption and that you didn't > > have problems with disabling preemption, which, unfortunately, doesn't > > have much to do with my concern with the sporadic erratic behaviors > > and that's what I pointed out in my previous reply. So, it doesn't > > feel like anything is resolved. > > ok, I got your point. I'd agree there is sporadic erratic behaviors, but > I expect there is no problem here. We all agree the worst case is the > same before/after the change. Any program should be able to handle the > worst case, otherwise the program itself is buggy. Discussing a buggy > program is meaningless. After the change, something behavior is changed, > but the worst case isn't. So I don't think this is a big problem. If you really think that, go ahead and remove _sum(), really. If you still can't see the difference between "reasonably accurate unless there's concurrent high frequency update" and "can jump on whatever", I can't help you. Worst case is important to consider but that's not the only criterion you base your decisions on. Think about it. It becomes the difference between "oh yeah, while my crazy concurrent FS benchmark is running, free block count is an estimate but otherwise it's pretty accruate" and "holy shit, it jumped while there's almost nothing going on the filesystem". It drastically limits both the usefulness of _sum() and thus the percpu counter and how much we can scale @batch on heavily loaded counters because it ends up directly affecting the accuracy of _sum(). Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 8:44 ` Tejun Heo @ 2011-04-29 14:02 ` Christoph Lameter 2011-04-29 14:03 ` Christoph Lameter 2011-04-29 14:18 ` Tejun Heo 0 siblings, 2 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-29 14:02 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 29 Apr 2011, Tejun Heo wrote: > > ok, I got your point. I'd agree there is sporadic erratic behaviors, but > > I expect there is no problem here. We all agree the worst case is the > > same before/after the change. Any program should be able to handle the > > worst case, otherwise the program itself is buggy. Discussing a buggy > > program is meaningless. After the change, something behavior is changed, > > but the worst case isn't. So I don't think this is a big problem. > > If you really think that, go ahead and remove _sum(), really. If you > still can't see the difference between "reasonably accurate unless > there's concurrent high frequency update" and "can jump on whatever", > I can't help you. Worst case is important to consider but that's not > the only criterion you base your decisions on. We agree it seems that _sum is a pretty attempt to be more accurate but not really gets you total accuracy (atomic_t). Our experiences with adding a _sum > Think about it. It becomes the difference between "oh yeah, while my > crazy concurrent FS benchmark is running, free block count is an > estimate but otherwise it's pretty accruate" and "holy shit, it jumped > while there's almost nothing going on the filesystem". It drastically > limits both the usefulness of _sum() and thus the percpu counter and > how much we can scale @batch on heavily loaded counters because it > ends up directly affecting the accuracy of _sum(). free block count is always an estimate if it only uses percpu_counter without other serialization. "Pretty accurate" is saying you feel good about it. In fact the potential deviations are not that much different. The problem with quietness arises because per_cpu_counter_add does not have the time boundary that f.e. VM statistics have. Those fold the differentials into the global sum every second or so. It would be better to replace _sum() with a loop that adds the differential and then zaps it. With the cmpxchg solution this is possible. Could we do the handling here in the same way that vm stat per cpu counters are done: 1. Bound the differential by time and size (we already have a batch notion here but we need to add a time boundary. Run a function over all counters every second or so that sums up the diffs). 2. Use lockless operations for differentials and atomics for globals to make it scale well. If someone wants more accuracy then we need the ability to dynamically set the batch limit similar to what the vm statistics do. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:02 ` Christoph Lameter @ 2011-04-29 14:03 ` Christoph Lameter 2011-04-29 14:18 ` Tejun Heo 1 sibling, 0 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-29 14:03 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 29 Apr 2011, Christoph Lameter wrote: > If someone wants more accuracy then we need the ability to dynamically set > the batch limit similar to what the vm statistics do. Forget the key point: After these measures it should be possible to remove _sum. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:02 ` Christoph Lameter 2011-04-29 14:03 ` Christoph Lameter @ 2011-04-29 14:18 ` Tejun Heo 2011-04-29 14:25 ` Christoph Lameter 1 sibling, 1 reply; 64+ messages in thread From: Tejun Heo @ 2011-04-29 14:18 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, Apr 29, 2011 at 09:02:23AM -0500, Christoph Lameter wrote: > On Fri, 29 Apr 2011, Tejun Heo wrote: > We agree it seems that _sum is a pretty attempt to be more accurate but > not really gets you total accuracy (atomic_t). Our experiences with adding > a _sum With concurrent updates, what is total accuracy anyway? The order of operations isn't defined. Sure, atomic_t is serialized and in that sense it might be totally accurate but without outer synchronization the concurrent events it is counting aren't ordered. Let's forget about total accuracy. I don't care and I don't think anyone should care, but I do want reasonable output without out-of-place hiccups. > free block count is always an estimate if it only uses percpu_counter > without other serialization. "Pretty accurate" is saying you feel good > about it. In fact the potential deviations are not that much different. Well, I feel good for a reason - it doesn't hiccup on a single touch(1) going on somewhere. > 1. Bound the differential by time and size (we already have a batch notion > here but we need to add a time boundary. Run a function over all counters > every second or so that sums up the diffs). > > 2. Use lockless operations for differentials and atomics for globals to > make it scale well. > > If someone wants more accuracy then we need the ability to dynamically set > the batch limit similar to what the vm statistics do. So, if you can remove _sum() by doing the above without introducing excessive complexity or penalizing use cases which might not have too much commonality with vmstat, by all means, but please pay attention to the current users. Actually take a look at them. Also, if someone is gonna put considerable amount of effort into it, please also invest some time into showing actual performance benefits, or at least possibility of actual benefits. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:18 ` Tejun Heo @ 2011-04-29 14:25 ` Christoph Lameter 2011-04-29 14:43 ` Tejun Heo 0 siblings, 1 reply; 64+ messages in thread From: Christoph Lameter @ 2011-04-29 14:25 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 29 Apr 2011, Tejun Heo wrote: > > If someone wants more accuracy then we need the ability to dynamically set > > the batch limit similar to what the vm statistics do. > > So, if you can remove _sum() by doing the above without introducing > excessive complexity or penalizing use cases which might not have too > much commonality with vmstat, by all means, but please pay attention > to the current users. Actually take a look at them. I am content to be maintaining the vm statistics.... But Shaohua may want to have a look at it? > Also, if someone is gonna put considerable amount of effort into it, > please also invest some time into showing actual performance benefits, > or at least possibility of actual benefits. Yes. I think Shaohua already has some tests that could use some improving. So we have kind of an agreement on how to move forward? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:25 ` Christoph Lameter @ 2011-04-29 14:43 ` Tejun Heo 2011-04-29 14:55 ` Christoph Lameter 2011-05-05 4:08 ` Shaohua Li 0 siblings, 2 replies; 64+ messages in thread From: Tejun Heo @ 2011-04-29 14:43 UTC (permalink / raw) To: Christoph Lameter Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org Hello, On Fri, Apr 29, 2011 at 09:25:09AM -0500, Christoph Lameter wrote: > On Fri, 29 Apr 2011, Tejun Heo wrote: > > > > If someone wants more accuracy then we need the ability to dynamically set > > > the batch limit similar to what the vm statistics do. > > > > So, if you can remove _sum() by doing the above without introducing > > excessive complexity or penalizing use cases which might not have too > > much commonality with vmstat, by all means, but please pay attention > > to the current users. Actually take a look at them. > > I am content to be maintaining the vm statistics.... But Shaohua may want > to have a look at it? It would be nice if vmstat can be merged with percpu counter tho so that the flushing can be done together. If we such piggybacking, the flushing overhead becomes much easier to justify. How does vmstat collect the percpu counters? Does one cpu visit all of them or each cpu flush local counter to global one periodically? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:43 ` Tejun Heo @ 2011-04-29 14:55 ` Christoph Lameter 2011-05-05 4:08 ` Shaohua Li 1 sibling, 0 replies; 64+ messages in thread From: Christoph Lameter @ 2011-04-29 14:55 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 29 Apr 2011, Tejun Heo wrote: > > I am content to be maintaining the vm statistics.... But Shaohua may want > > to have a look at it? > > It would be nice if vmstat can be merged with percpu counter tho so > that the flushing can be done together. If we such piggybacking, the > flushing overhead becomes much easier to justify. Right. We could put the folding of the percpu counters diffs into the vmstats function. See vmstat.c:refresh_cpu_vm_stats(). They would be folded with the same logic as the VM stats. percpu counters are a linked list though and therefore its expensive to scan that list. Maybe we can convert that to a vmalloced table? > How does vmstat collect the percpu counters? Does one cpu visit all > of them or each cpu flush local counter to global one periodically? Each cpu folds its differentials into the per zone and the global counter every N seconds. The seconds are configurable via /proc/sys/vm/stats_interval. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH] percpu: preemptless __per_cpu_counter_add 2011-04-29 14:43 ` Tejun Heo 2011-04-29 14:55 ` Christoph Lameter @ 2011-05-05 4:08 ` Shaohua Li 1 sibling, 0 replies; 64+ messages in thread From: Shaohua Li @ 2011-05-05 4:08 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Eric Dumazet, akpm@linux-foundation.org, linux-mm@kvack.org On Fri, 2011-04-29 at 22:43 +0800, Tejun Heo wrote: > Hello, > > On Fri, Apr 29, 2011 at 09:25:09AM -0500, Christoph Lameter wrote: > > On Fri, 29 Apr 2011, Tejun Heo wrote: > > > > > > If someone wants more accuracy then we need the ability to dynamically set > > > > the batch limit similar to what the vm statistics do. > > > > > > So, if you can remove _sum() by doing the above without introducing > > > excessive complexity or penalizing use cases which might not have too > > > much commonality with vmstat, by all means, but please pay attention > > > to the current users. Actually take a look at them. > > > > I am content to be maintaining the vm statistics.... But Shaohua may want > > to have a look at it? > > It would be nice if vmstat can be merged with percpu counter tho so > that the flushing can be done together. If we such piggybacking, the > flushing overhead becomes much easier to justify. > > How does vmstat collect the percpu counters? Does one cpu visit all > of them or each cpu flush local counter to global one periodically? I thought we can use a lglock like locking to address the issue. each cpu holds its lock to do update. _sum hold all cpu's lock to get consistent result. This will slow down _sum a little bit, but assume this doesn't matter. I haven't checked if we can still make fast path preemptless, but we use atomic now. This can still give me similar performance boost like previous implementation. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2011-05-05 4:08 UTC | newest] Thread overview: 64+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-13 14:45 percpu: preemptless __per_cpu_counter_add Christoph Lameter 2011-04-13 16:49 ` Christoph Lameter 2011-04-13 18:56 ` Tejun Heo 2011-04-13 20:22 ` [PATCH] " Christoph Lameter 2011-04-13 21:50 ` Tejun Heo 2011-04-13 22:17 ` Christoph Lameter 2011-04-13 22:23 ` Christoph Lameter 2011-04-13 23:55 ` Tejun Heo 2011-04-14 2:00 ` Eric Dumazet 2011-04-14 2:14 ` Eric Dumazet 2011-04-14 21:10 ` Christoph Lameter 2011-04-14 21:15 ` Tejun Heo 2011-04-15 17:37 ` Christoph Lameter 2011-04-15 18:27 ` Tejun Heo 2011-04-15 19:43 ` Christoph Lameter 2011-04-15 23:52 ` Tejun Heo 2011-04-18 14:38 ` Christoph Lameter 2011-04-21 14:43 ` Tejun Heo 2011-04-21 14:58 ` Tejun Heo 2011-04-21 17:50 ` Christoph Lameter 2011-04-21 18:01 ` Tejun Heo 2011-04-21 18:20 ` Christoph Lameter 2011-04-21 18:37 ` Tejun Heo 2011-04-21 18:54 ` Christoph Lameter 2011-04-21 19:08 ` Tejun Heo 2011-04-22 2:33 ` Shaohua Li 2011-04-26 12:10 ` Tejun Heo 2011-04-26 19:02 ` Hugh Dickins 2011-04-27 10:28 ` Tejun Heo 2011-04-27 5:43 ` Shaohua Li 2011-04-27 10:20 ` Tejun Heo 2011-04-28 3:28 ` Shaohua Li 2011-04-28 10:09 ` Tejun Heo 2011-04-28 14:11 ` Christoph Lameter 2011-04-28 14:23 ` Tejun Heo 2011-04-28 14:30 ` Tejun Heo 2011-04-28 14:58 ` Christoph Lameter 2011-04-28 14:42 ` Christoph Lameter 2011-04-28 14:44 ` Tejun Heo 2011-04-28 14:52 ` Christoph Lameter 2011-04-28 14:56 ` Tejun Heo 2011-04-28 15:05 ` Christoph Lameter 2011-04-28 15:12 ` Tejun Heo 2011-04-28 15:22 ` Christoph Lameter 2011-04-28 15:31 ` Tejun Heo 2011-04-28 15:40 ` Tejun Heo 2011-04-28 15:47 ` Christoph Lameter 2011-04-28 15:48 ` Eric Dumazet 2011-04-28 15:59 ` Eric Dumazet 2011-04-28 16:17 ` Christoph Lameter 2011-04-28 16:35 ` Eric Dumazet 2011-04-28 16:52 ` Christoph Lameter 2011-04-28 16:59 ` Eric Dumazet 2011-04-29 8:52 ` Tejun Heo 2011-04-29 8:32 ` Shaohua Li 2011-04-29 8:19 ` Shaohua Li 2011-04-29 8:44 ` Tejun Heo 2011-04-29 14:02 ` Christoph Lameter 2011-04-29 14:03 ` Christoph Lameter 2011-04-29 14:18 ` Tejun Heo 2011-04-29 14:25 ` Christoph Lameter 2011-04-29 14:43 ` Tejun Heo 2011-04-29 14:55 ` Christoph Lameter 2011-05-05 4:08 ` Shaohua Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).