From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Zijlstra Subject: Re: [PATCH] percpu_counter: Fix __percpu_counter_sum() Date: Tue, 09 Dec 2008 09:34:13 +0100 Message-ID: <1228811653.6809.26.camel@twins> References: <4936D287.6090206@cosmosbay.com> <4936EB04.8000609@cosmosbay.com> <20081206202233.3b74febc.akpm@linux-foundation.org> <493BCF60.1080409@cosmosbay.com> <20081207092854.f6bcbfae.akpm@linux-foundation.org> <493C0F40.7040304@cosmosbay.com> <20081207205250.dbb7fe4b.akpm@linux-foundation.org> <20081208221241.GA2501@mit.edu> <1228774836.16244.22.camel@lappy.programming.kicks-ass.net> <20081208230047.GC2501@mit.edu> <1228777500.12729.4.camel@twins> <493E2884.6010600@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Theodore Tso , Andrew Morton , linux kernel , "David S. Miller" , Mingming Cao , linux-ext4@vger.kernel.org To: Eric Dumazet Return-path: Received: from viefep18-int.chello.at ([213.46.255.22]:30825 "EHLO viefep18-int.chello.at" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753254AbYLIIeX (ORCPT ); Tue, 9 Dec 2008 03:34:23 -0500 In-Reply-To: <493E2884.6010600@cosmosbay.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Tue, 2008-12-09 at 09:12 +0100, Eric Dumazet wrote: > Peter Zijlstra a =C3=A9crit : > > On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote: > >> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote: > >>> atomic_t is pretty good on all archs, but you get to keep the cac= heline > >>> ping-pong. > >>> > >> Stupid question --- if you're worried about cacheline ping-pongs, = why > >> aren't each cpu's delta counter cacheline aligned? With a 64-byte > >> cache-line, and a 32-bit counters entry, with less than 16 CPU's w= e're > >> going to be getting cache ping-pong effects with percpu_counter's, > >> right? Or am I missing something? > >=20 > > sorta - a new per-cpu allocator is in the works, but we do cachelin= e > > align the per-cpu allocations (or used to), also, the allocations a= re > > node affine. > >=20 >=20 > I did work on a 'light weight percpu counter', aka percpu_lcounter, f= or > all metrics that dont need 64 bits wide, but a plain 'long' > (network, nr_files, nr_dentry, nr_inodes, ...) >=20 > struct percpu_lcounter { > atomic_long_t count; > #ifdef CONFIG_SMP > #ifdef CONFIG_HOTPLUG_CPU > struct list_head list; /* All percpu_counters are on a list = */ > #endif > long *counters; > #endif > }; >=20 > (No more spinlock) >=20 > Then I tried to have atomic_t (or atomic_long_t) for 'counters', but= got a > 10% slow down of __percpu_lcounter_add(), even if never hitting the '= slow path' > atomic_long_add_return() is really expensiven, even on a non contende= d cache > line. >=20 > struct percpu_lcounter { > atomic_long_t count; > #ifdef CONFIG_SMP > #ifdef CONFIG_HOTPLUG_CPU > struct list_head list; /* All percpu_counters are on a list = */ > #endif > atomic_long_t *counters; > #endif > }; >=20 > So I believe the percpu_clounter_sum() that tries to reset to 0 all c= pu local > counts would be really too expensive, if it slows down _add() so muc= h. >=20 > long percpu_lcounter_sum(struct percpu_lcounter *fblc) > { > long acc =3D 0; > int cpu; >=20 > for_each_online_cpu(cpu) > acc +=3D atomic_long_xchg(per_cpu_ptr(fblc->counters,= cpu), 0); > return atomic_long_add_return(acc, &fblc->count); > } >=20 > void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount,= s32 batch) > { > long count; > atomic_long_t *pcount; >=20 > pcount =3D per_cpu_ptr(flbc->counters, get_cpu()); > count =3D atomic_long_add_return(amount, pcount); /* way too = expensive !!! */ Yeah, its an extra LOCK ins where there wasn't one before. > if (unlikely(count >=3D batch || count <=3D -batch)) { > atomic_long_add(count, &flbc->count); > atomic_long_sub(count, pcount); Also, this are two LOCKs where, with the spinlock, you'd likely only have 1. So yes, having the per-cpu variable an atomic seems like a way too expensive idea. That xchg based _sum is cool though. > } > put_cpu(); > } >=20 > Just forget about it and let percpu_lcounter_sum() only read the valu= es, and > let percpu_lcounter_add() not using atomic ops in fast path. >=20 > void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount,= s32 batch) > { > long count; > long *pcount; >=20 > pcount =3D per_cpu_ptr(flbc->counters, get_cpu()); > count =3D *pcount + amount; > if (unlikely(count >=3D batch || count <=3D -batch)) { > atomic_long_add(count, &flbc->count); > count =3D 0; > } > *pcount =3D count; > put_cpu(); > } > EXPORT_SYMBOL(__percpu_lcounter_add); >=20 >=20 > Also, with upcoming NR_CPUS=3D4096, it may be time to design a hierar= chical percpu_counter, > to avoid hitting one shared "fbc->count" all the time a local counter= overflows. So we'd normally write to the shared cacheline every cpus/batch. Cascading this you'd get ln(cpus)/(batch^ln(cpus)) or something like that, right? Won't just increasing batch give the same result - or are we going to play funny games with the topology information? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html