From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: Re: [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread Date: Tue, 22 Jun 2010 21:52:34 +0800 Message-ID: <20100622135234.GA11561@localhost> References: <1276797878-28893-1-git-send-email-jack@suse.cz> <20100618060901.GA6590@dastard> <20100621233628.GL3828@quack.suse.cz> <20100622054409.GP7869@dastard> <20100621231416.904c50c7.akpm@linux-foundation.org> <20100622100924.GQ7869@dastard> <20100622131745.GB3338@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dave Chinner , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, hch@infradead.org, peterz@infradead.org To: Jan Kara Return-path: Received: from mga09.intel.com ([134.134.136.24]:55717 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752584Ab0FVNwm (ORCPT ); Tue, 22 Jun 2010 09:52:42 -0400 Content-Disposition: inline In-Reply-To: <20100622131745.GB3338@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: > On the other hand I think we will have to come up with something > more clever than what I do now because for some huge machines with > nr_cpu_ids == 256, the error of the counter is 256*9*8 = 18432 so that's > already unacceptable given the amounts we want to check (like 1536) - > already for nr_cpu_ids == 32, the error is the same as the difference we > want to check. I think we'll have to come up with some scheme whose error > is not dependent on the number of cpus or if it is dependent, it's only a > weak dependency (like a logarithm or so). > Or we could rely on the fact that IO completions for a bdi won't happen on > all CPUs and thus the error would be much more bounded. But I'm not sure > how much that is true or not. Yes the per CPU counter seems tricky. How about plain atomic operations? This test shows that atomic_dec_and_test() is about 4.5 times slower than plain i-- in a 4-core CPU. Not bad. Note that 1) we can avoid the atomic operations when there are no active waiters 2) most writeback will be submitted by one per-bdi-flusher, so no worry of cache bouncing (this also means the per CPU counter error is normally bounded by the batch size) 3) the cost of atomic inc/dec will be weakly related to core numbers but never socket numbers (based on 2), so won't scale too bad Thanks, Fengguang --- $ perf stat ./atomic Performance counter stats for './atomic': 903.875304 task-clock-msecs # 0.998 CPUs 76 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 98 page-faults # 0.000 M/sec 3011186459 cycles # 3331.418 M/sec 1608926490 instructions # 0.534 IPC 301481656 branches # 333.543 M/sec 94932 branch-misses # 0.031 % 88687 cache-references # 0.098 M/sec 1286 cache-misses # 0.001 M/sec 0.905576197 seconds time elapsed $ perf stat ./non-atomic Performance counter stats for './non-atomic': 215.315814 task-clock-msecs # 0.996 CPUs 18 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 99 page-faults # 0.000 M/sec 704358635 cycles # 3271.281 M/sec 303445790 instructions # 0.431 IPC 100574889 branches # 467.104 M/sec 39323 branch-misses # 0.039 % 36064 cache-references # 0.167 M/sec 850 cache-misses # 0.004 M/sec 0.216175521 seconds time elapsed -------------------------------------------------------------------------------- $ cat atomic.c #include typedef struct { int counter; } atomic_t; static inline int atomic_dec_and_test(atomic_t *v) { unsigned char c; asm volatile("lock; decl %0; sete %1" : "+m" (v->counter), "=qm" (c) : : "memory"); return c != 0; } int main(void) { atomic_t i; i.counter = 100000000; for (; !atomic_dec_and_test(&i);) ; return 0; } -------------------------------------------------------------------------------- $ cat non-atomic.c #include int main(void) { int i; for (i = 100000000; i; i--) ; return 0; }