From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753394AbZJEM3R (ORCPT ); Mon, 5 Oct 2009 08:29:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752539AbZJEM3Q (ORCPT ); Mon, 5 Oct 2009 08:29:16 -0400 Received: from viefep13-int.chello.at ([62.179.121.33]:9560 "EHLO viefep13-int.chello.at" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751214AbZJEM3Q (ORCPT ); Mon, 5 Oct 2009 08:29:16 -0400 X-SourceIP: 213.93.53.227 Subject: Re: find_busiest_group using lots of CPU From: Peter Zijlstra To: Jens Axboe Cc: Linux Kernel , mingo@elte.hu In-Reply-To: <20090930081811.GP23126@kernel.dk> References: <20090930081811.GP23126@kernel.dk> Content-Type: text/plain Date: Mon, 05 Oct 2009 14:31:38 +0200 Message-Id: <1254745898.26976.52.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.26.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote: > Hi, > > I stuffed a few more SSDs into my text box. Running a simple workload > that just does streaming reads from 10 processes (throughput is around > 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is > a 64 thread box. > > The top two profile entries are: > > 10.86% fio [kernel] [k] find_busiest_group > | > |--99.91%-- thread_return > | io_schedule > | sys_io_getevents > | system_call_fastpath > | 0x7f4b50b61604 > | | > | --100.00%-- td_io_getevents > | io_u_queued_complete > | thread_main > | run_threads > | main > | __libc_start_main > --0.09%-- [...] > > 5.78% fio [kernel] [k] cpumask_next_and > | > |--67.21%-- thread_return > | io_schedule > | sys_io_getevents > | system_call_fastpath > | 0x7f4b50b61604 > | | > | --100.00%-- td_io_getevents > | io_u_queued_complete > | thread_main > | run_threads > | main > | __libc_start_main > | > --32.79%-- find_busiest_group > thread_return > io_schedule > sys_io_getevents > system_call_fastpath > 0x7f4b50b61604 > | > --100.00%-- td_io_getevents > io_u_queued_complete > thread_main > run_threads > main > __libc_start_main > > This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with > both disabled but that yields the same result (well actually worse, 22% > spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is > not set. This seems way excessive! io_schedule() straight into find_busiest_group() leads me to think this could be SD_BALANCE_NEWIDLE, does something like: for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; do val=`cat $i`; echo $((val & ~0x02)) > $i; done [ assuming SCHED_DEBUG=y ] Cure things? If so, then its spending time looking for work, which there might not be on your machine, since everything is waiting for IO or somesuch. Not really sure what to do about it though, this is a quad socket nehalem, right? We could possibly disable SD_BALANCE_NEWIDLE on the NODE level, but that would again decrease throughput in things like kbuild.