From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757198Ab2IXQzI (ORCPT ); Mon, 24 Sep 2012 12:55:08 -0400 Received: from merlin.infradead.org ([205.233.59.134]:57707 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756692Ab2IXQzF convert rfc822-to-8bit (ORCPT ); Mon, 24 Sep 2012 12:55:05 -0400 Message-ID: <1348505683.11847.111.camel@twins> Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected From: Peter Zijlstra To: Linus Torvalds Cc: Mel Gorman , Borislav Petkov , Nikolay Ulyanitsky , Mike Galbraith , linux-kernel@vger.kernel.org, Andreas Herrmann , Andrew Morton , Thomas Gleixner , Ingo Molnar , Suresh Siddha Date: Mon, 24 Sep 2012 18:54:43 +0200 In-Reply-To: References: <20120914212717.GA29307@liondog.tnic> <20120924150048.GB11266@suse.de> <1348500647.11847.69.camel@twins> <1348503163.11847.97.camel@twins> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2012-09-24 at 09:30 -0700, Linus Torvalds wrote: > On Mon, Sep 24, 2012 at 9:12 AM, Peter Zijlstra wrote: > > > > So we're looking for an idle cpu around @target. We prefer a cpu of an > > idle core, since SMT-siblings share L[12] cache. The way we do this is > > by iterating the topology tree downwards starting at the LLC (L3) cache > > level. Its groups are either the SMT-siblings or singleton groups. > > So if it'sally guaranteed to be SMT-siblings or singleton groups, then > the whole "for_each_cpu()" is a total disaster. That's a truly > expensive way to look up adjacent CPU's. Is there no saner way to look > up that thing? Like a simple circular list of SMT siblings (I realize > that on x86 that list is either one or two, but other SMT > implementations are groups of four or more). SMT siblings aren't actually adjacent in the cpu number space (on x86 at least). So the alternative you suggest is pointer chasing a list, is that really much better than scanning a mostly empty bitmap? I've no idea how bad these bitmap scanning instructions are on modern chips. But let me try and come up with the list thing, I think we've actually got that someplace as well. > So I suspect your patch largely makes things faster (avoid those > insane cpumask operations), but the for_each_cpu() one is still an > absolutely horrible way to find a couple of basically statically known > (modulo hotplug, which is disabled here anyway) CPU's. So even if the > algorithm makes sense at some higher level, it doesn't really seem to > make sense from an implementation standpoint. Agreed. > Also, do we really want to spread things out that aggressively? > How/why do we know that we don't want to share L2 caches, for example? > It sounds like a bad idea from a power standpoint, and possibly > performance too. IIRC this current stuff is the result of Mike and Suresh running a few benchmarks.. Mike, Suresh, either one of you remember this? Otherwise I'll have to go trawl the archives.