From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754881Ab1AaKPw (ORCPT ); Mon, 31 Jan 2011 05:15:52 -0500 Received: from casper.infradead.org ([85.118.1.10]:60840 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752610Ab1AaKPv (ORCPT ); Mon, 31 Jan 2011 05:15:51 -0500 Subject: Re: [PATCH] a patch to fix the cpu-offline-online problem caused by pm_idle From: Peter Zijlstra To: Luming Yu Cc: LKML , Len Brown , "H. Peter Anvin" , tglx In-Reply-To: References: <1295894492.28776.470.camel@laptop> <1295946736.28776.479.camel@laptop> <1296210619.15234.263.camel@laptop> <1296405366.2274.60.camel@twins> Content-Type: text/plain; charset="UTF-8" Date: Mon, 31 Jan 2011 11:16:46 +0100 Message-ID: <1296469006.15234.359.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2011-01-30 at 22:26 -0500, Luming Yu wrote: > > Guessing is totally the wrong thing when you're sending stuff upstream, > > esp ugly patches such as this. .32 is more than a year old, anything > > could have happened. > > Ok. the default upstream kernel seems to have NMI watchdog disabled? Then enable it already, its a whole CONFIG option away.. > It's not working because of NMI watchdog. If you ignore NMI watchdog, > then I guess it works but just slow.. Don't guess, test it dammit. And then figure out why it triggers, I haven't seen _anything_ that would cause it to trigger, nor a sane explanation for your patch. > > Ok, so one IPI costs 50-100 us, even with 64 cpu, that's at most 6.4ms > > nowhere near enough to trigger the NMI watchdog. So what does go wrong? > > Good question! > But we also can't forget there were large latency from C3. Not 60+ seconds large I hope, I know NHM-EX has some suckage, but surely not that bad? > And I guess some reschedule ticks get lost to kick some CPUs out of > idle due to the side effects of the CPU PM feature. if use nohz=off, > everything seems to just work. > Yes, I agree we need to dig it out either. > But it's kind of combination problem between the special stop_machine > context and CPU power management... Yeah, so? Also, incidentally, stop-machine got a rewrite around .35 and again significant changes in .37, so please do test mainline and not your dinosaur. > > Yeah, what are you smoking? Why do you wreck perfectly fine code for one > > backward ass piece of hardware. > > Just make things less complex... But its wrong, it very clearly works around a real problem, don't ever do that, fix the problem!