From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753581AbZBCDvp (ORCPT ); Mon, 2 Feb 2009 22:51:45 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751501AbZBCDvf (ORCPT ); Mon, 2 Feb 2009 22:51:35 -0500 Received: from gate.crashing.org ([63.228.1.57]:33717 "EHLO gate.crashing.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751371AbZBCDve (ORCPT ); Mon, 2 Feb 2009 22:51:34 -0500 Subject: Re: PCI PM: Restore standard config registers of all devices early From: Benjamin Herrenschmidt To: Linus Torvalds Cc: "Rafael J. Wysocki" , Linux Kernel Mailing List , Jesse Barnes , Andreas Schwab , Len Brown , Ingo Molnar In-Reply-To: References: <200901261904.n0QJ4Q9c016709@hera.kernel.org> <200902022239.02453.rjw@sisk.pl> <200902022331.49131.rjw@sisk.pl> Content-Type: text/plain Date: Tue, 03 Feb 2009 14:51:01 +1100 Message-Id: <1233633061.16867.41.camel@pasglop> Mime-Version: 1.0 X-Mailer: Evolution 2.24.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2009-02-02 at 15:18 -0800, Linus Torvalds wrote: > (background: pci_restore_standard_config() will have done > pci_raw_set_power_state(PCI_D0) with the device clocks off, which > wouldn't > actualyl have _done_ anythign to the device, but then it does > > dev->current_state = PCI_D0; > > Maybe the simplest thing to do migth be to replace that with a > > pci_update_current_state(dev, PCI_D0); > > instead, to try to read back the state explicitly in > pci_restore_standard_config()). Note that I notice on various powerbook models a bunch of things like : restoring config space at offset XX (was: 0xffffffff, writin 0 ...) I'm not sure if that will do any good :-) The reason is that some devices (here sungem, ide-pmac and the firewire OHCI) -do- get clock gated until either you hit some of my magic pmac_feature calls in the driver or the driver calls pci_enable_device() (I have a hook there). So far, it appears to be harmless ... at least on the one machine I have here, but it shows that we are doing something wrong. So to recapitulate our entire discussion, let me know if I'm off the mark here: - We agree that in general, the arch must have a chance to turn things on before we do anything such as restoring the config space. This includes my pmac clock gating stuff and ACPI - The problem with ACPI is that it needs interrupts because it uses mutexes, GFP_KERNEL, etc.... There are two approaches to solving that problem. We can either make the mutexes not warn etc... or we can use a loop of disable_irq() instead of local_irq_save() between our normal and our "late" phase of PM. - There are some arguments for the fact that making mutexes not warn and GFP_KERNEL silently degrade to at least GFP_NOIO is a good thing to do at that stage in the suspend process (and the GFP_KERNEL -> NOIO should probably happen even earlier) so we might get ACPI for free there. Now, here's just a few things to throw into the discussion: - I think sysdevs should still have real IRQs off (local_irq_off) since that includes shutting down the PIC itself, the timers, etc... - Doing the disable_irq() loop instead of local_irq_save() will introduce one subtle change to driver semantics, ie, they -might- be entered while in their suspend_late() via some timer interrupt (ie, IO scheduler acting on a timer, network stuff, tty output for the serial driver, etc...) while we had a pretty strong guarantee with the local_irq_save() case that nothing would happen at that stage, this isn't true anymore. That means more chances for drivers to get it wrong... - Because of the two things above, I tend to favor myself the option of making mutexes not warn etc... - There are some additional reasons where generalizing that to the "late" suspend and early resume stages could be useful, such as allowing early video resume of fbdev/fbcon or the new KMS etc... while those things may internally use mutexes or do GFP_KERNEL allocs etc... One example hitting me right now is my AGP PM code that I need to run early to bring the video back properly, which calls agp_collect_device_status() which calls pci_get_class() which does a kzalloc(...,GFP_KERNEL). Another one is the fb_set_suspend() itself uses the fb_notifier which is a blocking notifier. This is annoying because this notifier isn't used only for suspend/resume, for for other things that can legitimately be blocking, so making it an atomic notifier isn't really the right options, etc... For those reason, I think we need to look at the few changes to the scheduler/mutex code, might_sleep() and allocators, so that: - During the main suspend/resume phase: GFP_KERNEL -> GFP_NOIO - During the "irq off" phase: * GFP_KERNEL/NOIO -> GFP_ATOMIC * msleep -> mdelay * might_sleep() doesn't warn * actual blocking on a mutex errors out verbosely That would allow a whole chunk of code to be used in the later phase as if it was boot time, including probably ACPI. Of course, we still want to be careful, we aren't going to allow explicit scheduling, ie, things like sysdev's should still be written with the idea that it -is- atomic context, but at least cases such as the ones we've exposed above would be much more manageable. >>From there, we can then rework PCI to properly call the arch stuff first thing on power on and be happy with it. Comments ? Cheers, Ben.