* Re: PCI PM: Restore standard config registers of all devices early
[not found] <200901261904.n0QJ4Q9c016709@hera.kernel.org>
@ 2009-02-02 9:54 ` Benjamin Herrenschmidt
2009-02-02 17:06 ` Linus Torvalds
2009-02-02 17:20 ` Rafael J. Wysocki
0 siblings, 2 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 9:54 UTC (permalink / raw)
To: Linux Kernel Mailing List
Cc: Linus Torvalds, Jesse Barnes, Rafael J. Wysocki, Andreas Schwab
On Mon, 2009-01-26 at 19:04 +0000, Linux Kernel Mailing List wrote:
> Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=aa8c6c93747f7b55fa11e1624fec8ca33763a805
> Commit: aa8c6c93747f7b55fa11e1624fec8ca33763a805
> Parent: 0db29af1e767464d71b89410d61a1e5b668d0370
> Author: Rafael J. Wysocki <rjw@sisk.pl>
> AuthorDate: Fri Jan 16 21:54:43 2009 +0100
> Committer: Jesse Barnes <jbarnes@virtuousgeek.org>
> CommitDate: Fri Jan 16 12:57:58 2009 -0800
>
> PCI PM: Restore standard config registers of all devices early
This seems to break resume on some PowerBooks (just reported by
Andreas).
I'm not yet sure what's up, but I can imagine various problems with the
approach. For one, pci_enable_device() is the only place where the arch
has a hook to turn back on things like clocks etc... to a device. You
aren't supposed to touch at the config space before the arch got a
chance to turn things back on. I agree there's somewhat a chicken and
egg problem here, so we might need to look more closely at what that
restore thing does and maybe add a call to the arch low level enable
hook first... Or create a new hook.
I don't know how x86 does but I'm sure there must be some kind ACPI
thingy that must be called too before you can touch a device, in case it
got powered off by more than just the standard D states (ie, clock
stopped on the bus or whole power plane switched off).
I'll try to reproduce and think about it more, but it looks to me that
this patch might not be quite the right approach yet.
Cheers,
Ben.
> There is a problem in our handling of suspend-resume of PCI devices that
> many of them have their standard config registers restored with
> interrupts enabled and they are put into the full power state with
> interrupts enabled as well. This may lead to the following scenario:
> * an interrupt vector is shared between two or more devices
> * one device is resumed earlier and generates an interrupt
> * the interrupt handler of another device tries to handle it and
> attempts to access the device the config space of which hasn't been
> restored yet and/or which still is in a low power state
> * the system crashes as a result
>
> To prevent this from happening we should restore the standard
> configuration registers of all devices with interrupts disabled and we
> should put them into the D0 power state right after that.
> Unfortunately, this cannot be done using the existing
> pci_set_power_state(), because it can sleep. Also, to do it we have to
> make sure that the config spaces of all devices were actually saved
> during suspend.
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> ---
> drivers/pci/pci-driver.c | 91 ++++++++++++++--------------------------------
> drivers/pci/pci.c | 63 +++++++++++++++++++++++++++++---
> drivers/pci/pci.h | 6 +++
> include/linux/pci.h | 5 +++
> 4 files changed, 95 insertions(+), 70 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index c697f26..9de07b7 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -355,17 +355,27 @@ static int pci_legacy_suspend(struct device *dev, pm_message_t state)
> int i = 0;
>
> if (drv && drv->suspend) {
> + pci_dev->state_saved = false;
> +
> i = drv->suspend(pci_dev, state);
> suspend_report_result(drv->suspend, i);
> - } else {
> - pci_save_state(pci_dev);
> - /*
> - * This is for compatibility with existing code with legacy PM
> - * support.
> - */
> - pci_pm_set_unknown_state(pci_dev);
> + if (i)
> + return i;
> +
> + if (pci_dev->state_saved)
> + goto Fixup;
> +
> + if (WARN_ON_ONCE(pci_dev->current_state != PCI_D0))
> + goto Fixup;
> }
>
> + pci_save_state(pci_dev);
> + /*
> + * This is for compatibility with existing code with legacy PM support.
> + */
> + pci_pm_set_unknown_state(pci_dev);
> +
> + Fixup:
> pci_fixup_device(pci_fixup_suspend, pci_dev);
>
> return i;
> @@ -386,81 +396,34 @@ static int pci_legacy_suspend_late(struct device *dev, pm_message_t state)
>
> static int pci_legacy_resume_early(struct device *dev)
> {
> - int error = 0;
> struct pci_dev * pci_dev = to_pci_dev(dev);
> struct pci_driver * drv = pci_dev->driver;
>
> - pci_fixup_device(pci_fixup_resume_early, pci_dev);
> -
> - if (drv && drv->resume_early)
> - error = drv->resume_early(pci_dev);
> - return error;
> + return drv && drv->resume_early ?
> + drv->resume_early(pci_dev) : 0;
> }
>
> static int pci_legacy_resume(struct device *dev)
> {
> - int error;
> struct pci_dev * pci_dev = to_pci_dev(dev);
> struct pci_driver * drv = pci_dev->driver;
>
> pci_fixup_device(pci_fixup_resume, pci_dev);
>
> - if (drv && drv->resume) {
> - error = drv->resume(pci_dev);
> - } else {
> - /* restore the PCI config space */
> - pci_restore_state(pci_dev);
> - error = pci_pm_reenable_device(pci_dev);
> - }
> - return error;
> + return drv && drv->resume ?
> + drv->resume(pci_dev) : pci_pm_reenable_device(pci_dev);
> }
>
> /* Auxiliary functions used by the new power management framework */
>
> -static int pci_restore_standard_config(struct pci_dev *pci_dev)
> -{
> - struct pci_dev *parent = pci_dev->bus->self;
> - int error = 0;
> -
> - /* Check if the device's bus is operational */
> - if (!parent || parent->current_state == PCI_D0) {
> - pci_restore_state(pci_dev);
> - pci_update_current_state(pci_dev, PCI_D0);
> - } else {
> - dev_warn(&pci_dev->dev, "unable to restore config, "
> - "bridge %s in low power state D%d\n", pci_name(parent),
> - parent->current_state);
> - pci_dev->current_state = PCI_UNKNOWN;
> - error = -EAGAIN;
> - }
> -
> - return error;
> -}
> -
> -static bool pci_is_bridge(struct pci_dev *pci_dev)
> -{
> - return !!(pci_dev->subordinate);
> -}
> -
> static void pci_pm_default_resume_noirq(struct pci_dev *pci_dev)
> {
> - if (pci_restore_standard_config(pci_dev))
> - pci_fixup_device(pci_fixup_resume_early, pci_dev);
> + pci_restore_standard_config(pci_dev);
> + pci_fixup_device(pci_fixup_resume_early, pci_dev);
> }
>
> static int pci_pm_default_resume(struct pci_dev *pci_dev)
> {
> - /*
> - * pci_restore_standard_config() should have been called once already,
> - * but it would have failed if the device's parent bridge had not been
> - * in power state D0 at that time. Check it and try again if necessary.
> - */
> - if (pci_dev->current_state == PCI_UNKNOWN) {
> - int error = pci_restore_standard_config(pci_dev);
> - if (error)
> - return error;
> - }
> -
> pci_fixup_device(pci_fixup_resume, pci_dev);
>
> if (!pci_is_bridge(pci_dev))
> @@ -575,11 +538,11 @@ static int pci_pm_resume_noirq(struct device *dev)
> struct device_driver *drv = dev->driver;
> int error = 0;
>
> + pci_pm_default_resume_noirq(pci_dev);
> +
> if (pci_has_legacy_pm_support(pci_dev))
> return pci_legacy_resume_early(dev);
>
> - pci_pm_default_resume_noirq(pci_dev);
> -
> if (drv && drv->pm && drv->pm->resume_noirq)
> error = drv->pm->resume_noirq(dev);
>
> @@ -730,11 +693,11 @@ static int pci_pm_restore_noirq(struct device *dev)
> struct device_driver *drv = dev->driver;
> int error = 0;
>
> + pci_pm_default_resume_noirq(pci_dev);
> +
> if (pci_has_legacy_pm_support(pci_dev))
> return pci_legacy_resume_early(dev);
>
> - pci_pm_default_resume_noirq(pci_dev);
> -
> if (drv && drv->pm && drv->pm->restore_noirq)
> error = drv->pm->restore_noirq(dev);
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e491fde..17bd932 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -22,7 +22,7 @@
> #include <asm/dma.h> /* isa_dma_bridge_buggy */
> #include "pci.h"
>
> -unsigned int pci_pm_d3_delay = 10;
> +unsigned int pci_pm_d3_delay = PCI_PM_D3_WAIT;
>
> #ifdef CONFIG_PCI_DOMAINS
> int pci_domains_supported = 1;
> @@ -426,6 +426,7 @@ static inline int platform_pci_sleep_wake(struct pci_dev *dev, bool enable)
> * given PCI device
> * @dev: PCI device to handle.
> * @state: PCI power state (D0, D1, D2, D3hot) to put the device into.
> + * @wait: If 'true', wait for the device to change its power state
> *
> * RETURN VALUE:
> * -EINVAL if the requested state is invalid.
> @@ -435,7 +436,7 @@ static inline int platform_pci_sleep_wake(struct pci_dev *dev, bool enable)
> * 0 if device's power state has been successfully changed.
> */
> static int
> -pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> +pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state, bool wait)
> {
> u16 pmcsr;
> bool need_restore = false;
> @@ -480,8 +481,10 @@ pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> break;
> case PCI_UNKNOWN: /* Boot-up */
> if ((pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D3hot
> - && !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET))
> + && !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET)) {
> need_restore = true;
> + wait = true;
> + }
> /* Fall-through: force to D0 */
> default:
> pmcsr = 0;
> @@ -491,12 +494,15 @@ pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> /* enter specified state */
> pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);
>
> + if (!wait)
> + return 0;
> +
> /* Mandatory power management transition delays */
> /* see PCI PM 1.1 5.6.1 table 18 */
> if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
> msleep(pci_pm_d3_delay);
> else if (state == PCI_D2 || dev->current_state == PCI_D2)
> - udelay(200);
> + udelay(PCI_PM_D2_DELAY);
>
> dev->current_state = state;
>
> @@ -515,7 +521,7 @@ pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> if (need_restore)
> pci_restore_bars(dev);
>
> - if (dev->bus->self)
> + if (wait && dev->bus->self)
> pcie_aspm_pm_state_change(dev->bus->self);
>
> return 0;
> @@ -585,7 +591,7 @@ int pci_set_power_state(struct pci_dev *dev, pci_power_t state)
> if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
> return 0;
>
> - error = pci_raw_set_power_state(dev, state);
> + error = pci_raw_set_power_state(dev, state, true);
>
> if (state > PCI_D0 && platform_pci_power_manageable(dev)) {
> /* Allow the platform to finalize the transition */
> @@ -730,6 +736,7 @@ pci_save_state(struct pci_dev *dev)
> /* XXX: 100% dword access ok here? */
> for (i = 0; i < 16; i++)
> pci_read_config_dword(dev, i * 4,&dev->saved_config_space[i]);
> + dev->state_saved = true;
> if ((i = pci_save_pcie_state(dev)) != 0)
> return i;
> if ((i = pci_save_pcix_state(dev)) != 0)
> @@ -1374,6 +1381,50 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev)
> }
>
> /**
> + * pci_restore_standard_config - restore standard config registers of PCI device
> + * @dev: PCI device to handle
> + *
> + * This function assumes that the device's configuration space is accessible.
> + * If the device needs to be powered up, the function will wait for it to
> + * change the state.
> + */
> +int pci_restore_standard_config(struct pci_dev *dev)
> +{
> + pci_power_t prev_state;
> + int error;
> +
> + pci_restore_state(dev);
> + pci_update_current_state(dev, PCI_D0);
> +
> + prev_state = dev->current_state;
> + if (prev_state == PCI_D0)
> + return 0;
> +
> + error = pci_raw_set_power_state(dev, PCI_D0, false);
> + if (error)
> + return error;
> +
> + if (pci_is_bridge(dev)) {
> + if (prev_state > PCI_D1)
> + mdelay(PCI_PM_BUS_WAIT);
> + } else {
> + switch(prev_state) {
> + case PCI_D3cold:
> + case PCI_D3hot:
> + mdelay(pci_pm_d3_delay);
> + break;
> + case PCI_D2:
> + udelay(PCI_PM_D2_DELAY);
> + break;
> + }
> + }
> +
> + dev->current_state = PCI_D0;
> +
> + return 0;
> +}
> +
> +/**
> * pci_enable_ari - enable ARI forwarding if hardware support it
> * @dev: the PCI device
> */
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 1351bb4..26ddf78 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -49,6 +49,12 @@ extern void pci_disable_enabled_device(struct pci_dev *dev);
> extern void pci_pm_init(struct pci_dev *dev);
> extern void platform_pci_wakeup_init(struct pci_dev *dev);
> extern void pci_allocate_cap_save_buffers(struct pci_dev *dev);
> +extern int pci_restore_standard_config(struct pci_dev *dev);
> +
> +static inline bool pci_is_bridge(struct pci_dev *pci_dev)
> +{
> + return !!(pci_dev->subordinate);
> +}
>
> extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
> extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 80f8b8b..48890cf 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -117,6 +117,10 @@ typedef int __bitwise pci_power_t;
> #define PCI_UNKNOWN ((pci_power_t __force) 5)
> #define PCI_POWER_ERROR ((pci_power_t __force) -1)
>
> +#define PCI_PM_D2_DELAY 200
> +#define PCI_PM_D3_WAIT 10
> +#define PCI_PM_BUS_WAIT 50
> +
> /** The pci_channel state describes connectivity between the CPU and
> * the pci device. If some PCI bus between here and the pci device
> * has crashed or locked up, this info is reflected here.
> @@ -252,6 +256,7 @@ struct pci_dev {
> unsigned int ari_enabled:1; /* ARI forwarding */
> unsigned int is_managed:1;
> unsigned int is_pcie:1;
> + unsigned int state_saved:1;
> pci_dev_flags_t dev_flags;
> atomic_t enable_cnt; /* pci_enable_device has been called */
>
> --
> To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 9:54 ` PCI PM: Restore standard config registers of all devices early Benjamin Herrenschmidt
@ 2009-02-02 17:06 ` Linus Torvalds
2009-02-02 20:29 ` Benjamin Herrenschmidt
2009-02-02 20:33 ` Benjamin Herrenschmidt
2009-02-02 17:20 ` Rafael J. Wysocki
1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 17:06 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Mon, 2 Feb 2009, Benjamin Herrenschmidt wrote:
>
> I don't know how x86 does but I'm sure there must be some kind ACPI
> thingy that must be called too before you can touch a device, in case it
> got powered off by more than just the standard D states (ie, clock
> stopped on the bus or whole power plane switched off).
It gets called later in that case.
> I'll try to reproduce and think about it more, but it looks to me that
> this patch might not be quite the right approach yet.
I think you'll need to move the clock gating thing to be a sysdev
suspend/resume event, which gets done really early along with things like
core timekeeping etc. We kind of hit that already with the USB driver,
where CONFIG_PPC does all kinds of wrong things:
#ifdef CONFIG_PPC_PMAC
if (retval == 0) {
/* Disable ASIC clocks for USB */
if (machine_is(powermac)) {
struct device_node *of_node;
of_node = pci_device_to_OF_node(dev);
if (of_node)
pmac_call_feature(PMAC_FTR_USB_ENABLE,
of_node, 0, 0);
}
}
#endif
inside the hcd-pci.c driver. If that whole thing was a sysdev feature, you
wouldn't need that kind of insane "do my own arch-specific thing in a
generic driver" thing. AND waking it up would work too.
I'm assuming this is exactly the kind of thing that is now biting you?
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 9:54 ` PCI PM: Restore standard config registers of all devices early Benjamin Herrenschmidt
2009-02-02 17:06 ` Linus Torvalds
@ 2009-02-02 17:20 ` Rafael J. Wysocki
1 sibling, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 17:20 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linux Kernel Mailing List, Linus Torvalds, Jesse Barnes,
Andreas Schwab
On Monday 02 February 2009, Benjamin Herrenschmidt wrote:
> On Mon, 2009-01-26 at 19:04 +0000, Linux Kernel Mailing List wrote:
> > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=aa8c6c93747f7b55fa11e1624fec8ca33763a805
> > Commit: aa8c6c93747f7b55fa11e1624fec8ca33763a805
> > Parent: 0db29af1e767464d71b89410d61a1e5b668d0370
> > Author: Rafael J. Wysocki <rjw@sisk.pl>
> > AuthorDate: Fri Jan 16 21:54:43 2009 +0100
> > Committer: Jesse Barnes <jbarnes@virtuousgeek.org>
> > CommitDate: Fri Jan 16 12:57:58 2009 -0800
> >
> > PCI PM: Restore standard config registers of all devices early
>
> This seems to break resume on some PowerBooks (just reported by
> Andreas).
Yes, I saw the report in Bugzilla.
> I'm not yet sure what's up, but I can imagine various problems with the
> approach. For one, pci_enable_device() is the only place where the arch
> has a hook to turn back on things like clocks etc... to a device. You
> aren't supposed to touch at the config space before the arch got a
> chance to turn things back on. I agree there's somewhat a chicken and
> egg problem here, so we might need to look more closely at what that
> restore thing does and maybe add a call to the arch low level enable
> hook first... Or create a new hook.
I'd prefer to create a new hook, althouth there may be a problem with ACPI
vs interrupts off. In the meantime, I'd like to test some recent fixes on top
of this patch and try a couple of debug patches to see what's up (in the
Bugzilla if that's not a problem).
> I don't know how x86 does but I'm sure there must be some kind ACPI
> thingy that must be called too before you can touch a device, in case it
> got powered off by more than just the standard D states (ie, clock
> stopped on the bus or whole power plane switched off).
>
> I'll try to reproduce and think about it more, but it looks to me that
> this patch might not be quite the right approach yet.
Yeah.
Well, the approach is a bit x86-ish ...
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 17:06 ` Linus Torvalds
@ 2009-02-02 20:29 ` Benjamin Herrenschmidt
2009-02-02 20:41 ` Linus Torvalds
2009-02-02 20:33 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 20:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
> I think you'll need to move the clock gating thing to be a sysdev
> suspend/resume event, which gets done really early along with things like
> core timekeeping etc. We kind of hit that already with the USB driver,
> where CONFIG_PPC does all kinds of wrong things:
But I want the gating to be tied to the device. IE. If the device is
ever suspended on its own, or the driver removed, I want the clocks
off. Ideally, I want USB autosuspend to be able to clock gate or that
sort of thing ..
I'd rather hook it up inside pci_set_power_state()...
> inside the hcd-pci.c driver. If that whole thing was a sysdev feature, you
> wouldn't need that kind of insane "do my own arch-specific thing in a
> generic driver" thing. AND waking it up would work too.
>
> I'm assuming this is exactly the kind of thing that is now biting you?
I don't know yet what is biting us, though I suppose so. I only saw
Andreas report before going to bed yesterday ;-)
I wouldn't need a sysdev if I was going to put the gating under arch
control, I could just do that from various arch code I already have
there doing bits and pieces. But I like it being tied to the driver, it
makes more sense in many ways to have a driver control the clocks of its
device. Maybe the best approach is to stick a hook into
pci_set_power_state() ... This should really be the very first thing to
happen, even before whacking back the BARs.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 17:06 ` Linus Torvalds
2009-02-02 20:29 ` Benjamin Herrenschmidt
@ 2009-02-02 20:33 ` Benjamin Herrenschmidt
2009-02-02 20:50 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 20:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Mon, 2009-02-02 at 09:06 -0800, Linus Torvalds wrote:
> > I don't know how x86 does but I'm sure there must be some kind ACPI
> > thingy that must be called too before you can touch a device, in case it
> > got powered off by more than just the standard D states (ie, clock
> > stopped on the bus or whole power plane switched off).
>
> It gets called later in that case.
And won't you have a potential problem here if ACPI is doing clock
gating or turning off the whole power plane ? IE. If that happens, your
pci_restore_state() done early with interrupts off will fail as well...
Since the root of that problem seems to be related to interrupts, maybe
the right approach is to have drivers disable it on suspend (ie,
disable_irq -> disabled at PIC level) and thus it would only get
re-enabled when all the drivers on a shared line have re-enabled it...
but that would mean drivers have to know they can not synchronously wait
for an interrupt from the device to happen at resume() time ...
It's not a trivial problem...
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:29 ` Benjamin Herrenschmidt
@ 2009-02-02 20:41 ` Linus Torvalds
2009-02-02 21:00 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 20:41 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> I'd rather hook it up inside pci_set_power_state()...
Umm. But you already _have_ that. Look at platform_pci_set_power_state().
It gets called before turning the device on (pci_raw_set_power_state(D0))
and after turning the device off (pci_raw_set_power_state(D1+))
Maybe you missed it? It's how ACPI does things.
However, the issue you see is that pci_restore_standard_config() doesn't
call it, because at least with ACPI, the ACPI code simply isn't ready to
be called with interrupts off. So it looks like you may be looking at the
wrong thing, hmm?
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:33 ` Benjamin Herrenschmidt
@ 2009-02-02 20:50 ` Linus Torvalds
2009-02-02 20:55 ` Linus Torvalds
2009-02-02 21:07 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 20:50 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> And won't you have a potential problem here if ACPI is doing clock
> gating or turning off the whole power plane ? IE. If that happens, your
> pci_restore_state() done early with interrupts off will fail as well...
Well, it won't fail any worse than it does anyway.
> Since the root of that problem seems to be related to interrupts, maybe
> the right approach is to have drivers disable it on suspend (ie,
> disable_irq -> disabled at PIC level)
No.
We went through this, you apparently didn't follow it.
You cannot disable_irq() at suspend/resume time, because that means that
when you have shared interrupts the device can no longer use interrupts in
those paths - because they may be disabled by totally different devices.
And a lot of devices WILL NOT WORK without interrupts. Including
suspend/resume events. Think something as common as USB.
> It's not a trivial problem...
That's the understatement of the year.
The whole reason we want to do the two-phase commit thing where
"suspend()" starts the thing and "suspend_late()" finalizes things is
exactly all about these inter-connections and interrupts in particular.
I suspect that we could possibly make ACPI happy by actually leaving
interrupts "enabled" in the suspend-late (and early-resume) paths, but
with all hardware interrupts actually turned off. But that's really just a
"let's fool people by turning off interrupts a different way" thing - it
in no way really changes any fundamental issues.
Whether you use "disable_irq() over all interrupts" or "local_irq_save ->
local_irq_restore" really doesn't change anything. You cannot do this in a
single phase, because that means that you randomly disable interrupts too
early (or enable them too late) when drivers still _require_ them.
So the only workable way to handle interrupts is one of two:
- the two-phase thing we do. Do a first phase with interrupts enabled,
then the actual low-level "turn off" with interrupts disabled (and the
reverse on resume), so that device drivers never have to see the case
of "interrupt happens with dead device", while still having the
_guarantee_ that interrupts work for part of their suspend/resume
cycle.
- expecting all drivers to be perfect and handle interrupts correcly in
the driver.
Quite frankly, I don't think the second one is workable. It may be the
optimal one in theory, but it's never worked for us in practice.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:50 ` Linus Torvalds
@ 2009-02-02 20:55 ` Linus Torvalds
2009-02-02 21:19 ` Benjamin Herrenschmidt
2009-02-02 21:39 ` Rafael J. Wysocki
2009-02-02 21:07 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 20:55 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Mon, 2 Feb 2009, Linus Torvalds wrote:
>
> I suspect that we could possibly make ACPI happy by actually leaving
> interrupts "enabled" in the suspend-late (and early-resume) paths, but
> with all hardware interrupts actually turned off. But that's really just a
> "let's fool people by turning off interrupts a different way" thing - it
> in no way really changes any fundamental issues.
Btw, I do think that we can make ACPI happy regardless.
We quite commonly call into ACPI during the early boot sequence, when
interrupts are disabled for all the same reasons. We don't get the
"might_sleep()" warnings, simply because we have these kinds of checks:
..
if ((!in_atomic() && !irqs_disabled()) ||
system_state != SYSTEM_RUNNING || oops_in_progress)
return;
..
ie we know that "system_state != SYSTEM_RUNNING" is a special case where
things are allowed to do things that they aren't normally allowed to do.
I suspect that late-suspend/early-resume is just exactly the same. It's a
boot, after all. Just let ACPI do its odd things, despite the fact that
interrupts are disabled. The fact that doing them while the system is
_running_ is invalid doesn't necessarily mean that it is invalid under
bootup or suspend/resume.
We might even make it possible to have timers going, if we end up saying
"we'll mask all hardware interrupts _except_ for the timer". I'm not sure
that is necessarily something we can do portably, though..
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:41 ` Linus Torvalds
@ 2009-02-02 21:00 ` Benjamin Herrenschmidt
2009-02-02 21:32 ` Rafael J. Wysocki
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 21:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
On Mon, 2009-02-02 at 12:41 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> >
> > I'd rather hook it up inside pci_set_power_state()...
>
> Umm. But you already _have_ that. Look at platform_pci_set_power_state().
> It gets called before turning the device on (pci_raw_set_power_state(D0))
> and after turning the device off (pci_raw_set_power_state(D1+))
>
> Maybe you missed it? It's how ACPI does things.
I may well have, it didn't exist when my stuff was written.
> However, the issue you see is that pci_restore_standard_config() doesn't
> call it, because at least with ACPI, the ACPI code simply isn't ready to
> be called with interrupts off. So it looks like you may be looking at the
> wrong thing, hmm?
Well.. yes and no... I still don't see how it can be sane to whack config
space back before the device has been turned back on. I wouldn't be surprised
if some devices don't grok well their BARs being written while not in D0.
I suppose I can always add another hook inside pci_restore_blah for my specific
case but I can also very easily see that can of worms hitting even more badly
in tight embedded environments such as handhelds, who use things like fine clock
control a lot more extensively.
In fact, even on x86, I'm not sure it's kosher to restore the config space
before you called ACPI...
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:50 ` Linus Torvalds
2009-02-02 20:55 ` Linus Torvalds
@ 2009-02-02 21:07 ` Benjamin Herrenschmidt
2009-02-02 21:49 ` Rafael J. Wysocki
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 21:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
> I suspect that we could possibly make ACPI happy by actually leaving
> interrupts "enabled" in the suspend-late (and early-resume) paths, but
> with all hardware interrupts actually turned off. But that's really just a
> "let's fool people by turning off interrupts a different way" thing - it
> in no way really changes any fundamental issues.
Well, it might be the right approach ... for the simple reason that ACPI
doesn't -really- need external interrupts off ... it's a side effect of
the interpreter using mutexes etc...
So maybe indeed we should look into doing something like that, in fact,
I quite like it:
* suspend:
- suspend devices with IRQs on
- suspend PICs (ie, mask all interrupts). the main kernel clock
source should be kept running tho
- do the device suspend "late" thing involving ACPI etc..
* resume:
the other way around.
Do we still need a real suspend "late" with hard IRQs off in that
scenario ?
Granted, because we are still effectively scheduling etc... we might
have the driver being hit by requests etc... so the suspend_late in the
above setup loses the property of being shielded against that...
> Whether you use "disable_irq() over all interrupts" or "local_irq_save ->
> local_irq_restore" really doesn't change anything. You cannot do this in a
> single phase, because that means that you randomly disable interrupts too
> early (or enable them too late) when drivers still _require_ them.
Well, sort-of. IE. local_irq_save() vs. disable_irq() has a relevant
difference in our context ... the ability to use mutexes, msleep, etc...
which may allow ACPI to operate.
I don't care about ACPI on powerpc, so I'm happy hooking the gating with
irqs off ... but the existing hook isn't in the right place imho.
> Quite frankly, I don't think the second one is workable. It may be the
> optimal one in theory, but it's never worked for us in practice.
True, it's going to be a pain. However, I still think you're going to
be bitten somewhere if you whack config space back before you called
the appropriate ACPI magic to resume the device, but I may be wrong,
maybe nobody will ever implement really aggressive power management
in their AML ? :-)
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:55 ` Linus Torvalds
@ 2009-02-02 21:19 ` Benjamin Herrenschmidt
2009-02-02 21:39 ` Rafael J. Wysocki
1 sibling, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 21:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Rafael J. Wysocki,
Andreas Schwab
> I suspect that late-suspend/early-resume is just exactly the same. It's a
> boot, after all.
I agree. We know we have turned off all other CPUs, we can't race,
etc...
> Just let ACPI do its odd things, despite the fact that
> interrupts are disabled. The fact that doing them while the system is
> _running_ is invalid doesn't necessarily mean that it is invalid under
> bootup or suspend/resume.
>
> We might even make it possible to have timers going, if we end up saying
> "we'll mask all hardware interrupts _except_ for the timer". I'm not sure
> that is necessarily something we can do portably, though..
Dunno. I can do it for powerpc easily enough but it's hard to tell for
others. The problem with timers running is that it means a slight change
of semantics for drivers in the sense that suspend_late now -might- be
interrupted by some request hitting the driver from interrupt time ...
ie, we had the property that suspend_late was not going to bother with
any exclusion vs. normal processing code path, but with timers on,
that's no longer the case...
I prefer the hack to get ACPI working early. Sounds saner to me in a
wicked way :-) And being able to restore the device via ACPI before
touching its config space -at-all- in the kernel side might end up
making things more robust for us .. who knows what those AML do tho...
BTW. Somebody happens to know what Windows does in that area ? Does it
restore config space ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 21:00 ` Benjamin Herrenschmidt
@ 2009-02-02 21:32 ` Rafael J. Wysocki
0 siblings, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 21:32 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab
On Monday 02 February 2009, Benjamin Herrenschmidt wrote:
> On Mon, 2009-02-02 at 12:41 -0800, Linus Torvalds wrote:
> >
> > On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> > >
> > > I'd rather hook it up inside pci_set_power_state()...
> >
> > Umm. But you already _have_ that. Look at platform_pci_set_power_state().
> > It gets called before turning the device on (pci_raw_set_power_state(D0))
> > and after turning the device off (pci_raw_set_power_state(D1+))
> >
> > Maybe you missed it? It's how ACPI does things.
>
> I may well have, it didn't exist when my stuff was written.
>
> > However, the issue you see is that pci_restore_standard_config() doesn't
> > call it, because at least with ACPI, the ACPI code simply isn't ready to
> > be called with interrupts off. So it looks like you may be looking at the
> > wrong thing, hmm?
>
> Well.. yes and no... I still don't see how it can be sane to whack config
> space back before the device has been turned back on. I wouldn't be surprised
> if some devices don't grok well their BARs being written while not in D0.
>
> I suppose I can always add another hook inside pci_restore_blah for my specific
> case but I can also very easily see that can of worms hitting even more badly
> in tight embedded environments such as handhelds, who use things like fine clock
> control a lot more extensively.
>
> In fact, even on x86, I'm not sure it's kosher to restore the config space
> before you called ACPI...
It need not be in theory, but I haven't seen any real life example of that.
I think devices without the PM capability are suspicious, so to speak.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 20:55 ` Linus Torvalds
2009-02-02 21:19 ` Benjamin Herrenschmidt
@ 2009-02-02 21:39 ` Rafael J. Wysocki
2009-02-02 22:05 ` Linus Torvalds
2009-02-02 22:28 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 21:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Monday 02 February 2009, Linus Torvalds wrote:
>
> On Mon, 2 Feb 2009, Linus Torvalds wrote:
> >
> > I suspect that we could possibly make ACPI happy by actually leaving
> > interrupts "enabled" in the suspend-late (and early-resume) paths, but
> > with all hardware interrupts actually turned off. But that's really just a
> > "let's fool people by turning off interrupts a different way" thing - it
> > in no way really changes any fundamental issues.
>
> Btw, I do think that we can make ACPI happy regardless.
>
> We quite commonly call into ACPI during the early boot sequence, when
> interrupts are disabled for all the same reasons. We don't get the
> "might_sleep()" warnings, simply because we have these kinds of checks:
>
> ..
> if ((!in_atomic() && !irqs_disabled()) ||
> system_state != SYSTEM_RUNNING || oops_in_progress)
> return;
> ..
>
> ie we know that "system_state != SYSTEM_RUNNING" is a special case where
> things are allowed to do things that they aren't normally allowed to do.
Yes, which is why I thought it might be a good idea to make the AML interpreter
allow is to execute AML with interrupts off, so that we can put devices into
low power states and/or put them into D0 with interrupts off.
Then, we'll be able to make ACPI happy (up to some strange ordering
expectations of some insane BIOSes maybe) and fix the interrupts issue at
the same time.
The idea would be to have a special code path(s) where AML can be executed
with interrupts off and a couple of special entry points into the AML
interpreter for this purpose.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 21:07 ` Benjamin Herrenschmidt
@ 2009-02-02 21:49 ` Rafael J. Wysocki
2009-02-02 22:15 ` Linus Torvalds
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 21:49 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Monday 02 February 2009, Benjamin Herrenschmidt wrote:
>
> > I suspect that we could possibly make ACPI happy by actually leaving
> > interrupts "enabled" in the suspend-late (and early-resume) paths, but
> > with all hardware interrupts actually turned off. But that's really just a
> > "let's fool people by turning off interrupts a different way" thing - it
> > in no way really changes any fundamental issues.
>
> Well, it might be the right approach ... for the simple reason that ACPI
> doesn't -really- need external interrupts off ... it's a side effect of
> the interpreter using mutexes etc...
>
> So maybe indeed we should look into doing something like that, in fact,
> I quite like it:
>
> * suspend:
>
> - suspend devices with IRQs on
> - suspend PICs (ie, mask all interrupts). the main kernel clock
> source should be kept running tho
> - do the device suspend "late" thing involving ACPI etc..
I think it would be easier to make ACPI allow us to run AML with interrupts
off.
> * resume:
>
> the other way around.
>
> Do we still need a real suspend "late" with hard IRQs off in that
> scenario ?
>
> Granted, because we are still effectively scheduling etc... we might
> have the driver being hit by requests etc... so the suspend_late in the
> above setup loses the property of being shielded against that...
Well, I don't think we'd need it in this case.
> > Whether you use "disable_irq() over all interrupts" or "local_irq_save ->
> > local_irq_restore" really doesn't change anything. You cannot do this in a
> > single phase, because that means that you randomly disable interrupts too
> > early (or enable them too late) when drivers still _require_ them.
>
> Well, sort-of. IE. local_irq_save() vs. disable_irq() has a relevant
> difference in our context ... the ability to use mutexes, msleep, etc...
> which may allow ACPI to operate.
>
> I don't care about ACPI on powerpc, so I'm happy hooking the gating with
> irqs off ... but the existing hook isn't in the right place imho.
>
> > Quite frankly, I don't think the second one is workable. It may be the
> > optimal one in theory, but it's never worked for us in practice.
>
> True, it's going to be a pain. However, I still think you're going to
> be bitten somewhere if you whack config space back before you called
> the appropriate ACPI magic to resume the device, but I may be wrong,
> maybe nobody will ever implement really aggressive power management
> in their AML ? :-)
In theory, if the device supports native PCI PM, it should be safe to put it
into D0 using that and restore the config registers.
Still, there may be some strange planar devices that need AML to power them
up. We'll see. :-)
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 21:39 ` Rafael J. Wysocki
@ 2009-02-02 22:05 ` Linus Torvalds
2009-02-02 22:09 ` Linus Torvalds
` (3 more replies)
2009-02-02 22:28 ` Benjamin Herrenschmidt
1 sibling, 4 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 22:05 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
>
> Yes, which is why I thought it might be a good idea to make the AML interpreter
> allow is to execute AML with interrupts off, so that we can put devices into
> low power states and/or put them into D0 with interrupts off.
>
> Then, we'll be able to make ACPI happy (up to some strange ordering
> expectations of some insane BIOSes maybe) and fix the interrupts issue at
> the same time.
>
> The idea would be to have a special code path(s) where AML can be executed
> with interrupts off and a couple of special entry points into the AML
> interpreter for this purpose.
Well, I don't think anybody wants to (or necessarily has the ability)
touch ACPI internals.
How about this kind of approach:
- step#1: split "sysdev" device suspend/resume from the general
"device_power_on/off()" step
Rationale: sysdev's include things like the actual interrupt controller
etc, and we almost certainly really do want to have hard interrupts
disabled over that whole thing.
I'm appending a suggested trivial patch for this.
- step#2: make suspend/resume actually do the three phases:
a) existing device->suspend() (interrupts on, everything live)
b) disable_device_irq()'s: things are live, but device interrupts
are turned off by essentially looping over the irq_desc_ptr[]
table.
c) existing device->suspend_late(). System is kind of live, but
all interrupts have been shut off at the source. But we really
could try to keep timers etc alive.
d) disable CPU interrupts.
e) sysdev_suspend(). This turns off the interrupt controller etc
*suspend cpu*
and then do the resume in reverse order.
The reason step#1 is needed is just because we have that irq handling
difference between the two. And if timers are running, we really can
pretty much run all normal code (turning timers off does hurt anything
that uses timeouts like msleep() or schedule_timeout()). Otherwise step#1
isn't important in itself.
Hmm? It doesn't look too painful, and it would mean that we could go back
to a perfectly regular pci_set_power_state() in the early-resume codepath.
Ingo: how do we walk over all the interrupt descriptors in todays world?
The whole sparse-vs-nonsparse makes things a bit more complex.
Linus
---
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon Feb 2 13:36:19 2009 -0800
Split up sysdev_[suspend|resume] from device_power_[down|up]
This just moves the sysdev_suspend/resume from the callee to the
callers, with no real change in semantics.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
arch/x86/kernel/apm_32.c | 4 ++++
drivers/base/power/main.c | 3 ---
drivers/xen/manage.c | 8 ++++++++
kernel/kexec.c | 7 +++++++
kernel/power/disk.c | 11 +++++++++++
kernel/power/main.c | 8 ++++++--
6 files changed, 36 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index 98807bb..67021ad 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -1192,6 +1192,7 @@ static int suspend(int vetoable)
device_suspend(PMSG_SUSPEND);
local_irq_disable();
device_power_down(PMSG_SUSPEND);
+ sysdev_syspend();
local_irq_enable();
@@ -1208,6 +1209,7 @@ static int suspend(int vetoable)
if (err != APM_SUCCESS)
apm_error("suspend", err);
err = (err == APM_SUCCESS) ? 0 : -EIO;
+ sysdev_resume();
device_power_up(PMSG_RESUME);
local_irq_enable();
device_resume(PMSG_RESUME);
@@ -1228,6 +1230,7 @@ static void standby(void)
local_irq_disable();
device_power_down(PMSG_SUSPEND);
+ sysdev_suspend();
local_irq_enable();
err = set_system_power_state(APM_STATE_STANDBY);
@@ -1235,6 +1238,7 @@ static void standby(void)
apm_error("standby", err);
local_irq_disable();
+ sysdev_resume();
device_power_up(PMSG_RESUME);
local_irq_enable();
}
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 670c9d6..2d14f4a 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -333,7 +333,6 @@ static void dpm_power_up(pm_message_t state)
*/
void device_power_up(pm_message_t state)
{
- sysdev_resume();
dpm_power_up(state);
}
EXPORT_SYMBOL_GPL(device_power_up);
@@ -577,8 +576,6 @@ int device_power_down(pm_message_t state)
}
dev->power.status = DPM_OFF_IRQ;
}
- if (!error)
- error = sysdev_suspend(state);
if (error)
dpm_power_up(resume_event(state));
return error;
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 9b91617..fb183a0 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -45,6 +45,13 @@ static int xen_suspend(void *data)
err);
return err;
}
+ err = sysdev_suspend();
+ if (err) {
+ printk(KERN_ERR "xen_suspend: sysdev_suspend failed: %d\n",
+ err);
+ device_power_up(PMSG_RESUME);
+ return err;
+ }
xen_mm_pin_all();
gnttab_suspend();
@@ -61,6 +68,7 @@ static int xen_suspend(void *data)
gnttab_resume();
xen_mm_unpin_all();
+ sysdev_resume();
device_power_up(PMSG_RESUME);
if (!*cancelled) {
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 8a6d7b0..b780288 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1465,6 +1465,11 @@ int kernel_kexec(void)
error = device_power_down(PMSG_FREEZE);
if (error)
goto Enable_irqs;
+
+ /* Suspend system devices */
+ error = sysdev_suspend();
+ if (error)
+ goto Power_up_devices;
} else
#endif
{
@@ -1477,6 +1482,8 @@ int kernel_kexec(void)
#ifdef CONFIG_KEXEC_JUMP
if (kexec_image->preserve_context) {
+ sysdev_resume();
+ Power_up_devices:
device_power_up(PMSG_RESTORE);
Enable_irqs:
local_irq_enable();
diff --git a/kernel/power/disk.c b/kernel/power/disk.c
index 432ee57..07dc9a2 100644
--- a/kernel/power/disk.c
+++ b/kernel/power/disk.c
@@ -227,6 +227,12 @@ static int create_image(int platform_mode)
"aborting hibernation\n");
goto Enable_irqs;
}
+ sysdev_suspend();
+ if (error) {
+ printk(KERN_ERR "PM: Some devices failed to power down, "
+ "aborting hibernation\n");
+ goto Power_up_devices;
+ }
if (hibernation_test(TEST_CORE))
goto Power_up;
@@ -242,9 +248,11 @@ static int create_image(int platform_mode)
if (!in_suspend)
platform_leave(platform_mode);
Power_up:
+ sysdev_resume();
/* NOTE: device_power_up() is just a resume() for devices
* that suspended with irqs off ... no overall powerup.
*/
+ Power_up_devices:
device_power_up(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
Enable_irqs:
@@ -335,6 +343,7 @@ static int resume_target_kernel(void)
"aborting resume\n");
goto Enable_irqs;
}
+ sysdev_suspend();
/* We'll ignore saved state, but this gets preempt count (etc) right */
save_processor_state();
error = restore_highmem();
@@ -357,6 +366,7 @@ static int resume_target_kernel(void)
swsusp_free();
restore_processor_state();
touch_softlockup_watchdog();
+ sysdev_resume();
device_power_up(PMSG_RECOVER);
Enable_irqs:
local_irq_enable();
@@ -440,6 +450,7 @@ int hibernation_platform_enter(void)
local_irq_disable();
error = device_power_down(PMSG_HIBERNATE);
if (!error) {
+ sysdev_suspend();
hibernation_ops->enter();
/* We should never get here */
while (1);
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 2399888..7709556 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -298,8 +298,12 @@ static int suspend_enter(suspend_state_t state)
goto Done;
}
- if (!suspend_test(TEST_CORE))
- error = suspend_ops->enter(state);
+ error = sysdev_suspend();
+ if (!error) {
+ if (!suspend_test(TEST_CORE))
+ error = suspend_ops->enter(state);
+ sysdev_resume();
+ }
device_power_up(PMSG_RESUME);
Done:
^ permalink raw reply related [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:05 ` Linus Torvalds
@ 2009-02-02 22:09 ` Linus Torvalds
2009-02-02 22:31 ` Rafael J. Wysocki
` (2 subsequent siblings)
3 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 22:09 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2 Feb 2009, Linus Torvalds wrote:
>
> Ingo: how do we walk over all the interrupt descriptors in todays world?
> The whole sparse-vs-nonsparse makes things a bit more complex.
Stage#2 would look something like this, except I haven't written the
"[disable|enable]_device_interrupts()" for the simple reason that I'd like
somebody else to do it ;)
But it would basically be just
for (irq = 1; irq < NR_IRQS; irq++)
[disable|enable]_irq(irq);
with maybe some day some way of marking system interrupts separately (ie
timers).
Linus
---
kernel/power/main.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 7709556..f54acb6 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -290,14 +290,15 @@ static int suspend_enter(suspend_state_t state)
int error = 0;
device_pm_lock();
- arch_suspend_disable_irqs();
- BUG_ON(!irqs_disabled());
-
+ disable_device_interrupts();
if ((error = device_power_down(PMSG_SUSPEND))) {
printk(KERN_ERR "PM: Some devices failed to power down\n");
goto Done;
}
+ arch_suspend_disable_irqs();
+ BUG_ON(!irqs_disabled());
+
error = sysdev_suspend();
if (!error) {
if (!suspend_test(TEST_CORE))
@@ -305,10 +306,12 @@ static int suspend_enter(suspend_state_t state)
sysdev_resume();
}
- device_power_up(PMSG_RESUME);
- Done:
arch_suspend_enable_irqs();
BUG_ON(irqs_disabled());
+
+ device_power_up(PMSG_RESUME);
+ Done:
+ enable_device_interrupts();
device_pm_unlock();
return error;
}
^ permalink raw reply related [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 21:49 ` Rafael J. Wysocki
@ 2009-02-02 22:15 ` Linus Torvalds
2009-02-02 22:33 ` Rafael J. Wysocki
2009-02-02 22:57 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 22:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
>
> I think it would be easier to make ACPI allow us to run AML with interrupts
> off.
Well, I'd agree, except I have this strong memory of us having known bugs
with ACPI turning hard-interrupts on again. Similarly, it uses mutexes etc
that simply don't work with interrupts off and/or may turn them on again
thanks to scheduling.
"Fixing" that seems not very easy. ACPI has a bad habit of being _really_
hard to fix in this area.
I do agree that _if_ we can just fix ACPI, we wouldn't have these issues,
and we should just call it with interrupts disabled with our existing
code. But my previous email was a "maybe we can do it like this" kind of
thing, which might allow us to use ACPI with none of the irq-off issues.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 21:39 ` Rafael J. Wysocki
2009-02-02 22:05 ` Linus Torvalds
@ 2009-02-02 22:28 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 22:28 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Mon, 2009-02-02 at 22:39 +0100, Rafael J. Wysocki wrote:
> Yes, which is why I thought it might be a good idea to make the AML interpreter
> allow is to execute AML with interrupts off, so that we can put devices into
> low power states and/or put them into D0 with interrupts off.
>
> Then, we'll be able to make ACPI happy (up to some strange ordering
> expectations of some insane BIOSes maybe) and fix the interrupts issue at
> the same time.
>
> The idea would be to have a special code path(s) where AML can be executed
> with interrupts off and a couple of special entry points into the AML
> interpreter for this purpose.
Right. That way, I can also use the same hooks for my clock gating and
be confident they'll be called at the right time.
Now, we need some ACPI folks to tell us why this won't work :-)
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:05 ` Linus Torvalds
2009-02-02 22:09 ` Linus Torvalds
@ 2009-02-02 22:31 ` Rafael J. Wysocki
2009-02-02 23:18 ` Linus Torvalds
2009-02-02 22:48 ` Benjamin Herrenschmidt
2009-02-02 23:49 ` Ingo Molnar
3 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 22:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Monday 02 February 2009, Linus Torvalds wrote:
>
> On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> >
> > Yes, which is why I thought it might be a good idea to make the AML interpreter
> > allow is to execute AML with interrupts off, so that we can put devices into
> > low power states and/or put them into D0 with interrupts off.
> >
> > Then, we'll be able to make ACPI happy (up to some strange ordering
> > expectations of some insane BIOSes maybe) and fix the interrupts issue at
> > the same time.
> >
> > The idea would be to have a special code path(s) where AML can be executed
> > with interrupts off and a couple of special entry points into the AML
> > interpreter for this purpose.
>
> Well, I don't think anybody wants to (or necessarily has the ability)
> touch ACPI internals.
That need not be very difficult AFAICS.
> How about this kind of approach:
>
> - step#1: split "sysdev" device suspend/resume from the general
> "device_power_on/off()" step
>
> Rationale: sysdev's include things like the actual interrupt controller
> etc, and we almost certainly really do want to have hard interrupts
> disabled over that whole thing.
Fine by me.
> I'm appending a suggested trivial patch for this.
Looks correct.
> - step#2: make suspend/resume actually do the three phases:
>
> a) existing device->suspend() (interrupts on, everything live)
>
> b) disable_device_irq()'s: things are live, but device interrupts
> are turned off by essentially looping over the irq_desc_ptr[]
> table.
Well, do we actually need to turn off all device interrupts?
Shared interrupts are the source of the problem, so perhaps we can
only disable interrupts of devices that use interrupt pins at this point
(MSI/MSI-X need not be disabled, for example, and the timer interrupts most
probably too)?
> c) existing device->suspend_late(). System is kind of live, but
> all interrupts have been shut off at the source. But we really
> could try to keep timers etc alive.
>
> d) disable CPU interrupts.
At what point do we disable the other CPUs?
> e) sysdev_suspend(). This turns off the interrupt controller etc
>
> *suspend cpu*
>
> and then do the resume in reverse order.
>
> The reason step#1 is needed is just because we have that irq handling
> difference between the two. And if timers are running, we really can
> pretty much run all normal code (turning timers off does hurt anything
> that uses timeouts like msleep() or schedule_timeout()). Otherwise step#1
> isn't important in itself.
>
> Hmm? It doesn't look too painful,
Well, it means reworking the entire suspend sequence (again) or we will
break assumptions made by some existing drivers (interrupts off during
suspend_late and resume_early). And that affects all drivers, not only PCI.
> and it would mean that we could go back to a perfectly regular
> pci_set_power_state() in the early-resume codepath.
I first would like to understand what _exactly_ breaks on the iBook reported to
have problems.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:15 ` Linus Torvalds
@ 2009-02-02 22:33 ` Rafael J. Wysocki
2009-02-02 22:56 ` Rafael J. Wysocki
2009-02-02 22:57 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 22:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Monday 02 February 2009, Linus Torvalds wrote:
>
> On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> >
> > I think it would be easier to make ACPI allow us to run AML with interrupts
> > off.
>
> Well, I'd agree, except I have this strong memory of us having known bugs
> with ACPI turning hard-interrupts on again. Similarly, it uses mutexes etc
> that simply don't work with interrupts off and/or may turn them on again
> thanks to scheduling.
>
> "Fixing" that seems not very easy. ACPI has a bad habit of being _really_
> hard to fix in this area.
>
> I do agree that _if_ we can just fix ACPI, we wouldn't have these issues,
> and we should just call it with interrupts disabled with our existing
> code. But my previous email was a "maybe we can do it like this" kind of
> thing, which might allow us to use ACPI with none of the irq-off issues.
Well, I think both will be difficult, this way or another. :-)
I'll have a look at the ACPI thing.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:05 ` Linus Torvalds
2009-02-02 22:09 ` Linus Torvalds
2009-02-02 22:31 ` Rafael J. Wysocki
@ 2009-02-02 22:48 ` Benjamin Herrenschmidt
2009-02-02 23:00 ` Rafael J. Wysocki
2009-02-02 23:49 ` Ingo Molnar
3 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 22:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 14:05 -0800, Linus Torvalds wrote:
> Well, I don't think anybody wants to (or necessarily has the ability)
> touch ACPI internals.
>
Somebody has a rough idea of what underlying kernel support ACPI uses ?
I know from discussions with Len that it's mostly mutexes... timers
too ?
It might end up being trivial to make it safe to call early with
interrupts off with the trick you mentioned, since it's really just the
same as boot as you mentioned.. ie, it's actually safe to take mutexes
etc... since we -know- no other CPU is running and we aren't scheduling
etc...
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:33 ` Rafael J. Wysocki
@ 2009-02-02 22:56 ` Rafael J. Wysocki
2009-02-03 0:11 ` Benjamin Herrenschmidt
2009-02-10 20:25 ` Pavel Machek
0 siblings, 2 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 22:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown
On Monday 02 February 2009, Rafael J. Wysocki wrote:
> On Monday 02 February 2009, Linus Torvalds wrote:
> >
> > On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> > >
> > > I think it would be easier to make ACPI allow us to run AML with interrupts
> > > off.
> >
> > Well, I'd agree, except I have this strong memory of us having known bugs
> > with ACPI turning hard-interrupts on again. Similarly, it uses mutexes etc
> > that simply don't work with interrupts off and/or may turn them on again
> > thanks to scheduling.
> >
> > "Fixing" that seems not very easy. ACPI has a bad habit of being _really_
> > hard to fix in this area.
> >
> > I do agree that _if_ we can just fix ACPI, we wouldn't have these issues,
> > and we should just call it with interrupts disabled with our existing
> > code. But my previous email was a "maybe we can do it like this" kind of
> > thing, which might allow us to use ACPI with none of the irq-off issues.
>
> Well, I think both will be difficult, this way or another. :-)
>
> I'll have a look at the ACPI thing.
Generally speaking, we'd need to run acpi_evaluate_object() with interrupts
off.
There are two apparent problems with that, from a quick look:
* The ACPI_MTX_INTERPRETER mutex needs to be acquired, but we know we won't
need that mutex with interrupts off, so presumably we can work around this.
* Memory allocations with GFP_KERNEL are made, which is even worse, because
we really shouldn't do that during suspend _at_ _all_, even during the regular
->suspend() with interrupts on, because there's not guarantee that swap will
will be available at that time. So, for the sake of correctness, we should
get rid of the GFP_KERNEL from the ACPI code paths executed during
suspend-resume anyway.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:15 ` Linus Torvalds
2009-02-02 22:33 ` Rafael J. Wysocki
@ 2009-02-02 22:57 ` Benjamin Herrenschmidt
2009-02-02 23:22 ` Rafael J. Wysocki
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-02 22:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 14:15 -0800, Linus Torvalds wrote:
>
> On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> >
> > I think it would be easier to make ACPI allow us to run AML with interrupts
> > off.
>
> Well, I'd agree, except I have this strong memory of us having known bugs
> with ACPI turning hard-interrupts on again. Similarly, it uses mutexes etc
> that simply don't work with interrupts off and/or may turn them on again
> thanks to scheduling.
>
> "Fixing" that seems not very easy. ACPI has a bad habit of being _really_
> hard to fix in this area.
>
> I do agree that _if_ we can just fix ACPI, we wouldn't have these issues,
> and we should just call it with interrupts disabled with our existing
> code. But my previous email was a "maybe we can do it like this" kind of
> thing, which might allow us to use ACPI with none of the irq-off issues.
Len, what's your take here ? How much of that stuff is burried deep and
how much is nicely split in a "helper" layer we could more easily fix ?
I'm adding Ingo to the CC as he might have more ideas on how best to
just make the mutexes work & not complain rather than touching ACPI
itself... again, just like boot, might just be a matter of instructing
the mutexes/lockdep to shut up and ignore in_atomic() in those "special"
phases such as late suspend and early resume().
That would help me for something else that broke recently too ... I have
a special "hook" in radeonfb that my arch calls to resume it -very-
early (interrupts off, I haven't even re-enabled the L2 cache). This is
very useful to help debugging problems at resume since without that you
basically don't see a thing and we have no serial port on most of these
machines.
However, that started breaking recently due to fb_set_suspend() calling
into various bits of infrastructure that is no longer safe to call in
atomic context.
Here too, in fact, those -would- be safe since it's mostly a matter of
teaching things like mutex of kmalloc that we are not in standard
SYSTEM_RUNNING state, and thus mutex can just pretty much ignore the
problem of being in atomic state and kmalloc/gfp could automatically
degrade to GFP_ATOMIC (*)
So it might just all be a matter of making might_sleep() shut up in
late suspend/early resume, and possibly msleep() silently turn into
mdelay or something like that. Just make sure we don't actually try to
schedule (and possibly BUG_ON if we actually end up blocking on a mutex,
we should not).
Len, do you think that would work with ACPI or it's more convoluted than
that ?
(*) There are reasons to think that kmalloc/gfp should both silently
turn into GFP_NOIO always while the suspend process is started, but
that's somewhat a different subject. Rafael, did we ever act on that ?
It's an old discussion we had but I don't know if we actually
implemented anything.
IE. Without that, afaik, a driver that hasn't suspend yet might end up
being blocked in some allocation somewhere due to an attempt to page
things out on to an already sleeping device. That driver might be in
such blockage while holding one of its internal mutexes or other thing
that will cause it's own suspend routine later on to screw up. etc etc
etc... In general, best to avoid having to teach drivers that in
suspend-land, non-atomic, allocations may block for ever. Better to make
them all atomic magically.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:48 ` Benjamin Herrenschmidt
@ 2009-02-02 23:00 ` Rafael J. Wysocki
2009-02-03 0:23 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 23:00 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Monday 02 February 2009, Benjamin Herrenschmidt wrote:
> On Mon, 2009-02-02 at 14:05 -0800, Linus Torvalds wrote:
> > Well, I don't think anybody wants to (or necessarily has the ability)
> > touch ACPI internals.
> >
> Somebody has a rough idea of what underlying kernel support ACPI uses ?
> I know from discussions with Len that it's mostly mutexes... timers
> too ?
Please see my last message in this thread (just sent).
> It might end up being trivial to make it safe to call early with
> interrupts off with the trick you mentioned, since it's really just the
> same as boot as you mentioned.. ie, it's actually safe to take mutexes
> etc... since we -know- no other CPU is running and we aren't scheduling
> etc...
But we can't tell no one is holding the mutex in question, AFAICS.
I'm afraid we'd really need a special "no mutexes, no GFP_KERNEL allocations"
code path for that.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:31 ` Rafael J. Wysocki
@ 2009-02-02 23:18 ` Linus Torvalds
2009-02-02 23:45 ` Rafael J. Wysocki
` (2 more replies)
0 siblings, 3 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 23:18 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> >
> > b) disable_device_irq()'s: things are live, but device interrupts
> > are turned off by essentially looping over the irq_desc_ptr[]
> > table.
>
> Well, do we actually need to turn off all device interrupts?
>
> Shared interrupts are the source of the problem, so perhaps we can
> only disable interrupts of devices that use interrupt pins at this point
> (MSI/MSI-X need not be disabled, for example, and the timer interrupts most
> probably too)?
We could try that, yes.
> > d) disable CPU interrupts.
>
> At what point do we disable the other CPUs?
I left it out, because I don't much care or think it matters. So take your
pick. I'd suggest keeping the current setup, and literally just insert the
new point between "device_power_off()" and "sysdev_suspend()", with _zero_
other changes.
> Well, it means reworking the entire suspend sequence (again) or we will
> break assumptions made by some existing drivers (interrupts off during
> suspend_late and resume_early). And that affects all drivers, not only PCI.
No it doesn't.
No changes AT ALL to the suspend sequence. We do everything in the same
order: look at my patch. The only difference is that instead of doing that
"cli" we do the "for_each_irq(disable_irq)" instead (and do the 'cli' a
bit later).
ZERO effect on drivers. The calling convention is 100% the same as far as
the driver is concerned: ->suspend() is called with interrupts on and a
fully working machine, and ->suspend_late() called with interrupts off.
The only difference is the _mechanism_ of turning interrupts off. NOTHING
else.
> I first would like to understand what _exactly_ breaks on the iBook reported to
> have problems.
I bet it's code like the USB one:
int usb_hcd_pci_resume(struct pci_dev *dev)
{
#ifdef CONFIG_PPC_PMAC
/* Reenable ASIC clocks for USB */
if (machine_is(powermac)) {
struct device_node *of_node;
of_node = pci_device_to_OF_node(dev);
if (of_node)
pmac_call_feature(PMAC_FTR_USB_ENABLE,
of_node, 0, 1);
}
#endif
..
retval = pci_enable_device(dev);
and now pci_enable_device() calls pci_raw_set_power_state(), which does:
if (dev->current_state == state) {
/* we're already there */
return 0;
} else ..
which means that it doesn't actually _do_ anything, because it thinks that
'current_state' was already PCI_D0. But if the device was totally turned
off, that's wrong.
(background: pci_restore_standard_config() will have done
pci_raw_set_power_state(PCI_D0) with the device clocks off, which wouldn't
actualyl have _done_ anythign to the device, but then it does
dev->current_state = PCI_D0;
Maybe the simplest thing to do migth be to replace that with a
pci_update_current_state(dev, PCI_D0);
instead, to try to read back the state explicitly in
pci_restore_standard_config()).
Best test:
Ben, does this trivial patch make any difference for those powermacs?
Linus
---
drivers/pci/pci.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 17bd932..97e1c38 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1419,7 +1419,7 @@ int pci_restore_standard_config(struct pci_dev *dev)
}
}
- dev->current_state = PCI_D0;
+ pci_update_current_state(dev, PCI_D0);
return 0;
}
^ permalink raw reply related [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:57 ` Benjamin Herrenschmidt
@ 2009-02-02 23:22 ` Rafael J. Wysocki
2009-02-03 1:03 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 23:22 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Monday 02 February 2009, Benjamin Herrenschmidt wrote:
> On Mon, 2009-02-02 at 14:15 -0800, Linus Torvalds wrote:
> >
> > On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> > >
> > > I think it would be easier to make ACPI allow us to run AML with interrupts
> > > off.
> >
> > Well, I'd agree, except I have this strong memory of us having known bugs
> > with ACPI turning hard-interrupts on again. Similarly, it uses mutexes etc
> > that simply don't work with interrupts off and/or may turn them on again
> > thanks to scheduling.
> >
> > "Fixing" that seems not very easy. ACPI has a bad habit of being _really_
> > hard to fix in this area.
> >
> > I do agree that _if_ we can just fix ACPI, we wouldn't have these issues,
> > and we should just call it with interrupts disabled with our existing
> > code. But my previous email was a "maybe we can do it like this" kind of
> > thing, which might allow us to use ACPI with none of the irq-off issues.
>
> Len, what's your take here ? How much of that stuff is burried deep and
> how much is nicely split in a "helper" layer we could more easily fix ?
>
> I'm adding Ingo to the CC as he might have more ideas on how best to
> just make the mutexes work & not complain rather than touching ACPI
> itself... again, just like boot, might just be a matter of instructing
> the mutexes/lockdep to shut up and ignore in_atomic() in those "special"
> phases such as late suspend and early resume().
>
> That would help me for something else that broke recently too ... I have
> a special "hook" in radeonfb that my arch calls to resume it -very-
> early (interrupts off, I haven't even re-enabled the L2 cache). This is
> very useful to help debugging problems at resume since without that you
> basically don't see a thing and we have no serial port on most of these
> machines.
>
> However, that started breaking recently due to fb_set_suspend() calling
> into various bits of infrastructure that is no longer safe to call in
> atomic context.
>
> Here too, in fact, those -would- be safe since it's mostly a matter of
> teaching things like mutex of kmalloc that we are not in standard
> SYSTEM_RUNNING state, and thus mutex can just pretty much ignore the
> problem of being in atomic state and kmalloc/gfp could automatically
> degrade to GFP_ATOMIC (*)
>
> So it might just all be a matter of making might_sleep() shut up in
> late suspend/early resume, and possibly msleep() silently turn into
> mdelay or something like that. Just make sure we don't actually try to
> schedule (and possibly BUG_ON if we actually end up blocking on a mutex,
> we should not).
>
> Len, do you think that would work with ACPI or it's more convoluted than
> that ?
>
> (*) There are reasons to think that kmalloc/gfp should both silently
> turn into GFP_NOIO always while the suspend process is started, but
> that's somewhat a different subject. Rafael, did we ever act on that ?
> It's an old discussion we had but I don't know if we actually
> implemented anything.
We have the ->prepare(), ->complete() callbacks that, among other things,
can be used for allocating and freeing memory with GFP_KERNEL safely.
> IE. Without that, afaik, a driver that hasn't suspend yet might end up
> being blocked in some allocation somewhere due to an attempt to page
> things out on to an already sleeping device. That driver might be in
> such blockage while holding one of its internal mutexes or other thing
> that will cause it's own suspend routine later on to screw up. etc etc
> etc...
Yes, that's possible in theory, never observed in practice from what I can
tell.
> In general, best to avoid having to teach drivers that in suspend-land,
> non-atomic, allocations may block for ever. Better to make them all atomic
> magically.
Hm, atomic allocations may cause other problems to happen (ie. fail easily).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:18 ` Linus Torvalds
@ 2009-02-02 23:45 ` Rafael J. Wysocki
2009-02-02 23:59 ` Linus Torvalds
2009-02-03 0:58 ` Benjamin Herrenschmidt
2009-02-03 3:51 ` Benjamin Herrenschmidt
2 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-02 23:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Linus Torvalds wrote:
>
> On Mon, 2 Feb 2009, Rafael J. Wysocki wrote:
> > >
> > > b) disable_device_irq()'s: things are live, but device interrupts
> > > are turned off by essentially looping over the irq_desc_ptr[]
> > > table.
> >
> > Well, do we actually need to turn off all device interrupts?
> >
> > Shared interrupts are the source of the problem, so perhaps we can
> > only disable interrupts of devices that use interrupt pins at this point
> > (MSI/MSI-X need not be disabled, for example, and the timer interrupts most
> > probably too)?
>
> We could try that, yes.
So, perhaps we can just loop over the interrupt links and disable them all
at this point?
> > > d) disable CPU interrupts.
> >
> > At what point do we disable the other CPUs?
>
> I left it out, because I don't much care or think it matters. So take your
> pick. I'd suggest keeping the current setup, and literally just insert the
> new point between "device_power_off()" and "sysdev_suspend()", with _zero_
> other changes.
Ah, OK.
> > Well, it means reworking the entire suspend sequence (again) or we will
> > break assumptions made by some existing drivers (interrupts off during
> > suspend_late and resume_early). And that affects all drivers, not only PCI.
>
> No it doesn't.
>
> No changes AT ALL to the suspend sequence. We do everything in the same
> order: look at my patch. The only difference is that instead of doing that
> "cli" we do the "for_each_irq(disable_irq)" instead (and do the 'cli' a
> bit later).
Sounds good.
> ZERO effect on drivers. The calling convention is 100% the same as far as
> the driver is concerned: ->suspend() is called with interrupts on and a
> fully working machine, and ->suspend_late() called with interrupts off.
>
> The only difference is the _mechanism_ of turning interrupts off. NOTHING
> else.
>
> > I first would like to understand what _exactly_ breaks on the iBook reported to
> > have problems.
>
> I bet it's code like the USB one:
>
> int usb_hcd_pci_resume(struct pci_dev *dev)
> {
>
> #ifdef CONFIG_PPC_PMAC
> /* Reenable ASIC clocks for USB */
> if (machine_is(powermac)) {
> struct device_node *of_node;
>
> of_node = pci_device_to_OF_node(dev);
> if (of_node)
> pmac_call_feature(PMAC_FTR_USB_ENABLE,
> of_node, 0, 1);
> }
> #endif
> ..
> retval = pci_enable_device(dev);
>
> and now pci_enable_device() calls pci_raw_set_power_state(), which does:
>
> if (dev->current_state == state) {
> /* we're already there */
> return 0;
> } else ..
>
> which means that it doesn't actually _do_ anything, because it thinks that
> 'current_state' was already PCI_D0. But if the device was totally turned
> off, that's wrong.
>
> (background: pci_restore_standard_config() will have done
> pci_raw_set_power_state(PCI_D0) with the device clocks off, which wouldn't
> actualyl have _done_ anythign to the device, but then it does
>
> dev->current_state = PCI_D0;
>
> Maybe the simplest thing to do migth be to replace that with a
>
> pci_update_current_state(dev, PCI_D0);
>
> instead, to try to read back the state explicitly in
> pci_restore_standard_config()).
>
> Best test:
>
> Ben, does this trivial patch make any difference for those powermacs?
>
> Linus
> ---
> drivers/pci/pci.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 17bd932..97e1c38 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1419,7 +1419,7 @@ int pci_restore_standard_config(struct pci_dev *dev)
> }
> }
>
> - dev->current_state = PCI_D0;
> + pci_update_current_state(dev, PCI_D0);
Good idea anyway.
> return 0;
> }
Or perhaps put
current_state = PCI_UNKNOWN;
under that 'if (machine_is(powermac))' ?
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:05 ` Linus Torvalds
` (2 preceding siblings ...)
2009-02-02 22:48 ` Benjamin Herrenschmidt
@ 2009-02-02 23:49 ` Ingo Molnar
2009-02-03 22:09 ` Rafael J. Wysocki
3 siblings, 1 reply; 98+ messages in thread
From: Ingo Molnar @ 2009-02-02 23:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Ingo: how do we walk over all the interrupt descriptors in todays world?
> The whole sparse-vs-nonsparse makes things a bit more complex.
this:
struct irq_desc *desc;
int irq;
for_each_irq_desc(i, desc) {
...
}
should do the trick.
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:45 ` Rafael J. Wysocki
@ 2009-02-02 23:59 ` Linus Torvalds
2009-02-03 0:15 ` Rafael J. Wysocki
2009-02-03 0:15 ` Linus Torvalds
0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-02 23:59 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
>
> So, perhaps we can just loop over the interrupt links and disable them all
> at this point?
Well, we could, but I think it's actually inferior. Why? Because timer
interrupts can come over those too.
So we're really better off just looping over our own software irq_desc
arrays. That has the feature that
(a) we can easily mark the individual irqs if we want to - in ways that
we can never do if we end up looping over some hardware concept like
the IO-APIC inputs.
IOW, we can easily add a magic flag for registering a timer
interrupt (in fact, I guess we already have IRQF_TIMER), but we can
also expand on this in the future with other purely software
interrupt flags - ie a driver that knows it is irq-safe could just
tell us so, and we wouldn't need to disable that irq if there are
only irq-safe drivers registered on it.
(b) it's portable, and handles re-sending the interrupts correctly
(unlike just turning the interrupts off), and doesn't have any issues
with worrying about polarity etc complicating issues.
Looping over the irq links is really quite hard. Sure, we can do the
ACPI thing, but that's platform-specific, _and_ it's not actually
guaranteed to hit everything anyway (ie it likely only disables the
mapped ones, not necessarily at all directly connected ones - IDE
and USB tend to be direct irq's because they are often on the same
chip as the irq controller).
> > @@ -1419,7 +1419,7 @@ int pci_restore_standard_config(struct pci_dev *dev)
> > - dev->current_state = PCI_D0;
> > + pci_update_current_state(dev, PCI_D0);
>
> Good idea anyway.
Yeah, it probably makes sense to do this regardless of any PPC issues.
Exactly because we don't call the platform code.
> Or perhaps put
>
> current_state = PCI_UNKNOWN;
>
> under that 'if (machine_is(powermac))' ?
Yes. That would be ok, but in _theory_ ACPI could hit the same thing, and
then the pci_restore_standard_config() really needs to do that anyway.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:56 ` Rafael J. Wysocki
@ 2009-02-03 0:11 ` Benjamin Herrenschmidt
2009-02-03 0:21 ` Linus Torvalds
2009-02-10 20:25 ` Pavel Machek
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 0:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 23:56 +0100, Rafael J. Wysocki wrote:
> * The ACPI_MTX_INTERPRETER mutex needs to be acquired, but we know we won't
> need that mutex with interrupts off, so presumably we can work around this.
Or just make the mutex code not care when in the late suspend/early
resume. I think that's the right approach and involves no change to ACPI
itself.
> * Memory allocations with GFP_KERNEL are made, which is even worse, because
> we really shouldn't do that during suspend _at_ _all_, even during the regular
> ->suspend() with interrupts on, because there's not guarantee that swap will
> will be available at that time. So, for the sake of correctness, we should
> get rid of the GFP_KERNEL from the ACPI code paths executed during
> suspend-resume anyway.
Well, as I said, this is a problem with more than just ACPI and again, I
don't think the fix is to be done in ACPI itself. The buddy and slab
should basically stick NOIO or ATOMIC to any allocation done after we
started suspending the system.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:59 ` Linus Torvalds
@ 2009-02-03 0:15 ` Rafael J. Wysocki
2009-02-03 0:28 ` Linus Torvalds
2009-02-03 0:15 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 0:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> >
> > So, perhaps we can just loop over the interrupt links and disable them all
> > at this point?
>
> Well, we could, but I think it's actually inferior. Why? Because timer
> interrupts can come over those too.
>
> So we're really better off just looping over our own software irq_desc
> arrays. That has the feature that
>
> (a) we can easily mark the individual irqs if we want to - in ways that
> we can never do if we end up looping over some hardware concept like
> the IO-APIC inputs.
>
> IOW, we can easily add a magic flag for registering a timer
> interrupt (in fact, I guess we already have IRQF_TIMER), but we can
> also expand on this in the future with other purely software
> interrupt flags - ie a driver that knows it is irq-safe could just
> tell us so, and we wouldn't need to disable that irq if there are
> only irq-safe drivers registered on it.
>
> (b) it's portable, and handles re-sending the interrupts correctly
> (unlike just turning the interrupts off), and doesn't have any issues
> with worrying about polarity etc complicating issues.
>
> Looping over the irq links is really quite hard. Sure, we can do the
> ACPI thing, but that's platform-specific, _and_ it's not actually
> guaranteed to hit everything anyway (ie it likely only disables the
> mapped ones, not necessarily at all directly connected ones - IDE
> and USB tend to be direct irq's because they are often on the same
> chip as the irq controller).
OK
> > > @@ -1419,7 +1419,7 @@ int pci_restore_standard_config(struct pci_dev *dev)
> > > - dev->current_state = PCI_D0;
> > > + pci_update_current_state(dev, PCI_D0);
> >
> > Good idea anyway.
>
> Yeah, it probably makes sense to do this regardless of any PPC issues.
> Exactly because we don't call the platform code.
OK, I'll send a (yet another) patch to Jesse for that.
> > Or perhaps put
> >
> > current_state = PCI_UNKNOWN;
> >
> > under that 'if (machine_is(powermac))' ?
>
> Yes. That would be ok, but in _theory_ ACPI could hit the same thing, and
> then the pci_restore_standard_config() really needs to do that anyway.
OK
BTW, on the PMAC the problematic driver appears to be the radeon driver,
according to Ben, and the breakage is not related to USB.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:59 ` Linus Torvalds
2009-02-03 0:15 ` Rafael J. Wysocki
@ 2009-02-03 0:15 ` Linus Torvalds
1 sibling, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 0:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2 Feb 2009, Linus Torvalds wrote:
>
> IOW, we can easily add a magic flag for registering a timer
> interrupt (in fact, I guess we already have IRQF_TIMER)
Side note: we may have IRQF_TIMER (and some timers even set it), but since
it's not actually _used_ for anything it should come as no surprise that
it's not set by the x86 code, for example (nor by a lot of other
architectures).
But we can already notice system interrupts automatically by them being
registered by "setup_irq()" instead of "request_irq()". And on x86, many
of the really core interrupts aren't actually seen as device interrupts at
all, but are directly taken by the kernel through local apic stuff. So
they wouldn't get disabled anyway (example: local APIC timer interrupts
and the inter-processor IPI).
So it should be pretty easy to do a reasonable job with the information
that we already have available.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:11 ` Benjamin Herrenschmidt
@ 2009-02-03 0:21 ` Linus Torvalds
0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 0:21 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> > * Memory allocations with GFP_KERNEL are made, which is even worse, because
> > we really shouldn't do that during suspend _at_ _all_, even during the regular
> > ->suspend() with interrupts on, because there's not guarantee that swap will
> > will be available at that time. So, for the sake of correctness, we should
> > get rid of the GFP_KERNEL from the ACPI code paths executed during
> > suspend-resume anyway.
>
> Well, as I said, this is a problem with more than just ACPI and again, I
> don't think the fix is to be done in ACPI itself. The buddy and slab
> should basically stick NOIO or ATOMIC to any allocation done after we
> started suspending the system.
We already do have magic code for things like PF_MEMALLOC (recursion), or
TIF_MEMDIE (oom) or "(rt_task(p)) && !in_interrupt()" (give RT tasks
higher priorities) in the page allocator.
Having a special case for NOT_RUNNING wouldn't be _pretty_, but it
certainly isn't totally unreasonable either.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:00 ` Rafael J. Wysocki
@ 2009-02-03 0:23 ` Benjamin Herrenschmidt
2009-02-03 0:29 ` Rafael J. Wysocki
2009-02-03 0:44 ` Linus Torvalds
0 siblings, 2 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 0:23 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> But we can't tell no one is holding the mutex in question, AFAICS.
>
> I'm afraid we'd really need a special "no mutexes, no GFP_KERNEL allocations"
> code path for that.
No, the mutex shouldn't be held already, if it is, you're probably
already in deep trouble. IE, you probably want to enfore that anyway,
ie, it wouldn't be very sane to suspend the machine while ACPI was
already in the -middle- of interpreting something anyway.
IE, you should have something to ensure, before you turn interrupts off,
that nobody else is inside the AML interpreter. You already know there
are no other CPUs, so it's just a matter of making sure no other process
has scheduled while holding that mutex.
The easy way to do that is to do something like taking the mutex
yourself and then setting a flag so that the intepreter stops trying to
take it or release it itself, maybe just using the global system state.
Then release the mutex on resume.
All of these are issues that exist today. IE. Regardless of that
powermac problem, which is unrelated (see other posts), I think these
things need to be sorted cleanly or suspend will not be as rock solid as
it could/should be. IE. It's several order of magnitude better than it
was, I agree, but I believe we have here a few reasonably simple things
we can/should do to make it more robust.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:15 ` Rafael J. Wysocki
@ 2009-02-03 0:28 ` Linus Torvalds
2009-02-03 1:12 ` Benjamin Herrenschmidt
2009-02-03 6:07 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 0:28 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
>
> BTW, on the PMAC the problematic driver appears to be the radeon driver,
> according to Ben, and the breakage is not related to USB.
Hmm. atyfb_base.c has the same kind of things with magic PMAC code, but it
doesn't follow the USB pattern - it just replaces "pci_set_power_state()"
_entirely_.
It seems a very interesting suspend routine, btw. It doesn't seem to do
any of the pci_save_state() at all, just re-initializes from scratch.
Maybe it is unhappy with the PM layer deciding to try to restore state
since it clearly didn't..
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:23 ` Benjamin Herrenschmidt
@ 2009-02-03 0:29 ` Rafael J. Wysocki
2009-02-03 0:44 ` Linus Torvalds
1 sibling, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 0:29 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
>
> > But we can't tell no one is holding the mutex in question, AFAICS.
> >
> > I'm afraid we'd really need a special "no mutexes, no GFP_KERNEL allocations"
> > code path for that.
>
> No, the mutex shouldn't be held already, if it is, you're probably
> already in deep trouble. IE, you probably want to enfore that anyway,
> ie, it wouldn't be very sane to suspend the machine while ACPI was
> already in the -middle- of interpreting something anyway.
>
> IE, you should have something to ensure, before you turn interrupts off,
> that nobody else is inside the AML interpreter. You already know there
> are no other CPUs, so it's just a matter of making sure no other process
> has scheduled while holding that mutex.
>
> The easy way to do that is to do something like taking the mutex
> yourself and then setting a flag so that the intepreter stops trying to
> take it or release it itself, maybe just using the global system state.
>
> Then release the mutex on resume.
Yes, that should work.
> All of these are issues that exist today. IE. Regardless of that
> powermac problem, which is unrelated (see other posts), I think these
> things need to be sorted cleanly or suspend will not be as rock solid as
> it could/should be. IE. It's several order of magnitude better than it
> was, I agree, but I believe we have here a few reasonably simple things
> we can/should do to make it more robust.
Agreed.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:23 ` Benjamin Herrenschmidt
2009-02-03 0:29 ` Rafael J. Wysocki
@ 2009-02-03 0:44 ` Linus Torvalds
2009-02-03 1:32 ` Benjamin Herrenschmidt
2009-02-03 5:06 ` Ingo Molnar
1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 0:44 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> IE, you should have something to ensure, before you turn interrupts off,
> that nobody else is inside the AML interpreter. You already know there
> are no other CPUs, so it's just a matter of making sure no other process
> has scheduled while holding that mutex.
>
> The easy way to do that is to do something like taking the mutex
> yourself and then setting a flag so that the intepreter stops trying to
> take it or release it itself, maybe just using the global system state.
>
> Then release the mutex on resume.
Why do you think this improves on anything?
Basically, it turns the mutex into a non-entity - but if your whole
argument is that it might as well be a non-entity because nobody else can
take it anyway, then why not just leave it around?
IOW, if your argument boils down to "there can be no contention", then you
might as well say "just use the mutex, it will never block".
So the only thing you really need is to just disable the _debugging_ code
that mutexes have (if they get built with debugging in the first place).
I can't find the bothersome code anyway: I do find
DEBUG_LOCKS_WARN_ON(in_interrupt());
but that's just saying that you shouldn't be using mutexes from
interrupts, not from irq-off segments. There's probably something I'm
missing, like the preempt_check_resched() causing a schedule event with
irq's disabled, and the "might_sleep()" thing. But the latter should
already be disabled by the "system_state != SYSTEM_RUNNING" thing.
Oh, except we don't seem to have a SYSTEM_SUSPEND thing. Is that what
people are complaining about?
Just do
system_state = SYSTEM_SUSPEND;
..
system_state = SYSTEM_RUNNING;
around the irqs-off section in suspending if so.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:18 ` Linus Torvalds
2009-02-02 23:45 ` Rafael J. Wysocki
@ 2009-02-03 0:58 ` Benjamin Herrenschmidt
2009-02-03 3:51 ` Benjamin Herrenschmidt
2 siblings, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 0:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> (background: pci_restore_standard_config() will have done
> pci_raw_set_power_state(PCI_D0) with the device clocks off, which wouldn't
> actualyl have _done_ anythign to the device, but then it does
Actually, we are lucky we don't even crash with a machine check when
hitting an unclocked device ...
I found a problem with radeonfb (patch sent separately) that explains
the breakage on one of my recent machines, but I wouldn't be surprised
if some other models of powerbooks/ibooks have also problem with USB
as you mentioned. IE. I have to check whether the ones doing PCI D state
can also use the Apple specific clock stuff. We might be lucky and they
are exclusive from each other.
Also, the nice thing with turning clocks off on these Apple ASICs is
that it doesn't lose the config space content. So we may just get lucky
here.
> Ben, does this trivial patch make any difference for those powermacs?
On my test setup, the problem is radeonfb and I just made a patch to fix
it. But it's using some standard PCI USB, not the older fancy Apple
stuff. I'll check on an older machine tongight when I get back home, I
don't have one here.
Cheers,
Ben.
> Linus
> ---
> drivers/pci/pci.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 17bd932..97e1c38 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1419,7 +1419,7 @@ int pci_restore_standard_config(struct pci_dev *dev)
> }
> }
>
> - dev->current_state = PCI_D0;
> + pci_update_current_state(dev, PCI_D0);
>
> return 0;
> }
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:22 ` Rafael J. Wysocki
@ 2009-02-03 1:03 ` Benjamin Herrenschmidt
2009-02-10 20:25 ` kmalloc during suspend, was " Pavel Machek
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 1:03 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> > (*) There are reasons to think that kmalloc/gfp should both silently
> > turn into GFP_NOIO always while the suspend process is started, but
> > that's somewhat a different subject. Rafael, did we ever act on that ?
> > It's an old discussion we had but I don't know if we actually
> > implemented anything.
>
> We have the ->prepare(), ->complete() callbacks that, among other things,
> can be used for allocating and freeing memory with GFP_KERNEL safely.
First, that's assuming drivers will be smart enough to figure that out.
They won't, believe me.
Then, as I said, this doesn't work in practice because there is no way
drivers and/or underlying subsystems will start "remembering" when they
are within a prepare/complete pair, and suddenly do allocations
differently. That's just going to break.
It's the allocator itself that -must- degrade to NOIO or ATOMIC, or
we'll just never get it right imho.
> Yes, that's possible in theory, never observed in practice from what I can
> tell.
Sure, neither did I, though I could manufacture a case. IE, you'll need
memory pressure at suspend time to start having problems with
allocations blocking on disk access. So I'm sure most of the time we
never hit it ... until we do one day and don't know why the whole system
deadlocked somewhere in suspend or resume.
Here we have a few low hanging fruits like that we know about, that can
make the difference between suspend/resume works always vs. it works
-most- of the time, unless I happened to have been under heavy memory
pressure while playing an MP3 on the train a 3rd day of the month....
IE. We don't want that rare case to break, because you can be sure that
that once out of 1000 times where it does is going to piss me off real
bad and probably just the day I typed something for 2h in a row without
saving etc...
> Hm, atomic allocations may cause other problems to happen (ie. fail easily).
That's true, but at the end of the day, if the choice is between
deadlock and failure, pick your medicine...
Maybe we could do it differently and have the allocators basically just
give up if they are going to trigger IOs or something but at the end of
the day, we need to make it robust, not just "will work most of the
time".
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:28 ` Linus Torvalds
@ 2009-02-03 1:12 ` Benjamin Herrenschmidt
2009-02-03 1:32 ` Linus Torvalds
2009-02-03 6:07 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 1:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 16:28 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> >
> > BTW, on the PMAC the problematic driver appears to be the radeon driver,
> > according to Ben, and the breakage is not related to USB.
> Hmm. atyfb_base.c has the same kind of things with magic PMAC code, but it
> doesn't follow the USB pattern - it just replaces "pci_set_power_state()"
> _entirely_.
radeonfb_pm.c is the culprit here, I haven't looked at atyfb (mach64)
yet, but it has similar issues yes. I just attached a patch to the BZ
cleaning up radeonfb. I can have a look at doing a similar cleanup to
atyfb and aty128fb next.
> It seems a very interesting suspend routine, btw. It doesn't seem to do
> any of the pci_save_state() at all, just re-initializes from scratch.
> Maybe it is unhappy with the PM layer deciding to try to restore state
> since it clearly didn't..
Well, I wrote that so ... :-) bear in mind that this code was mostly
written way before the PCI layer knew how to save and restore state.
Also, atyfb has some cases of chips that don't support PCI D states but
use private MMIO register to control suspend/resume.
Anyway, my proposed radeonfb patch is at:
http://bugzilla.kernel.org/attachment.cgi?id=20085
I'll look at cleaning up atyfb and aty128fb later today if needed.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 1:12 ` Benjamin Herrenschmidt
@ 2009-02-03 1:32 ` Linus Torvalds
2009-02-03 1:46 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 1:32 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> Anyway, my proposed radeonfb patch is at:
>
> http://bugzilla.kernel.org/attachment.cgi?id=20085
>
> I'll look at cleaning up atyfb and aty128fb later today if needed.
Ok.
Note how the PCI layer (currently) only saves the low _16_ dwords of
config space (64 bytes). The radeonfb code that did it by hand saved the
whole 64 dwords (256 bytes). It _probably_ doesn't matter, but..
The reason the PCI layer only does 64 bytes is that that was the really
old PCI config space model - the rest was undefined and apparently a few
cards reportedly even crashed when accessing it (but who knows, that may
be urban folklore).
But if that patch works for you, it's clearly already better than _not_
applying the patch, so..
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:44 ` Linus Torvalds
@ 2009-02-03 1:32 ` Benjamin Herrenschmidt
2009-02-03 5:06 ` Ingo Molnar
1 sibling, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 1:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> Why do you think this improves on anything?
>
> Basically, it turns the mutex into a non-entity - but if your whole
> argument is that it might as well be a non-entity because nobody else can
> take it anyway, then why not just leave it around?
>
> IOW, if your argument boils down to "there can be no contention", then you
> might as well say "just use the mutex, it will never block".
Oh sure, you all argument is fine, there's just one little nit for which
I think it made sense to take the mutex that way...
The possible problem I see is on the "boundary" when you are about to
get interrupts off. You want to make sure that any other kernel thread /
process that was inside ACPI is completed before you do that.
Taking the mutex will do that for you.
Then you can disable interrupts and just ignore the mutex until you
re-enable them.
Now, of course, in a perfect world, anything that can poke ACPI would have
got some suspend calls before hand and won't be around touching it but
I wouldn't bet my life on this.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 1:32 ` Linus Torvalds
@ 2009-02-03 1:46 ` Benjamin Herrenschmidt
2009-02-03 3:30 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 1:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> Note how the PCI layer (currently) only saves the low _16_ dwords of
> config space (64 bytes). The radeonfb code that did it by hand saved the
> whole 64 dwords (256 bytes). It _probably_ doesn't matter, but..
>
> The reason the PCI layer only does 64 bytes is that that was the really
> old PCI config space model - the rest was undefined and apparently a few
> cards reportedly even crashed when accessing it (but who knows, that may
> be urban folklore).
>
> But if that patch works for you, it's clearly already better than _not_
> applying the patch, so..
Radeons don't do much with config space... the worst we may miss I
suppose is subsystem vendor/device... Maybe I'll add something to
explicitely save and restore it or X might get upset. I'll have a look.
At some point, that whole code will migrate out of radeonfb into the new
radeon DRM with kernel mode setting in which case even X should stop
caring ... mostly.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 1:46 ` Benjamin Herrenschmidt
@ 2009-02-03 3:30 ` Benjamin Herrenschmidt
2009-02-03 3:47 ` Linus Torvalds
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 3:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 2009-02-03 at 12:46 +1100, Benjamin Herrenschmidt wrote:
> > Note how the PCI layer (currently) only saves the low _16_ dwords of
> > config space (64 bytes). The radeonfb code that did it by hand saved the
> > whole 64 dwords (256 bytes). It _probably_ doesn't matter, but..
> >
> > The reason the PCI layer only does 64 bytes is that that was the really
> > old PCI config space model - the rest was undefined and apparently a few
> > cards reportedly even crashed when accessing it (but who knows, that may
> > be urban folklore).
> >
> > But if that patch works for you, it's clearly already better than _not_
> > applying the patch, so..
>
> Radeons don't do much with config space... the worst we may miss I
> suppose is subsystem vendor/device... Maybe I'll add something to
> explicitely save and restore it or X might get upset. I'll have a look.
Actually, subsystem stuff is below 0x40 so it should be fine too.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 3:30 ` Benjamin Herrenschmidt
@ 2009-02-03 3:47 ` Linus Torvalds
2009-02-03 4:03 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 3:47 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-02-03 at 12:46 +1100, Benjamin Herrenschmidt wrote:
> >
> > Radeons don't do much with config space... the worst we may miss I
> > suppose is subsystem vendor/device... Maybe I'll add something to
> > explicitely save and restore it or X might get upset. I'll have a look.
>
> Actually, subsystem stuff is below 0x40 so it should be fine too.
The things above 0x40 tend to be:
- capabilities (values and next-pointers)
The PCI layer will save a random couple of these (read: the ones it
cares about)
- random non-architected values specific to that chip. And sometimes
these are important. Like ISA interrupt routing information for cardbus
controllers. Or timing values set up by the BIOS.
but in 99% of all cases the stuff isn't really anything special, or is
just fine at its cold-reset default values. But I could in theory see that
some graphics card could hide things like DRAM timings in there, that get
initialized on POST, but not (obviously) on a STR cycle.
So it could go either way. Totally unimportant or hugely important. Most
cards don't seem to care, and saving just the low 64 bytes seems to work
for almost all drivers.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:18 ` Linus Torvalds
2009-02-02 23:45 ` Rafael J. Wysocki
2009-02-03 0:58 ` Benjamin Herrenschmidt
@ 2009-02-03 3:51 ` Benjamin Herrenschmidt
2009-02-03 3:55 ` Benjamin Herrenschmidt
2009-02-03 9:26 ` Rafael J. Wysocki
2 siblings, 2 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 3:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 15:18 -0800, Linus Torvalds wrote:
> (background: pci_restore_standard_config() will have done
> pci_raw_set_power_state(PCI_D0) with the device clocks off, which
> wouldn't
> actualyl have _done_ anythign to the device, but then it does
>
> dev->current_state = PCI_D0;
>
> Maybe the simplest thing to do migth be to replace that with a
>
> pci_update_current_state(dev, PCI_D0);
>
> instead, to try to read back the state explicitly in
> pci_restore_standard_config()).
Note that I notice on various powerbook models a bunch of things like
<some pci device> : restoring config space at offset XX (was: 0xffffffff, writin 0 ...)
I'm not sure if that will do any good :-)
The reason is that some devices (here sungem, ide-pmac and the firewire OHCI) -do-
get clock gated until either you hit some of my magic pmac_feature calls in
the driver or the driver calls pci_enable_device() (I have a hook there).
So far, it appears to be harmless ... at least on the one machine I have here,
but it shows that we are doing something wrong.
So to recapitulate our entire discussion, let me know if I'm off the mark here:
- We agree that in general, the arch must have a chance to turn things on
before we do anything such as restoring the config space. This includes
my pmac clock gating stuff and ACPI
- The problem with ACPI is that it needs interrupts because it uses mutexes,
GFP_KERNEL, etc.... There are two approaches to solving that problem. We can
either make the mutexes not warn etc... or we can use a loop of disable_irq()
instead of local_irq_save() between our normal and our "late" phase of PM.
- There are some arguments for the fact that making mutexes not warn and
GFP_KERNEL silently degrade to at least GFP_NOIO is a good thing to do at
that stage in the suspend process (and the GFP_KERNEL -> NOIO should probably
happen even earlier) so we might get ACPI for free there.
Now, here's just a few things to throw into the discussion:
- I think sysdevs should still have real IRQs off (local_irq_off) since that
includes shutting down the PIC itself, the timers, etc...
- Doing the disable_irq() loop instead of local_irq_save() will introduce one
subtle change to driver semantics, ie, they -might- be entered while in their
suspend_late() via some timer interrupt (ie, IO scheduler acting on a timer,
network stuff, tty output for the serial driver, etc...) while we had a pretty
strong guarantee with the local_irq_save() case that nothing would happen at
that stage, this isn't true anymore. That means more chances for drivers to
get it wrong...
- Because of the two things above, I tend to favor myself the option of
making mutexes not warn etc...
- There are some additional reasons where generalizing that to the "late"
suspend and early resume stages could be useful, such as allowing early video
resume of fbdev/fbcon or the new KMS etc... while those things may internally
use mutexes or do GFP_KERNEL allocs etc... One example hitting me right now is
my AGP PM code that I need to run early to bring the video back properly,
which calls agp_collect_device_status() which calls pci_get_class() which
does a kzalloc(...,GFP_KERNEL). Another one is the fb_set_suspend() itself
uses the fb_notifier which is a blocking notifier. This is annoying because
this notifier isn't used only for suspend/resume, for for other things that
can legitimately be blocking, so making it an atomic notifier isn't really
the right options, etc...
For those reason, I think we need to look at the few changes to the
scheduler/mutex code, might_sleep() and allocators, so that:
- During the main suspend/resume phase: GFP_KERNEL -> GFP_NOIO
- During the "irq off" phase:
* GFP_KERNEL/NOIO -> GFP_ATOMIC
* msleep -> mdelay
* might_sleep() doesn't warn
* actual blocking on a mutex errors out verbosely
That would allow a whole chunk of code to be used in the later phase as if it
was boot time, including probably ACPI. Of course, we still want to be careful,
we aren't going to allow explicit scheduling, ie, things like sysdev's should
still be written with the idea that it -is- atomic context, but at least cases
such as the ones we've exposed above would be much more manageable.
>From there, we can then rework PCI to properly call the arch stuff first thing
on power on and be happy with it.
Comments ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 3:51 ` Benjamin Herrenschmidt
@ 2009-02-03 3:55 ` Benjamin Herrenschmidt
2009-02-03 4:09 ` Linus Torvalds
2009-02-03 9:26 ` Rafael J. Wysocki
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 3:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> Note that I notice on various powerbook models a bunch of things like
>
> <some pci device> : restoring config space at offset XX (was: 0xffffffff, writin 0 ...)
>
> I'm not sure if that will do any good :-)
>
> The reason is that some devices (here sungem, ide-pmac and the firewire OHCI) -do-
> get clock gated until either you hit some of my magic pmac_feature calls in
> the driver or the driver calls pci_enable_device() (I have a hook there).
>
> So far, it appears to be harmless ... at least on the one machine I have here,
> but it shows that we are doing something wrong.
Another little nit here...
The new code will basically make the core bring back everything to D0 on
resume. That's a huge power surge and often not needed at all...
If we set aside the shared interrupt problem for a moment, why should
waking up from sleep suddenly D0 everything in my system including stuff
that might have self suspended due to not being used (wifi killswitched,
audio off, etc...) and is still not going to be used until I actually
decide to do so (aka the user)...
I suppose the drivers can then turn those things back off right away but
it's still a bit fishy...
Of course, drivers that can do that sort of dynamic power management
must already be ready to cope with interrupts coming from a shared line
while asleep...
Maybe we need some flags in the drivers telling the core "I know what
I'm doing and I'm ready to take shared interrupts after suspended, but
please don't mess with my device behind my back"... but then, even
stupid driver writers will abuse the flags ... hard.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 3:47 ` Linus Torvalds
@ 2009-02-03 4:03 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 4:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 19:47 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> > On Tue, 2009-02-03 at 12:46 +1100, Benjamin Herrenschmidt wrote:
> > >
> > > Radeons don't do much with config space... the worst we may miss I
> > > suppose is subsystem vendor/device... Maybe I'll add something to
> > > explicitely save and restore it or X might get upset. I'll have a look.
> >
> > Actually, subsystem stuff is below 0x40 so it should be fine too.
>
> The things above 0x40 tend to be:
>
> - capabilities (values and next-pointers)
>
> The PCI layer will save a random couple of these (read: the ones it
> cares about)
>
> - random non-architected values specific to that chip. And sometimes
> these are important. Like ISA interrupt routing information for cardbus
> controllers. Or timing values set up by the BIOS.
Right, I've seen such things, but in our case, all the radeons I support
in there should be ok with just 64 bytes.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 3:55 ` Benjamin Herrenschmidt
@ 2009-02-03 4:09 ` Linus Torvalds
2009-02-03 4:21 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 4:09 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> The new code will basically make the core bring back everything to D0 on
> resume. That's a huge power surge and often not needed at all...
Well, we could easily do it only if it was D0 _before_ resume.
But quite frankly, before I worry about "huge power surge" issues, I want
to feel like the architecture allows for clueless driver writers, and
still giving us reliable suspend/resume.
Once we have that reliability, and only _then_, do I care at all about
power issues. Make it work first, and not result in a dead machine if
there is unlucky shared interrupt timings.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 4:09 ` Linus Torvalds
@ 2009-02-03 4:21 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 4:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 20:09 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> >
> > The new code will basically make the core bring back everything to D0 on
> > resume. That's a huge power surge and often not needed at all...
>
> Well, we could easily do it only if it was D0 _before_ resume.
>
> But quite frankly, before I worry about "huge power surge" issues, I want
> to feel like the architecture allows for clueless driver writers, and
> still giving us reliable suspend/resume.
>
> Once we have that reliability, and only _then_, do I care at all about
> power issues. Make it work first, and not result in a dead machine if
> there is unlucky shared interrupt timings.
Right.
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:44 ` Linus Torvalds
2009-02-03 1:32 ` Benjamin Herrenschmidt
@ 2009-02-03 5:06 ` Ingo Molnar
2009-02-03 11:06 ` Peter Zijlstra
1 sibling, 1 reply; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 5:06 UTC (permalink / raw)
To: Linus Torvalds, Peter Zijlstra
Cc: Benjamin Herrenschmidt, Rafael J. Wysocki,
Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
> >
> > IE, you should have something to ensure, before you turn interrupts off,
> > that nobody else is inside the AML interpreter. You already know there
> > are no other CPUs, so it's just a matter of making sure no other process
> > has scheduled while holding that mutex.
> >
> > The easy way to do that is to do something like taking the mutex
> > yourself and then setting a flag so that the intepreter stops trying to
> > take it or release it itself, maybe just using the global system state.
> >
> > Then release the mutex on resume.
>
> Why do you think this improves on anything?
>
> Basically, it turns the mutex into a non-entity - but if your whole
> argument is that it might as well be a non-entity because nobody else can
> take it anyway, then why not just leave it around?
>
> IOW, if your argument boils down to "there can be no contention", then you
> might as well say "just use the mutex, it will never block".
>
> So the only thing you really need is to just disable the _debugging_ code
> that mutexes have (if they get built with debugging in the first place).
>
> I can't find the bothersome code anyway: I do find
>
> DEBUG_LOCKS_WARN_ON(in_interrupt());
>
> but that's just saying that you shouldn't be using mutexes from
> interrupts, not from irq-off segments. There's probably something I'm
> missing, like the preempt_check_resched() causing a schedule event with
> irq's disabled, and the "might_sleep()" thing. But the latter should
> already be disabled by the "system_state != SYSTEM_RUNNING" thing.
Mutexes should work just fine in irqs-off sections - they'll safely
save/restore interrupts, even the debug variants.
We used to have code in the mutex code that unconditionally enabled
interrupts (a spin_unlock_irq() iirc) - but we fixed that pretty
early on because it surprised some early boot code. Maybe this is
the case you remember?
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 0:28 ` Linus Torvalds
2009-02-03 1:12 ` Benjamin Herrenschmidt
@ 2009-02-03 6:07 ` Benjamin Herrenschmidt
2009-02-03 15:48 ` Linus Torvalds
2009-02-03 16:33 ` Jesse Barnes
1 sibling, 2 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 6:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Mon, 2009-02-02 at 16:28 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> >
> > BTW, on the PMAC the problematic driver appears to be the radeon driver,
> > according to Ben, and the breakage is not related to USB.
>
> Hmm. atyfb_base.c has the same kind of things with magic PMAC code, but it
> doesn't follow the USB pattern - it just replaces "pci_set_power_state()"
> _entirely_.
>
> It seems a very interesting suspend routine, btw. It doesn't seem to do
> any of the pci_save_state() at all, just re-initializes from scratch.
> Maybe it is unhappy with the PM layer deciding to try to restore state
> since it clearly didn't..
In fact, atyfb is also busted in -rc3 for different reasons though
probably by the same patch. I just dug an old powerbook with a mach64
and it crashes on resume with a machine check in there. (Among 2 or 3
other problems introduced by recent kernels, such as pci_get_* now
kmallocs with GFP_KERNEL internally which makes it WARN when I use it in
my low level suspend/resume code to whack the memory controller, etc...
this one is going to bite others I reckon, or IDE having some interrupt
problems on resume).
Adding a pci_save_state() to atyfb_pci_suspend() and a
pci_set_power_state() + pci_restore_state() at the beginning of
atyfb_pci_resume() fixes the machine check here.
Now where it gets funny is that I've added code to read the BAR and
command register content before, between, and after those calls and
print it and .. they are sane... Until i discovered that what happens is
that the new generic code seems to actually blast 0 all over my config
space if I don't call pci_save_state() in suspend(). I suppose I was
missing a "mandatory" call here... but the core should be more robust,
ie it shouldn't erase the config space of something because a driver
"forgot" to call pci_save_state() !
I'll do a patch fixing atyfb. I'll check if aty128fb tomorrow when
paulus brings a machine with such a chip in, but it looks like it will
have the same problem.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 3:51 ` Benjamin Herrenschmidt
2009-02-03 3:55 ` Benjamin Herrenschmidt
@ 2009-02-03 9:26 ` Rafael J. Wysocki
2009-02-03 17:04 ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
2009-02-03 17:53 ` PCI PM: Restore standard config registers of all devices early Linus Torvalds
1 sibling, 2 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 9:26 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
> On Mon, 2009-02-02 at 15:18 -0800, Linus Torvalds wrote:
> > (background: pci_restore_standard_config() will have done
> > pci_raw_set_power_state(PCI_D0) with the device clocks off, which
> > wouldn't
> > actualyl have _done_ anythign to the device, but then it does
> >
> > dev->current_state = PCI_D0;
> >
> > Maybe the simplest thing to do migth be to replace that with a
> >
> > pci_update_current_state(dev, PCI_D0);
> >
> > instead, to try to read back the state explicitly in
> > pci_restore_standard_config()).
>
> Note that I notice on various powerbook models a bunch of things like
>
> <some pci device> : restoring config space at offset XX (was: 0xffffffff, writin 0 ...)
>
> I'm not sure if that will do any good :-)
>
> The reason is that some devices (here sungem, ide-pmac and the firewire OHCI) -do-
> get clock gated until either you hit some of my magic pmac_feature calls in
> the driver or the driver calls pci_enable_device() (I have a hook there).
>
> So far, it appears to be harmless ... at least on the one machine I have here,
> but it shows that we are doing something wrong.
>
> So to recapitulate our entire discussion, let me know if I'm off the mark here:
>
> - We agree that in general, the arch must have a chance to turn things on
> before we do anything such as restoring the config space. This includes
> my pmac clock gating stuff and ACPI
Well, in fact that depends on whether or not the $subject patch causes more
breakage to happen. If it doesn't, then this probably is only a theoretical
issue.
That said, since we appear to be able to fix it, we should do that anyway IMO.
> - The problem with ACPI is that it needs interrupts because it uses mutexes,
> GFP_KERNEL, etc.... There are two approaches to solving that problem. We can
> either make the mutexes not warn etc... or we can use a loop of disable_irq()
> instead of local_irq_save() between our normal and our "late" phase of PM.
>
> - There are some arguments for the fact that making mutexes not warn and
> GFP_KERNEL silently degrade to at least GFP_NOIO is a good thing to do at
> that stage in the suspend process (and the GFP_KERNEL -> NOIO should probably
> happen even earlier) so we might get ACPI for free there.
I like the "loop of disable_irq()" idea, with a few modifications (see below).
> Now, here's just a few things to throw into the discussion:
>
> - I think sysdevs should still have real IRQs off (local_irq_off) since that
> includes shutting down the PIC itself, the timers, etc...
Agreed. That's quite easy to implement, though.
> - Doing the disable_irq() loop instead of local_irq_save() will introduce one
> subtle change to driver semantics, ie, they -might- be entered while in their
> suspend_late() via some timer interrupt (ie, IO scheduler acting on a timer,
> network stuff, tty output for the serial driver, etc...) while we had a pretty
> strong guarantee with the local_irq_save() case that nothing would happen at
> that stage, this isn't true anymore. That means more chances for drivers to
> get it wrong...
Fortunately, only a few drivers implement suspend_late, so that should be
easy to audit.
Also, the driver core code will probably need some modifications, because it
was written with the assumption that interrupts will be off in certain places,
but that's not a very big deal.
> - Because of the two things above, I tend to favor myself the option of
> making mutexes not warn etc...
>
> - There are some additional reasons where generalizing that to the "late"
> suspend and early resume stages could be useful, such as allowing early video
> resume of fbdev/fbcon or the new KMS etc... while those things may internally
> use mutexes or do GFP_KERNEL allocs etc... One example hitting me right now is
> my AGP PM code that I need to run early to bring the video back properly,
> which calls agp_collect_device_status() which calls pci_get_class() which
> does a kzalloc(...,GFP_KERNEL). Another one is the fb_set_suspend() itself
> uses the fb_notifier which is a blocking notifier. This is annoying because
> this notifier isn't used only for suspend/resume, for for other things that
> can legitimately be blocking, so making it an atomic notifier isn't really
> the right options, etc...
>
> For those reason, I think we need to look at the few changes to the
> scheduler/mutex code, might_sleep() and allocators, so that:
>
> - During the main suspend/resume phase: GFP_KERNEL -> GFP_NOIO
That will have to be done anyway at one point IMO.
> - During the "irq off" phase:
> * GFP_KERNEL/NOIO -> GFP_ATOMIC
> * msleep -> mdelay
> * might_sleep() doesn't warn
> * actual blocking on a mutex errors out verbosely
That should not be necessary if we do the "loop of disable_irq()" version
(except for the GFP_KERNEL -> GFP_NOIO thing).
> That would allow a whole chunk of code to be used in the later phase as if it
> was boot time, including probably ACPI. Of course, we still want to be careful,
> we aren't going to allow explicit scheduling, ie, things like sysdev's should
> still be written with the idea that it -is- atomic context, but at least cases
> such as the ones we've exposed above would be much more manageable.
>
> >From there, we can then rework PCI to properly call the arch stuff first thing
> on power on and be happy with it.
>
> Comments ?
As I said, I tend to prefer the "loop of disable_irq()" approach, because it
would allow us to preserve the current ordering of ACPI operations. Namely,
if we do:
suspend devices (normal suspend)
loop of disable_irq()
late suspend of devices
_PTS
disable nonboot CPUs
local_irq_disable()
sysdev suspend
enter sleep state
get control from the BIOS
sysdev resume
(*)
local_irq_enable()
enable nonboot CPUs
_WAK
early resume of devices
loop of enable_irq()
resume devices (normal resume)
the ordering of _PTS with respect to putting devices into low power states and
disabling the nonboot CPUs will be the same as it is now and the same applies
to _WAK and putting devices into D0 etc. (I really _really_ wouldn't like to
change this ordering, since this alone is likely to break things badly).
Now, there's one subtle problem with resume in this picture. Namely, before
running the "early resume of devices" we have to make sure that the interrupts
will be masked. However, masking MSI-X, for example, means writing into
the memory space of the device, so we can't do it at this point. Of course, we
can assume that MSI/MSI-X will be masked when we get control from the BIOS
(moreover, they are not shareable, so we can just ignore them at this point),
but still we'll have to mask the other interrupts before doing the
local_irq_enable() on resume - marked by the (*) above. This appears to be
doable, though.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 5:06 ` Ingo Molnar
@ 2009-02-03 11:06 ` Peter Zijlstra
2009-02-03 12:09 ` Ingo Molnar
0 siblings, 1 reply; 98+ messages in thread
From: Peter Zijlstra @ 2009-02-03 11:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Benjamin Herrenschmidt, Rafael J. Wysocki,
Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown
On Tue, 2009-02-03 at 06:06 +0100, Ingo Molnar wrote:
>
> Mutexes should work just fine in irqs-off sections - they'll safely
> save/restore interrupts, even the debug variants.
mutex_lock() has might_sleep() which will generate a splat in atomic
contexts.
That said, afaik the acpi code uses semaphores.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 11:06 ` Peter Zijlstra
@ 2009-02-03 12:09 ` Ingo Molnar
0 siblings, 0 replies; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 12:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linus Torvalds, Benjamin Herrenschmidt, Rafael J. Wysocki,
Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown
* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2009-02-03 at 06:06 +0100, Ingo Molnar wrote:
> >
> > Mutexes should work just fine in irqs-off sections - they'll safely
> > save/restore interrupts, even the debug variants.
>
> mutex_lock() has might_sleep() which will generate a splat in atomic
> contexts.
>
> That said, afaik the acpi code uses semaphores.
even the might_sleep() should be OK with Linus's system_state change to the
suspend code:
void __might_sleep(char *file, int line)
{
#ifdef in_atomic
static unsigned long prev_jiffy; /* ratelimiting */
if ((!in_atomic() && !irqs_disabled()) ||
system_state != SYSTEM_RUNNING || oops_in_progress)
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 6:07 ` Benjamin Herrenschmidt
@ 2009-02-03 15:48 ` Linus Torvalds
2009-02-03 22:59 ` Benjamin Herrenschmidt
2009-02-03 16:33 ` Jesse Barnes
1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 15:48 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Benjamin Herrenschmidt wrote:
>
> Now where it gets funny is that I've added code to read the BAR and
> command register content before, between, and after those calls and
> print it and .. they are sane... Until i discovered that what happens is
> that the new generic code seems to actually blast 0 all over my config
> space if I don't call pci_save_state() in suspend(). I suppose I was
> missing a "mandatory" call here... but the core should be more robust,
> ie it shouldn't erase the config space of something because a driver
> "forgot" to call pci_save_state() !
You've found a bug somewhere.
We _should_ be saving things, the legacy code does something like this:
if (drv && drv->suspend) {
pci_dev->state_saved = false;
i = drv->suspend(pci_dev, state);
suspend_report_result(drv->suspend, i);
if (i)
return i;
if (pci_dev->state_saved)
goto Fixup;
if (WARN_ON_ONCE(pci_dev->current_state != PCI_D0))
goto Fixup;
}
pci_save_state(pci_dev);
ie if your ->suspend function doesn't use pci_save_state() itself (which
sets that "state_saved" flag to true), then the generic code will do it
for you.
Also, on the resume path, we actually have
if (pci_dev->state_saved)
pci_restore_standard_config(pci_dev);
so I wonder how the heck you got that blast of all zeroes - because we
clearly shouldn't be trying to restore any unsaved state!
So if you can figure out how it does all that...
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 6:07 ` Benjamin Herrenschmidt
2009-02-03 15:48 ` Linus Torvalds
@ 2009-02-03 16:33 ` Jesse Barnes
1 sibling, 0 replies; 98+ messages in thread
From: Jesse Barnes @ 2009-02-03 16:33 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Rafael J. Wysocki, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Monday, February 2, 2009 10:07 pm Benjamin Herrenschmidt wrote:
> On Mon, 2009-02-02 at 16:28 -0800, Linus Torvalds wrote:
> > On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> > > BTW, on the PMAC the problematic driver appears to be the radeon
> > > driver, according to Ben, and the breakage is not related to USB.
> >
> > Hmm. atyfb_base.c has the same kind of things with magic PMAC code, but
> > it doesn't follow the USB pattern - it just replaces
> > "pci_set_power_state()" _entirely_.
> >
> > It seems a very interesting suspend routine, btw. It doesn't seem to do
> > any of the pci_save_state() at all, just re-initializes from scratch.
> > Maybe it is unhappy with the PM layer deciding to try to restore state
> > since it clearly didn't..
>
> In fact, atyfb is also busted in -rc3 for different reasons though
> probably by the same patch. I just dug an old powerbook with a mach64
> and it crashes on resume with a machine check in there. (Among 2 or 3
> other problems introduced by recent kernels, such as pci_get_* now
> kmallocs with GFP_KERNEL internally which makes it WARN when I use it in
> my low level suspend/resume code to whack the memory controller, etc...
> this one is going to bite others I reckon, or IDE having some interrupt
> problems on resume).
>
> Adding a pci_save_state() to atyfb_pci_suspend() and a
> pci_set_power_state() + pci_restore_state() at the beginning of
> atyfb_pci_resume() fixes the machine check here.
>
> Now where it gets funny is that I've added code to read the BAR and
> command register content before, between, and after those calls and
> print it and .. they are sane... Until i discovered that what happens is
> that the new generic code seems to actually blast 0 all over my config
> space if I don't call pci_save_state() in suspend(). I suppose I was
> missing a "mandatory" call here... but the core should be more robust,
> ie it shouldn't erase the config space of something because a driver
> "forgot" to call pci_save_state() !
Whoa, I don't think we actually zero the contents in the suspend/resume core,
but if the device goes into D3 the config space contents may be lost, maybe
that's what happening here?
--
Jesse Barnes, Intel Open Source Technology Center
^ permalink raw reply [flat|nested] 98+ messages in thread
* Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 9:26 ` Rafael J. Wysocki
@ 2009-02-03 17:04 ` Rafael J. Wysocki
2009-02-03 17:59 ` Linus Torvalds
2009-02-03 21:02 ` Benjamin Herrenschmidt
2009-02-03 17:53 ` PCI PM: Restore standard config registers of all devices early Linus Torvalds
1 sibling, 2 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 17:04 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Linus Torvalds
Cc: Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Rafael J. Wysocki wrote:
> On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
> > On Mon, 2009-02-02 at 15:18 -0800, Linus Torvalds wrote:
[--snip--]
> > Comments ?
>
> As I said, I tend to prefer the "loop of disable_irq()" approach, because it
> would allow us to preserve the current ordering of ACPI operations. Namely,
> if we do:
>
> suspend devices (normal suspend)
> loop of disable_irq()
> late suspend of devices
> _PTS
> disable nonboot CPUs
> local_irq_disable()
> sysdev suspend
> enter sleep state
> get control from the BIOS
> sysdev resume
> (*)
> local_irq_enable()
> enable nonboot CPUs
> _WAK
> early resume of devices
> loop of enable_irq()
> resume devices (normal resume)
>
> the ordering of _PTS with respect to putting devices into low power states and
> disabling the nonboot CPUs will be the same as it is now and the same applies
> to _WAK and putting devices into D0 etc. (I really _really_ wouldn't like to
> change this ordering, since this alone is likely to break things badly).
>
> Now, there's one subtle problem with resume in this picture. Namely, before
> running the "early resume of devices" we have to make sure that the interrupts
> will be masked. However, masking MSI-X, for example, means writing into
> the memory space of the device, so we can't do it at this point. Of course, we
> can assume that MSI/MSI-X will be masked when we get control from the BIOS
> (moreover, they are not shareable, so we can just ignore them at this point),
> but still we'll have to mask the other interrupts before doing the
> local_irq_enable() on resume - marked by the (*) above. This appears to be
> doable, though.
Having reconsidered it, I think that the "loop of disable_irq()" may be
problematic due to MSI/MSI-X and devices that are put into D3 during the
"normal" suspend. That is, we shouldn't try to mask MSI/MSI-X for devices in
D3 (especially MSI-X, since that involves writing to the device's memory
space). This implies that devices in D3 should be avoided in the "loop of
disable_irq()", but that could be tricky if we loop over struct irq_desc
objects.
Still, we can modify pci_pm_suspend() (and the other PCI callbacks analogously)
so that it masks the interrupt of the device right before returning to the
caller if the device has not been put into a low power state before. After
that all devices will either be in low power states, so they won't be able to
generate interrupts, or have their interrupts masked. In the latter case the
core can then put them into low power states in suspend_late().
To summarize, I'd like to do the following:
suspend devices (normal suspend, mask interrupts for devices still in D0)
late suspend of devices (devices cannot generate interrupts)
_PTS
disable nonboot CPUs
local_irq_disable()
sysdev suspend
enter sleep state
get control from the BIOS
sysdev resume
(*)
local_irq_enable()
enable nonboot CPUs
_WAK
early resume of devices {
- devices cannot generate interrupts
- devices are being put into D0
- standard config registers are being restored
}
resume devices (unmask device interrupts, normal resume)
The "early resume of devices" step can effectively unmask MSI/MSI-X, but that
shouldn't matter, since they are not shared anyway.
IOW, I would just move the late suspend of devices before _PTS and the
early resume of devices after _WAK, with the additional disabling/enabling of
device interrups (all PCI devices must enter the late suspend phase either in a
low power state, or with their interrupts masked; analogously for resume).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 9:26 ` Rafael J. Wysocki
2009-02-03 17:04 ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
@ 2009-02-03 17:53 ` Linus Torvalds
2009-02-03 21:57 ` Rafael J. Wysocki
1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 17:53 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
>
> As I said, I tend to prefer the "loop of disable_irq()" approach, because it
> would allow us to preserve the current ordering of ACPI operations. Namely,
> if we do:
>
> suspend devices (normal suspend)
> loop of disable_irq()
> late suspend of devices
> _PTS
> disable nonboot CPUs
> local_irq_disable()
> sysdev suspend
> enter sleep state
> get control from the BIOS
> sysdev resume
> (*)
> local_irq_enable()
> enable nonboot CPUs
> _WAK
> early resume of devices
> loop of enable_irq()
> resume devices (normal resume)
>
> the ordering of _PTS with respect to putting devices into low power states and
> disabling the nonboot CPUs will be the same as it is now and the same applies
> to _WAK and putting devices into D0 etc. (I really _really_ wouldn't like to
> change this ordering, since this alone is likely to break things badly).
Yes.
Also, make the "loop of disable/enable_irq()" phase be a helper function
that also sets system_state to SYSTEM_SUSPENDING/SYSTEM_RUNNING
respectively, and it should all be pretty clean, and the changes really
should be pretty minimal.
> Now, there's one subtle problem with resume in this picture. Namely, before
> running the "early resume of devices" we have to make sure that the interrupts
> will be masked. However, masking MSI-X, for example, means writing into
> the memory space of the device, so we can't do it at this point.
I really don't think it matters. Why? We simply don't care.
All MSI-X things will still have to go through the regular irq layer, so
the disable/enable_irq part, so even though we've done the
"local_irq_enable()", we just don't care.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 17:04 ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
@ 2009-02-03 17:59 ` Linus Torvalds
2009-02-03 18:31 ` Linus Torvalds
2009-02-03 18:32 ` Jesse Barnes
2009-02-03 21:02 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 17:59 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
>
> Having reconsidered it, I think that the "loop of disable_irq()" may be
> problematic due to MSI/MSI-X and devices that are put into D3 during the
> "normal" suspend. That is, we shouldn't try to mask MSI/MSI-X for devices in
> D3
Rafael, you seem to be confused about what "disable_irq()" does.
It does not touch the driver hardware AT ALL. It literally just touches
the interrupt controller, and even that only indirectly.
What disable_irq() does it literally:
void disable_irq_nosync(unsigned int irq)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
if (!desc)
return;
spin_lock_irqsave(&desc->lock, flags);
if (!desc->depth++) {
desc->status |= IRQ_DISABLED;
desc->chip->disable(irq);
}
spin_unlock_irqrestore(&desc->lock, flags);
}
and then it does a "synchronize_irq()" to wait to make sure that there are
no pending ones.
And in many cases, even the
desc->chip->disable(irq);
doesn't actually _do_ anything - we'll quite possibly continue to take the
interrupt, and only when the interrupt happens will it see the "oh,
IRQ_DISABLED is set" thing, and do something about it.
So don't worry about putting devices in D3 - disable_irq() will not care
AT ALL whether the device is alive or not, and will never try to touch it
anyway.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 17:59 ` Linus Torvalds
@ 2009-02-03 18:31 ` Linus Torvalds
2009-02-03 18:41 ` Ingo Molnar
2009-02-03 18:32 ` Jesse Barnes
1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 18:31 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Linus Torvalds wrote:
>
> So don't worry about putting devices in D3 - disable_irq() will not care
> AT ALL whether the device is alive or not, and will never try to touch it
> anyway.
Btw, this is very much the case for MSI irq's in particular. If you ask to
_mask_ them, it will go to the look up the device list and try to mask
them (quite frankly, that sounds insane to me, but whatever), but that not
what the irq layer does for the simple "disable()" case.
It will literally just set the flag, and then even if an interrupt happens
afterwards, it will just ->ack it, and then call the ->end thing - and
doesn't need to do anything else in the whole "disable" path because MSI's
are obviously edge-triggered.
And the ACK is a pure (x2/io)apic thing, and again doesn't actually touch
the device itself, only the irq controller itself.
So while it is possible in theory that some irq controller ends up trying
to access the device for disable_irq, I don't think it's ever true in
reality.
But it's possible that I overlooked some really odd case, of course.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 17:59 ` Linus Torvalds
2009-02-03 18:31 ` Linus Torvalds
@ 2009-02-03 18:32 ` Jesse Barnes
2009-02-03 18:46 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Jesse Barnes @ 2009-02-03 18:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday, February 3, 2009 9:59 am Linus Torvalds wrote:
> On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> > Having reconsidered it, I think that the "loop of disable_irq()" may be
> > problematic due to MSI/MSI-X and devices that are put into D3 during the
> > "normal" suspend. That is, we shouldn't try to mask MSI/MSI-X for
> > devices in D3
>
> Rafael, you seem to be confused about what "disable_irq()" does.
>
> It does not touch the driver hardware AT ALL. It literally just touches
> the interrupt controller, and even that only indirectly.
>
> What disable_irq() does it literally:
>
> void disable_irq_nosync(unsigned int irq)
> {
> struct irq_desc *desc = irq_to_desc(irq);
> unsigned long flags;
>
> if (!desc)
> return;
>
> spin_lock_irqsave(&desc->lock, flags);
> if (!desc->depth++) {
> desc->status |= IRQ_DISABLED;
> desc->chip->disable(irq);
> }
> spin_unlock_irqrestore(&desc->lock, flags);
> }
>
> and then it does a "synchronize_irq()" to wait to make sure that there are
> no pending ones.
>
> And in many cases, even the
>
> desc->chip->disable(irq);
>
> doesn't actually _do_ anything - we'll quite possibly continue to take the
> interrupt, and only when the interrupt happens will it see the "oh,
> IRQ_DISABLED is set" thing, and do something about it.
But won't ->disable point at ->mask in the MSI case (msi_chip doesn't have
a ->disable, so the IRQ core will make ->disable point at ->mask)?
And in mask we do go poke at PCI regs (msi_set_mask_bits) to mask interrupts
if possible (though if there's no mask bit in the legacy MSI case we don't do
anything).
But again, these interrupts won't be shared, so maybe it doesn't matter, and
maybe the IRQ_DISABLED flag will keep any drivers from seeing post suspend
interrupts anyway...
--
Jesse Barnes, Intel Open Source Technology Center
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 18:31 ` Linus Torvalds
@ 2009-02-03 18:41 ` Ingo Molnar
0 siblings, 0 replies; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 18:41 UTC (permalink / raw)
To: Linus Torvalds, Thomas Gleixner
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Jesse Barnes, Andreas Schwab,
Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Feb 2009, Linus Torvalds wrote:
> >
> > So don't worry about putting devices in D3 - disable_irq() will not care
> > AT ALL whether the device is alive or not, and will never try to touch
> > it anyway.
>
> Btw, this is very much the case for MSI irq's in particular. If you ask to
> _mask_ them, it will go to the look up the device list and try to mask
> them (quite frankly, that sounds insane to me, but whatever), but that not
> what the irq layer does for the simple "disable()" case.
>
> It will literally just set the flag, and then even if an interrupt happens
> afterwards, it will just ->ack it, and then call the ->end thing - and
> doesn't need to do anything else in the whole "disable" path because MSI's
> are obviously edge-triggered.
>
> And the ACK is a pure (x2/io)apic thing, and again doesn't actually touch
> the device itself, only the irq controller itself.
>
> So while it is possible in theory that some irq controller ends up trying
> to access the device for disable_irq, I don't think it's ever true in
> reality.
>
> But it's possible that I overlooked some really odd case, of course.
Sidenote: one future usecase that moves towards this might be threaded IRQ
handlers - but it's not a real problem.
For them to run well we want to split IRQ processing into two bits: the
"fast and atomic" bit, which just goes in, checks which hw generated the
IRQ, then disables that hw via its device, and then wakes up the irq kernel
thread to do all the fancy stuff.
[ at this point that 'complete the IRQ' processing might even be done on
another CPU - if we decide that the irq was requested from a given CPU. So
there's even performance advantage to such a splitup. We have not yet
built up much local CPU state so we can migrate the work cheaply. ]
But it's not an issue: this too will be driven centrally and disable_irq()
will obviously inhibit irqs to all handlers on the chain [to stay
compatible] and not touch the device hw in the suspend case.
So the 'disable the device irq itself' functionality is hidden behind the
irq flow machinery - and never executed once you've done a widespread
disable_irq() calls on (almost) all devices.
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 18:32 ` Jesse Barnes
@ 2009-02-03 18:46 ` Linus Torvalds
2009-02-03 19:03 ` Linus Torvalds
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 18:46 UTC (permalink / raw)
To: Jesse Barnes
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Jesse Barnes wrote:
>
> But won't ->disable point at ->mask in the MSI case (msi_chip doesn't have
> a ->disable, so the IRQ core will make ->disable point at ->mask)?
No. I clarified this in the next message, but basically it boils down to a
simple case: "for edge-triggered interrupts, it's actually stupid to mask
things, because it's both simpler and _cheaper_ to just take change that
the interrupt might happen once" (especially since it almost never
happens).
In fact, for an edge-triggered interrupt it is often a _bug_ to mask the
interrupt source, because you often lose the interrupt. So masking is not
only complex and expensive, it's also often _wrong_. Instead, what
disable_irq() does is to take the interrupt, but instead of calling the
interrupt handlers (it's disabled!) it just sets a flag.
And that is a big _correctness_ issue, because the setting of the flag is
how we know to resend the interrupt when we enable things again - even
though the hardware itself actually lost the edge.
[ Of course, you can hope that the PCI device itself doesn't lose it, and
sees the edge again when unmasking, but the fact that many interrupt
controllers get this fundamental masking issue wrong means that I'd
almost bet that lots of PCI devices get it wrong too! ]
> And in mask we do go poke at PCI regs (msi_set_mask_bits) to mask interrupts
> if possible (though if there's no mask bit in the legacy MSI case we don't do
> anything).
I actually think that's a bug. I think it's just horribly stupid for to
->mask to go down to the device layer (for all the same reasons that we
don't do it for disable), but it doesn't much matter. We only use it for
"shutdown" of the irq controller - we should do the same thing we do for
irq_disable and just set a flag.
The only reason to actually _mask_ an interrupt is if it is
level-triggered and can cause screaming. But even then it's often better
to just wait for the interrupt to happen (and mask it then), because 99%
of the time the interrupt obviously never happens.
But we've never had reason to optimize ->mask/->unmask.
That said, we've also never had much reason to _care_ deeply, so it's also
possible that we do mask things over some path. I didn't actually walk
_all_ the paths, and the logic for irq handling has changed enough over
the years that I don't know all the paths any more. Maybe we do that
explicit mask in some path I missed. We _shouldn't_, but who knows..
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 18:46 ` Linus Torvalds
@ 2009-02-03 19:03 ` Linus Torvalds
2009-02-03 19:13 ` Ingo Molnar
2009-02-03 19:19 ` Linus Torvalds
0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 19:03 UTC (permalink / raw)
To: Jesse Barnes
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Linus Torvalds wrote:
>
> That said, we've also never had much reason to _care_ deeply, so it's also
> possible that we do mask things over some path. I didn't actually walk
> _all_ the paths, and the logic for irq handling has changed enough over
> the years that I don't know all the paths any more. Maybe we do that
> explicit mask in some path I missed. We _shouldn't_, but who knows..
Ok, so I decided to actually try to walk it all. Better look at the actual
code.
Hmm. The _normal_ simple irq handler does this the way I described, but
for some reason the "handle_edge_irq()" does not. And the reason is
actually a buglet: it needs to mask things for the "recursive interrupt"
case.
But that literally just looks like a small implementation detail (the code
decided to share the code for IRQ_INPROGRESS and IRQ_DISABLED). We should
fix it, so that you _can_ disable irqs and not have to worry about this
all.
I'm really not sure why that handle_edge_irq thing uses "ack_and_mask()"
instead of just "desc->chip->ack()"? I'm also totally flummoxed as to why
it feels it needs to go all the way out to the device to mask things,
instead of just masking at an apic level, which is much simpler and faster
(especially since masking should never happen in practice anyway).
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:03 ` Linus Torvalds
@ 2009-02-03 19:13 ` Ingo Molnar
2009-02-03 19:38 ` Linus Torvalds
2009-02-03 19:19 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 19:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jesse Barnes, Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Feb 2009, Linus Torvalds wrote:
> >
> > That said, we've also never had much reason to _care_ deeply, so it's also
> > possible that we do mask things over some path. I didn't actually walk
> > _all_ the paths, and the logic for irq handling has changed enough over
> > the years that I don't know all the paths any more. Maybe we do that
> > explicit mask in some path I missed. We _shouldn't_, but who knows..
>
> Ok, so I decided to actually try to walk it all. Better look at the actual
> code.
>
> Hmm. The _normal_ simple irq handler does this the way I described, but
> for some reason the "handle_edge_irq()" does not. And the reason is
> actually a buglet: it needs to mask things for the "recursive interrupt"
> case.
>
> But that literally just looks like a small implementation detail (the code
> decided to share the code for IRQ_INPROGRESS and IRQ_DISABLED). We should
> fix it, so that you _can_ disable irqs and not have to worry about this
> all.
>
> I'm really not sure why that handle_edge_irq thing uses "ack_and_mask()"
> instead of just "desc->chip->ack()"? I'm also totally flummoxed as to why
> it feels it needs to go all the way out to the device to mask things,
> instead of just masking at an apic level, which is much simpler and faster
> (especially since masking should never happen in practice anyway).
Hm, do you mean mask_ack_irq()? The ->mask_ack() irqchip method is just a
small tweak normally: if we get an irq while the irq was disabled, we can
mark it pending and masks it for real.
It's optional for a PIC implementation to provide it and the generic code
does it via ->mask() + ->ack() if the PIC implementation keeps it NULL.
[ In theory this also solves screaming level-triggered irqs that advertise
themselves as edge-triggered [due to firmware/BIOS bug - these do happen]
and then keep spamming the system. ]
I have not done a deep audit, normally (on x86 at least) ->mask_ack() should
not touch any lowlevel device bits (only the interrupt controller bits).
Have you found a case where it does?
That would be arguably broken i think - we should not touch lowlevel device
bits from the current generation of PIC code really, there's just no point.
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:03 ` Linus Torvalds
2009-02-03 19:13 ` Ingo Molnar
@ 2009-02-03 19:19 ` Linus Torvalds
2009-02-03 21:11 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 19:19 UTC (permalink / raw)
To: Jesse Barnes
Cc: Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 3 Feb 2009, Linus Torvalds wrote:
>
> Hmm. The _normal_ simple irq handler does this the way I described, but
> for some reason the "handle_edge_irq()" does not. And the reason is
> actually a buglet: it needs to mask things for the "recursive interrupt"
> case.
Btw, just to clarify: none of this happens at the actual "irq_disable()"
time: it only happens if you get an interrupt _while_ it's disabled. Which
obviously shouldn't happen in the shutdown/wakeup path anyway for MSI,
since the interrupts aren't shared, but it would be good to just be extra
safe.
I do suspect we could/should just get rid of the msi masking entirely, but
that may be too scary a step.
For the current suspend/resume situation, maybe it's enough to know that
it shouldn't be happening anyway, and even if it _does_ happen on a device
that has been shut down, it's just not going to do anything. Sure, it's
doing that "writel/readl", but if it gets lost, who really cares? Nobody.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:13 ` Ingo Molnar
@ 2009-02-03 19:38 ` Linus Torvalds
2009-02-03 19:53 ` Ingo Molnar
0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 19:38 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jesse Barnes, Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown
On Tue, 3 Feb 2009, Ingo Molnar wrote:
> >
> > I'm really not sure why that handle_edge_irq thing uses "ack_and_mask()"
> > instead of just "desc->chip->ack()"? I'm also totally flummoxed as to why
> > it feels it needs to go all the way out to the device to mask things,
> > instead of just masking at an apic level, which is much simpler and faster
> > (especially since masking should never happen in practice anyway).
>
> Hm, do you mean mask_ack_irq()?
Yes.
> The ->mask_ack() irqchip method is just a
> small tweak normally: if we get an irq while the irq was disabled, we can
> mark it pending and masks it for real.
No, I know why mask_ack_irq() does what it does and I agree with it. What
I was really reacting to was that handle_edge_irq() calls it at _all_.
IOW, I'm talking about this code:
handle_edge_irq(unsigned int irq, struct irq_desc *desc)
...
if (unlikely((desc->status & (IRQ_INPROGRESS | IRQ_DISABLED)) ||
!desc->action)) {
desc->status |= (IRQ_PENDING | IRQ_MASKED);
mask_ack_irq(desc, irq);
..
where the masking part seems a bit pointless. And in the case of MSI, it
causes us to go all the way out to the device, which sounds pretty
expensive too.
So if you have high enough interrupt load that you get another interrupt
on another CPU while handling one, I think the "mask_ack_irq()" likely
makes things go slower. It also causes us to have that horrible:
/*
* When another irq arrived while we were handling
* one, we could have masked the irq.
* Renable it, if it was not disabled in meantime.
*/
if (unlikely((desc->status &
(IRQ_PENDING | IRQ_MASKED | IRQ_DISABLED)) ==
(IRQ_PENDING | IRQ_MASKED))) {
desc->chip->unmask(irq);
desc->status &= ~IRQ_MASKED;
}
code there in the middle of the loop to handle the interrupt.
That function has another oddity too: it does
if (unlikely(!action)) {
desc->chip->mask(irq);
goto out_unlock;
}
but the thing is, if somebody unregistered the interrupt, then it should
have been masked to begin with. Why would be need to mask it in the
interrupt handler?
Maybe I'm missing something really subtle, but my reaction would be that
the function _should_ look just something like the appended.
But no, I'm really not going to commit this. Consider this patch to be a
question ("why does it do that?") rather than a suggestion ("we should do
this").
Linus
---
kernel/irq/chip.c | 20 +++-----------------
1 files changed, 3 insertions(+), 17 deletions(-)
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 7de11bd..ad852ce 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -468,8 +468,8 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)
*/
if (unlikely((desc->status & (IRQ_INPROGRESS | IRQ_DISABLED)) ||
!desc->action)) {
- desc->status |= (IRQ_PENDING | IRQ_MASKED);
- mask_ack_irq(desc, irq);
+ desc->status |= IRQ_PENDING;
+ desc->chip->ack(irq);
desc = irq_remap_to_desc(irq, desc);
goto out_unlock;
}
@@ -486,22 +486,8 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)
struct irqaction *action = desc->action;
irqreturn_t action_ret;
- if (unlikely(!action)) {
- desc->chip->mask(irq);
+ if (unlikely(!action))
goto out_unlock;
- }
-
- /*
- * When another irq arrived while we were handling
- * one, we could have masked the irq.
- * Renable it, if it was not disabled in meantime.
- */
- if (unlikely((desc->status &
- (IRQ_PENDING | IRQ_MASKED | IRQ_DISABLED)) ==
- (IRQ_PENDING | IRQ_MASKED))) {
- desc->chip->unmask(irq);
- desc->status &= ~IRQ_MASKED;
- }
desc->status &= ~IRQ_PENDING;
spin_unlock(&desc->lock);
^ permalink raw reply related [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:38 ` Linus Torvalds
@ 2009-02-03 19:53 ` Ingo Molnar
2009-02-03 20:04 ` Ingo Molnar
2009-02-03 20:18 ` Linus Torvalds
0 siblings, 2 replies; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 19:53 UTC (permalink / raw)
To: Linus Torvalds, Thomas Gleixner
Cc: Jesse Barnes, Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Feb 2009, Ingo Molnar wrote:
> > >
> > > I'm really not sure why that handle_edge_irq thing uses "ack_and_mask()"
> > > instead of just "desc->chip->ack()"? I'm also totally flummoxed as to why
> > > it feels it needs to go all the way out to the device to mask things,
> > > instead of just masking at an apic level, which is much simpler and faster
> > > (especially since masking should never happen in practice anyway).
> >
> > Hm, do you mean mask_ack_irq()?
>
> Yes.
>
> > The ->mask_ack() irqchip method is just a
> > small tweak normally: if we get an irq while the irq was disabled, we can
> > mark it pending and masks it for real.
>
> No, I know why mask_ack_irq() does what it does and I agree with it. What
> I was really reacting to was that handle_edge_irq() calls it at _all_.
> IOW, I'm talking about this code:
>
> handle_edge_irq(unsigned int irq, struct irq_desc *desc)
> ...
> if (unlikely((desc->status & (IRQ_INPROGRESS | IRQ_DISABLED)) ||
> !desc->action)) {
> desc->status |= (IRQ_PENDING | IRQ_MASKED);
> mask_ack_irq(desc, irq);
> ..
>
> where the masking part seems a bit pointless. And in the case of MSI, it
> causes us to go all the way out to the device, which sounds pretty
> expensive too.
hm, i agree that your patched version looks much simpler.
There's two reasons we have this variant:
- huge bikeshed painting thread in the early days of the genirq patchset,
with folks expressing concern about our original plans to keep
edge-triggered unmasked _always_. (which your patch does too)
So we just went with the path of least resistence and used this hybride.
- the screaming-irq observation i had - do you consider that valid?:
>> [ In theory this also solves screaming level-triggered irqs that
>> advertise themselves as edge-triggered [due to firmware/BIOS bug -
>> these do happen] and then keep spamming the system. ]
I wanted to have a pretty much interchangeable flow method between edge
and level triggered - so that the BIOS cannot screw us by enumerating an
irq as edge-triggered while it's level-triggered.
Especially for legacy x86 irqs in the low <16 range the trigger mode can
be influenced by chipset settings and might not always be what we think
it is.
That's my rough recollection - Thomas, is that correct and do you have
anything to add here?
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:53 ` Ingo Molnar
@ 2009-02-03 20:04 ` Ingo Molnar
2009-02-03 20:18 ` Linus Torvalds
1 sibling, 0 replies; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 20:04 UTC (permalink / raw)
To: Linus Torvalds, Thomas Gleixner
Cc: Jesse Barnes, Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown
* Ingo Molnar <mingo@elte.hu> wrote:
> - the screaming-irq observation i had - do you consider that valid?:
>
> >> [ In theory this also solves screaming level-triggered irqs that
> >> advertise themselves as edge-triggered [due to firmware/BIOS bug -
> >> these do happen] and then keep spamming the system. ]
>
> I wanted to have a pretty much interchangeable flow method between edge
> and level triggered - so that the BIOS cannot screw us by enumerating an
> irq as edge-triggered while it's level-triggered.
>
> Especially for legacy x86 irqs in the low <16 range the trigger mode can
> be influenced by chipset settings and might not always be what we think
> it is.
For apic and MSI based methods that's not a big issue: the trigger mode is
explicitly set by us, so if there's a mismatch it's a kernel bug.
And even for legacy ISA i8259a.c it should be fine after all, as we
initialize it via:
set_irq_chip_and_handler_name(irq, &i8259A_chip, handle_level_irq,
Which is a screaming-safe sequence. (and the x86 i8259 PIC does not lose
edges.)
That still leaves other architectures ... but i think the argument is a lot
weaker than i thought it to be.
Could you send a Signed-off-by so that i can queue it up and test it a bit?
If it does not blow up on x86 in practice then we should be fine, and it
avoids the MSI ->mask() stupidity as well.
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:53 ` Ingo Molnar
2009-02-03 20:04 ` Ingo Molnar
@ 2009-02-03 20:18 ` Linus Torvalds
2009-02-03 20:57 ` Ingo Molnar
1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 20:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, Jesse Barnes, Rafael J. Wysocki,
Benjamin Herrenschmidt, Linux Kernel Mailing List, Andreas Schwab,
Len Brown
On Tue, 3 Feb 2009, Ingo Molnar wrote:
>
> - the screaming-irq observation i had - do you consider that valid?:
>
> >> [ In theory this also solves screaming level-triggered irqs that
> >> advertise themselves as edge-triggered [due to firmware/BIOS bug -
> >> these do happen] and then keep spamming the system. ]
>
> I wanted to have a pretty much interchangeable flow method between edge
> and level triggered - so that the BIOS cannot screw us by enumerating an
> irq as edge-triggered while it's level-triggered.
Yes, if we can't be 100% sure it's really edge-triggered, I guess the mask
thing is really worth it. So maybe "handle_edge_irq()" is actually doing
everything right.
Of course, with MSI, we can fundamentally really be sure that it's
edge-triggered (since it's literally a packet on the PCI bus that
generates it), and that actually brings up another possibility: assuming
handle_edge_irq() is doing the correct "safe" thing, maybe the answer is
to just get rid of the MSI "mask()" operation as being unnecessary, and
catch it at that level.
NOTE! From a correctness standpoint I think this is all irrelevant. Even
if we have turned off the power of some device, the msi irq masking isn't
going to hurt (apart from _possibly_ causing a machine check, but that's
nothing new - architectures that enable machine checks on accesses to
non-responding PCI hardware have to handle those anyway).
So I wouldn't worry too much. I think this is interesting mostly from a
performance standpoint - MSI interrupts are supposed to be fast, and under
heavy interrupt load I could easily see something like
- cpu1: handles interrupt, has acked it, calls down to the handler
- the handler clears the original irq source, but another packet (or disk
completion) happens almost immediately
- cpu2 takes the second interrupt, but it's still IRQ_INPROGRESS, so it
masks.
- cpu1 gets back and unmasks etc and now really handles it because of
IRQ_PENDING.
Note how the mask/unmask were all just costly extra overhead over the PCI
bus. If we're talking something like high-performance 10Gbit ethernet (or
even maybe fast SSD disks), driver writers actually do count PCI cycles,
because a single PCI read can be several hundred ns, and if you take a
thousand interrupts per second, it does add up.
Of course, ethernet tends to do things like interrupt mitigation to avoid
this, but that has its own downsides (longer latencies) and isn't really
considered optimal in some RT environments (wall street trading kind of
things).
I really don't know how big an issue this all is. It probably isn't really
noticeable.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 20:18 ` Linus Torvalds
@ 2009-02-03 20:57 ` Ingo Molnar
2009-02-03 21:04 ` Ingo Molnar
0 siblings, 1 reply; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 20:57 UTC (permalink / raw)
To: Linus Torvalds, David S. Miller
Cc: Thomas Gleixner, Jesse Barnes, Rafael J. Wysocki,
Benjamin Herrenschmidt, Linux Kernel Mailing List, Andreas Schwab,
Len Brown
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> So I wouldn't worry too much. I think this is interesting mostly from a
> performance standpoint - MSI interrupts are supposed to be fast, and under
> heavy interrupt load I could easily see something like
>
> - cpu1: handles interrupt, has acked it, calls down to the handler
>
> - the handler clears the original irq source, but another packet (or disk
> completion) happens almost immediately
>
> - cpu2 takes the second interrupt, but it's still IRQ_INPROGRESS, so it
> masks.
>
> - cpu1 gets back and unmasks etc and now really handles it because of
> IRQ_PENDING.
>
> Note how the mask/unmask were all just costly extra overhead over the PCI
> bus. If we're talking something like high-performance 10Gbit ethernet (or
> even maybe fast SSD disks), driver writers actually do count PCI cycles,
> because a single PCI read can be several hundred ns, and if you take a
> thousand interrupts per second, it does add up.
In practice MSI (and in particular MSI-X) irq sources tend to be bound to a
single CPU on modern x86 hardware. The kernel does not do IRQ balancing
anymore, nor does the hardware. We have a slow irq-balancer daemon
(irqbalanced) in user-space. So singular IRQ sources, especially when they
are MSI, tend to be 99.9% on the same CPU. Changing affinity is possible and
has to always work reliably, but it is a performance slowpath.
An increasing trend is to have multiple irqs per device (multiple descriptor
rings, split rx and tx rings with separate irq sources): and each IRQ can
get balanced to a separate CPU. But those irqs cannot interact on a ->mask()
level as each IRQ has its separate irq_desc.
The most advanced way of balancing IRQs is not widespread yet: it is where
devices actually interpret the payload and send completions dynamically to
differing CPUs - depending on things like the TCP/IP hash value or a
in-descriptor "target CPU". That way we could get completion on the CPU
where the work was submitted from. (and where the data structures are the
most cache-localized)
That principle works both for networking and for other IO transports - but
we have little support for it yet. It would work really well for workloads
where one physical device is shared by many CPUs.
(A lesser method that approximates this is the use of lots of
submission/completion rings per device and their binding to cpus - but that
can never really approach the number of CPUs really possible in a system.)
And in this most advanced mode of MSI IRQs, and if MSI devices had the
ability to direct IRQs to a specific CPU (they dont have that right now
AFAICT), we'd run into the overhead scenarios you describe above, and your
edge-triggered flow is the most performant one.
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 17:04 ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
2009-02-03 17:59 ` Linus Torvalds
@ 2009-02-03 21:02 ` Benjamin Herrenschmidt
2009-02-03 21:56 ` Rafael J. Wysocki
1 sibling, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 21:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 2009-02-03 at 18:04 +0100, Rafael J. Wysocki wrote:
> > Now, there's one subtle problem with resume in this picture. Namely, before
> > running the "early resume of devices" we have to make sure that the interrupts
> > will be masked. However, masking MSI-X, for example, means writing into
> > the memory space of the device, so we can't do it at this point. Of course, we
> > can assume that MSI/MSI-X will be masked when we get control from the BIOS
> > (moreover, they are not shareable, so we can just ignore them at this point),
> > but still we'll have to mask the other interrupts before doing the
> > local_irq_enable() on resume - marked by the (*) above. This appears to be
> > doable, though.
Which is why I prefer making mutex/semaphores/allocations "safe" to use
in that late suspend phase with IRQs off.
It sounds like a less invasive thing, simpler, change, allowing to move
the ACPI stuff back to where it belongs, and it would help solving other
problems such as the problems I exposed with video resume, which I'm
trying to do -very- early (ie, before sysdev's even).
In fact, as I may have said elsewhere, I'm also being bitten by the PCI
layer doing kmalloc(...GFP_KERNEL) all over the place nowadays including
in things like pci_get_device() which are hurting some memory controller
code I have that runs in late suspend (I could refactor that code to
do the pci_get_* earlier, it's just one more thing..).
> Having reconsidered it, I think that the "loop of disable_irq()" may be
> problematic due to MSI/MSI-X and devices that are put into D3 during the
> "normal" suspend. That is, we shouldn't try to mask MSI/MSI-X for devices in
> D3 (especially MSI-X, since that involves writing to the device's memory
> space). This implies that devices in D3 should be avoided in the "loop of
> disable_irq()", but that could be tricky if we loop over struct irq_desc
> objects.
>
> Still, we can modify pci_pm_suspend() (and the other PCI callbacks analogously)
> so that it masks the interrupt of the device right before returning to the
> caller if the device has not been put into a low power state before. After
> that all devices will either be in low power states, so they won't be able to
> generate interrupts, or have their interrupts masked. In the latter case the
> core can then put them into low power states in suspend_late().
That's going to be hard to get right vs. shared interrupts no ?
I think the "other" solution overall is much more simple.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 20:57 ` Ingo Molnar
@ 2009-02-03 21:04 ` Ingo Molnar
2009-02-03 21:12 ` Thomas Gleixner
2009-02-03 21:18 ` Linus Torvalds
0 siblings, 2 replies; 98+ messages in thread
From: Ingo Molnar @ 2009-02-03 21:04 UTC (permalink / raw)
To: Linus Torvalds, David S. Miller
Cc: Thomas Gleixner, Jesse Barnes, Rafael J. Wysocki,
Benjamin Herrenschmidt, Linux Kernel Mailing List, Andreas Schwab,
Len Brown
* Ingo Molnar <mingo@elte.hu> wrote:
[...]
> That principle works both for networking and for other IO transports - but
> we have little support for it yet. It would work really well for workloads
> where one physical device is shared by many CPUs.
Which is really what we have in _practice_ - only benchmarketing sets up one
physical device per CPU and avoids all the ugly consequences of
gathering/spreadig a channel of information from/to a single device to
multiple CPUs.
> (A lesser method that approximates this is the use of lots of
> submission/completion rings per device and their binding to cpus - but
> that can never really approach the number of CPUs really possible in a
> system.)
btw., more advanced device IRQ models was one of the thinking behind sparse
IRQ support: defining a really large NR_IRQS limit on x86 by default, on all
form factors, and making it really easy and cheap to have a _ton_ of IRQs in
a Linux system might give hw designers ideas to create such hardware.
/me dreams on ;-)
> And in this most advanced mode of MSI IRQs, and if MSI devices had the
> ability to direct IRQs to a specific CPU (they dont have that right now
> AFAICT), we'd run into the overhead scenarios you describe above, and your
> edge-triggered flow is the most performant one.
So i'd still like your tentative Signed-off-by for your patch - it's i think
not v2.6.29 material but if it stays problem free in testing we can try it
in v2.6.30. If it causes problem it will be clearly bisectable and clearly
revertable.
Maybe we could split it in two: and for MSI we could introduce a 'simpler
and faster' edge flow as well - and keep the legacy handler untouched. That
way it's low-risk in its entirety. (and avoids the MSI ->mask complication
as well.)
Ingo
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 19:19 ` Linus Torvalds
@ 2009-02-03 21:11 ` Benjamin Herrenschmidt
2009-02-03 21:53 ` Rafael J. Wysocki
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 21:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jesse Barnes, Rafael J. Wysocki, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 2009-02-03 at 11:19 -0800, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Linus Torvalds wrote:
> >
> > Hmm. The _normal_ simple irq handler does this the way I described, but
> > for some reason the "handle_edge_irq()" does not. And the reason is
> > actually a buglet: it needs to mask things for the "recursive interrupt"
> > case.
>
> Btw, just to clarify: none of this happens at the actual "irq_disable()"
> time: it only happens if you get an interrupt _while_ it's disabled. Which
> obviously shouldn't happen in the shutdown/wakeup path anyway for MSI,
> since the interrupts aren't shared, but it would be good to just be extra
> safe.
>
> I do suspect we could/should just get rid of the msi masking entirely, but
> that may be too scary a step.
I agree :-) It's been a source of problem anyway, I remember hearing
some device reacting strangely (aka, losing MSIs) when masked, and it's
not even implemented by all devices (somebody had the "smart" idea of
making an optional feature). So it ends up being a lot of not very
useful code...
> For the current suspend/resume situation, maybe it's enough to know that
> it shouldn't be happening anyway, and even if it _does_ happen on a device
> that has been shut down, it's just not going to do anything. Sure, it's
> doing that "writel/readl", but if it gets lost, who really cares? Nobody.
Note that I still believe that life would be simpler to just keep the
current local_irq_save() and just tweak might_sleep() etc... so that
ACPI is safe to call. At least it can be done orthogonally to this
interrupt change.
IE. As I said earlier, that interrupt masking -will- change the exposed
semantics of suspend_late() to drivers. In fact, iirc, it's you who
advertised initially suspend_late() as being the 'easy' way to write a
PCI driver suspend routine specifically because you don't have to bother
about being re-entered from anywhere for things like request processing
etc... With the change, the kernel is essentially still operating,
timers are ticking, we may even be scheduling, so this important
assumption is gone. That means a lot more chances for driver to screw
up.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:04 ` Ingo Molnar
@ 2009-02-03 21:12 ` Thomas Gleixner
2009-02-04 10:07 ` Russell King
2009-02-03 21:18 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Thomas Gleixner @ 2009-02-03 21:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, David S. Miller, Jesse Barnes, Rafael J. Wysocki,
Benjamin Herrenschmidt, Linux Kernel Mailing List, Andreas Schwab,
Len Brown, Russell King
On Tue, 3 Feb 2009, Ingo Molnar wrote:
> So i'd still like your tentative Signed-off-by for your patch - it's i think
> not v2.6.29 material but if it stays problem free in testing we can try it
> in v2.6.30. If it causes problem it will be clearly bisectable and clearly
> revertable.
>
> Maybe we could split it in two: and for MSI we could introduce a 'simpler
> and faster' edge flow as well - and keep the legacy handler untouched. That
> way it's low-risk in its entirety. (and avoids the MSI ->mask complication
> as well.)
Yes, seperating out the MSI handler into it's own flow control is the
right way to go. We had trouble with real edge hardware and IIRC most
of the problems originated from ARM. rmk ??
Thanks,
tglx
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:04 ` Ingo Molnar
2009-02-03 21:12 ` Thomas Gleixner
@ 2009-02-03 21:18 ` Linus Torvalds
1 sibling, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 21:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: David S. Miller, Thomas Gleixner, Jesse Barnes, Rafael J. Wysocki,
Benjamin Herrenschmidt, Linux Kernel Mailing List, Andreas Schwab,
Len Brown
On Tue, 3 Feb 2009, Ingo Molnar wrote:
>
> So i'd still like your tentative Signed-off-by for your patch - it's i think
> not v2.6.29 material but if it stays problem free in testing we can try it
> in v2.6.30. If it causes problem it will be clearly bisectable and clearly
> revertable.
Hmm. You can have my sign-off, because I certainly don't think the patch
is _wrong_, but I also don't think it's necessarily really worth it unless
somebody can show a real upside. As mentioned, I don't think this matters
in practice for the whole MSI suspend issue - doing the "mask()" thing
feels stupid and certainl;y wasn't what I expected us to do, but it really
shouldn't matter from an actual behavioural standpoint at all.
And if, as you say, we don't expect to have interrupts spread out across
CPUs (and with the current io-apic they certainly don't happen - I don't
know about x2apic) my performance arguments are bogus too.
So the patch probably is the right thing to do, but the upsides are slim
to nonexistent, and the downside you pointed out of somebody using that
"handle_edge_irq()" thing with an interrupt that turns out to be level
after all makes me worry a bit.
But feel free to put it in some experimental branch with my sign-off.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:11 ` Benjamin Herrenschmidt
@ 2009-02-03 21:53 ` Rafael J. Wysocki
2009-02-03 22:33 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 21:53 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-02-03 at 11:19 -0800, Linus Torvalds wrote:
> >
> > On Tue, 3 Feb 2009, Linus Torvalds wrote:
> > >
> > > Hmm. The _normal_ simple irq handler does this the way I described, but
> > > for some reason the "handle_edge_irq()" does not. And the reason is
> > > actually a buglet: it needs to mask things for the "recursive interrupt"
> > > case.
> >
> > Btw, just to clarify: none of this happens at the actual "irq_disable()"
> > time: it only happens if you get an interrupt _while_ it's disabled. Which
> > obviously shouldn't happen in the shutdown/wakeup path anyway for MSI,
> > since the interrupts aren't shared, but it would be good to just be extra
> > safe.
> >
> > I do suspect we could/should just get rid of the msi masking entirely, but
> > that may be too scary a step.
>
> I agree :-) It's been a source of problem anyway, I remember hearing
> some device reacting strangely (aka, losing MSIs) when masked, and it's
> not even implemented by all devices (somebody had the "smart" idea of
> making an optional feature). So it ends up being a lot of not very
> useful code...
>
> > For the current suspend/resume situation, maybe it's enough to know that
> > it shouldn't be happening anyway, and even if it _does_ happen on a device
> > that has been shut down, it's just not going to do anything. Sure, it's
> > doing that "writel/readl", but if it gets lost, who really cares? Nobody.
>
> Note that I still believe that life would be simpler to just keep the
> current local_irq_save() and just tweak might_sleep() etc... so that
> ACPI is safe to call. At least it can be done orthogonally to this
> interrupt change.
That would change the ordering of ACPI method calls, which also is important
and prone to breaking, as I wrote in the original message.
> IE. As I said earlier, that interrupt masking -will- change the exposed
> semantics of suspend_late() to drivers. In fact, iirc, it's you who
> advertised initially suspend_late() as being the 'easy' way to write a
> PCI driver suspend routine specifically because you don't have to bother
> about being re-entered from anywhere for things like request processing
> etc... With the change, the kernel is essentially still operating,
> timers are ticking, we may even be scheduling, so this important
> assumption is gone. That means a lot more chances for driver to screw
> up.
Whatever we do, it will have drawbacks and the approach with moving
local_irq_disable() seems to be more straightforward to me.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:02 ` Benjamin Herrenschmidt
@ 2009-02-03 21:56 ` Rafael J. Wysocki
0 siblings, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 21:56 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-02-03 at 18:04 +0100, Rafael J. Wysocki wrote:
>
> > > Now, there's one subtle problem with resume in this picture. Namely, before
> > > running the "early resume of devices" we have to make sure that the interrupts
> > > will be masked. However, masking MSI-X, for example, means writing into
> > > the memory space of the device, so we can't do it at this point. Of course, we
> > > can assume that MSI/MSI-X will be masked when we get control from the BIOS
> > > (moreover, they are not shareable, so we can just ignore them at this point),
> > > but still we'll have to mask the other interrupts before doing the
> > > local_irq_enable() on resume - marked by the (*) above. This appears to be
> > > doable, though.
>
> Which is why I prefer making mutex/semaphores/allocations "safe" to use
> in that late suspend phase with IRQs off.
>
> It sounds like a less invasive thing, simpler, change, allowing to move
> the ACPI stuff back to where it belongs, and it would help solving other
> problems such as the problems I exposed with video resume, which I'm
> trying to do -very- early (ie, before sysdev's even).
>
> In fact, as I may have said elsewhere, I'm also being bitten by the PCI
> layer doing kmalloc(...GFP_KERNEL) all over the place nowadays including
> in things like pci_get_device() which are hurting some memory controller
> code I have that runs in late suspend (I could refactor that code to
> do the pci_get_* earlier, it's just one more thing..).
>
> > Having reconsidered it, I think that the "loop of disable_irq()" may be
> > problematic due to MSI/MSI-X and devices that are put into D3 during the
> > "normal" suspend. That is, we shouldn't try to mask MSI/MSI-X for devices in
> > D3 (especially MSI-X, since that involves writing to the device's memory
> > space). This implies that devices in D3 should be avoided in the "loop of
> > disable_irq()", but that could be tricky if we loop over struct irq_desc
> > objects.
> >
> > Still, we can modify pci_pm_suspend() (and the other PCI callbacks analogously)
> > so that it masks the interrupt of the device right before returning to the
> > caller if the device has not been put into a low power state before. After
> > that all devices will either be in low power states, so they won't be able to
> > generate interrupts, or have their interrupts masked. In the latter case the
> > core can then put them into low power states in suspend_late().
>
> That's going to be hard to get right vs. shared interrupts no ?
>
> I think the "other" solution overall is much more simple.
No, it is not and the reason is the ACPI ordering (sorry for repeating myself).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 17:53 ` PCI PM: Restore standard config registers of all devices early Linus Torvalds
@ 2009-02-03 21:57 ` Rafael J. Wysocki
0 siblings, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 21:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Linus Torvalds wrote:
>
> On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> >
> > As I said, I tend to prefer the "loop of disable_irq()" approach, because it
> > would allow us to preserve the current ordering of ACPI operations. Namely,
> > if we do:
> >
> > suspend devices (normal suspend)
> > loop of disable_irq()
> > late suspend of devices
> > _PTS
> > disable nonboot CPUs
> > local_irq_disable()
> > sysdev suspend
> > enter sleep state
> > get control from the BIOS
> > sysdev resume
> > (*)
> > local_irq_enable()
> > enable nonboot CPUs
> > _WAK
> > early resume of devices
> > loop of enable_irq()
> > resume devices (normal resume)
> >
> > the ordering of _PTS with respect to putting devices into low power states and
> > disabling the nonboot CPUs will be the same as it is now and the same applies
> > to _WAK and putting devices into D0 etc. (I really _really_ wouldn't like to
> > change this ordering, since this alone is likely to break things badly).
>
> Yes.
>
> Also, make the "loop of disable/enable_irq()" phase be a helper function
> that also sets system_state to SYSTEM_SUSPENDING/SYSTEM_RUNNING
> respectively, and it should all be pretty clean, and the changes really
> should be pretty minimal.
OK
So, I'm going to implement something along these lines. We'll see how it
works out.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 23:49 ` Ingo Molnar
@ 2009-02-03 22:09 ` Rafael J. Wysocki
2009-02-03 23:13 ` Linus Torvalds
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 22:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Benjamin Herrenschmidt, Linux Kernel Mailing List,
Jesse Barnes, Andreas Schwab, Len Brown
On Tuesday 03 February 2009, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Ingo: how do we walk over all the interrupt descriptors in todays world?
> > The whole sparse-vs-nonsparse makes things a bit more complex.
>
> this:
>
> struct irq_desc *desc;
> int irq;
>
> for_each_irq_desc(i, desc) {
> ...
> }
>
> should do the trick.
So, what do I do in the loop to disable irqs for all devices on the IOAPIC
level. Would disable_irq(irq) be sufficient?
Also, how can I skip the timer interrupt (we don't want to disable that)?
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:53 ` Rafael J. Wysocki
@ 2009-02-03 22:33 ` Benjamin Herrenschmidt
2009-02-03 22:44 ` Rafael J. Wysocki
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 22:33 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 2009-02-03 at 22:53 +0100, Rafael J. Wysocki wrote:
> That would change the ordering of ACPI method calls, which also is important
> and prone to breaking, as I wrote in the original message.
Ok, I'm not -that- familiar with ACPI, but I don't see where this
ordering change you seem to fear is ... Ie, what gets re-ordered vs.
what ?
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 22:33 ` Benjamin Herrenschmidt
@ 2009-02-03 22:44 ` Rafael J. Wysocki
2009-02-03 23:05 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 22:44 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-02-03 at 22:53 +0100, Rafael J. Wysocki wrote:
>
> > That would change the ordering of ACPI method calls, which also is important
> > and prone to breaking, as I wrote in the original message.
>
> Ok, I'm not -that- familiar with ACPI, but I don't see where this
> ordering change you seem to fear is ... Ie, what gets re-ordered vs.
> what ?
(Newer) ACPI says that devices should be put into low power states (presumably
with the help of appropriate ACPI AML routines) before the _PTS method is
called. In turn, we're supposed to disable nonboot CPUs after calling _PTS.
There is analogous requirement for the _WAK method during resume.
Currently, the suspend code ordering follows these rules, but if we move
the putting of devices into low power states into the suspend_late part, they
will have to be done after _PTS and that is likely to break things (we've
already had this problem once and I have really bad memories related to it).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 15:48 ` Linus Torvalds
@ 2009-02-03 22:59 ` Benjamin Herrenschmidt
2009-02-03 23:23 ` Rafael J. Wysocki
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 22:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
> You've found a bug somewhere.
Yup :-)
> We _should_ be saving things, the legacy code does something like this:
>
> if (drv && drv->suspend) {
> pci_dev->state_saved = false;
>
> i = drv->suspend(pci_dev, state);
> suspend_report_result(drv->suspend, i);
> if (i)
> return i;
>
> if (pci_dev->state_saved)
> goto Fixup;
>
> if (WARN_ON_ONCE(pci_dev->current_state != PCI_D0))
> goto Fixup;
It looks like the above is what breaks. Looks like current_state is
UNKNOWN. The device is an old mach64 that has no PCI PM capability, thus
the driver doesn't call any PCI PM stuff, the state basically stays set
to what the core set it to at probe time which appears to be
PCI_UNKNOWN.
Thus we don't call pci_save_state().
Then ...
> }
>
> pci_save_state(pci_dev);
>
> ie if your ->suspend function doesn't use pci_save_state() itself (which
> sets that "state_saved" flag to true), then the generic code will do it
> for you.
>
> Also, on the resume path, we actually have
>
> if (pci_dev->state_saved)
> pci_restore_standard_config(pci_dev);
>
> so I wonder how the heck you got that blast of all zeroes - because we
> clearly shouldn't be trying to restore any unsaved state!
Well, that's it ... we don't actually test pci_dev->state_saved in
whatever is currently upstream. The code is:
static void pci_pm_default_resume_noirq(struct pci_dev *pci_dev)
{
pci_restore_standard_config(pci_dev);
pci_fixup_device(pci_fixup_resume_early, pci_dev);
}
Oops...
Rafael, the second one is trivial to fix, but what about the first one ?
Should we not count UNKNOWN in that goto or should we set legacy stuff
that don't do PCI PM to PCI_D0 somewhere ? Or both ? :-)
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 22:44 ` Rafael J. Wysocki
@ 2009-02-03 23:05 ` Benjamin Herrenschmidt
2009-02-03 23:18 ` Linus Torvalds
2009-02-03 23:25 ` Rafael J. Wysocki
0 siblings, 2 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-03 23:05 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
> (Newer) ACPI says that devices should be put into low power states (presumably
> with the help of appropriate ACPI AML routines) before the _PTS method is
> called. In turn, we're supposed to disable nonboot CPUs after calling _PTS.
> There is analogous requirement for the _WAK method during resume.
>
> Currently, the suspend code ordering follows these rules, but if we move
> the putting of devices into low power states into the suspend_late part, they
> will have to be done after _PTS and that is likely to break things (we've
> already had this problem once and I have really bad memories related to it).
Wait wait wait ... the -whole- point of the exercise, wether using
local_irq_save or disable_irq, -is- to put the ACPI bit -after- setting
the device in low power state and before the restore on wakeup...
So basically, that isn't changing.
The -one- thing that indeed conflicts here is that we disable nonboot
CPUs earlier. Right ?
Now, I doubt that would be a big issue, ie, we are supposedly capable of
dynamically disabling/enabling CPUs anyway, but if it is, then indeed I
see how the using of higher level PIC irq disabling would allow to move
the whole suspend_late() over to before disabling non-boot CPUs.
That does introduce a significant change in semantics for drivers in the
sense that now, suspend_late will be called with timers running, things
scheduling, requests coming in, etc... One of the big reasons for doing
suspend_late with IRQs off was precisely that drivers wouldn't have to
synchronize with all these things. They now do.
Thus I tend to think that keeping the disabling of nonboot CPUs earlier
than the suspending of devices is the least of two evils. _BUT_ As I
said, I'm no ACPI expert and not -that- familiar with x86 land, it might
indeed be a can of worms.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 22:09 ` Rafael J. Wysocki
@ 2009-02-03 23:13 ` Linus Torvalds
0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 23:13 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ingo Molnar, Benjamin Herrenschmidt, Linux Kernel Mailing List,
Jesse Barnes, Andreas Schwab, Len Brown
On Tue, 3 Feb 2009, Rafael J. Wysocki wrote:
> >
> > struct irq_desc *desc;
> > int irq;
> >
> > for_each_irq_desc(i, desc) {
> > ...
> > }
> >
> > should do the trick.
>
> So, what do I do in the loop to disable irqs for all devices on the IOAPIC
> level. Would disable_irq(irq) be sufficient?
Yes. It literally should be something as simple as
void disable_device_irqs(void)
{
struct irq_desc *desc;
int i;
for_each_irq_desc(i, desc)
disable_irq(i);
}
void enable_device_irqs(void)
{
struct irq_desc *desc;
int i;
for_each_irq_desc(i, desc)
enable_irq(i);
}
although we might do something smarter eventually (ie start adding tests
for specific flags etc - at that point we'll also need to mark which
interrupts we've disabled so that we get the disable/enable nesting
right).
And it probably makes sense to do this inside kernel/irq/manage.c or
similar, since any extensions are going to be very aware of irq internals.
For example, the thing might well end up taking the desc->lock and peek
inside the descriptor to see if it's really a "device" interrupt or a
"system" interrupt etc.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 23:05 ` Benjamin Herrenschmidt
@ 2009-02-03 23:18 ` Linus Torvalds
2009-02-04 0:27 ` Benjamin Herrenschmidt
2009-03-04 8:02 ` Pavel Machek
2009-02-03 23:25 ` Rafael J. Wysocki
1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-03 23:18 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Wed, 4 Feb 2009, Benjamin Herrenschmidt wrote:
>
> Wait wait wait ... the -whole- point of the exercise, wether using
> local_irq_save or disable_irq, -is- to put the ACPI bit -after- setting
> the device in low power state and before the restore on wakeup...
You're missing context.
ACPI isn't just for the device power settings. It's for the CPU's too, and
it's for things like "prepare to sleep". And they all have _different_
requirements.
What Rafael is trying to tell you is that we have ACPI-initiated ordering
requirements between turning CPU's off, and turning devices off, and he's
not willing to change that.
> The -one- thing that indeed conflicts here is that we disable nonboot
> CPUs earlier. Right ?
This.
> Now, I doubt that would be a big issue
With ACPI, there is no such thing as a "big issue". There are only tons of
small horrid details that. And the "big issue" is that all the small
details are insane.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 22:59 ` Benjamin Herrenschmidt
@ 2009-02-03 23:23 ` Rafael J. Wysocki
0 siblings, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 23:23 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Linux Kernel Mailing List, Jesse Barnes,
Andreas Schwab, Len Brown, Ingo Molnar
On Tuesday 03 February 2009, Benjamin Herrenschmidt wrote:
>
> > You've found a bug somewhere.
>
> Yup :-)
>
> > We _should_ be saving things, the legacy code does something like this:
> >
> > if (drv && drv->suspend) {
> > pci_dev->state_saved = false;
> >
> > i = drv->suspend(pci_dev, state);
> > suspend_report_result(drv->suspend, i);
> > if (i)
> > return i;
> >
> > if (pci_dev->state_saved)
> > goto Fixup;
> >
> > if (WARN_ON_ONCE(pci_dev->current_state != PCI_D0))
> > goto Fixup;
>
> It looks like the above is what breaks. Looks like current_state is
> UNKNOWN. The device is an old mach64 that has no PCI PM capability, thus
> the driver doesn't call any PCI PM stuff, the state basically stays set
> to what the core set it to at probe time which appears to be
> PCI_UNKNOWN.
Overlooked, sorry.
> Thus we don't call pci_save_state().
>
> Then ...
>
> > }
> >
> > pci_save_state(pci_dev);
> >
> > ie if your ->suspend function doesn't use pci_save_state() itself (which
> > sets that "state_saved" flag to true), then the generic code will do it
> > for you.
> >
> > Also, on the resume path, we actually have
> >
> > if (pci_dev->state_saved)
> > pci_restore_standard_config(pci_dev);
> >
> > so I wonder how the heck you got that blast of all zeroes - because we
> > clearly shouldn't be trying to restore any unsaved state!
>
> Well, that's it ... we don't actually test pci_dev->state_saved in
> whatever is currently upstream. The code is:
>
> static void pci_pm_default_resume_noirq(struct pci_dev *pci_dev)
> {
> pci_restore_standard_config(pci_dev);
> pci_fixup_device(pci_fixup_resume_early, pci_dev);
> }
>
> Oops...
The assumption here is that the state will be saved either by the driver
or by the core, so the bug here is a consequence of the previous one.
I'll add a check.
> Rafael, the second one is trivial to fix, but what about the first one ?
> Should we not count UNKNOWN in that goto or should we set legacy stuff
> that don't do PCI PM to PCI_D0 somewhere ? Or both ? :-)
Well, if the state is UNKNOWN, I think it's safe to save the config, so the
WARN_ON should really catch the low power states only. I'll fix that.
Thanks for debugging it!
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 23:05 ` Benjamin Herrenschmidt
2009-02-03 23:18 ` Linus Torvalds
@ 2009-02-03 23:25 ` Rafael J. Wysocki
2009-02-04 0:46 ` Linus Torvalds
1 sibling, 1 reply; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-02-03 23:25 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Wednesday 04 February 2009, Benjamin Herrenschmidt wrote:
>
> > (Newer) ACPI says that devices should be put into low power states (presumably
> > with the help of appropriate ACPI AML routines) before the _PTS method is
> > called. In turn, we're supposed to disable nonboot CPUs after calling _PTS.
> > There is analogous requirement for the _WAK method during resume.
> >
> > Currently, the suspend code ordering follows these rules, but if we move
> > the putting of devices into low power states into the suspend_late part, they
> > will have to be done after _PTS and that is likely to break things (we've
> > already had this problem once and I have really bad memories related to it).
>
> Wait wait wait ... the -whole- point of the exercise, wether using
> local_irq_save or disable_irq, -is- to put the ACPI bit -after- setting
> the device in low power state and before the restore on wakeup...
>
> So basically, that isn't changing.
>
> The -one- thing that indeed conflicts here is that we disable nonboot
> CPUs earlier. Right ?
>
> Now, I doubt that would be a big issue, ie, we are supposedly capable of
> dynamically disabling/enabling CPUs anyway, but if it is, then indeed I
> see how the using of higher level PIC irq disabling would allow to move
> the whole suspend_late() over to before disabling non-boot CPUs.
>
> That does introduce a significant change in semantics for drivers in the
> sense that now, suspend_late will be called with timers running, things
> scheduling, requests coming in, etc... One of the big reasons for doing
> suspend_late with IRQs off was precisely that drivers wouldn't have to
> synchronize with all these things. They now do.
I realize the problem, but IMO only a few drivers will be affected, since only
a few of them implement suspend_late/resume_early.
> Thus I tend to think that keeping the disabling of nonboot CPUs earlier
> than the suspending of devices is the least of two evils. _BUT_ As I
> said, I'm no ACPI expert and not -that- familiar with x86 land, it might
> indeed be a can of worms.
It is. :-)
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 23:18 ` Linus Torvalds
@ 2009-02-04 0:27 ` Benjamin Herrenschmidt
2009-03-04 8:02 ` Pavel Machek
1 sibling, 0 replies; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-04 0:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Tue, 2009-02-03 at 15:18 -0800, Linus Torvalds wrote:
> With ACPI, there is no such thing as a "big issue". There are only tons of
> small horrid details that. And the "big issue" is that all the small
> details are insane.
Allright, fair enough :-) Here's my grand plan defeated by the uglyness
of ACPI ... argh ! :-)
In any case, both approach will solve the problem at hand and I'll be
able to move some of that pmac uglyness into the platform hooks.
BTW. Do we want one or two system states ? IE. regarding silently
"upgrading" GFP_KERNEL, I believe it needs to be done in two different
phases. The main suspend loop (after all the prepare() calls complete)
should be done with GFP_KERNEL -> GFP_NOIO, and the last phase with IRQs
really disabled (local_irq_save, aka. sysdev's) could use being done
with GFP_KERNEL -> GFP_ATOMIC.
The later will not be used much, but with the PCI layer now doing
kmalloc GFP_KERNEL all over the place in trivial things like pci_get_*,
I think it's worth it as sysdev's like cpufreq, or other low level arch
bits like memory controller tweaking does involve mucking around with
PCI devices.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 23:25 ` Rafael J. Wysocki
@ 2009-02-04 0:46 ` Linus Torvalds
0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2009-02-04 0:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Benjamin Herrenschmidt, Jesse Barnes, Linux Kernel Mailing List,
Andreas Schwab, Len Brown, Ingo Molnar
On Wed, 4 Feb 2009, Rafael J. Wysocki wrote:
>
> I realize the problem, but IMO only a few drivers will be affected, since only
> a few of them implement suspend_late/resume_early.
Also, since we are fixing the irq issues and the basic PCI suspend/resume
logic to be handled by the generic code, I suspect a lot of the drivers
that _do_ the late/early stuff will be happy to simply get rid of them.
Linus
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 21:12 ` Thomas Gleixner
@ 2009-02-04 10:07 ` Russell King
0 siblings, 0 replies; 98+ messages in thread
From: Russell King @ 2009-02-04 10:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Ingo Molnar, Linus Torvalds, David S. Miller, Jesse Barnes,
Rafael J. Wysocki, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Andreas Schwab, Len Brown
On Tue, Feb 03, 2009 at 10:12:04PM +0100, Thomas Gleixner wrote:
> On Tue, 3 Feb 2009, Ingo Molnar wrote:
> > So i'd still like your tentative Signed-off-by for your patch - it's i think
> > not v2.6.29 material but if it stays problem free in testing we can try it
> > in v2.6.30. If it causes problem it will be clearly bisectable and clearly
> > revertable.
> >
> > Maybe we could split it in two: and for MSI we could introduce a 'simpler
> > and faster' edge flow as well - and keep the legacy handler untouched. That
> > way it's low-risk in its entirety. (and avoids the MSI ->mask complication
> > as well.)
>
> Yes, seperating out the MSI handler into it's own flow control is the
> right way to go. We had trouble with real edge hardware and IIRC most
> of the problems originated from ARM. rmk ??
No idea, too long ago.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: PCI PM: Restore standard config registers of all devices early
2009-02-02 22:56 ` Rafael J. Wysocki
2009-02-03 0:11 ` Benjamin Herrenschmidt
@ 2009-02-10 20:25 ` Pavel Machek
1 sibling, 0 replies; 98+ messages in thread
From: Pavel Machek @ 2009-02-10 20:25 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Benjamin Herrenschmidt, Linux Kernel Mailing List,
Jesse Barnes, Andreas Schwab, Len Brown
Hi!
> > I'll have a look at the ACPI thing.
>
> Generally speaking, we'd need to run acpi_evaluate_object() with interrupts
> off.
>
> There are two apparent problems with that, from a quick look:
>
> * The ACPI_MTX_INTERPRETER mutex needs to be acquired, but we know we won't
> need that mutex with interrupts off, so presumably we can work around this.
In such case we either need refrigerator, or we should grab
MTX_INTERPRETER before we start suspending, hmm.
> * Memory allocations with GFP_KERNEL are made, which is even worse, because
> we really shouldn't do that during suspend _at_ _all_, even during the regular
> ->suspend() with interrupts on, because there's not guarantee that swap will
> will be available at that time. So, for the sake of correctness, we should
> get rid of the GFP_KERNEL from the ACPI code paths executed during
> suspend-resume anyway.
Well, we have some small reserves for stuff like that...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 98+ messages in thread
* kmalloc during suspend, was Re: PCI PM: Restore standard config registers of all devices early
2009-02-03 1:03 ` Benjamin Herrenschmidt
@ 2009-02-10 20:25 ` Pavel Machek
0 siblings, 0 replies; 98+ messages in thread
From: Pavel Machek @ 2009-02-10 20:25 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Rafael J. Wysocki, Linus Torvalds, Linux Kernel Mailing List,
Jesse Barnes, Andreas Schwab, Len Brown, Ingo Molnar
On Tue 2009-02-03 12:03:46, Benjamin Herrenschmidt wrote:
>
> > > (*) There are reasons to think that kmalloc/gfp should both silently
> > > turn into GFP_NOIO always while the suspend process is started, but
> > > that's somewhat a different subject. Rafael, did we ever act on that ?
> > > It's an old discussion we had but I don't know if we actually
> > > implemented anything.
> >
> > We have the ->prepare(), ->complete() callbacks that, among other things,
> > can be used for allocating and freeing memory with GFP_KERNEL safely.
>
> First, that's assuming drivers will be smart enough to figure that out.
> They won't, believe me.
For big alocations, I see no other way. If someone wants 100MB for
firmmware image...
> Then, as I said, this doesn't work in practice because there is no way
> drivers and/or underlying subsystems will start "remembering" when they
> are within a prepare/complete pair, and suddenly do allocations
> differently. That's just going to break.
They could set a flag in ->prepare()... but yes, drivers will get that
wrong.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-02-03 23:18 ` Linus Torvalds
2009-02-04 0:27 ` Benjamin Herrenschmidt
@ 2009-03-04 8:02 ` Pavel Machek
2009-03-04 23:25 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 98+ messages in thread
From: Pavel Machek @ 2009-03-04 8:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Benjamin Herrenschmidt, Rafael J. Wysocki, Jesse Barnes,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Tue 2009-02-03 15:18:14, Linus Torvalds wrote:
>
>
> On Wed, 4 Feb 2009, Benjamin Herrenschmidt wrote:
> >
> > Wait wait wait ... the -whole- point of the exercise, wether using
> > local_irq_save or disable_irq, -is- to put the ACPI bit -after- setting
> > the device in low power state and before the restore on wakeup...
>
> You're missing context.
>
> ACPI isn't just for the device power settings. It's for the CPU's too, and
> it's for things like "prepare to sleep". And they all have _different_
> requirements.
As long as we claim to support cpu hotplug, suspending the cpus early
should work. User might have unplugged the cpu manually...
If something is broken in acpi, we may need to disable cpu*/onlone on
affected systems.
Pavel
>
> What Rafael is trying to tell you is that we have ACPI-initiated ordering
> requirements between turning CPU's off, and turning devices off, and he's
> not willing to change that.
>
> > The -one- thing that indeed conflicts here is that we disable nonboot
> > CPUs earlier. Right ?
>
> This.
>
> > Now, I doubt that would be a big issue
>
> With ACPI, there is no such thing as a "big issue". There are only tons of
> small horrid details that. And the "big issue" is that all the small
> details are insane.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-03-04 8:02 ` Pavel Machek
@ 2009-03-04 23:25 ` Benjamin Herrenschmidt
2009-03-05 8:19 ` Pavel Machek
0 siblings, 1 reply; 98+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-04 23:25 UTC (permalink / raw)
To: Pavel Machek
Cc: Linus Torvalds, Rafael J. Wysocki, Jesse Barnes,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
> As long as we claim to support cpu hotplug, suspending the cpus early
> should work. User might have unplugged the cpu manually...
>
> If something is broken in acpi, we may need to disable cpu*/onlone on
> affected systems.
You seem to miss some context here... it seems that ACPI mandates that
CPUs are suspended after devices and you know that we can't just start
blacklisting half of the machines out there just because we happen not
to do the same as what Windows does...
So let's be on the safe side.
Cheers,
Ben
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-03-04 23:25 ` Benjamin Herrenschmidt
@ 2009-03-05 8:19 ` Pavel Machek
2009-03-05 19:09 ` Rafael J. Wysocki
0 siblings, 1 reply; 98+ messages in thread
From: Pavel Machek @ 2009-03-05 8:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Rafael J. Wysocki, Jesse Barnes,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Thu 2009-03-05 10:25:04, Benjamin Herrenschmidt wrote:
>
> > As long as we claim to support cpu hotplug, suspending the cpus early
> > should work. User might have unplugged the cpu manually...
> >
> > If something is broken in acpi, we may need to disable cpu*/onlone on
> > affected systems.
>
> You seem to miss some context here... it seems that ACPI mandates that
> CPUs are suspended after devices and you know that we can't just start
> blacklisting half of the machines out there just because we happen not
> to do the same as what Windows does...
I did not see that ACPI mandate... where is it?
That would interfere with user's ability to online/offline cpus,
badly.
Are you saying that we should online all the CPUs before starting
suspend to stay on the "safe side"?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 98+ messages in thread
* Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
2009-03-05 8:19 ` Pavel Machek
@ 2009-03-05 19:09 ` Rafael J. Wysocki
0 siblings, 0 replies; 98+ messages in thread
From: Rafael J. Wysocki @ 2009-03-05 19:09 UTC (permalink / raw)
To: Pavel Machek
Cc: Benjamin Herrenschmidt, Linus Torvalds, Jesse Barnes,
Linux Kernel Mailing List, Andreas Schwab, Len Brown, Ingo Molnar
On Thursday 05 March 2009, Pavel Machek wrote:
> On Thu 2009-03-05 10:25:04, Benjamin Herrenschmidt wrote:
> >
> > > As long as we claim to support cpu hotplug, suspending the cpus early
> > > should work. User might have unplugged the cpu manually...
> > >
> > > If something is broken in acpi, we may need to disable cpu*/onlone on
> > > affected systems.
> >
> > You seem to miss some context here... it seems that ACPI mandates that
> > CPUs are suspended after devices and you know that we can't just start
> > blacklisting half of the machines out there just because we happen not
> > to do the same as what Windows does...
>
> I did not see that ACPI mandate... where is it?
15.1.6 in ACPI 3.0b (for example).
This is more complicated, though. Turning the CPUs on/off involves things
that need to be ordered with respect to _PTS, _WAK in a specific way (ie. just
like we do it right now, which is _PTS before disable_nonboot_cpus() and
enable_nonboot_cpus() before _WAK). Otherwise at least some machines will
break (verified experimentally some time ago).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 98+ messages in thread
end of thread, other threads:[~2009-03-05 19:10 UTC | newest]
Thread overview: 98+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200901261904.n0QJ4Q9c016709@hera.kernel.org>
2009-02-02 9:54 ` PCI PM: Restore standard config registers of all devices early Benjamin Herrenschmidt
2009-02-02 17:06 ` Linus Torvalds
2009-02-02 20:29 ` Benjamin Herrenschmidt
2009-02-02 20:41 ` Linus Torvalds
2009-02-02 21:00 ` Benjamin Herrenschmidt
2009-02-02 21:32 ` Rafael J. Wysocki
2009-02-02 20:33 ` Benjamin Herrenschmidt
2009-02-02 20:50 ` Linus Torvalds
2009-02-02 20:55 ` Linus Torvalds
2009-02-02 21:19 ` Benjamin Herrenschmidt
2009-02-02 21:39 ` Rafael J. Wysocki
2009-02-02 22:05 ` Linus Torvalds
2009-02-02 22:09 ` Linus Torvalds
2009-02-02 22:31 ` Rafael J. Wysocki
2009-02-02 23:18 ` Linus Torvalds
2009-02-02 23:45 ` Rafael J. Wysocki
2009-02-02 23:59 ` Linus Torvalds
2009-02-03 0:15 ` Rafael J. Wysocki
2009-02-03 0:28 ` Linus Torvalds
2009-02-03 1:12 ` Benjamin Herrenschmidt
2009-02-03 1:32 ` Linus Torvalds
2009-02-03 1:46 ` Benjamin Herrenschmidt
2009-02-03 3:30 ` Benjamin Herrenschmidt
2009-02-03 3:47 ` Linus Torvalds
2009-02-03 4:03 ` Benjamin Herrenschmidt
2009-02-03 6:07 ` Benjamin Herrenschmidt
2009-02-03 15:48 ` Linus Torvalds
2009-02-03 22:59 ` Benjamin Herrenschmidt
2009-02-03 23:23 ` Rafael J. Wysocki
2009-02-03 16:33 ` Jesse Barnes
2009-02-03 0:15 ` Linus Torvalds
2009-02-03 0:58 ` Benjamin Herrenschmidt
2009-02-03 3:51 ` Benjamin Herrenschmidt
2009-02-03 3:55 ` Benjamin Herrenschmidt
2009-02-03 4:09 ` Linus Torvalds
2009-02-03 4:21 ` Benjamin Herrenschmidt
2009-02-03 9:26 ` Rafael J. Wysocki
2009-02-03 17:04 ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
2009-02-03 17:59 ` Linus Torvalds
2009-02-03 18:31 ` Linus Torvalds
2009-02-03 18:41 ` Ingo Molnar
2009-02-03 18:32 ` Jesse Barnes
2009-02-03 18:46 ` Linus Torvalds
2009-02-03 19:03 ` Linus Torvalds
2009-02-03 19:13 ` Ingo Molnar
2009-02-03 19:38 ` Linus Torvalds
2009-02-03 19:53 ` Ingo Molnar
2009-02-03 20:04 ` Ingo Molnar
2009-02-03 20:18 ` Linus Torvalds
2009-02-03 20:57 ` Ingo Molnar
2009-02-03 21:04 ` Ingo Molnar
2009-02-03 21:12 ` Thomas Gleixner
2009-02-04 10:07 ` Russell King
2009-02-03 21:18 ` Linus Torvalds
2009-02-03 19:19 ` Linus Torvalds
2009-02-03 21:11 ` Benjamin Herrenschmidt
2009-02-03 21:53 ` Rafael J. Wysocki
2009-02-03 22:33 ` Benjamin Herrenschmidt
2009-02-03 22:44 ` Rafael J. Wysocki
2009-02-03 23:05 ` Benjamin Herrenschmidt
2009-02-03 23:18 ` Linus Torvalds
2009-02-04 0:27 ` Benjamin Herrenschmidt
2009-03-04 8:02 ` Pavel Machek
2009-03-04 23:25 ` Benjamin Herrenschmidt
2009-03-05 8:19 ` Pavel Machek
2009-03-05 19:09 ` Rafael J. Wysocki
2009-02-03 23:25 ` Rafael J. Wysocki
2009-02-04 0:46 ` Linus Torvalds
2009-02-03 21:02 ` Benjamin Herrenschmidt
2009-02-03 21:56 ` Rafael J. Wysocki
2009-02-03 17:53 ` PCI PM: Restore standard config registers of all devices early Linus Torvalds
2009-02-03 21:57 ` Rafael J. Wysocki
2009-02-02 22:48 ` Benjamin Herrenschmidt
2009-02-02 23:00 ` Rafael J. Wysocki
2009-02-03 0:23 ` Benjamin Herrenschmidt
2009-02-03 0:29 ` Rafael J. Wysocki
2009-02-03 0:44 ` Linus Torvalds
2009-02-03 1:32 ` Benjamin Herrenschmidt
2009-02-03 5:06 ` Ingo Molnar
2009-02-03 11:06 ` Peter Zijlstra
2009-02-03 12:09 ` Ingo Molnar
2009-02-02 23:49 ` Ingo Molnar
2009-02-03 22:09 ` Rafael J. Wysocki
2009-02-03 23:13 ` Linus Torvalds
2009-02-02 22:28 ` Benjamin Herrenschmidt
2009-02-02 21:07 ` Benjamin Herrenschmidt
2009-02-02 21:49 ` Rafael J. Wysocki
2009-02-02 22:15 ` Linus Torvalds
2009-02-02 22:33 ` Rafael J. Wysocki
2009-02-02 22:56 ` Rafael J. Wysocki
2009-02-03 0:11 ` Benjamin Herrenschmidt
2009-02-03 0:21 ` Linus Torvalds
2009-02-10 20:25 ` Pavel Machek
2009-02-02 22:57 ` Benjamin Herrenschmidt
2009-02-02 23:22 ` Rafael J. Wysocki
2009-02-03 1:03 ` Benjamin Herrenschmidt
2009-02-10 20:25 ` kmalloc during suspend, was " Pavel Machek
2009-02-02 17:20 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox