Re: Linux 4.15-rc2: Regression in resume from ACPI S3

Linux Power Management development
 help / color / mirror / Atom feed

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
       [not found]                 ` <alpine.DEB.2.20.1712131507160.1885@nanos>
@ 2017-12-13 16:23                   ` Bjorn Helgaas
  2017-12-13 16:41                     ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Bjorn Helgaas @ 2017-12-13 16:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maarten Lankhorst, Michal Hocko, Linus Torvalds,
	Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

[+cc linux-pci, linux-pm]

On Wed, Dec 13, 2017 at 04:57:56PM +0100, Thomas Gleixner wrote:
> So I was finally able to figure out what the hell is going on:
> 
> Suspend:
> 
>  - The device suspend code puts the graphics card into a power
>    state != PCI_D0.
> 
>  - Offline non boot CPUs
> 
>  - Break interrupt affinity. Allocate new vector on CPU 0, compose and
>    write MSI message which ends up in:
> 
>    __pci_write_msi_msg(entry, msg)
>    {
> 	if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
> 	   /* Don't touch the hardware now */
> 	} else {
> 	   ....
> 	}
> 	entry->msg = *msg;
>    }
>  
>   So because the device is not in PCI_D0 the message is not written. It's
>   written in the device resume path.

I'm not a PM guru, but this ordering seems fragile.  If we offline
CPUs before re-targeting interrupts directed at those CPUs, aren't we
always going to be at risk of sending interrupts to an offline CPU?

Even if the device is now asleep and therefore should not generate an
interrupt, it seems like there's a window when the device returns to
PCI_D0 where it could generate an interrupt before we have a chance to
update the MSI message.

> Resume:
> [  139.670446] ACPI: Low-level resume complete
> [  139.670541] PM: Restoring platform NVS memory
> [  139.672462] do_IRQ: 0.55 No irq handler for vector
> [  139.672475] Enabling non-boot CPUs ...
> 
> So the spurious interrupt happens early and way before the device resume
> code writes the new MSI message.
> 
> I checked the behaviour on 4.14. The MSI write is delayed there in the same
> way, but there is no spurious interrupt. There is no interrupt coming in at
> all _BEFORE_ the device is put out of PCI_D0.
> 
> And this has certainly nothing to do with the vector management changes,
> but I can't figure yet what makes that spurious interrupt to be sent.
> 
> Any ideas welcome.
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 16:23                   ` Linux 4.15-rc2: Regression in resume from ACPI S3 Bjorn Helgaas
@ 2017-12-13 16:41                     ` Thomas Gleixner
  2017-12-13 17:45                       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-13 16:41 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Maarten Lankhorst, Michal Hocko, Linus Torvalds,
	Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, 13 Dec 2017, Bjorn Helgaas wrote:
> [+cc linux-pci, linux-pm]
> 
> On Wed, Dec 13, 2017 at 04:57:56PM +0100, Thomas Gleixner wrote:
> > So I was finally able to figure out what the hell is going on:
> > 
> > Suspend:
> > 
> >  - The device suspend code puts the graphics card into a power
> >    state != PCI_D0.
> > 
> >  - Offline non boot CPUs
> > 
> >  - Break interrupt affinity. Allocate new vector on CPU 0, compose and
> >    write MSI message which ends up in:
> > 
> >    __pci_write_msi_msg(entry, msg)
> >    {
> > 	if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
> > 	   /* Don't touch the hardware now */
> > 	} else {
> > 	   ....
> > 	}
> > 	entry->msg = *msg;
> >    }
> >  
> >   So because the device is not in PCI_D0 the message is not written. It's
> >   written in the device resume path.
> 
> I'm not a PM guru, but this ordering seems fragile.  If we offline
> CPUs before re-targeting interrupts directed at those CPUs, aren't we
> always going to be at risk of sending interrupts to an offline CPU?
> 
> Even if the device is now asleep and therefore should not generate an
> interrupt, it seems like there's a window when the device returns to
> PCI_D0 where it could generate an interrupt before we have a chance to
> update the MSI message.

Definitely. That was fragile forever but puzzles me is that I can't figure
out what now causes that spurious interrupt to surface out of the blue.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 16:41                     ` Thomas Gleixner
@ 2017-12-13 17:45                       ` Linus Torvalds
  2017-12-13 18:19                         ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2017-12-13 17:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Definitely. That was fragile forever but puzzles me is that I can't figure
> out what now causes that spurious interrupt to surface out of the blue.

Perhaps just timing?

How hard would it be to change the ordering to just redirect irqs first?

                 Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 17:45                       ` Linus Torvalds
@ 2017-12-13 18:19                         ` Thomas Gleixner
  2017-12-13 20:52                           ` Thomas Gleixner
  2017-12-13 22:39                           ` Rafael J. Wysocki
  0 siblings, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-13 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, 13 Dec 2017, Linus Torvalds wrote:

> On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Definitely. That was fragile forever but puzzles me is that I can't figure
> > out what now causes that spurious interrupt to surface out of the blue.
> 
> Perhaps just timing?

That's what I'm trying to figure out right now, because that is the only
sensible explanation left. The whole machinery of suspend is exactly the
same with and without the vector changes. I instrumented all functions
involved and the picture is the same. I even do not see any fundamental
timing differences where one would say: That's it.

What puzzles me even more is that in the range of commits I'm fiddling with
there is no other change than the vector management stuff and the point
where it breaks makes no sense at all. The point Maarten bisected it to
works nicely here, so that might just point to a very subtle timing issue.

> How hard would it be to change the ordering to just redirect irqs first?

The whole interrupt redirection happens when the non boot CPUs are brought
down, which is the very last step before the actual suspend happens.

We could probably do that earlier, but that's something Rafael needs to
answer ultimately.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 18:19                         ` Thomas Gleixner
@ 2017-12-13 20:52                           ` Thomas Gleixner
  2017-12-13 21:06                             ` Thomas Gleixner
  2017-12-13 22:39                           ` Rafael J. Wysocki
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-13 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Linus Torvalds wrote:
> 
> > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > out what now causes that spurious interrupt to surface out of the blue.
> > 
> > Perhaps just timing?
> 
> That's what I'm trying to figure out right now, because that is the only
> sensible explanation left. The whole machinery of suspend is exactly the
> same with and without the vector changes. I instrumented all functions
> involved and the picture is the same. I even do not see any fundamental
> timing differences where one would say: That's it.
> 
> What puzzles me even more is that in the range of commits I'm fiddling with
> there is no other change than the vector management stuff and the point
> where it breaks makes no sense at all. The point Maarten bisected it to
> works nicely here, so that might just point to a very subtle timing issue.

After doing more debugging on this it turns out that this looks like a
legacy interrupt coming in. The vector number is always 55, which is legacy
IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is
masked and vector 55 is completely unused.

More questions than answers. Still investigating.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 20:52                           ` Thomas Gleixner
@ 2017-12-13 21:06                             ` Thomas Gleixner
  2017-12-13 22:48                               ` Rafael J. Wysocki
  2017-12-14 11:54                               ` Thomas Gleixner
  0 siblings, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-13 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > On Wed, 13 Dec 2017, Linus Torvalds wrote:
> > 
> > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > > out what now causes that spurious interrupt to surface out of the blue.
> > > 
> > > Perhaps just timing?
> > 
> > That's what I'm trying to figure out right now, because that is the only
> > sensible explanation left. The whole machinery of suspend is exactly the
> > same with and without the vector changes. I instrumented all functions
> > involved and the picture is the same. I even do not see any fundamental
> > timing differences where one would say: That's it.
> > 
> > What puzzles me even more is that in the range of commits I'm fiddling with
> > there is no other change than the vector management stuff and the point
> > where it breaks makes no sense at all. The point Maarten bisected it to
> > works nicely here, so that might just point to a very subtle timing issue.
> 
> After doing more debugging on this it turns out that this looks like a
> legacy interrupt coming in. The vector number is always 55, which is legacy
> IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is
> masked and vector 55 is completely unused.
> 
> More questions than answers. Still investigating.

And it does not explain Maartens report which gets a spurious vector 33 on
CPU4 after the non boot cpus have been brought online again. And that's the
vector which was assigned before the affinity was moved by unplugging CPU4.

Hrmpf. Even more mystery to solve.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 18:19                         ` Thomas Gleixner
  2017-12-13 20:52                           ` Thomas Gleixner
@ 2017-12-13 22:39                           ` Rafael J. Wysocki
  2017-12-13 23:26                             ` Rafael J. Wysocki
  1 sibling, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-13 22:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wednesday, December 13, 2017 7:19:17 PM CET Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Linus Torvalds wrote:
> 
> > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > out what now causes that spurious interrupt to surface out of the blue.
> > 
> > Perhaps just timing?
> 
> That's what I'm trying to figure out right now, because that is the only
> sensible explanation left. The whole machinery of suspend is exactly the
> same with and without the vector changes. I instrumented all functions
> involved and the picture is the same. I even do not see any fundamental
> timing differences where one would say: That's it.
> 
> What puzzles me even more is that in the range of commits I'm fiddling with
> there is no other change than the vector management stuff and the point
> where it breaks makes no sense at all. The point Maarten bisected it to
> works nicely here, so that might just point to a very subtle timing issue.
> 
> > How hard would it be to change the ordering to just redirect irqs first?
> 
> The whole interrupt redirection happens when the non boot CPUs are brought
> down, which is the very last step before the actual suspend happens.
> 
> We could probably do that earlier, but that's something Rafael needs to
> answer ultimately.

Well, that's both flattering and concerning. ;-)

Anyway, yes, we can do that earlier AFAICS.  Action handlers are not going to
run after we've called suspend_device_irqs() which happens before the final
stage of PCI devices suspend (suspend_noirq) and it doesn't matter which CPU
gets the interrupt from that point on (it is either wakeup or unwanted then).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 21:06                             ` Thomas Gleixner
@ 2017-12-13 22:48                               ` Rafael J. Wysocki
  2017-12-14 11:54                               ` Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-13 22:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wednesday, December 13, 2017 10:06:40 PM CET Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > > On Wed, 13 Dec 2017, Linus Torvalds wrote:
> > > 
> > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > > > out what now causes that spurious interrupt to surface out of the blue.
> > > > 
> > > > Perhaps just timing?
> > > 
> > > That's what I'm trying to figure out right now, because that is the only
> > > sensible explanation left. The whole machinery of suspend is exactly the
> > > same with and without the vector changes. I instrumented all functions
> > > involved and the picture is the same. I even do not see any fundamental
> > > timing differences where one would say: That's it.
> > > 
> > > What puzzles me even more is that in the range of commits I'm fiddling with
> > > there is no other change than the vector management stuff and the point
> > > where it breaks makes no sense at all. The point Maarten bisected it to
> > > works nicely here, so that might just point to a very subtle timing issue.
> > 
> > After doing more debugging on this it turns out that this looks like a
> > legacy interrupt coming in. The vector number is always 55, which is legacy
> > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is
> > masked and vector 55 is completely unused.
> > 
> > More questions than answers. Still investigating.
> 
> And it does not explain Maartens report which gets a spurious vector 33 on
> CPU4 after the non boot cpus have been brought online again. And that's the
> vector which was assigned before the affinity was moved by unplugging CPU4.
> 
> Hrmpf. Even more mystery to solve.

Any chance to look at /proc/interrupts from a machine where that can be
reproduced?

I'm also curious if that can be reproduced by doing CPU offline/online
without suspending?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 22:39                           ` Rafael J. Wysocki
@ 2017-12-13 23:26                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-13 23:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, Linux PCI, Linux PM

On Wed, Dec 13, 2017 at 11:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Wednesday, December 13, 2017 7:19:17 PM CET Thomas Gleixner wrote:
>> On Wed, 13 Dec 2017, Linus Torvalds wrote:
>>
>> > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > >
>> > > Definitely. That was fragile forever but puzzles me is that I can't figure
>> > > out what now causes that spurious interrupt to surface out of the blue.
>> >
>> > Perhaps just timing?
>>
>> That's what I'm trying to figure out right now, because that is the only
>> sensible explanation left. The whole machinery of suspend is exactly the
>> same with and without the vector changes. I instrumented all functions
>> involved and the picture is the same. I even do not see any fundamental
>> timing differences where one would say: That's it.
>>
>> What puzzles me even more is that in the range of commits I'm fiddling with
>> there is no other change than the vector management stuff and the point
>> where it breaks makes no sense at all. The point Maarten bisected it to
>> works nicely here, so that might just point to a very subtle timing issue.
>>
>> > How hard would it be to change the ordering to just redirect irqs first?
>>
>> The whole interrupt redirection happens when the non boot CPUs are brought
>> down, which is the very last step before the actual suspend happens.
>>
>> We could probably do that earlier, but that's something Rafael needs to
>> answer ultimately.
>
> Well, that's both flattering and concerning. ;-)
>
> Anyway, yes, we can do that earlier AFAICS.  Action handlers are not going to
> run after we've called suspend_device_irqs() which happens before the final
> stage of PCI devices suspend (suspend_noirq) and it doesn't matter which CPU
> gets the interrupt from that point on (it is either wakeup or unwanted then).

There is a catch that we don't and likely should not do that for
suspend-to-idle, but since we have pm_suspend_target_state now, that
case can be distinguished from the "full suspend" one readily.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-13 21:06                             ` Thomas Gleixner
  2017-12-13 22:48                               ` Rafael J. Wysocki
@ 2017-12-14 11:54                               ` Thomas Gleixner
  2017-12-14 12:12                                 ` Rafael J. Wysocki
                                                   ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-14 11:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > > On Wed, 13 Dec 2017, Linus Torvalds wrote:
> > > 
> > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > > > out what now causes that spurious interrupt to surface out of the blue.
> > > > 
> > > > Perhaps just timing?
> > > 
> > > That's what I'm trying to figure out right now, because that is the only
> > > sensible explanation left. The whole machinery of suspend is exactly the
> > > same with and without the vector changes. I instrumented all functions
> > > involved and the picture is the same. I even do not see any fundamental
> > > timing differences where one would say: That's it.
> > > 
> > > What puzzles me even more is that in the range of commits I'm fiddling with
> > > there is no other change than the vector management stuff and the point
> > > where it breaks makes no sense at all. The point Maarten bisected it to
> > > works nicely here, so that might just point to a very subtle timing issue.
> > 
> > After doing more debugging on this it turns out that this looks like a
> > legacy interrupt coming in. The vector number is always 55, which is legacy
> > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is
> > masked and vector 55 is completely unused.
> > 
> > More questions than answers. Still investigating.

At least that one could be explained by the changes. In the previous
management scheme the IOAPIC interrupts were always allocated even when the
interrupt was not in use. The new scheme does not longer do that because
people complained about the vector waste (16 vectors on each CPU) and it
got rid of all the special casing of IRQ0-15.

So the old scheme silently consumed the spurious vector. I added debug code
to that effect to 4.14 and on that machine IRQ7 is triggered at the same
point post resume and the core code drops it silently because the interrupt
is marked masked and no action assigned.

So the only difference to today is that the new code complains, while the
old one does an extra mask of the already masked IOAPIC pin and silently
returns.

After quite some investigation I found out that its independent of the
graphics thing. That's a genuine issue on that platform which seems to emit
random legacy vectors which were never ever used for unknown reasons. I
verified that both the IOAPIC and the PIC are masked, so they cannot send
crap. Though it turned out that the silly firmware unmasks the PIC and
leaves it that way when it returns from suspend. Now there is a race
whether the kernel resume path manages to mask the PIC again early enough
before something triggers IRQ7 or not. Adding/removing debug code makes the
problem come and go. So I really don't worry about that one and rather
prefer to have the spurious interrupt printed than silently consumed by
chance.

Now the graphics issue is a different story. That only happens on
hibernation after doing the snapshot. There all non boot cpus are onlined
again and after that the devices are 'thawed'. The following reenable of
interrupts fails because i915 is not in PCI_D0 state.

Suspend:

   irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
   __pci_write_msi_msg: Not written <- Device not in PCI_D0
   ....
   device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume]
   pci_pm_resume_noirq <-dpm_run_callback
   pci_pm_resume_noirq <-dpm_run_callback
   pci_pm_default_resume_early <-pci_pm_resume_noirq
   pci_pm_default_resume_early <-pci_pm_resume_noirq
   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a  <-- Set the new affinity
   device_pm_callback_end: i915 0000:00:02.0, err=0

Hibernate:

   irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
   __pci_write_msi_msg: Not written <- Device not in PCI_D0
   ....
   device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
   pci_pm_thaw_noirq <-dpm_run_callback
   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
   __pci_write_msi_msg: Not written  <--- Device is not in PCI_D0
   device_pm_callback_end: i915 0000:00:02.0, err=0

So that code path fails to set the new affinity because at the point where
the MSI msg should be written the device state is != PCI_D0.

Now, what's different vs. 4.14:

The 4.14 code accidentaly had the irq descriptor for this vector still
populated in the old CPU due to the convoluted way the vector allocation
worked. I have still to investigate if one of those cases is actually
leaking the descriptor, which would be a fatal bug.

But the new code does a proper cleanup and does not repopulate it on the
offline CPU. So that unearthes the issue. I'm handing that over to the PM
folks to look at. I got lost in that maze of callbacks.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 11:54                               ` Thomas Gleixner
@ 2017-12-14 12:12                                 ` Rafael J. Wysocki
  2017-12-14 12:30                                   ` Thomas Gleixner
  2017-12-14 13:24                                 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner
  2017-12-14 19:03                                 ` Linus Torvalds
  2 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-14 12:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote:
> On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > > On Wed, 13 Dec 2017, Thomas Gleixner wrote:
> > > > On Wed, 13 Dec 2017, Linus Torvalds wrote:
> > > > 
> > > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > >
> > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure
> > > > > > out what now causes that spurious interrupt to surface out of the blue.
> > > > > 
> > > > > Perhaps just timing?
> > > > 
> > > > That's what I'm trying to figure out right now, because that is the only
> > > > sensible explanation left. The whole machinery of suspend is exactly the
> > > > same with and without the vector changes. I instrumented all functions
> > > > involved and the picture is the same. I even do not see any fundamental
> > > > timing differences where one would say: That's it.
> > > > 
> > > > What puzzles me even more is that in the range of commits I'm fiddling with
> > > > there is no other change than the vector management stuff and the point
> > > > where it breaks makes no sense at all. The point Maarten bisected it to
> > > > works nicely here, so that might just point to a very subtle timing issue.
> > > 
> > > After doing more debugging on this it turns out that this looks like a
> > > legacy interrupt coming in. The vector number is always 55, which is legacy
> > > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is
> > > masked and vector 55 is completely unused.
> > > 
> > > More questions than answers. Still investigating.
> 
> At least that one could be explained by the changes. In the previous
> management scheme the IOAPIC interrupts were always allocated even when the
> interrupt was not in use. The new scheme does not longer do that because
> people complained about the vector waste (16 vectors on each CPU) and it
> got rid of all the special casing of IRQ0-15.
> 
> So the old scheme silently consumed the spurious vector. I added debug code
> to that effect to 4.14 and on that machine IRQ7 is triggered at the same
> point post resume and the core code drops it silently because the interrupt
> is marked masked and no action assigned.
> 
> So the only difference to today is that the new code complains, while the
> old one does an extra mask of the already masked IOAPIC pin and silently
> returns.
> 
> After quite some investigation I found out that its independent of the
> graphics thing. That's a genuine issue on that platform which seems to emit
> random legacy vectors which were never ever used for unknown reasons. I
> verified that both the IOAPIC and the PIC are masked, so they cannot send
> crap. Though it turned out that the silly firmware unmasks the PIC and
> leaves it that way when it returns from suspend. Now there is a race
> whether the kernel resume path manages to mask the PIC again early enough
> before something triggers IRQ7 or not. Adding/removing debug code makes the
> problem come and go. So I really don't worry about that one and rather
> prefer to have the spurious interrupt printed than silently consumed by
> chance.

OK

> Now the graphics issue is a different story. That only happens on
> hibernation after doing the snapshot. There all non boot cpus are onlined
> again and after that the devices are 'thawed'. The following reenable of
> interrupts fails because i915 is not in PCI_D0 state.
> 
> Suspend:
> 
>    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
>    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
>    __pci_write_msi_msg: Not written <- Device not in PCI_D0
>    ....
>    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume]
>    pci_pm_resume_noirq <-dpm_run_callback
>    pci_pm_resume_noirq <-dpm_run_callback
>    pci_pm_default_resume_early <-pci_pm_resume_noirq
>    pci_pm_default_resume_early <-pci_pm_resume_noirq
>    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a  <-- Set the new affinity
>    device_pm_callback_end: i915 0000:00:02.0, err=0

So this works, because we power up the device during resume even if it
had been suspended (via runtime PM) before the suspend started.

> Hibernate:
> 
>    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
>    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
>    __pci_write_msi_msg: Not written <- Device not in PCI_D0
>    ....
>    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
>    pci_pm_thaw_noirq <-dpm_run_callback
>    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
>    __pci_write_msi_msg: Not written  <--- Device is not in PCI_D0
>    device_pm_callback_end: i915 0000:00:02.0, err=0

And here we try to leave the device alone which is OK for devices in D0,
but not for suspended ones.

It looks like we need to power up them at the "thaw" time too or at least
I don't see how to address that differently.

> So that code path fails to set the new affinity because at the point where
> the MSI msg should be written the device state is != PCI_D0.
> 
> Now, what's different vs. 4.14:
> 
> The 4.14 code accidentaly had the irq descriptor for this vector still
> populated in the old CPU due to the convoluted way the vector allocation
> worked. I have still to investigate if one of those cases is actually
> leaking the descriptor, which would be a fatal bug.
> 
> But the new code does a proper cleanup and does not repopulate it on the
> offline CPU. So that unearthes the issue. I'm handing that over to the PM
> folks to look at. I got lost in that maze of callbacks.

OK, thanks so much for getting to the bottom of this!

Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 12:12                                 ` Rafael J. Wysocki
@ 2017-12-14 12:30                                   ` Thomas Gleixner
  2017-12-14 15:30                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-14 12:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, 14 Dec 2017, Rafael J. Wysocki wrote:
> On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote:
> > Now the graphics issue is a different story. That only happens on
> > hibernation after doing the snapshot. There all non boot cpus are onlined
> > again and after that the devices are 'thawed'. The following reenable of
> > interrupts fails because i915 is not in PCI_D0 state.
> > 
> > Suspend:
> > 
> >    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
> >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> >    __pci_write_msi_msg: Not written <- Device not in PCI_D0
> >    ....
> >    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume]
> >    pci_pm_resume_noirq <-dpm_run_callback
> >    pci_pm_resume_noirq <-dpm_run_callback
> >    pci_pm_default_resume_early <-pci_pm_resume_noirq
> >    pci_pm_default_resume_early <-pci_pm_resume_noirq
> >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a  <-- Set the new affinity
> >    device_pm_callback_end: i915 0000:00:02.0, err=0
> 
> So this works, because we power up the device during resume even if it
> had been suspended (via runtime PM) before the suspend started.
> 
> > Hibernate:
> > 
> >    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
> >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> >    __pci_write_msi_msg: Not written <- Device not in PCI_D0
> >    ....
> >    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
> >    pci_pm_thaw_noirq <-dpm_run_callback
> >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> >    __pci_write_msi_msg: Not written  <--- Device is not in PCI_D0
> >    device_pm_callback_end: i915 0000:00:02.0, err=0
> 
> And here we try to leave the device alone which is OK for devices in D0,
> but not for suspended ones.
> 
> It looks like we need to power up them at the "thaw" time too or at least
> I don't see how to address that differently.

The question is whether the code which brings the device out of D0 should
write the message unconditionally. That would be sufficient I think.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 11:54                               ` Thomas Gleixner
  2017-12-14 12:12                                 ` Rafael J. Wysocki
@ 2017-12-14 13:24                                 ` Thomas Gleixner
  2017-12-14 19:03                                 ` Linus Torvalds
  2 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-14 13:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, 14 Dec 2017, Thomas Gleixner wrote:
> Now, what's different vs. 4.14:
> 
> The 4.14 code accidentaly had the irq descriptor for this vector still
> populated in the old CPU due to the convoluted way the vector allocation
> worked. I have still to investigate if one of those cases is actually
> leaking the descriptor, which would be a fatal bug.

It doesn't leak. It repopulates it at the same place out of sheer luck.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 12:30                                   ` Thomas Gleixner
@ 2017-12-14 15:30                                     ` Rafael J. Wysocki
  2017-12-14 15:52                                       ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-14 15:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thursday, December 14, 2017 1:30:37 PM CET Thomas Gleixner wrote:
> On Thu, 14 Dec 2017, Rafael J. Wysocki wrote:
> > On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote:
> > > Now the graphics issue is a different story. That only happens on
> > > hibernation after doing the snapshot. There all non boot cpus are onlined
> > > again and after that the devices are 'thawed'. The following reenable of
> > > interrupts fails because i915 is not in PCI_D0 state.
> > > 
> > > Suspend:
> > > 
> > >    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
> > >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> > >    __pci_write_msi_msg: Not written <- Device not in PCI_D0
> > >    ....
> > >    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume]
> > >    pci_pm_resume_noirq <-dpm_run_callback
> > >    pci_pm_resume_noirq <-dpm_run_callback
> > >    pci_pm_default_resume_early <-pci_pm_resume_noirq
> > >    pci_pm_default_resume_early <-pci_pm_resume_noirq
> > >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a  <-- Set the new affinity
> > >    device_pm_callback_end: i915 0000:00:02.0, err=0
> > 
> > So this works, because we power up the device during resume even if it
> > had been suspended (via runtime PM) before the suspend started.
> > 
> > > Hibernate:
> > > 
> > >    irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
> > >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> > >    __pci_write_msi_msg: Not written <- Device not in PCI_D0
> > >    ....
> > >    device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
> > >    pci_pm_thaw_noirq <-dpm_run_callback
> > >    __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a
> > >    __pci_write_msi_msg: Not written  <--- Device is not in PCI_D0
> > >    device_pm_callback_end: i915 0000:00:02.0, err=0
> > 
> > And here we try to leave the device alone which is OK for devices in D0,
> > but not for suspended ones.
> > 
> > It looks like we need to power up them at the "thaw" time too or at least
> > I don't see how to address that differently.
> 
> The question is whether the code which brings the device out of D0 should
> write the message unconditionally. That would be sufficient I think.

It doesn't have to do that.

The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which
in fact requires the device to be in D0, so the caller should put it into
D0 instead of trying to "update" its power state.

[Note that the PCI layer doesn't put devices into low-power states during the
hibernation's "freeze" transition, but drivers can legitimately do that in
their "freeze" callbacks which was overlooked in that code and that's what
i915 does.]

So IMO what we need is the change below.  I'm going to test it shortly,
but please give it a go too.

---
 drivers/pci/pci-driver.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/pci/pci-driver.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-driver.c
+++ linux-pm/drivers/pci/pci-driver.c
@@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
 	if (pci_has_legacy_pm_support(pci_dev))
 		return pci_legacy_resume_early(dev);
 
-	pci_update_current_state(pci_dev, PCI_D0);
+	/*
+	 * pci_restore_state() requires the device to be in D0 (because of MSI
+	 * restoration among other things), so force it into D0 in case the
+	 * driver's "freeze" callbacks put it into a low-power state directly.
+	 */
+	pci_set_power_state(pci_dev, PCI_D0);
 	pci_restore_state(pci_dev);
 
 	if (drv && drv->pm && drv->pm->thaw_noirq)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 15:30                                     ` Rafael J. Wysocki
@ 2017-12-14 15:52                                       ` Thomas Gleixner
  2017-12-14 15:54                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-14 15:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, 14 Dec 2017, Rafael J. Wysocki wrote:
> The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which
> in fact requires the device to be in D0, so the caller should put it into
> D0 instead of trying to "update" its power state.
> 
> [Note that the PCI layer doesn't put devices into low-power states during the
> hibernation's "freeze" transition, but drivers can legitimately do that in
> their "freeze" callbacks which was overlooked in that code and that's what
> i915 does.]
> 
> So IMO what we need is the change below.  I'm going to test it shortly,
> but please give it a go too.

So now this looks more reasonable:

  irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
  __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
  __pci_write_msi_msg: Not written
  ...
  device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
  pci_pm_thaw_noirq <-dpm_run_callback
  __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
  device_pm_callback_end: i915 0000:00:02.0, err=0
  ...
  resume_irqs: Resume 125
  ...
  irq_handler_entry: irq=125 name=i915

Thanks,

	tglx

> ---
>  drivers/pci/pci-driver.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/pci/pci-driver.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-driver.c
> +++ linux-pm/drivers/pci/pci-driver.c
> @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
>  	if (pci_has_legacy_pm_support(pci_dev))
>  		return pci_legacy_resume_early(dev);
>  
> -	pci_update_current_state(pci_dev, PCI_D0);
> +	/*
> +	 * pci_restore_state() requires the device to be in D0 (because of MSI
> +	 * restoration among other things), so force it into D0 in case the
> +	 * driver's "freeze" callbacks put it into a low-power state directly.
> +	 */
> +	pci_set_power_state(pci_dev, PCI_D0);
>  	pci_restore_state(pci_dev);
>  
>  	if (drv && drv->pm && drv->pm->thaw_noirq)
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 15:52                                       ` Thomas Gleixner
@ 2017-12-14 15:54                                         ` Rafael J. Wysocki
  2017-12-14 16:17                                           ` Maarten Lankhorst
  2017-12-15  2:07                                           ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki
  0 siblings, 2 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-14 15:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thursday, December 14, 2017 4:52:22 PM CET Thomas Gleixner wrote:
> On Thu, 14 Dec 2017, Rafael J. Wysocki wrote:
> > The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which
> > in fact requires the device to be in D0, so the caller should put it into
> > D0 instead of trying to "update" its power state.
> > 
> > [Note that the PCI layer doesn't put devices into low-power states during the
> > hibernation's "freeze" transition, but drivers can legitimately do that in
> > their "freeze" callbacks which was overlooked in that code and that's what
> > i915 does.]
> > 
> > So IMO what we need is the change below.  I'm going to test it shortly,
> > but please give it a go too.
> 
> So now this looks more reasonable:
> 
>   irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
>   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
>   __pci_write_msi_msg: Not written
>   ...
>   device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
>   pci_pm_thaw_noirq <-dpm_run_callback
>   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
>   device_pm_callback_end: i915 0000:00:02.0, err=0
>   ...
>   resume_irqs: Resume 125
>   ...
>   irq_handler_entry: irq=125 name=i915

Cool.

Let me respin it with a changelog etc then.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 15:54                                         ` Rafael J. Wysocki
@ 2017-12-14 16:17                                           ` Maarten Lankhorst
  2017-12-15  2:07                                           ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki
  1 sibling, 0 replies; 26+ messages in thread
From: Maarten Lankhorst @ 2017-12-14 16:17 UTC (permalink / raw)
  To: Rafael J. Wysocki, Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Michal Hocko, Andy Lutomirski,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci,
	linux-pm

Op 14-12-17 om 16:54 schreef Rafael J. Wysocki:
> On Thursday, December 14, 2017 4:52:22 PM CET Thomas Gleixner wrote:
>> On Thu, 14 Dec 2017, Rafael J. Wysocki wrote:
>>> The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which
>>> in fact requires the device to be in D0, so the caller should put it into
>>> D0 instead of trying to "update" its power state.
>>>
>>> [Note that the PCI layer doesn't put devices into low-power states during the
>>> hibernation's "freeze" transition, but drivers can legitimately do that in
>>> their "freeze" callbacks which was overlooked in that code and that's what
>>> i915 does.]
>>>
>>> So IMO what we need is the change below.  I'm going to test it shortly,
>>> but please give it a go too.
>> So now this looks more reasonable:
>>
>>   irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10
>>   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
>>   __pci_write_msi_msg: Not written
>>   ...
>>   device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw]
>>   pci_pm_thaw_noirq <-dpm_run_callback
>>   __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a 
>>   device_pm_callback_end: i915 0000:00:02.0, err=0
>>   ...
>>   resume_irqs: Resume 125
>>   ...
>>   irq_handler_entry: irq=125 name=i915
> Cool.
>
> Let me respin it with a changelog etc then.
>
> Thanks,
> Rafael
>
>
The machine I was using for reproducing the bug appears to be fixed with this patch, so I now sent
it to intel's trybot for results.

https://patchwork.freedesktop.org/series/35367/

Thanks for looking at the bug!

~Maarten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 11:54                               ` Thomas Gleixner
  2017-12-14 12:12                                 ` Rafael J. Wysocki
  2017-12-14 13:24                                 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner
@ 2017-12-14 19:03                                 ` Linus Torvalds
  2017-12-14 22:36                                   ` Thomas Gleixner
  2 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2017-12-14 19:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> So the old scheme silently consumed the spurious vector. I added debug code
> to that effect to 4.14 and on that machine IRQ7 is triggered at the same
> point post resume and the core code drops it silently because the interrupt
> is marked masked and no action assigned.
>
> So the only difference to today is that the new code complains, while the
> old one does an extra mask of the already masked IOAPIC pin and silently
> returns.

Great debugging, and it looks like Rafael has a patch that already got
positive testing.

I just wanted to pipe up about that "irq7", because judging from your
email it seems like you think it's a real irq:

>        Now there is a race
> whether the kernel resume path manages to mask the PIC again early enough
> before something triggers IRQ7 or not.

..and that's not how the PIC works.

In fact, "legacy irq 7" is the _normal_ and very traditional spurious
interrupt, and it's documented. If the PIC gets an interrupt from
_any_ source, but the interrupt goes away before the PIC gets an
acknowledge from the CPU (and by "acknowledge", I'm not talking about
the explicit software IRQ ACK, I'm talking about the hardware
protocol, between the PIC and the CPU), the PIC will then report irq 7
as the interrupt - regardless of what the original was.

The reason is almost always something like

 - CPU interrupts are disabled or masked

 - driver does a write to the external hardware that causes an
interrupt to be raised

 - CPU doesn't react to the irq due to the disabled/masked nature

 - but the driver then does something that masks the interrupt again

 - interrupts are enabled/unmasked on the CPU

 - CPU now acks the interrupt, but the PIC no longer sees any
interrupt source, so the PIC (that has to reply with *something*)
replies with that documented spurious irq7.

To confuse things further, irq7 is not _exclusively_ the spurious
interrupt, You can definitely put real hardware and connect it to pin7
of the PIC, and get real irq7 reports.

And to confuse things even *more*, this "irq7" thing is per-PIC, and
the PC model obviously has the whole "nested PIC" thing where the
second PIC is connected to irq2 of the first PIC. So there are *two*
different "spurious interrupt" reports, one for each PIC.

Anyway, to avoid this issue, drivers should strive to

 (a) actually take the interrupt when doing things that can cause
them, and have the interrupt handler do whatever it is that causes the
interrupt to go away (ie: "normal operation")

 (b) if you play games with clearing the source of the interrupt
_without_ taking the interrupt, you should strive to basically mask
the interrupt first.

So to do (b) you can do something like

      mask_device_interrupt(dev);
      read_from_device_to_synchronize(dev);

instead of (or perhaps _before_) disabling interrupts at a CPU level.
Suspend/resume obviously does tend to play games with these kinds of
things where you are no longer in "normal operation" and you do setup
without having interrupts actually enabled.

Or you can just decide that spurious interrupts are ok, and ignore the
issue. But they *can* be very confusing, and obviously in this case
that confusion then seems to have caused actual problems.

            Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 19:03                                 ` Linus Torvalds
@ 2017-12-14 22:36                                   ` Thomas Gleixner
  2017-12-14 22:47                                     ` Linus Torvalds
  2017-12-15  0:34                                     ` Rafael J. Wysocki
  0 siblings, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-14 22:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, 14 Dec 2017, Linus Torvalds wrote:
> On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> I just wanted to pipe up about that "irq7", because judging from your
> email it seems like you think it's a real irq:
> 
> >        Now there is a race
> > whether the kernel resume path manages to mask the PIC again early enough
> > before something triggers IRQ7 or not.
> 
> ..and that's not how the PIC works.
> 
> In fact, "legacy irq 7" is the _normal_ and very traditional spurious
> interrupt, and it's documented. If the PIC gets an interrupt from
> _any_ source, but the interrupt goes away before the PIC gets an
> acknowledge from the CPU (and by "acknowledge", I'm not talking about
> the explicit software IRQ ACK, I'm talking about the hardware
> protocol, between the PIC and the CPU), the PIC will then report irq 7
> as the interrupt - regardless of what the original was.
> 
> The reason is almost always something like
> 
>  - CPU interrupts are disabled or masked
> 
>  - driver does a write to the external hardware that causes an
> interrupt to be raised

Which should be a non issue because _ALL_ PIC irq lines are masked at the
PIC itself. All interrupts are routed through IOAPIC. So unless the IOAPIC
sports similar behaviour the PIC should not ever observe that scenario.

But, because the silly firmware comes out of suspend with all PIC lines
unmasked for whatever reason, the PIC can observe that IRQ being raised and
the CPU not handling it. So yes, I forgot about 7 being magic, but I still
think it's the firmware which causes it by unmasking the PIC irqs.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 22:36                                   ` Thomas Gleixner
@ 2017-12-14 22:47                                     ` Linus Torvalds
  2017-12-15  9:05                                       ` Thomas Gleixner
  2017-12-15  0:34                                     ` Rafael J. Wysocki
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2017-12-14 22:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, Dec 14, 2017 at 2:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> But, because the silly firmware comes out of suspend with all PIC lines
> unmasked for whatever reason, the PIC can observe that IRQ being raised and
> the CPU not handling it. So yes, I forgot about 7 being magic, but I still
> think it's the firmware which causes it by unmasking the PIC irqs.

Yes, that sounds quite likely.

            Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 22:36                                   ` Thomas Gleixner
  2017-12-14 22:47                                     ` Linus Torvalds
@ 2017-12-15  0:34                                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-15  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, Linux PCI, Linux PM

On Thu, Dec 14, 2017 at 11:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 14 Dec 2017, Linus Torvalds wrote:
>> On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> I just wanted to pipe up about that "irq7", because judging from your
>> email it seems like you think it's a real irq:
>>
>> >        Now there is a race
>> > whether the kernel resume path manages to mask the PIC again early enough
>> > before something triggers IRQ7 or not.
>>
>> ..and that's not how the PIC works.
>>
>> In fact, "legacy irq 7" is the _normal_ and very traditional spurious
>> interrupt, and it's documented. If the PIC gets an interrupt from
>> _any_ source, but the interrupt goes away before the PIC gets an
>> acknowledge from the CPU (and by "acknowledge", I'm not talking about
>> the explicit software IRQ ACK, I'm talking about the hardware
>> protocol, between the PIC and the CPU), the PIC will then report irq 7
>> as the interrupt - regardless of what the original was.
>>
>> The reason is almost always something like
>>
>>  - CPU interrupts are disabled or masked
>>
>>  - driver does a write to the external hardware that causes an
>> interrupt to be raised
>
> Which should be a non issue because _ALL_ PIC irq lines are masked at the
> PIC itself. All interrupts are routed through IOAPIC. So unless the IOAPIC
> sports similar behaviour the PIC should not ever observe that scenario.
>
> But, because the silly firmware comes out of suspend with all PIC lines
> unmasked for whatever reason, the PIC can observe that IRQ being raised and
> the CPU not handling it. So yes, I forgot about 7 being magic, but I still
> think it's the firmware which causes it by unmasking the PIC irqs.

That's my understanding too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()
  2017-12-14 15:54                                         ` Rafael J. Wysocki
  2017-12-14 16:17                                           ` Maarten Lankhorst
@ 2017-12-15  2:07                                           ` Rafael J. Wysocki
  2017-12-15 14:28                                             ` Rafael J. Wysocki
  2017-12-15 18:30                                             ` Bjorn Helgaas
  1 sibling, 2 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-15  2:07 UTC (permalink / raw)
  To: Thomas Gleixner, linux-pci
  Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pm, Chen Yu

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

It is incorrect to call pci_restore_state() for devices in low-power
states (D1-D3), as that involves the restoration of MSI setup which
requires MMIO to be operational and that is only the case in D0.

However, pci_pm_thaw_noirq() may do that if the driver's "freeze"
callbacks put the device into a low-power state, so fix it by making
it force devices into D0 via pci_set_power_state() instead of trying
to "update" their power state which is pointless.

Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation)
Cc: 4.13+ <stable@vger.kernel.org> # 4.13+
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Maarten Lankhorst <dev@mblankhorst.nl>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Maarten Lankhorst <dev@mblankhorst.nl>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

The bug is not as old as I thought, actually.

Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq()
forever, but it started to be problematic in 4.13, when we started
to call pci_restore_state() in addition to it to fix another issue.

---
 drivers/pci/pci-driver.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/pci/pci-driver.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-driver.c
+++ linux-pm/drivers/pci/pci-driver.c
@@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
 	if (pci_has_legacy_pm_support(pci_dev))
 		return pci_legacy_resume_early(dev);
 
-	pci_update_current_state(pci_dev, PCI_D0);
+	/*
+	 * pci_restore_state() requires the device to be in D0 (because of MSI
+	 * restoration among other things), so force it into D0 in case the
+	 * driver's "freeze" callbacks put it into a low-power state directly.
+	 */
+	pci_set_power_state(pci_dev, PCI_D0);
 	pci_restore_state(pci_dev);
 
 	if (drv && drv->pm && drv->pm->thaw_noirq)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux 4.15-rc2: Regression in resume from ACPI S3
  2017-12-14 22:47                                     ` Linus Torvalds
@ 2017-12-15  9:05                                       ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2017-12-15  9:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki,
	Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pci, linux-pm

On Thu, 14 Dec 2017, Linus Torvalds wrote:

> On Thu, Dec 14, 2017 at 2:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > But, because the silly firmware comes out of suspend with all PIC lines
> > unmasked for whatever reason, the PIC can observe that IRQ being raised and
> > the CPU not handling it. So yes, I forgot about 7 being magic, but I still
> > think it's the firmware which causes it by unmasking the PIC irqs.
> 
> Yes, that sounds quite likely.

And just for the record I was able to figure out which interrupt comes in
and goes away again. It's the only level triggered interrupt, which is the
ACPI interrupt.....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()
  2017-12-15  2:07                                           ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki
@ 2017-12-15 14:28                                             ` Rafael J. Wysocki
  2017-12-15 18:30                                             ` Bjorn Helgaas
  1 sibling, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-15 14:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Thomas Gleixner, Linux PCI, Linus Torvalds, Maarten Lankhorst,
	Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, Linux PM, Chen Yu

On Fri, Dec 15, 2017 at 3:07 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> It is incorrect to call pci_restore_state() for devices in low-power
> states (D1-D3), as that involves the restoration of MSI setup which
> requires MMIO to be operational and that is only the case in D0.
>
> However, pci_pm_thaw_noirq() may do that if the driver's "freeze"
> callbacks put the device into a low-power state, so fix it by making
> it force devices into D0 via pci_set_power_state() instead of trying
> to "update" their power state which is pointless.
>
> Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation)
> Cc: 4.13+ <stable@vger.kernel.org> # 4.13+
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: Maarten Lankhorst <dev@mblankhorst.nl>
> Tested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Maarten Lankhorst <dev@mblankhorst.nl>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> The bug is not as old as I thought, actually.
>
> Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq()
> forever, but it started to be problematic in 4.13, when we started
> to call pci_restore_state() in addition to it to fix another issue.

Bjorn, any concerns about this one?

> ---
>  drivers/pci/pci-driver.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/pci/pci-driver.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-driver.c
> +++ linux-pm/drivers/pci/pci-driver.c
> @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
>         if (pci_has_legacy_pm_support(pci_dev))
>                 return pci_legacy_resume_early(dev);
>
> -       pci_update_current_state(pci_dev, PCI_D0);
> +       /*
> +        * pci_restore_state() requires the device to be in D0 (because of MSI
> +        * restoration among other things), so force it into D0 in case the
> +        * driver's "freeze" callbacks put it into a low-power state directly.
> +        */
> +       pci_set_power_state(pci_dev, PCI_D0);
>         pci_restore_state(pci_dev);
>
>         if (drv && drv->pm && drv->pm->thaw_noirq)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()
  2017-12-15  2:07                                           ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki
  2017-12-15 14:28                                             ` Rafael J. Wysocki
@ 2017-12-15 18:30                                             ` Bjorn Helgaas
  2017-12-15 23:44                                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 26+ messages in thread
From: Bjorn Helgaas @ 2017-12-15 18:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, linux-pci, Linus Torvalds, Maarten Lankhorst,
	Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List,
	the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas,
	Rafael J. Wysocki, linux-pm, Chen Yu

On Fri, Dec 15, 2017 at 03:07:18AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> It is incorrect to call pci_restore_state() for devices in low-power
> states (D1-D3), as that involves the restoration of MSI setup which
> requires MMIO to be operational and that is only the case in D0.
> 
> However, pci_pm_thaw_noirq() may do that if the driver's "freeze"
> callbacks put the device into a low-power state, so fix it by making
> it force devices into D0 via pci_set_power_state() instead of trying
> to "update" their power state which is pointless.
> 
> Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation)
> Cc: 4.13+ <stable@vger.kernel.org> # 4.13+
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: Maarten Lankhorst <dev@mblankhorst.nl>
> Tested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Maarten Lankhorst <dev@mblankhorst.nl>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Let me know if you want me to take this.  I don't have anything
currently queued up that touches pci-driver.c, so I'm happy if you
take it yourself.

> ---
> 
> The bug is not as old as I thought, actually.
> 
> Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq()
> forever, but it started to be problematic in 4.13, when we started
> to call pci_restore_state() in addition to it to fix another issue.
> 
> ---
>  drivers/pci/pci-driver.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/pci/pci-driver.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-driver.c
> +++ linux-pm/drivers/pci/pci-driver.c
> @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
>  	if (pci_has_legacy_pm_support(pci_dev))
>  		return pci_legacy_resume_early(dev);
>  
> -	pci_update_current_state(pci_dev, PCI_D0);
> +	/*
> +	 * pci_restore_state() requires the device to be in D0 (because of MSI
> +	 * restoration among other things), so force it into D0 in case the
> +	 * driver's "freeze" callbacks put it into a low-power state directly.
> +	 */
> +	pci_set_power_state(pci_dev, PCI_D0);
>  	pci_restore_state(pci_dev);
>  
>  	if (drv && drv->pm && drv->pm->thaw_noirq)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()
  2017-12-15 18:30                                             ` Bjorn Helgaas
@ 2017-12-15 23:44                                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2017-12-15 23:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Thomas Gleixner, Linux PCI, Linus Torvalds,
	Maarten Lankhorst, Michal Hocko, Andy Lutomirski,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, Linux PM,
	Chen Yu

On Fri, Dec 15, 2017 at 7:30 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Fri, Dec 15, 2017 at 03:07:18AM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> It is incorrect to call pci_restore_state() for devices in low-power
>> states (D1-D3), as that involves the restoration of MSI setup which
>> requires MMIO to be operational and that is only the case in D0.
>>
>> However, pci_pm_thaw_noirq() may do that if the driver's "freeze"
>> callbacks put the device into a low-power state, so fix it by making
>> it force devices into D0 via pci_set_power_state() instead of trying
>> to "update" their power state which is pointless.
>>
>> Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation)
>> Cc: 4.13+ <stable@vger.kernel.org> # 4.13+
>> Reported-by: Thomas Gleixner <tglx@linutronix.de>
>> Reported-by: Maarten Lankhorst <dev@mblankhorst.nl>
>> Tested-by: Thomas Gleixner <tglx@linutronix.de>
>> Tested-by: Maarten Lankhorst <dev@mblankhorst.nl>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> Let me know if you want me to take this.  I don't have anything
> currently queued up that touches pci-driver.c, so I'm happy if you
> take it yourself.

I will take it.

Depending of what Yu finds, we may need an additional fix to make the
Purley system work.

>> ---
>>
>> The bug is not as old as I thought, actually.
>>
>> Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq()
>> forever, but it started to be problematic in 4.13, when we started
>> to call pci_restore_state() in addition to it to fix another issue.
>>
>> ---
>>  drivers/pci/pci-driver.c |    7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/pci/pci-driver.c
>> ===================================================================
>> --- linux-pm.orig/drivers/pci/pci-driver.c
>> +++ linux-pm/drivers/pci/pci-driver.c
>> @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi
>>       if (pci_has_legacy_pm_support(pci_dev))
>>               return pci_legacy_resume_early(dev);
>>
>> -     pci_update_current_state(pci_dev, PCI_D0);
>> +     /*
>> +      * pci_restore_state() requires the device to be in D0 (because of MSI
>> +      * restoration among other things), so force it into D0 in case the
>> +      * driver's "freeze" callbacks put it into a low-power state directly.
>> +      */
>> +     pci_set_power_state(pci_dev, PCI_D0);
>>       pci_restore_state(pci_dev);
>>
>>       if (drv && drv->pm && drv->pm->thaw_noirq)
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-12-15 23:44 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <168050887.sZlTFXWCmO@aspire.rjw.lan>
     [not found] ` <CA+55aFwsMuHUBQz5kDNwRf17JnasXMWjvmLq5qXGH-694yeq1w@mail.gmail.com>
     [not found]   ` <20171206121452.GA6320@dhcp22.suse.cz>
     [not found]     ` <db81e89d-eee6-868c-dc10-c028f4c4be84@mblankhorst.nl>
     [not found]       ` <alpine.DEB.2.20.1712061344560.1724@nanos>
     [not found]         ` <0f1d3d63-fa10-5cef-8014-81753dc60243@mblankhorst.nl>
     [not found]           ` <alpine.DEB.2.20.1712061514420.1724@nanos>
     [not found]             ` <57c8679e-1b88-c9ad-2299-2bea7560b28f@mblankhorst.nl>
     [not found]               ` <alpine.DEB.2.20.1712081129450.1840@nanos>
     [not found]                 ` <alpine.DEB.2.20.1712131507160.1885@nanos>
2017-12-13 16:23                   ` Linux 4.15-rc2: Regression in resume from ACPI S3 Bjorn Helgaas
2017-12-13 16:41                     ` Thomas Gleixner
2017-12-13 17:45                       ` Linus Torvalds
2017-12-13 18:19                         ` Thomas Gleixner
2017-12-13 20:52                           ` Thomas Gleixner
2017-12-13 21:06                             ` Thomas Gleixner
2017-12-13 22:48                               ` Rafael J. Wysocki
2017-12-14 11:54                               ` Thomas Gleixner
2017-12-14 12:12                                 ` Rafael J. Wysocki
2017-12-14 12:30                                   ` Thomas Gleixner
2017-12-14 15:30                                     ` Rafael J. Wysocki
2017-12-14 15:52                                       ` Thomas Gleixner
2017-12-14 15:54                                         ` Rafael J. Wysocki
2017-12-14 16:17                                           ` Maarten Lankhorst
2017-12-15  2:07                                           ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki
2017-12-15 14:28                                             ` Rafael J. Wysocki
2017-12-15 18:30                                             ` Bjorn Helgaas
2017-12-15 23:44                                               ` Rafael J. Wysocki
2017-12-14 13:24                                 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner
2017-12-14 19:03                                 ` Linus Torvalds
2017-12-14 22:36                                   ` Thomas Gleixner
2017-12-14 22:47                                     ` Linus Torvalds
2017-12-15  9:05                                       ` Thomas Gleixner
2017-12-15  0:34                                     ` Rafael J. Wysocki
2017-12-13 22:39                           ` Rafael J. Wysocki
2017-12-13 23:26                             ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox