[5.2-rc1 regression]: nvme vs. hibernation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [5.2-rc1 regression]: nvme vs. hibernation
@ 2019-05-24 15:22 Jiri Kosina
  2019-05-24 15:44 ` Keith Busch
  0 siblings, 1 reply; 10+ messages in thread
From: Jiri Kosina @ 2019-05-24 15:22 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig
  Cc: Hannes Reinecke, Keith Busch, Sagi Grimberg, linux-kernel,
	linux-nvme

Hi,

Something is broken in Linus' tree (4dde821e429) with respec to 
hibernation on my thinkpad x270, and it seems to be nvme related.

I reliably see the warning below during hibernation, and then sometimes 
resume sort of works but the machine misbehaves here and there (seems like 
lost IRQs), sometimes it never comes back from the hibernated state.

I will not have too much have time to look into this over weekend, so I am 
sending this out as-is in case anyone has immediate idea. Otherwise I'll 
bisect it on monday (I don't even know at the moment what exactly was the 
last version that worked reliably, I'll have to figure that out as well 
later).




 WARNING: CPU: 0 PID: 363 at kernel/irq/chip.c:210 irq_startup+0xff/0x110
 Modules linked in: bnep ccm af_packet fuse 8021q garp stp mrp llc tun ip6table_mangle ip6table_filter ip6_tables iptable_mangle xt_DSCP xt_tcpudp xt_conntrac
  snd_hda_core aes_x86_64 glue_helper crypto_simd snd_pcm cryptd e1000e ptp pcspkr joydev pps_core snd_timer i2c_i801 cfg80211 mei_me mei intel_pch_thermal th
 CPU: 0 PID: 363 Comm: kworker/u8:4 Not tainted 5.1.0-08122-ga2d635decbfa #9
 Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
 Workqueue: events_unbound async_run_entry_fn
 RIP: 0010:irq_startup+0xff/0x110
 Code: f6 4c 89 e7 e8 92 34 00 00 85 c0 75 21 4c 89 e7 31 d2 4c 89 ee e8 e1 cc ff ff 48 89 df e8 89 fe ff ff 41 89 c4 e9 37 ff ff ff <0f> 0b eb b0 0f 0b eb ac 66 0f 1f 84 00 00 
 44 00 00
 RSP: 0018:ffffa05100f13bd0 EFLAGS: 00010002
 RAX: 0000000000000200 RBX: ffff9168e360ec00 RCX: 0000000000000200
 RDX: 0000000000000200 RSI: ffffffff9f383600 RDI: ffff9168e360ec18
 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000007
 R10: 000000007a6a7b55 R11: 0000000000000000 R12: 0000000000000001
 R13: ffff9168e360ec18 R14: ffff9168e6c97000 R15: ffff9168df76c000
 FS:  0000000000000000(0000) GS:ffff9168e7280000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fc942c4bf60 CR3: 00000001f1210002 CR4: 00000000003606e0
 Call Trace:
  ? __irq_get_desc_lock+0x4e/0x90
  enable_irq+0x39/0x70
  nvme_poll_irqdisable+0x3a3/0x470 [nvme]
  __nvme_disable_io_queues.isra.42+0x16a/0x1d0 [nvme]
  nvme_dev_disable+0x17e/0x1e0 [nvme]
  ? pci_pm_poweroff+0xf0/0xf0
  nvme_suspend+0x13/0x20 [nvme]
  pci_pm_freeze+0x52/0xd0
  dpm_run_callback+0x6b/0x2e0
  __device_suspend+0x147/0x630
  ? dpm_show_time+0xe0/0xe0
  async_suspend+0x1a/0x90
  async_run_entry_fn+0x39/0x160
  process_one_work+0x1f0/0x5b0
  ? process_one_work+0x16a/0x5b0
  worker_thread+0x4c/0x3f0
  kthread+0x103/0x140
  ? process_one_work+0x5b0/0x5b0
  ? kthread_bind+0x10/0x10
  ret_from_fork+0x3a/0x50
 irq event stamp: 381230
 hardirqs last  enabled at (381229): [<ffffffff9e90910d>] _raw_spin_unlock_irqrestore+0x4d/0x70
 hardirqs last disabled at (381230): [<ffffffff9e908fa4>] _raw_spin_lock_irqsave+0x24/0x60
 softirqs last  enabled at (381104): [<ffffffffc0eeb734>] __iwl_mvm_mac_stop+0xa4/0x1a0 [iwlmvm]
 softirqs last disabled at (381102): [<ffffffffc0eeeda6>] iwl_mvm_async_handlers_purge+0x26/0xa0 [iwlmvm]

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [5.2-rc1 regression]: nvme vs. hibernation
  2019-05-24 15:22 [5.2-rc1 regression]: nvme vs. hibernation Jiri Kosina
@ 2019-05-24 15:44 ` Keith Busch
  2019-05-24 22:27   ` Jiri Kosina
  0 siblings, 1 reply; 10+ messages in thread
From: Keith Busch @ 2019-05-24 15:44 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Jens Axboe, Christoph Hellwig, Hannes Reinecke, Keith Busch,
	Sagi Grimberg, linux-kernel, linux-nvme

On Fri, May 24, 2019 at 05:22:30PM +0200, Jiri Kosina wrote:
> Hi,
> 
> Something is broken in Linus' tree (4dde821e429) with respec to 
> hibernation on my thinkpad x270, and it seems to be nvme related.
> 
> I reliably see the warning below during hibernation, and then sometimes 
> resume sort of works but the machine misbehaves here and there (seems like 
> lost IRQs), sometimes it never comes back from the hibernated state.
> 
> I will not have too much have time to look into this over weekend, so I am 
> sending this out as-is in case anyone has immediate idea. Otherwise I'll 
> bisect it on monday (I don't even know at the moment what exactly was the 
> last version that worked reliably, I'll have to figure that out as well 
> later).

I believe the warning call trace was introduced when we converted nvme to
lock-less completions. On device shutdown, we'll check queues for any
pending completions, and we temporarily disable the interrupts to make
sure that queues interrupt handler can't run concurrently.

On hibernation, most CPUs are offline, and the interrupt re-enabling
is hitting this warning that says the IRQ is not associated with any
online CPUs.

I'm sure we can find a way to fix this warning, but I'm not sure that
explains the rest of the symptoms you're describing though.
 
 
>  WARNING: CPU: 0 PID: 363 at kernel/irq/chip.c:210 irq_startup+0xff/0x110
>  Modules linked in: bnep ccm af_packet fuse 8021q garp stp mrp llc tun ip6table_mangle ip6table_filter ip6_tables iptable_mangle xt_DSCP xt_tcpudp xt_conntrac
>   snd_hda_core aes_x86_64 glue_helper crypto_simd snd_pcm cryptd e1000e ptp pcspkr joydev pps_core snd_timer i2c_i801 cfg80211 mei_me mei intel_pch_thermal th
>  CPU: 0 PID: 363 Comm: kworker/u8:4 Not tainted 5.1.0-08122-ga2d635decbfa #9
>  Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
>  Workqueue: events_unbound async_run_entry_fn
>  RIP: 0010:irq_startup+0xff/0x110
>  Code: f6 4c 89 e7 e8 92 34 00 00 85 c0 75 21 4c 89 e7 31 d2 4c 89 ee e8 e1 cc ff ff 48 89 df e8 89 fe ff ff 41 89 c4 e9 37 ff ff ff <0f> 0b eb b0 0f 0b eb ac 66 0f 1f 84 00 00 
>  44 00 00
>  RSP: 0018:ffffa05100f13bd0 EFLAGS: 00010002
>  RAX: 0000000000000200 RBX: ffff9168e360ec00 RCX: 0000000000000200
>  RDX: 0000000000000200 RSI: ffffffff9f383600 RDI: ffff9168e360ec18
>  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000007
>  R10: 000000007a6a7b55 R11: 0000000000000000 R12: 0000000000000001
>  R13: ffff9168e360ec18 R14: ffff9168e6c97000 R15: ffff9168df76c000
>  FS:  0000000000000000(0000) GS:ffff9168e7280000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fc942c4bf60 CR3: 00000001f1210002 CR4: 00000000003606e0
>  Call Trace:
>   ? __irq_get_desc_lock+0x4e/0x90
>   enable_irq+0x39/0x70
>   nvme_poll_irqdisable+0x3a3/0x470 [nvme]
>   __nvme_disable_io_queues.isra.42+0x16a/0x1d0 [nvme]
>   nvme_dev_disable+0x17e/0x1e0 [nvme]
>   ? pci_pm_poweroff+0xf0/0xf0
>   nvme_suspend+0x13/0x20 [nvme]
>   pci_pm_freeze+0x52/0xd0
>   dpm_run_callback+0x6b/0x2e0
>   __device_suspend+0x147/0x630
>   ? dpm_show_time+0xe0/0xe0
>   async_suspend+0x1a/0x90
>   async_run_entry_fn+0x39/0x160
>   process_one_work+0x1f0/0x5b0
>   ? process_one_work+0x16a/0x5b0
>   worker_thread+0x4c/0x3f0
>   kthread+0x103/0x140
>   ? process_one_work+0x5b0/0x5b0
>   ? kthread_bind+0x10/0x10
>   ret_from_fork+0x3a/0x50
>  irq event stamp: 381230
>  hardirqs last  enabled at (381229): [<ffffffff9e90910d>] _raw_spin_unlock_irqrestore+0x4d/0x70
>  hardirqs last disabled at (381230): [<ffffffff9e908fa4>] _raw_spin_lock_irqsave+0x24/0x60
>  softirqs last  enabled at (381104): [<ffffffffc0eeb734>] __iwl_mvm_mac_stop+0xa4/0x1a0 [iwlmvm]
>  softirqs last disabled at (381102): [<ffffffffc0eeeda6>] iwl_mvm_async_handlers_purge+0x26/0xa0 [iwlmvm]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [5.2-rc1 regression]: nvme vs. hibernation
  2019-05-24 15:44 ` Keith Busch
@ 2019-05-24 22:27   ` Jiri Kosina
  2019-05-24 23:58     ` Dongli Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Jiri Kosina @ 2019-05-24 22:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Hannes Reinecke, Keith Busch,
	Sagi Grimberg, linux-kernel, linux-nvme

On Fri, 24 May 2019, Keith Busch wrote:

> > Something is broken in Linus' tree (4dde821e429) with respec to 
> > hibernation on my thinkpad x270, and it seems to be nvme related.
> > 
> > I reliably see the warning below during hibernation, and then sometimes 
> > resume sort of works but the machine misbehaves here and there (seems like 
> > lost IRQs), sometimes it never comes back from the hibernated state.
> > 
> > I will not have too much have time to look into this over weekend, so I am 
> > sending this out as-is in case anyone has immediate idea. Otherwise I'll 
> > bisect it on monday (I don't even know at the moment what exactly was the 
> > last version that worked reliably, I'll have to figure that out as well 
> > later).
> 
> I believe the warning call trace was introduced when we converted nvme to
> lock-less completions. On device shutdown, we'll check queues for any
> pending completions, and we temporarily disable the interrupts to make
> sure that queues interrupt handler can't run concurrently.

Yeah, the completion changes were the primary reason why I brought this up 
with all of you guys in CC.

> On hibernation, most CPUs are offline, and the interrupt re-enabling
> is hitting this warning that says the IRQ is not associated with any
> online CPUs.
> 
> I'm sure we can find a way to fix this warning, but I'm not sure that
> explains the rest of the symptoms you're describing though.

It seems to be more or less reliable enough for bisect. I'll try that on 
monday and will let you know.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [5.2-rc1 regression]: nvme vs. hibernation
  2019-05-24 22:27   ` Jiri Kosina
@ 2019-05-24 23:58     ` Dongli Zhang
  2019-05-27  9:29       ` Jiri Kosina
  0 siblings, 1 reply; 10+ messages in thread
From: Dongli Zhang @ 2019-05-24 23:58 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-kernel, linux-nvme,
	Keith Busch, Hannes Reinecke, Christoph Hellwig

Hi Jiri,

Looks this has been discussed in the past.

http://lists.infradead.org/pipermail/linux-nvme/2019-April/023234.html

I created a fix for a case but not good enough.

http://lists.infradead.org/pipermail/linux-nvme/2019-April/023277.html

Perhaps people would have better solution.

Dongli Zhang

On 05/25/2019 06:27 AM, Jiri Kosina wrote:
> On Fri, 24 May 2019, Keith Busch wrote:
> 
>>> Something is broken in Linus' tree (4dde821e429) with respec to 
>>> hibernation on my thinkpad x270, and it seems to be nvme related.
>>>
>>> I reliably see the warning below during hibernation, and then sometimes 
>>> resume sort of works but the machine misbehaves here and there (seems like 
>>> lost IRQs), sometimes it never comes back from the hibernated state.
>>>
>>> I will not have too much have time to look into this over weekend, so I am 
>>> sending this out as-is in case anyone has immediate idea. Otherwise I'll 
>>> bisect it on monday (I don't even know at the moment what exactly was the 
>>> last version that worked reliably, I'll have to figure that out as well 
>>> later).
>>
>> I believe the warning call trace was introduced when we converted nvme to
>> lock-less completions. On device shutdown, we'll check queues for any
>> pending completions, and we temporarily disable the interrupts to make
>> sure that queues interrupt handler can't run concurrently.
> 
> Yeah, the completion changes were the primary reason why I brought this up 
> with all of you guys in CC.
> 
>> On hibernation, most CPUs are offline, and the interrupt re-enabling
>> is hitting this warning that says the IRQ is not associated with any
>> online CPUs.
>>
>> I'm sure we can find a way to fix this warning, but I'm not sure that
>> explains the rest of the symptoms you're describing though.
> 
> It seems to be more or less reliable enough for bisect. I'll try that on 
> monday and will let you know.
> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [5.2-rc1 regression]: nvme vs. hibernation
  2019-05-24 23:58     ` Dongli Zhang
@ 2019-05-27  9:29       ` Jiri Kosina
  2019-05-27 11:15         ` Jiri Kosina
  2019-05-28 15:21         ` "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation) Jiri Kosina
  0 siblings, 2 replies; 10+ messages in thread
From: Jiri Kosina @ 2019-05-27  9:29 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-kernel, linux-nvme,
	Keith Busch, Hannes Reinecke, Christoph Hellwig

On Sat, 25 May 2019, Dongli Zhang wrote:

> Looks this has been discussed in the past.
> 
> http://lists.infradead.org/pipermail/linux-nvme/2019-April/023234.html
> 
> I created a fix for a case but not good enough.
> 
> http://lists.infradead.org/pipermail/linux-nvme/2019-April/023277.html

That removes the warning, but I still seem to have ~1:1 chance of reboot 
(triple fault?) immediately after hibernation image is read from disk. 
Seems like that has been going all the way down to 4.19, which seems to be 
rock stable. It's a bit hard to bisect, as I am not really 100% sure 
whether this is one issue or two intermixed ones, and it doesn't reproduce 
completely reliably.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [5.2-rc1 regression]: nvme vs. hibernation
  2019-05-27  9:29       ` Jiri Kosina
@ 2019-05-27 11:15         ` Jiri Kosina
  2019-05-28 15:21         ` "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation) Jiri Kosina
  1 sibling, 0 replies; 10+ messages in thread
From: Jiri Kosina @ 2019-05-27 11:15 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-kernel, linux-nvme,
	Keith Busch, Hannes Reinecke, Christoph Hellwig

On Mon, 27 May 2019, Jiri Kosina wrote:

> > Looks this has been discussed in the past.
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2019-April/023234.html
> > 
> > I created a fix for a case but not good enough.
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2019-April/023277.html
> 
> That removes the warning, but I still seem to have ~1:1 chance of reboot 
> (triple fault?) immediately after hibernation image is read from disk. 
> Seems like that has been going all the way down to 4.19, which seems to be 
> rock stable. It's a bit hard to bisect, as I am not really 100% sure 
> whether this is one issue or two intermixed ones, and it doesn't reproduce 
> completely reliably.

So far this seems to be independent issue, related to kASLR, I'll look 
into that separately.

Still, we should either remove the warning or fix the underlying issue.

Thanks,

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation)
  2019-05-27  9:29       ` Jiri Kosina
  2019-05-27 11:15         ` Jiri Kosina
@ 2019-05-28 15:21         ` Jiri Kosina
  2019-05-28 19:22           ` Jiri Kosina
  1 sibling, 1 reply; 10+ messages in thread
From: Jiri Kosina @ 2019-05-28 15:21 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-kernel, linux-nvme,
	Keith Busch, Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Ingo Molnar, x86, Rafael J. Wysocki, linux-pm, Josh Poimboeuf

On Mon, 27 May 2019, Jiri Kosina wrote:

> > Looks this has been discussed in the past.
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2019-April/023234.html
> > 
> > I created a fix for a case but not good enough.
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2019-April/023277.html
> 
> That removes the warning, but I still seem to have ~1:1 chance of reboot 
> (triple fault?) immediately after hibernation image is read from disk. 

[ some x86/PM folks added ]

I isolated this to 'nosmt' being present in the "outer" (resuming) kernel, 
and am still not sure whether this is x86 issue or nvme/PCI/blk-mq issue.

For the newcomers to this thread: on my thinkpad x270, 'nosmt' reliably 
breaks resume from hibernation; after the image is read out from disk and 
attempt is made to jump to the old kernel, machine reboots.

I verified that it succesfully makes it to the point where restore_image() 
is called from swsusp_arch_resume() (and verified that only BSP is alive 
at that time), but the old kernel never comes back and triplefault-like 
reboot happens.

It's sufficient to remove "nosmt" from the *resuming* kernel, and that 
makes the issue go away (and we resume to the old kernel that has SMT 
correctly disabled). So it has something to do with enabling & disabling 
the siblings before we do the CR3 dance and jump to the old kernel.

I haven't yet been able to isolate this to being (or not being) relevant 
to the pending nvme CQS warning above.

Any ideas how to debug this welcome. I haven't been able to reproduce it 
in a VM, so it's either something specific to that machine in general, or 
to nvme specifically.

Dongli Zhang, could you please try hibernation with "nosmt" on the system 
where you originally saw the initial pending CQS warning? Are you by any 
chance seeing the issue as well?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation)
  2019-05-28 15:21         ` "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation) Jiri Kosina
@ 2019-05-28 19:22           ` Jiri Kosina
  2019-05-29  8:56             ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Jiri Kosina @ 2019-05-28 19:22 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-kernel, linux-nvme,
	Keith Busch, Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Ingo Molnar, x86, Rafael J. Wysocki, linux-pm, Josh Poimboeuf

On Tue, 28 May 2019, Jiri Kosina wrote:

> [ some x86/PM folks added ]
> 
> I isolated this to 'nosmt' being present in the "outer" (resuming) kernel, 
> and am still not sure whether this is x86 issue or nvme/PCI/blk-mq issue.
> 
> For the newcomers to this thread: on my thinkpad x270, 'nosmt' reliably 
> breaks resume from hibernation; after the image is read out from disk and 
> attempt is made to jump to the old kernel, machine reboots.

Thomas figured it out (and this should be really more widespread than just 
my machine :) ).

nosmt forces HT siblings to mwait, but that explodes after %cr3 change 
during resume, as the mwait target address is all of a sudden not valid 
anymore for neither of the hyperthreads.

That also explains why I apparently didn't see it that regularly with 
kaslr disabled.

Nothing to do with nvme, so let's not continue coming up with proper fix 
in this thread.

Thanks,

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation)
  2019-05-28 19:22           ` Jiri Kosina
@ 2019-05-29  8:56             ` Peter Zijlstra
  2019-05-29  9:20               ` Jiri Kosina
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2019-05-29  8:56 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Dongli Zhang, Keith Busch, Jens Axboe, Sagi Grimberg,
	linux-kernel, linux-nvme, Keith Busch, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Ingo Molnar, x86,
	Rafael J. Wysocki, linux-pm, Josh Poimboeuf

On Tue, May 28, 2019 at 09:22:14PM +0200, Jiri Kosina wrote:
> On Tue, 28 May 2019, Jiri Kosina wrote:
> 
> > [ some x86/PM folks added ]
> > 
> > I isolated this to 'nosmt' being present in the "outer" (resuming) kernel, 
> > and am still not sure whether this is x86 issue or nvme/PCI/blk-mq issue.
> > 
> > For the newcomers to this thread: on my thinkpad x270, 'nosmt' reliably 
> > breaks resume from hibernation; after the image is read out from disk and 
> > attempt is made to jump to the old kernel, machine reboots.

> 
> Thomas figured it out (and this should be really more widespread than just 
> my machine :) ).
> 
> nosmt forces HT siblings to mwait, but that explodes after %cr3 change 
> during resume, as the mwait target address is all of a sudden not valid 
> anymore for neither of the hyperthreads.

ARGH!!! But also, you wrote:

> > I verified that it succesfully makes it to the point where restore_image()
> > is called from swsusp_arch_resume() (and verified that only BSP is alive
> > at that time), but the old kernel never comes back and triplefault-like
> > reboot happens.

which means that even without nosmt all 'other' CPUs are offline. And
when I look at resume_target_kernel() I see it call
hibernate_resume_nonboot_cpu_disable().

So how is the SMT offline different from that offline? afaict they all
get into play_dead()->native_play_dead()->mwait_play_dead().


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation)
  2019-05-29  8:56             ` Peter Zijlstra
@ 2019-05-29  9:20               ` Jiri Kosina
  0 siblings, 0 replies; 10+ messages in thread
From: Jiri Kosina @ 2019-05-29  9:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dongli Zhang, Keith Busch, Jens Axboe, Sagi Grimberg,
	linux-kernel, linux-nvme, Keith Busch, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Ingo Molnar, x86,
	Rafael J. Wysocki, linux-pm, Josh Poimboeuf

On Wed, 29 May 2019, Peter Zijlstra wrote:

> > > I verified that it succesfully makes it to the point where restore_image()
> > > is called from swsusp_arch_resume() (and verified that only BSP is alive
> > > at that time), but the old kernel never comes back and triplefault-like
> > > reboot happens.
> 
> which means that even without nosmt all 'other' CPUs are offline. And
> when I look at resume_target_kernel() I see it call
> hibernate_resume_nonboot_cpu_disable().
> 
> So how is the SMT offline different from that offline? afaict they all 
> get into play_dead()->native_play_dead()->mwait_play_dead().

There is no way those other CPUs have been offlined before to the 
native_play_dead() state, as this is way before any userspace was alive to 
initiate any kind of hotplug.

So they are guaranteed to have been all online, and then offlined properly 
to resume_play_dead(). 'nosmt' is the only exception there, as it's the 
only kind of offlining that has already happened at this point.

Let's continue in the other thread.

Thanks,

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-05-29  9:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-24 15:22 [5.2-rc1 regression]: nvme vs. hibernation Jiri Kosina
2019-05-24 15:44 ` Keith Busch
2019-05-24 22:27   ` Jiri Kosina
2019-05-24 23:58     ` Dongli Zhang
2019-05-27  9:29       ` Jiri Kosina
2019-05-27 11:15         ` Jiri Kosina
2019-05-28 15:21         ` "nosmt" breaks resuming from hibernation (was Re: [5.2-rc1 regression]: nvme vs. hibernation) Jiri Kosina
2019-05-28 19:22           ` Jiri Kosina
2019-05-29  8:56             ` Peter Zijlstra
2019-05-29  9:20               ` Jiri Kosina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox