[PATCH] Fix scheduler crash after s3 resume

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix scheduler crash after s3 resume
@ 2013-01-23 15:51 Tomasz Wroblewski
  2013-01-23 16:11 ` Jan Beulich
  2013-01-24  6:18 ` Juergen Gross
  0 siblings, 2 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-23 15:51 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: george.dunlap, keir, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 4706 bytes --]

Hi all,

This was also discussed earlier, for example here 
http://xen.markmail.org/thread/iqvkylp3mclmsnbw

Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a 
global variable, which, among other things, is used to prevent disabling 
cpu scheduler, prevent breaking vcpu affinities, prevent removing the 
cpu from cpupool on suspend. However, it missed one place where cpu is 
removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), 
line 840:

     cpumask_clear_cpu(cpu, cpupool0->cpu_valid);

This causes the vcpu in the default pool to be considered inactive, and 
the following assertion is violated in sched_credit.c soon after resume 
transitions out of xen, causing a platform reboot:

(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
failed at sched_credit.c:507
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: 0000000000000001   rbx: 0000000000000008   rcx: 0000000000000008
(XEN) rdx: 00000000000000ff   rsi: 0000000000000008   rdi: 0000000000000000
(XEN) rbp: ffff83011415fdd8   rsp: ffff83011415fcf8   r8:  0000000000000000
(XEN) r9:  000000000000003e   r10: 00000008f3de731f   r11: ffffea0000063800
(XEN) r12: ffff82c480261720   r13: ffff830137b4d950   r14: ffff830137beb010
(XEN) r15: ffff82c480261720   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 000000013c17d000   cr2: ffff8800ac6ef8f0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff83011415fcf8:
(XEN)    00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
(XEN)    0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
(XEN)    ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
(XEN)    ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
(XEN)    0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
(XEN)    ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
(XEN)    000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
(XEN)    e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
(XEN)    00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
(XEN)    0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
(XEN)    ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
(XEN)    80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
(XEN)    ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
(XEN)    ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
(XEN)    ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
(XEN)    00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
(XEN)    ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
(XEN)    00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
(XEN) Xen call trace:
(XEN)    [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN)    [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
(XEN)    [<ffff82c48012748f>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
(XEN)    [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
(XEN)    [<ffff82c480124bb4>] do_softirq+0x13/0x15
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
failed at sched_credit.c:507
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

^ reason for above being that "cpus" cpumask is empty as it is a logical 
"and" between cpupool's valid cpus (from which the cpu was removed) and 
cpu affinity mask.

Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 
(which blocked removal of the cpu from the cpupool in cpupool.c) by also 
blocking it's removal from the cpupool's valid cpumask. So cpu 
affinities are still preserved across suspend/resume, and scheuduler 
does not need to be disabled, as per original intent (I think). Would 
welcome comments.

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

Commit message:
Fix s3 resume regression (crash in scheduler) after c-s 
25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the 
cpupool's cpu_valid mask - in the spirit of mentioned c-s.


[-- Attachment #2: fix-suspend-cpu-valid-mask --]
[-- Type: text/plain, Size: 501 bytes --]

diff -r 4b476378fc35 xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c	Mon Jan 21 17:03:10 2013 +0000
+++ b/xen/arch/x86/smpboot.c	Wed Jan 23 15:25:28 2013 +0000
@@ -837,7 +837,8 @@
     remove_siblinginfo(cpu);
 
     /* It's now safe to remove this processor from the online map */
-    cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
+    if (system_state != SYS_STATE_suspend)
+        cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
     cpumask_clear_cpu(cpu, &cpu_online_map);
     fixup_irqs();
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Fix scheduler crash after s3 resume
  2013-01-23 15:51 [PATCH] Fix scheduler crash after s3 resume Tomasz Wroblewski
@ 2013-01-23 16:11 ` Jan Beulich
  2013-01-23 16:57   ` Tomasz Wroblewski
  2013-01-24  6:18 ` Juergen Gross
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-23 16:11 UTC (permalink / raw)
  To: Tomasz Wroblewski, xen-devel@lists.xen.org
  Cc: george.dunlap, Juergen Gross, keir

>>> On 23.01.13 at 16:51, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> Hi all,
> 
> This was also discussed earlier, for example here 
> http://xen.markmail.org/thread/iqvkylp3mclmsnbw 
> 
> Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a 
> global variable, which, among other things, is used to prevent disabling 
> cpu scheduler, prevent breaking vcpu affinities, prevent removing the 
> cpu from cpupool on suspend. However, it missed one place where cpu is 
> removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), 
> line 840:
> 
>      cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
> 
> This causes the vcpu in the default pool to be considered inactive, and 
> the following assertion is violated in sched_credit.c soon after resume 
> transitions out of xen, causing a platform reboot:
> 
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) Enabling non-boot CPUs  ...
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
> failed at sched_credit.c:507
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    1
> (XEN) RIP:    e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
> (XEN) rax: 0000000000000001   rbx: 0000000000000008   rcx: 0000000000000008
> (XEN) rdx: 00000000000000ff   rsi: 0000000000000008   rdi: 0000000000000000
> (XEN) rbp: ffff83011415fdd8   rsp: ffff83011415fcf8   r8:  0000000000000000
> (XEN) r9:  000000000000003e   r10: 00000008f3de731f   r11: ffffea0000063800
> (XEN) r12: ffff82c480261720   r13: ffff830137b4d950   r14: ffff830137beb010
> (XEN) r15: ffff82c480261720   cr0: 0000000080050033   cr4: 00000000000026f0
> (XEN) cr3: 000000013c17d000   cr2: ffff8800ac6ef8f0
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff83011415fcf8:
> (XEN)    00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
> (XEN)    0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
> (XEN)    ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
> (XEN)    ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
> (XEN)    0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
> (XEN)    ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
> (XEN)    000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
> (XEN)    e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
> (XEN)    00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
> (XEN)    0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
> (XEN)    ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
> (XEN)    80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
> (XEN)    ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
> (XEN)    ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
> (XEN)    ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
> (XEN)    00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
> (XEN)    ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
> (XEN)    00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
> (XEN) Xen call trace:
> (XEN)    [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN)    [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
> (XEN)    [<ffff82c48012748f>] execute_timer+0x4e/0x6c
> (XEN)    [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
> (XEN)    [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
> (XEN)    [<ffff82c480124bb4>] do_softirq+0x13/0x15
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
> failed at sched_credit.c:507
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
> 
> ^ reason for above being that "cpus" cpumask is empty as it is a logical 
> "and" between cpupool's valid cpus (from which the cpu was removed) and 
> cpu affinity mask.

So can you confirm that this is not a problem on a non-debug
hypervisor? I'm particularly asking because, leaving the ASSERT()
aside, such a fundamental flaw would have made it impossible
for S3 to work for anyone, and that's reportedly not the case.

> Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 
> (which blocked removal of the cpu from the cpupool in cpupool.c) by also 
> blocking it's removal from the cpupool's valid cpumask. So cpu 
> affinities are still preserved across suspend/resume, and scheuduler 
> does not need to be disabled, as per original intent (I think). Would 
> welcome comments.

Looks reasonable (and consistent with the earlier change), but
I'd still like to wait for at least Keir's and Juergen's opinion.

Jan

> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
> 
> Commit message:
> Fix s3 resume regression (crash in scheduler) after c-s 
> 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the 
> cpupool's cpu_valid mask - in the spirit of mentioned c-s.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Fix scheduler crash after s3 resume
  2013-01-23 16:11 ` Jan Beulich
@ 2013-01-23 16:57   ` Tomasz Wroblewski
  2013-01-23 17:01     ` Tomasz Wroblewski
  2013-01-23 17:50     ` Tomasz Wroblewski
  0 siblings, 2 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-23 16:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

On 23/01/13 17:11, Jan Beulich wrote:
> So can you confirm that this is not a problem on a non-debug
> hypervisor? I'm particularly asking because, leaving the ASSERT()
> aside, such a fundamental flaw would have made it impossible
> for S3 to work for anyone, and that's reportedly not the case.
>    
Jan,

I tested with non debug hypervisor, without the patch, and it seems I 
get one of two outcomes: either it works, but the cpupool is emptied of 
all cpus but cpu 0 (so I presume nothing will get scheduled on them 
after resume), evidenced by "dump run queues" debug key, or I get the 
following crash:

(XEN) ----[ Xen-4.3-unstable  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48012113a>] vcpu_wake+0x4a/0x3b0
(XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
(XEN) rax: 00007d3b7fd10180   rbx: ffff8300ba6f3000   rcx: 0000000000000000
(XEN) rdx: 0000000000000000   rsi: 0000000000000040   rdi: ffff8300ba6f3000
(XEN) rbp: ffff82c4802da1c0   rsp: ffff82c4802af998   r8:  0000000000000016
(XEN) r9:  ffff8301300bc0c8   r10: 0000000000000002   r11: ffff83013797d010
(XEN) r12: ffff82c4802efee0   r13: 0000000000000092   r14: ffff82c4802efee0
(XEN) r15: ffff82c4802da1c0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000000ba65d000   cr2: 0000000000000060
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c4802af998:
(XEN)    ffff82c4802afbc0 ffffffffffffff49 0000000000000000 ffff82c400000010
(XEN)    0000000800000008 ffff8300ba6f3000 0000000000000000 000000000000001b
(XEN)    ffff830137984000 ffff83013c5e00f0 0000000000000001 ffff82c48015b6fd
(XEN)    000000000013c4be ffff830137984000 0000000000000027 ffff82c480106375
(XEN)    ffff82c4802a8000 7400000000000000 ffff8301379c1b80 ffff82c4802c7800
(XEN)    ffff82c4802f02e8 ffff82c480165a2b ffff8800ba60e3c8 ffff8301300bc800
(XEN)    000000000000001b ffff82c4802afbf8 ffff8301379c1ba4 0000000000000000
(XEN)    ffffffffffff8000 2000000000000001 ffff82c4802afab8 ffff82c4802c7800
(XEN)    ffff82c4802f02e8 0000000000000000 ffff8301379c0080 ffff82c4802afbf8
(XEN)    00000000000000f0 ffff82c48015d507 00000000000000f0 ffff82c4802afbf8
(XEN)    ffff8301379c0080 0000000000000000 ffff82c4802f02e8 ffff82c4802c7800
(XEN)    0000000000000400 0000000000000002 0000000000000000 0000000000000002
(XEN)    ffff82c4802efea0 ffff82c48024dfc0 ffff82c4802a8000 ffff8301379c00a4
(XEN)    ffff8301379c00a4 0000006900000000 ffff82c480165cf7 000000000000e008
(XEN)    0000000000000202 ffff82c4802afb78 000000000000e010 ffff82c480165cf7
(XEN)    0000000000000096 0000000000000000 ffff82c48024dfc0 0000000000000000
(XEN)    ffff8301379c00a4 0000000000000092 0000000000000096 ffff82c48013bb56
(XEN)    ffff82c480263700 0000000000000000 ffff82c4802efea0 0000000000000003
(XEN)    ffff82c4802f0290 ffff82c4802a8000 00000000000026f0 ffff82c48015d507
(XEN)    00000000000026f0 ffff82c4802a8000 ffff82c4802f0290 0000000000000003
(XEN) Xen call trace:
(XEN)    [<ffff82c48012113a>] vcpu_wake+0x4a/0x3b0
(XEN)    [<ffff82c48015b6fd>] vcpu_kick+0x1d/0x80
(XEN)    [<ffff82c480106375>] evtchn_set_pending+0x105/0x180
(XEN)    [<ffff82c480165a2b>] do_IRQ+0x1db/0x5e0
(XEN)    [<ffff82c48015d507>] common_interrupt+0x57/0x60
(XEN)    [<ffff82c480165cf7>] do_IRQ+0x4a7/0x5e0
(XEN)    [<ffff82c480165cf7>] do_IRQ+0x4a7/0x5e0
(XEN)    [<ffff82c48013bb56>] __serial_putc+0x76/0x160
(XEN)    [<ffff82c48015d507>] common_interrupt+0x57/0x60
(XEN)    [<ffff82c48019a0ef>] enter_state+0x29f/0x390
(XEN)    [<ffff82c480139ce0>] serial_rx+0/0xa0
(XEN)    [<ffff82c480100000>] _stext+0/0x14
(XEN)    [<ffff82c480110b57>] handle_keypress+0x67/0xd0
(XEN)    [<ffff82c48013bca6>] serial_rx_interrupt+0x66/0xe0
(XEN)    [<ffff82c48013a1db>] __ns16550_poll+0x4b/0xb0
(XEN)    [<ffff82c48013a190>] __ns16550_poll+0/0xb0
(XEN)    [<ffff82c480139fd6>] ns16550_poll+0x26/0x30
(XEN)    [<ffff82c480182499>] do_invalid_op+0x3d9/0x3f0
(XEN)    [<ffff82c480139fb0>] ns16550_poll+0/0x30
(XEN)    [<ffff82c480216355>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82c480139fb0>] ns16550_poll+0/0x30
(XEN)    [<ffff82c480139fcc>] ns16550_poll+0x1c/0x30
(XEN)    [<ffff82c480125a3b>] timer_softirq_action+0xbb/0x2b0
(XEN)    [<ffff82c48012311a>] __do_softirq+0x5a/0x90
(XEN)    [<ffff82c480156e95>] idle_loop+0x25/0x50
(XEN)
(XEN) Pagetable walk from 0000000000000060:
(XEN)  L4[0x000] = 000000013a80d063 ffffffffffffffff
(XEN)  L3[0x000] = 000000013a80c063 ffffffffffffffff
(XEN)  L2[0x000] = 000000013a80b063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000060
(XEN) ****************************************

With the patch, there's no crash and the cpupool correctly contains 4 
cpus after resume (same as before suspend).

Sidenote: I was also getting the vcpu_wake crashes earlier, on debug 
versions, when I just commented out the asserts for test purposes. So 
just assumed its a consequence of a violated assert later on.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Fix scheduler crash after s3 resume
  2013-01-23 16:57   ` Tomasz Wroblewski
@ 2013-01-23 17:01     ` Tomasz Wroblewski
  2013-01-23 17:50     ` Tomasz Wroblewski
  1 sibling, 0 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-23 17:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

Aha, also don't be offended by the stacktraces attached having a serial 
port activity in the trace, I had to get them by hooking the most of xen 
suspend logic (without actual do_lowlevel_suspend) onto a debug key 
instead of using real s3 - since I can't otherwise get resume serial 
trace on this laptop, which uses pci serial card. But I did test the 
resulting patch against real s3 cycle.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Fix scheduler crash after s3 resume
  2013-01-23 16:57   ` Tomasz Wroblewski
  2013-01-23 17:01     ` Tomasz Wroblewski
@ 2013-01-23 17:50     ` Tomasz Wroblewski
  1 sibling, 0 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-23 17:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

Okay, sadly after more testing it seems I am getting the same vcpu_wake 
crash, even with patch (although its somewhat rare) From the disassembly 
it actually looks to be crashing somewhere in vcpu_schedule_lock(), 
which gets inlined. So whilst patch fixes the fact that cpus are removed 
from cpupool0 after resume, the crashiness remains, which might be a 
separate bug - so I'm going to attempt to dig here more.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Fix scheduler crash after s3 resume
  2013-01-23 15:51 [PATCH] Fix scheduler crash after s3 resume Tomasz Wroblewski
  2013-01-23 16:11 ` Jan Beulich
@ 2013-01-24  6:18 ` Juergen Gross
  2013-01-24 14:26   ` [PATCH v2] " Tomasz Wroblewski
  1 sibling, 1 reply; 26+ messages in thread
From: Juergen Gross @ 2013-01-24  6:18 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: george.dunlap, keir, Jan Beulich, xen-devel@lists.xen.org

Am 23.01.2013 16:51, schrieb Tomasz Wroblewski:
> Hi all,
>
> This was also discussed earlier, for example here
> http://xen.markmail.org/thread/iqvkylp3mclmsnbw
>
> Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a
> global variable, which, among other things, is used to prevent disabling
> cpu scheduler, prevent breaking vcpu affinities, prevent removing the
> cpu from cpupool on suspend. However, it missed one place where cpu is
> removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(),
> line 840:
>
> cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
>
> This causes the vcpu in the default pool to be considered inactive, and
> the following assertion is violated in sched_credit.c soon after resume
> transitions out of xen, causing a platform reboot:
>
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) Enabling non-boot CPUs ...
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'
> failed at sched_credit.c:507
> (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]----
> (XEN) CPU: 1
> (XEN) RIP: e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor
> (XEN) rax: 0000000000000001 rbx: 0000000000000008 rcx: 0000000000000008
> (XEN) rdx: 00000000000000ff rsi: 0000000000000008 rdi: 0000000000000000
> (XEN) rbp: ffff83011415fdd8 rsp: ffff83011415fcf8 r8: 0000000000000000
> (XEN) r9: 000000000000003e r10: 00000008f3de731f r11: ffffea0000063800
> (XEN) r12: ffff82c480261720 r13: ffff830137b4d950 r14: ffff830137beb010
> (XEN) r15: ffff82c480261720 cr0: 0000000080050033 cr4: 00000000000026f0
> (XEN) cr3: 000000013c17d000 cr2: ffff8800ac6ef8f0
> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
> (XEN) Xen stack trace from rsp=ffff83011415fcf8:
> (XEN) 00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
> (XEN) 0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
> (XEN) 0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
> (XEN) ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
> (XEN) ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
> (XEN) 0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
> (XEN) 0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
> (XEN) ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
> (XEN) 000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
> (XEN) e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
> (XEN) 00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
> (XEN) 0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
> (XEN) ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
> (XEN) 80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
> (XEN) ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
> (XEN) ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
> (XEN) ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
> (XEN) 00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
> (XEN) ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
> (XEN) 00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
> (XEN) Xen call trace:
> (XEN) [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN) [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
> (XEN) [<ffff82c48012748f>] execute_timer+0x4e/0x6c
> (XEN) [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
> (XEN) [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
> (XEN) [<ffff82c480124bb4>] do_softirq+0x13/0x15
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'
> failed at sched_credit.c:507
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
>
> ^ reason for above being that "cpus" cpumask is empty as it is a logical
> "and" between cpupool's valid cpus (from which the cpu was removed) and
> cpu affinity mask.
>
> Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1
> (which blocked removal of the cpu from the cpupool in cpupool.c) by also
> blocking it's removal from the cpupool's valid cpumask. So cpu
> affinities are still preserved across suspend/resume, and scheuduler
> does not need to be disabled, as per original intent (I think). Would
> welcome comments.
>
> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

>
> Commit message:
> Fix s3 resume regression (crash in scheduler) after c-s
> 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the
> cpupool's cpu_valid mask - in the spirit of mentioned c-s.
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24  6:18 ` Juergen Gross
@ 2013-01-24 14:26   ` Tomasz Wroblewski
  2013-01-24 15:36     ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-24 14:26 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Keir (Xen.org), Jan Beulich,
	xen-devel@lists.xen.org

[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]

Hi, had to made another version of this patch which also fixes the 
additional interminent crash in vcpu_wake(). Would be grateful for 
comments / possible ack again.

This crash was happening when vcpu_wake is called between 
disable_nonboot_cpus() and enable_nonboot_cpus() on s3 path (happens for 
example from the insides of rcu_barrier() sometime). It turns out that 
it was due to vcpu_schedule_lock() accessing per_cpu area, which however 
is freed at this point as it's freed in percpu.c cpu down callback.

I tried the approach of preserving per cpu area on cpu down/up (during 
s3), as well as testing for cpu being online in vcpu_wake before 
acquiring this lock, but ultimately although helping a bit these were 
not fully succesful. So concluded it's probably not really correct to 
let the scheduler run rampart during the disable_nonboot_cpus() / 
enable_nonboot_cpus() window on s3 path and made a new patch version. 
Tested it across many s3 iterations (on lenovo T520), with no problems. 
It should be pretty uninvasive as it only touches S3 path.

Changes from v1:
- modified cpu_disable_scheduler (schedule.c) to run most of stop 
scheduler logic again (i.e. the vcpu migrate). This is partial revert of 
c-s 25079:d5ccb2d1dbd1 . However, breaking of domain vcpu affinities 
seems to be avoidable on this path, so added a condition to do just that

- instead of skipping cpupool0->is_valid cpumask clear on suspend path, 
added restore of that bit on resume path. Moving this was needed because 
otherwise cpu_disable_scheduler() fails to migrate the vcpu (as it's 
still in cpu_valid mask when it's attempted), return EAGAIN and BUG() 
will fire in __cpu_disable()

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

Commit message:
Fix S3 resume regression after C-S 25079:d5ccb2d1dbd1. Regression causes 
either an interminent crash in vcpu_wake (attempt to access 
vcpu_schedule_lock which is in freed per cpu area at this point), or, in 
debug xen, more frequent ASSERT(!cpumask_empty(&cpus)...) firing in 
_csched_cpu_pick.

Fix this by reverting the hunk which turned off disabling cpu scheduler 
on suspend path. Additionally, avoid breaking domain vcpu affinities on 
suspend path. On resume, restore the frozen cpus in cpupool's cpu_valid 
mask, so they can once again be used by scheduler.

[-- Attachment #2: fix-suspend-scheduler-v2 --]
[-- Type: text/plain, Size: 1512 bytes --]

diff -r 4b476378fc35 -r b7fa7fdaad59 xen/common/cpu.c
--- a/xen/common/cpu.c	Mon Jan 21 17:03:10 2013 +0000
+++ b/xen/common/cpu.c	Thu Jan 24 13:40:31 2013 +0000
@@ -5,6 +5,7 @@
 #include <xen/init.h>
 #include <xen/sched.h>
 #include <xen/stop_machine.h>
+#include <xen/sched-if.h>

 unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
 #ifndef nr_cpumask_bits
@@ -212,6 +213,8 @@
             BUG_ON(error == -EBUSY);
             printk("Error taking CPU%d up: %d\n", cpu, error);
         }
+        if (system_state == SYS_STATE_resume)
+            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
     }

     cpumask_clear(&frozen_cpus);
diff -r 4b476378fc35 -r b7fa7fdaad59 xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Jan 21 17:03:10 2013 +0000
+++ b/xen/common/schedule.c	Thu Jan 24 13:40:31 2013 +0000
@@ -545,7 +545,7 @@
     int    ret = 0;

     c = per_cpu(cpupool, cpu);
-    if ( (c == NULL) || (system_state == SYS_STATE_suspend) )
+    if ( c == NULL )
         return ret;

     for_each_domain_in_cpupool ( d, c )
@@ -556,7 +556,8 @@

             cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid);
             if ( cpumask_empty(&online_affinity) &&
-                 cpumask_test_cpu(cpu, v->cpu_affinity) )
+                 cpumask_test_cpu(cpu, v->cpu_affinity) &&
+                 system_state != SYS_STATE_suspend )
             {
                 printk("Breaking vcpu affinity for domain %d vcpu %d\n",
                         v->domain->domain_id, v->vcpu_id);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24 14:26   ` [PATCH v2] " Tomasz Wroblewski
@ 2013-01-24 15:36     ` Jan Beulich
  2013-01-24 15:57       ` George Dunlap
  2013-01-24 16:25       ` Tomasz Wroblewski
  0 siblings, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2013-01-24 15:36 UTC (permalink / raw)
  To: Tomasz Wroblewski, Juergen Gross
  Cc: George Dunlap, Keir (Xen.org), xen-devel@lists.xen.org

>>> On 24.01.13 at 15:26, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
>@@ -212,6 +213,8 @@
>             BUG_ON(error == -EBUSY);
>             printk("Error taking CPU%d up: %d\n", cpu, error);
>         }
>+        if (system_state == SYS_STATE_resume)
>+            cpumask_set_cpu(cpu, cpupool0->cpu_valid);

This can't be right: What tells you that all CPUs were in pool 0?

Also, for the future - generating patches with -p helps quite
a bit in reviewing them.

>--- a/xen/common/schedule.c	Mon Jan 21 17:03:10 2013 +0000
>+++ b/xen/common/schedule.c	Thu Jan 24 13:40:31 2013 +0000
>@@ -545,7 +545,7 @@
>     int    ret = 0;
> 
>     c = per_cpu(cpupool, cpu);
>-    if ( (c == NULL) || (system_state == SYS_STATE_suspend) )
>+    if ( c == NULL )
>         return ret;
> 
>     for_each_domain_in_cpupool ( d, c )
>@@ -556,7 +556,8 @@
> 
>             cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid);
>             if ( cpumask_empty(&online_affinity) &&
>-                 cpumask_test_cpu(cpu, v->cpu_affinity) )
>+                 cpumask_test_cpu(cpu, v->cpu_affinity) &&
>+                 system_state != SYS_STATE_suspend )
>             {
>                 printk("Breaking vcpu affinity for domain %d vcpu %d\n",
>                         v->domain->domain_id, v->vcpu_id);

I doubt this is correct, as you don't restore any of the settings
during resume that you tear down here.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24 15:36     ` Jan Beulich
@ 2013-01-24 15:57       ` George Dunlap
  2013-01-24 16:25       ` Tomasz Wroblewski
  1 sibling, 0 replies; 26+ messages in thread
From: George Dunlap @ 2013-01-24 15:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tomasz Wroblewski, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

On 24/01/13 15:36, Jan Beulich wrote:
>>>> On 24.01.13 at 15:26, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
>> @@ -212,6 +213,8 @@
>>              BUG_ON(error == -EBUSY);
>>              printk("Error taking CPU%d up: %d\n", cpu, error);
>>          }
>> +        if (system_state == SYS_STATE_resume)
>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
> This can't be right: What tells you that all CPUs were in pool 0?
>
> Also, for the future - generating patches with -p helps quite
> a bit in reviewing them.

If you're using mercurial e-mail / mercurial queues, you can get this 
effect by adding the following lines to your ~/.hgrc:

[diff]
showfunc = True
nodates = 1
git = 1

  -George

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24 15:36     ` Jan Beulich
  2013-01-24 15:57       ` George Dunlap
@ 2013-01-24 16:25       ` Tomasz Wroblewski
  2013-01-24 16:56         ` Jan Beulich
  1 sibling, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-24 16:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

On 24/01/13 16:36, Jan Beulich wrote:
>>>> On 24.01.13 at 15:26, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>>>          
>> @@ -212,6 +213,8 @@
>>              BUG_ON(error == -EBUSY);
>>              printk("Error taking CPU%d up: %d\n", cpu, error);
>>          }
>> +        if (system_state == SYS_STATE_resume)
>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>      
> This can't be right: What tells you that all CPUs were in pool 0?
>
>    
You're right, in my simple tests this was the case, but generally 
speaking it might not be.. Would an approach based on storing cpupool0 
mask in disable_nonboot_cpus() and restoring it in enable_nonboot_cpus() 
be more acceptable?
> Also, for the future - generating patches with -p helps quite
> a bit in reviewing them.
>
>    
Ok, thanks!
>> --- a/xen/common/schedule.c	Mon Jan 21 17:03:10 2013 +0000
>> +++ b/xen/common/schedule.c	Thu Jan 24 13:40:31 2013 +0000
>> @@ -545,7 +545,7 @@
>>      int    ret = 0;
>>
>>      c = per_cpu(cpupool, cpu);
>> -    if ( (c == NULL) || (system_state == SYS_STATE_suspend) )
>> +    if ( c == NULL )
>>          return ret;
>>
>>      for_each_domain_in_cpupool ( d, c )
>> @@ -556,7 +556,8 @@
>>
>>              cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid);
>>              if ( cpumask_empty(&online_affinity)&&
>> -                 cpumask_test_cpu(cpu, v->cpu_affinity) )
>> +                 cpumask_test_cpu(cpu, v->cpu_affinity)&&
>> +                 system_state != SYS_STATE_suspend )
>>              {
>>                  printk("Breaking vcpu affinity for domain %d vcpu %d\n",
>>                          v->domain->domain_id, v->vcpu_id);
>>      
> I doubt this is correct, as you don't restore any of the settings
> during resume that you tear down here.
>
>    
Is the objection about the affinity part or also the (c == NULL) bit? 
The cpu_disable_scheduler() function is currently part of a regular cpu 
down process, and was also part of suspend process before the "system 
state variable" changeset which regressed it. So the (c==NULL) hunk 
mostly just returns to previous state where this was working alot better 
(by empirical testing). But I am no expert on this, so would be grateful 
for ideas how this could be fixed in a better way!

Just to recap, the current problem boils down, I believe,  to the fact 
that vcpu_wake (schedule.c) function keeps getting called occasionally 
during the S3 path for cpus which have the per_cpu data freed, causing a 
crash. Safest way of fixing it seemed to be just put the suspend 
cpu_disable_scheduler under regular path again - it probably isn't the 
best..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24 16:25       ` Tomasz Wroblewski
@ 2013-01-24 16:56         ` Jan Beulich
  2013-01-25  9:07           ` Tomasz Wroblewski
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-24 16:56 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 24.01.13 at 17:25, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> On 24/01/13 16:36, Jan Beulich wrote:
>>>>> On 24.01.13 at 15:26, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>>>>          
>>> @@ -212,6 +213,8 @@
>>>              BUG_ON(error == -EBUSY);
>>>              printk("Error taking CPU%d up: %d\n", cpu, error);
>>>          }
>>> +        if (system_state == SYS_STATE_resume)
>>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>>      
>> This can't be right: What tells you that all CPUs were in pool 0?
>>
>>    
> You're right, in my simple tests this was the case, but generally 
> speaking it might not be.. Would an approach based on storing cpupool0 
> mask in disable_nonboot_cpus() and restoring it in enable_nonboot_cpus() 
> be more acceptable?

As long as that doesn't lead to other pools still come back
corrupted after resume.

You may need to work this out with Juergen.

>>> --- a/xen/common/schedule.c	Mon Jan 21 17:03:10 2013 +0000
>>> +++ b/xen/common/schedule.c	Thu Jan 24 13:40:31 2013 +0000
>>> @@ -545,7 +545,7 @@
>>>      int    ret = 0;
>>>
>>>      c = per_cpu(cpupool, cpu);
>>> -    if ( (c == NULL) || (system_state == SYS_STATE_suspend) )
>>> +    if ( c == NULL )
>>>          return ret;
>>>
>>>      for_each_domain_in_cpupool ( d, c )
>>> @@ -556,7 +556,8 @@
>>>
>>>              cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid);
>>>              if ( cpumask_empty(&online_affinity)&&
>>> -                 cpumask_test_cpu(cpu, v->cpu_affinity) )
>>> +                 cpumask_test_cpu(cpu, v->cpu_affinity)&&
>>> +                 system_state != SYS_STATE_suspend )
>>>              {
>>>                  printk("Breaking vcpu affinity for domain %d vcpu %d\n",
>>>                          v->domain->domain_id, v->vcpu_id);
>>>      
>> I doubt this is correct, as you don't restore any of the settings
>> during resume that you tear down here.
>>
>>    
> Is the objection about the affinity part or also the (c == NULL) bit? 
> The cpu_disable_scheduler() function is currently part of a regular cpu 
> down process, and was also part of suspend process before the "system 
> state variable" changeset which regressed it. So the (c==NULL) hunk 
> mostly just returns to previous state where this was working alot better 
> (by empirical testing). But I am no expert on this, so would be grateful 
> for ideas how this could be fixed in a better way!

The change you're about to partly revert was correcting one
fundamental mistake: bringing down a CPU at runtime is a
different thing than doing the same during suspend. In the
former case you indeed want all associations to it to be cut off,
whereas in the S3 case you want everything to come back at
resume the way it was before suspend. So I think you're just
trying to revert to much of that original change.

But again, Keir and Juergen (who collectively did that change
iirc) would be good for you to consult with.

> Just to recap, the current problem boils down, I believe,  to the fact 
> that vcpu_wake (schedule.c) function keeps getting called occasionally 
> during the S3 path for cpus which have the per_cpu data freed, causing a 
> crash. Safest way of fixing it seemed to be just put the suspend 
> cpu_disable_scheduler under regular path again - it probably isn't the 
> best..

It certainly looks wrong for vcpu_wake() to be called on an offline
CPU in the first place (i.e. I'm getting the impression you're trying
to cure symptoms rather than the root cause). Did you note down
the call tree that got you there?

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-24 16:56         ` Jan Beulich
@ 2013-01-25  9:07           ` Tomasz Wroblewski
  2013-01-25  9:36             ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25  9:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org


> As long as that doesn't lead to other pools still come back
> corrupted after resume.
>
> You may need to work this out with Juergen.
>
>    
Nods. My intention is to just offset the cpupool0 valid mask clear done 
on __cpu_disable().

>
> The change you're about to partly revert was correcting one
> fundamental mistake: bringing down a CPU at runtime is a
> different thing than doing the same during suspend. In the
> former case you indeed want all associations to it to be cut off,
> whereas in the S3 case you want everything to come back at
> resume the way it was before suspend. So I think you're just
> trying to revert to much of that original change.
>
> But again, Keir and Juergen (who collectively did that change
> iirc) would be good for you to consult with.
>
>    
Understood, it'd certainly be preferable if this could be fixed leaving 
the original intent in (I tried but so far failed). From my limited 
understanding of reading the scheduler code, it is actually the 
vcpu_migrate() function called from cpu_disable_scheduler which 
substitutes new v->processor in the vcpu struct, which in turn causes 
the vcpu_wakes to stop happening on these partially downed cpus, and 
hence avoids the crash.

> It certainly looks wrong for vcpu_wake() to be called on an offline
> CPU in the first place (i.e. I'm getting the impression you're trying
> to cure symptoms rather than the root cause). Did you note down
> the call tree that got you there?
>
>    
Here's an example one, taken from unpatched suspend path - was pretty 
hard to get as for some reason I am getting a lot of serial output eaten 
even with sync_console. I also seen it from few other places, believe 
the window of opportunity for it to happen is before the 
enable_nonboot_boot_cpus() is called in enter_state(). If it survives 
past this point it won't crash.


(XEN) Finishing wakeup from ACPI S3 state.
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a
(XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor
(XEN) rax: 00007d3b7fcf7580   rbx: ffff8300ba70f000   rcx: ffff8301020180c8
(XEN) rdx: 0000000000000000   rsi: 0000000000000040   rdi: ffff8300ba70f000
(XEN) rbp: ffff82c4802c7bf0   rsp: ffff82c4802c7bb0   r8:  0000000000000002
(XEN) r9:  ffff83013a805380   r10: 00000009713faecf   r11: 000000000000000a
(XEN) r12: ffff82c480308ae0   r13: ffff82c4802f2dc0   r14: 0000000000000246
(XEN) r15: 0000000000000082   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000000ba674000   cr2: 0000000000000060
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c4802c7bb0:
(XEN)    ffff82c4802c7c00 0000000000000286 ffff83013a8053a8 ffff8300ba70f000
(XEN)    0000000000000000 ffff8300ba70f15c 0000000000000246 ffff82c4802c7e3c
(XEN)    ffff82c4802c7c00 ffff82c48012337b ffff82c4802c7c20 ffff82c48015eaf6
(XEN)    ffff830137b75000 0000000000000013 ffff82c4802c7c30 ffff82c48015eb75
(XEN)    ffff82c4802c7c60 ffff82c4801067e5 ffff82c4802c7c60 ffff8300ba70f000
(XEN)    0000000000000000 ffff8300ba70f15c ffff82c4802c7c90 ffff82c480106a88
(XEN)    ffff82c480308c80 ffff8300ba70f000 ffff82c480121590 00000009a5a50985
(XEN)    ffff82c4802c7ca0 ffff82c480181af7 ffff82c4802c7cb0 ffff82c480121599
(XEN)    ffff82c4802c7ce0 ffff82c4801274cf 0000000000000002 ffff8300ba70f060
(XEN)    ffff82c480308c80 ffff830137beb240 ffff82c4802c7d30 ffff82c4801275cb
(XEN)    ffff82c480308e28 0000000000000286 0000000000000000 ffff82c4802e0000
(XEN)    ffff82c4802e0000 ffff82c4802c0000 fffffffffffffffd ffff82c4802c7e3c
(XEN)    ffff82c4802c7d60 ffff82c480124b8e 0000000000000000 ffff82c4802655e0
(XEN)    0000000000000003 0000000000000100 ffff82c4802c7d70 ffff82c480124bdf
(XEN)    ffff82c4802c7db0 ffff82c48012cd1f ffff82c4802c7da0 0000000000000000
(XEN)    ffff82c48012c79c ffff82c4802c7e3c 0000000000000008 0000000000000000
(XEN)    ffff82c4802c7e20 ffff82c480125a77 ffff82c4802c7e40 ffff82c48012cccb
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffff82c4802c7e00 ffff82c4802c0000 0000000000000000 0000000000000003
(XEN)    ffff82c480308e90 00000000000026f0 ffff82c4802c7e40 ffff82c48012cbf8
(XEN) Xen call trace:
(XEN)    [<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a
(XEN)    [<ffff82c48012337b>] vcpu_unblock+0x4b/0x4d
(XEN)    [<ffff82c48015eaf6>] vcpu_kick+0x20/0x71
(XEN)    [<ffff82c48015eb75>] vcpu_mark_events_pending+0x2e/0x39
(XEN)    [<ffff82c4801067e5>] evtchn_set_pending+0xba/0x185
(XEN)    [<ffff82c480106a88>] send_guest_vcpu_virq+0x62/0x7f
(XEN)    [<ffff82c480181af7>] send_timer_event+0xe/0x10
(XEN)    [<ffff82c480121599>] vcpu_singleshot_timer_fn+0x9/0xb
(XEN)    [<ffff82c4801274cf>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c4801275cb>] timer_softirq_action+0xde/0x206
(XEN)    [<ffff82c480124b8e>] __do_softirq+0x8e/0x99
(XEN)    [<ffff82c480124bdf>] process_pending_softirqs+0x46/0x48
(XEN)    [<ffff82c48012cd1f>] rcu_barrier_action+0x54/0x7d
(XEN)    [<ffff82c480125a77>] stop_machine_run+0x1b5/0x1fe
(XEN)    [<ffff82c48012cbf8>] rcu_barrier+0x24/0x26
(XEN)    [<ffff82c48019dae4>] enter_state+0x2cb/0x36b
(XEN)    [<ffff82c48019db9c>] enter_state_action+0x18/0x24
(XEN)    [<ffff82c480126ac6>] do_tasklet_work+0x8d/0xc7
(XEN)    [<ffff82c480126e14>] do_tasklet+0x65/0x95
(XEN)    [<ffff82c480159945>] idle_loop+0x63/0x6a
(XEN)
(XEN) Pagetable walk from 0000000000000060:
(XEN)  L4[0x000] = 000000013a808063 ffffffffffffffff
(XEN)  L3[0x000] = 000000013a807063 ffffffffffffffff
(XEN)  L2[0x000] = 000000013a806063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000060
(XEN) ****************************************

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25  9:07           ` Tomasz Wroblewski
@ 2013-01-25  9:36             ` Jan Beulich
  2013-01-25  9:45               ` Tomasz Wroblewski
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-25  9:36 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 10:07, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
>> It certainly looks wrong for vcpu_wake() to be called on an offline
>> CPU in the first place (i.e. I'm getting the impression you're trying
>> to cure symptoms rather than the root cause). Did you note down
>> the call tree that got you there?
>>
> Here's an example one, taken from unpatched suspend path - was pretty 
> hard to get as for some reason I am getting a lot of serial output eaten 
> even with sync_console. I also seen it from few other places, believe 
> the window of opportunity for it to happen is before the 
> enable_nonboot_boot_cpus() is called in enter_state(). If it survives 
> past this point it won't crash.
> 
> 
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a
> (XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor
> (XEN) rax: 00007d3b7fcf7580   rbx: ffff8300ba70f000   rcx: ffff8301020180c8
> (XEN) rdx: 0000000000000000   rsi: 0000000000000040   rdi: ffff8300ba70f000
> (XEN) rbp: ffff82c4802c7bf0   rsp: ffff82c4802c7bb0   r8:  0000000000000002
> (XEN) r9:  ffff83013a805380   r10: 00000009713faecf   r11: 000000000000000a
> (XEN) r12: ffff82c480308ae0   r13: ffff82c4802f2dc0   r14: 0000000000000246
> (XEN) r15: 0000000000000082   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 00000000ba674000   cr2: 0000000000000060
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c4802c7bb0:
> (XEN)    ffff82c4802c7c00 0000000000000286 ffff83013a8053a8 ffff8300ba70f000
> (XEN)    0000000000000000 ffff8300ba70f15c 0000000000000246 ffff82c4802c7e3c
> (XEN)    ffff82c4802c7c00 ffff82c48012337b ffff82c4802c7c20 ffff82c48015eaf6
> (XEN)    ffff830137b75000 0000000000000013 ffff82c4802c7c30 ffff82c48015eb75
> (XEN)    ffff82c4802c7c60 ffff82c4801067e5 ffff82c4802c7c60 ffff8300ba70f000
> (XEN)    0000000000000000 ffff8300ba70f15c ffff82c4802c7c90 ffff82c480106a88
> (XEN)    ffff82c480308c80 ffff8300ba70f000 ffff82c480121590 00000009a5a50985
> (XEN)    ffff82c4802c7ca0 ffff82c480181af7 ffff82c4802c7cb0 ffff82c480121599
> (XEN)    ffff82c4802c7ce0 ffff82c4801274cf 0000000000000002 ffff8300ba70f060
> (XEN)    ffff82c480308c80 ffff830137beb240 ffff82c4802c7d30 ffff82c4801275cb
> (XEN)    ffff82c480308e28 0000000000000286 0000000000000000 ffff82c4802e0000
> (XEN)    ffff82c4802e0000 ffff82c4802c0000 fffffffffffffffd ffff82c4802c7e3c
> (XEN)    ffff82c4802c7d60 ffff82c480124b8e 0000000000000000 ffff82c4802655e0
> (XEN)    0000000000000003 0000000000000100 ffff82c4802c7d70 ffff82c480124bdf
> (XEN)    ffff82c4802c7db0 ffff82c48012cd1f ffff82c4802c7da0 0000000000000000
> (XEN)    ffff82c48012c79c ffff82c4802c7e3c 0000000000000008 0000000000000000
> (XEN)    ffff82c4802c7e20 ffff82c480125a77 ffff82c4802c7e40 ffff82c48012cccb
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    ffff82c4802c7e00 ffff82c4802c0000 0000000000000000 0000000000000003
> (XEN)    ffff82c480308e90 00000000000026f0 ffff82c4802c7e40 ffff82c48012cbf8
> (XEN) Xen call trace:
> (XEN)    [<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a
> (XEN)    [<ffff82c48012337b>] vcpu_unblock+0x4b/0x4d
> (XEN)    [<ffff82c48015eaf6>] vcpu_kick+0x20/0x71
> (XEN)    [<ffff82c48015eb75>] vcpu_mark_events_pending+0x2e/0x39
> (XEN)    [<ffff82c4801067e5>] evtchn_set_pending+0xba/0x185
> (XEN)    [<ffff82c480106a88>] send_guest_vcpu_virq+0x62/0x7f
> (XEN)    [<ffff82c480181af7>] send_timer_event+0xe/0x10
> (XEN)    [<ffff82c480121599>] vcpu_singleshot_timer_fn+0x9/0xb
> (XEN)    [<ffff82c4801274cf>] execute_timer+0x4e/0x6c
> (XEN)    [<ffff82c4801275cb>] timer_softirq_action+0xde/0x206
> (XEN)    [<ffff82c480124b8e>] __do_softirq+0x8e/0x99
> (XEN)    [<ffff82c480124bdf>] process_pending_softirqs+0x46/0x48
> (XEN)    [<ffff82c48012cd1f>] rcu_barrier_action+0x54/0x7d
> (XEN)    [<ffff82c480125a77>] stop_machine_run+0x1b5/0x1fe
> (XEN)    [<ffff82c48012cbf8>] rcu_barrier+0x24/0x26

I think I had already raised the question of the placement of
this rcu_barrier() here, and the lack of a counterpart in the
suspend portion of the path. Keir? Or should
rcu_barrier_action() avoid calling process_pending_softirqs()
while still resuming, and instead call __do_softirq() with all but
RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
or alternatively by open-coding its effect)?

Jan

> (XEN)    [<ffff82c48019dae4>] enter_state+0x2cb/0x36b
> (XEN)    [<ffff82c48019db9c>] enter_state_action+0x18/0x24
> (XEN)    [<ffff82c480126ac6>] do_tasklet_work+0x8d/0xc7
> (XEN)    [<ffff82c480126e14>] do_tasklet+0x65/0x95
> (XEN)    [<ffff82c480159945>] idle_loop+0x63/0x6a
> (XEN)
> (XEN) Pagetable walk from 0000000000000060:
> (XEN)  L4[0x000] = 000000013a808063 ffffffffffffffff
> (XEN)  L3[0x000] = 000000013a807063 ffffffffffffffff
> (XEN)  L2[0x000] = 000000013a806063 ffffffffffffffff
> (XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: 0000000000000060
> (XEN) ****************************************

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25  9:36             ` Jan Beulich
@ 2013-01-25  9:45               ` Tomasz Wroblewski
  2013-01-25 10:15                 ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25  9:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org


> I think I had already raised the question of the placement of
> this rcu_barrier() here, and the lack of a counterpart in the
> suspend portion of the path. Keir? Or should
> rcu_barrier_action() avoid calling process_pending_softirqs()
> while still resuming, and instead call __do_softirq() with all but
> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
> or alternatively by open-coding its effect)?
>
>    
Though I recall these vcpu_wake crashes happen also from other entry 
points in enter_state but rcu_barrier, so I dont think removing that 
helps much. Just was unable to get a proper log of them today due to 
most of them being cut in half. Will try bit more.

My belief is that as long as vcpu_migrate is not called in 
cpu_disable_scheduler, the vcpu->processor shall continue to point to 
offline cpu. Which will crash if the vcpu_wake is called for that vcpu. 
If vcpu_migrate is called, then vcpu_wake will still be called with some 
frequency but since vcpu->processor shall point to online cpu, and it 
won't crash. So likely avoiding the wakes here completely is not the 
goal, just the offline ones.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25  9:45               ` Tomasz Wroblewski
@ 2013-01-25 10:15                 ` Jan Beulich
  2013-01-25 10:18                   ` Tomasz Wroblewski
  2013-01-25 10:23                   ` Juergen Gross
  0 siblings, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2013-01-25 10:15 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 10:45, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:

>> I think I had already raised the question of the placement of
>> this rcu_barrier() here, and the lack of a counterpart in the
>> suspend portion of the path. Keir? Or should
>> rcu_barrier_action() avoid calling process_pending_softirqs()
>> while still resuming, and instead call __do_softirq() with all but
>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>> or alternatively by open-coding its effect)?
>>
> Though I recall these vcpu_wake crashes happen also from other entry 
> points in enter_state but rcu_barrier, so I dont think removing that 
> helps much. Just was unable to get a proper log of them today due to 
> most of them being cut in half. Will try bit more.

In which case making __do_softirq() itself honor being in the
suspend/resume path might still be an option.

> My belief is that as long as vcpu_migrate is not called in 
> cpu_disable_scheduler, the vcpu->processor shall continue to point to 
> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. 
> If vcpu_migrate is called, then vcpu_wake will still be called with some 
> frequency but since vcpu->processor shall point to online cpu, and it 
> won't crash. So likely avoiding the wakes here completely is not the 
> goal, just the offline ones.

But you neglect the fact that waking vCPU-s at this point is
unnecessary anyway (they have nowhere to run on).

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:15                 ` Jan Beulich
@ 2013-01-25 10:18                   ` Tomasz Wroblewski
  2013-01-25 10:29                     ` Jan Beulich
  2013-01-25 10:23                   ` Juergen Gross
  1 sibling, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25 10:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org


> But you neglect the fact that waking vCPU-s at this point is
> unnecessary anyway (they have nowhere to run on).
>
>    
Hm, can't they still run on cpu 0? Also, I did earlier try to patch this 
more primitively by adding a check to vcpu_wake whether the cpu is in 
cpu_online_map, and whilst it stopped the crashing, I observed some 
unexpected hangs w/o any serial output - sometimes just the dom0 would 
hang, sometimes also xen would stop being responsive on serial. So I 
figured out that these missed vcpu wakes account for something still.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:15                 ` Jan Beulich
  2013-01-25 10:18                   ` Tomasz Wroblewski
@ 2013-01-25 10:23                   ` Juergen Gross
  2013-01-25 10:29                     ` Tomasz Wroblewski
  2013-01-25 10:31                     ` Jan Beulich
  1 sibling, 2 replies; 26+ messages in thread
From: Juergen Gross @ 2013-01-25 10:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Tomasz Wroblewski, Keir (Xen.org),
	xen-devel@lists.xen.org

Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>
>>> I think I had already raised the question of the placement of
>>> this rcu_barrier() here, and the lack of a counterpart in the
>>> suspend portion of the path. Keir? Or should
>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>> while still resuming, and instead call __do_softirq() with all but
>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>> or alternatively by open-coding its effect)?
>>>
>> Though I recall these vcpu_wake crashes happen also from other entry
>> points in enter_state but rcu_barrier, so I dont think removing that
>> helps much. Just was unable to get a proper log of them today due to
>> most of them being cut in half. Will try bit more.
>
> In which case making __do_softirq() itself honor being in the
> suspend/resume path might still be an option.
>
>> My belief is that as long as vcpu_migrate is not called in
>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>> frequency but since vcpu->processor shall point to online cpu, and it
>> won't crash. So likely avoiding the wakes here completely is not the
>> goal, just the offline ones.
>
> But you neglect the fact that waking vCPU-s at this point is
> unnecessary anyway (they have nowhere to run on).

What about adding a global scheduler_disable() in freeze_domains() and a 
scheduler_enable() in thaw_domains() which will switch scheduler locking to
a global lock (or disable it at all?). This should solve all problems without
any complex changes of current behavior.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:23                   ` Juergen Gross
@ 2013-01-25 10:29                     ` Tomasz Wroblewski
  2013-01-25 10:31                     ` Jan Beulich
  1 sibling, 0 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25 10:29 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Keir (Xen.org), Jan Beulich,
	xen-devel@lists.xen.org

On 25/01/13 11:23, Juergen Gross wrote:
> What about adding a global scheduler_disable() in freeze_domains() and a
> scheduler_enable() in thaw_domains() which will switch scheduler locking to
> a global lock (or disable it at all?). This should solve all problems without
> any complex changes of current behavior.
>
>    
Yes, I'd be very happy to test such a patch! Probably not familiar 
enough with this to be able to write the disable/enable logic myself though

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:18                   ` Tomasz Wroblewski
@ 2013-01-25 10:29                     ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2013-01-25 10:29 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 11:18, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:

>> But you neglect the fact that waking vCPU-s at this point is
>> unnecessary anyway (they have nowhere to run on).
>>    
> Hm, can't they still run on cpu 0?

They could, but there's no point in even trying to before having
thawed domains.

> Also, I did earlier try to patch this 
> more primitively by adding a check to vcpu_wake whether the cpu is in 
> cpu_online_map, and whilst it stopped the crashing, I observed some 
> unexpected hangs w/o any serial output - sometimes just the dom0 would 
> hang, sometimes also xen would stop being responsive on serial. So I 
> figured out that these missed vcpu wakes account for something still.

Oh, yes, of course, keeping vcpu_wake() from doing its job
like this is surely calling for other problems.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:23                   ` Juergen Gross
  2013-01-25 10:29                     ` Tomasz Wroblewski
@ 2013-01-25 10:31                     ` Jan Beulich
  2013-01-25 10:35                       ` Juergen Gross
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-25 10:31 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Tomasz Wroblewski, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 11:23, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote:
> Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>
>>>> I think I had already raised the question of the placement of
>>>> this rcu_barrier() here, and the lack of a counterpart in the
>>>> suspend portion of the path. Keir? Or should
>>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>>> while still resuming, and instead call __do_softirq() with all but
>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>>> or alternatively by open-coding its effect)?
>>>>
>>> Though I recall these vcpu_wake crashes happen also from other entry
>>> points in enter_state but rcu_barrier, so I dont think removing that
>>> helps much. Just was unable to get a proper log of them today due to
>>> most of them being cut in half. Will try bit more.
>>
>> In which case making __do_softirq() itself honor being in the
>> suspend/resume path might still be an option.
>>
>>> My belief is that as long as vcpu_migrate is not called in
>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>>> frequency but since vcpu->processor shall point to online cpu, and it
>>> won't crash. So likely avoiding the wakes here completely is not the
>>> goal, just the offline ones.
>>
>> But you neglect the fact that waking vCPU-s at this point is
>> unnecessary anyway (they have nowhere to run on).
> 
> What about adding a global scheduler_disable() in freeze_domains() and a 
> scheduler_enable() in thaw_domains() which will switch scheduler locking to
> a global lock (or disable it at all?). This should solve all problems 
> without
> any complex changes of current behavior.

I don't see how this would address the so far described
shortcomings.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:31                     ` Jan Beulich
@ 2013-01-25 10:35                       ` Juergen Gross
  2013-01-25 10:40                         ` Jan Beulich
  2013-01-25 11:56                         ` Tomasz Wroblewski
  0 siblings, 2 replies; 26+ messages in thread
From: Juergen Gross @ 2013-01-25 10:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Tomasz Wroblewski, Keir (Xen.org),
	xen-devel@lists.xen.org

Am 25.01.2013 11:31, schrieb Jan Beulich:
>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com>  wrote:
>> Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>
>>>>> I think I had already raised the question of the placement of
>>>>> this rcu_barrier() here, and the lack of a counterpart in the
>>>>> suspend portion of the path. Keir? Or should
>>>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>>>> while still resuming, and instead call __do_softirq() with all but
>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>>>> or alternatively by open-coding its effect)?
>>>>>
>>>> Though I recall these vcpu_wake crashes happen also from other entry
>>>> points in enter_state but rcu_barrier, so I dont think removing that
>>>> helps much. Just was unable to get a proper log of them today due to
>>>> most of them being cut in half. Will try bit more.
>>>
>>> In which case making __do_softirq() itself honor being in the
>>> suspend/resume path might still be an option.
>>>
>>>> My belief is that as long as vcpu_migrate is not called in
>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>>>> frequency but since vcpu->processor shall point to online cpu, and it
>>>> won't crash. So likely avoiding the wakes here completely is not the
>>>> goal, just the offline ones.
>>>
>>> But you neglect the fact that waking vCPU-s at this point is
>>> unnecessary anyway (they have nowhere to run on).
>>
>> What about adding a global scheduler_disable() in freeze_domains() and a
>> scheduler_enable() in thaw_domains() which will switch scheduler locking to
>> a global lock (or disable it at all?). This should solve all problems
>> without
>> any complex changes of current behavior.
>
> I don't see how this would address the so far described
> shortcomings.

The crash happens due to an access to the scheduler percpu area which isn't
allocated at the moment. The accessed element is the address of the scheduler
lock for this cpu. Disabling the percpu locking scheme of the scheduler while
the non-boot cpus are offline will avoid the crash.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:35                       ` Juergen Gross
@ 2013-01-25 10:40                         ` Jan Beulich
  2013-01-25 11:05                           ` Juergen Gross
  2013-01-25 11:56                         ` Tomasz Wroblewski
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-25 10:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Tomasz Wroblewski, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 11:35, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote:
> Am 25.01.2013 11:31, schrieb Jan Beulich:
>>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com>  wrote:
>>> Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>>
>>>>>> I think I had already raised the question of the placement of
>>>>>> this rcu_barrier() here, and the lack of a counterpart in the
>>>>>> suspend portion of the path. Keir? Or should
>>>>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>>>>> while still resuming, and instead call __do_softirq() with all but
>>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>>>>> or alternatively by open-coding its effect)?
>>>>>>
>>>>> Though I recall these vcpu_wake crashes happen also from other entry
>>>>> points in enter_state but rcu_barrier, so I dont think removing that
>>>>> helps much. Just was unable to get a proper log of them today due to
>>>>> most of them being cut in half. Will try bit more.
>>>>
>>>> In which case making __do_softirq() itself honor being in the
>>>> suspend/resume path might still be an option.
>>>>
>>>>> My belief is that as long as vcpu_migrate is not called in
>>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>>>>> frequency but since vcpu->processor shall point to online cpu, and it
>>>>> won't crash. So likely avoiding the wakes here completely is not the
>>>>> goal, just the offline ones.
>>>>
>>>> But you neglect the fact that waking vCPU-s at this point is
>>>> unnecessary anyway (they have nowhere to run on).
>>>
>>> What about adding a global scheduler_disable() in freeze_domains() and a
>>> scheduler_enable() in thaw_domains() which will switch scheduler locking to
>>> a global lock (or disable it at all?). This should solve all problems
>>> without
>>> any complex changes of current behavior.
>>
>> I don't see how this would address the so far described
>> shortcomings.
> 
> The crash happens due to an access to the scheduler percpu area which isn't
> allocated at the moment. The accessed element is the address of the 
> scheduler
> lock for this cpu. Disabling the percpu locking scheme of the scheduler 
> while
> the non-boot cpus are offline will avoid the crash.

Ah, okay. But that wouldn't prevent other bad effects that could
result from vCPU-s pointing to offline pCPU-s. Hence I think such
a solution, even if sufficient for now, would set us up for future
similar (and similarly hard to debug) issues.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:40                         ` Jan Beulich
@ 2013-01-25 11:05                           ` Juergen Gross
  0 siblings, 0 replies; 26+ messages in thread
From: Juergen Gross @ 2013-01-25 11:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Tomasz Wroblewski, Keir (Xen.org),
	xen-devel@lists.xen.org

Am 25.01.2013 11:40, schrieb Jan Beulich:
>>>> On 25.01.13 at 11:35, Juergen Gross<juergen.gross@ts.fujitsu.com>  wrote:
>> Am 25.01.2013 11:31, schrieb Jan Beulich:
>>>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com>   wrote:
>>>> Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>    wrote:
>>>>>
>>>>>>> I think I had already raised the question of the placement of
>>>>>>> this rcu_barrier() here, and the lack of a counterpart in the
>>>>>>> suspend portion of the path. Keir? Or should
>>>>>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>>>>>> while still resuming, and instead call __do_softirq() with all but
>>>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>>>>>> or alternatively by open-coding its effect)?
>>>>>>>
>>>>>> Though I recall these vcpu_wake crashes happen also from other entry
>>>>>> points in enter_state but rcu_barrier, so I dont think removing that
>>>>>> helps much. Just was unable to get a proper log of them today due to
>>>>>> most of them being cut in half. Will try bit more.
>>>>>
>>>>> In which case making __do_softirq() itself honor being in the
>>>>> suspend/resume path might still be an option.
>>>>>
>>>>>> My belief is that as long as vcpu_migrate is not called in
>>>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>>>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>>>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>>>>>> frequency but since vcpu->processor shall point to online cpu, and it
>>>>>> won't crash. So likely avoiding the wakes here completely is not the
>>>>>> goal, just the offline ones.
>>>>>
>>>>> But you neglect the fact that waking vCPU-s at this point is
>>>>> unnecessary anyway (they have nowhere to run on).
>>>>
>>>> What about adding a global scheduler_disable() in freeze_domains() and a
>>>> scheduler_enable() in thaw_domains() which will switch scheduler locking to
>>>> a global lock (or disable it at all?). This should solve all problems
>>>> without
>>>> any complex changes of current behavior.
>>>
>>> I don't see how this would address the so far described
>>> shortcomings.
>>
>> The crash happens due to an access to the scheduler percpu area which isn't
>> allocated at the moment. The accessed element is the address of the
>> scheduler
>> lock for this cpu. Disabling the percpu locking scheme of the scheduler
>> while
>> the non-boot cpus are offline will avoid the crash.
>
> Ah, okay. But that wouldn't prevent other bad effects that could
> result from vCPU-s pointing to offline pCPU-s. Hence I think such
> a solution, even if sufficient for now, would set us up for future
> similar (and similarly hard to debug) issues.

To avoid this problem you would have to change the suspend logic, I think.

We could take the cpus just physically offline without removing all the
state information from the system. During resume the old state and percpu areas
would just be reused.

While this would be a clean solution, it's not a really small task to do...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 10:35                       ` Juergen Gross
  2013-01-25 10:40                         ` Jan Beulich
@ 2013-01-25 11:56                         ` Tomasz Wroblewski
  2013-01-25 12:27                           ` Jan Beulich
  1 sibling, 1 reply; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25 11:56 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, Keir (Xen.org), Jan Beulich,
	xen-devel@lists.xen.org


> The crash happens due to an access to the scheduler percpu area which isn't
> allocated at the moment. The accessed element is the address of the scheduler
> lock for this cpu. Disabling the percpu locking scheme of the scheduler while
> the non-boot cpus are offline will avoid the crash.
>
>    
Ok, so I tried this approach (by turning the locking in vcpu_wake to be 
conditional based on system_state), and whilst it stopped vcpu_wake 
crash I traded it for a crash in acpi_cpufreq_target:

(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff82c4801a0594>] acpi_cpufreq_target+0x165/0x33b
(XEN) RFLAGS: 0000000000010293   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830137bc7300   rcx: 0000000000000000
(XEN) rdx: 0000000000000009   rsi: ffff82c480265460   rdi: ffff830137bd7d60
(XEN) rbp: ffff830137bd7db0   rsp: ffff830137bd7d30   r8:  0000000000000004
(XEN) r9:  00000000fffffffe   r10: 0000000000000009   r11: 0000000000000000
(XEN) r12: ffff830137bc7c70   r13: ffff8301025444f8   r14: ffff830137bc7c70
(XEN) r15: 0000000001b5b14c   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000000ba674000   cr2: 000000000000004c
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830137bd7d30:
(XEN)    000000008017d626 0000000000000009 00000009000000fb ffff830100000001
(XEN)    ffff830137bd7d60 0000080000000199 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffffffff37bd7da0 00000000ffffffea
(XEN)    ffff830137bc7c70 00000000002936c8 0000000006d9e30a 0000000001b5b14c
(XEN)    ffff830137bd7df0 ffff82c4801414ee ffff830137bc7c70 0000000000000003
(XEN)    ffff830137bd7df0 0000000000000008 0000000000000008 ffff830102ae1340
(XEN)    ffff830137bd7e50 ffff82c480140815 ffff8301141624a0 002936c800000286
(XEN)    ffff82c480308dc0 ffff830137bc7c70 0000000000000003 ffff830102ae1380
(XEN)    ffff830137bebb50 ffff830137bebc00 0000000000000010 0000000000000030
(XEN)    ffff830137bd7e70 ffff82c480140a2b ffff830137bd7e70 0000001548c205b8
(XEN)    ffff830137bd7ef0 ffff82c4801a31da 0000000000000002(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

(call graph sadly got eaten)

which corresponds to the following lines in cpufreq.c

     freqs.old = perf->states[perf->state].core_frequency * 1000;
     freqs.new = data->freq_table[next_state].frequency;
ffff82c4801a058d:       8b 55 94                mov    -0x6c(%rbp),%edx
ffff82c4801a0590:       48 8b 43 08             mov    0x8(%rbx),%rax
ffff82c4801a0594:       8b 44 d0 04             mov    0x4(%rax,%rdx,8),%eax
ffff82c4801a0598:       89 45 8c                mov    %eax,-0x74(%rbp)
ffff82c4801a059b:       48 c7 c0 00 80 ff ff    mov    $0xffffffffffff8000,%rax
ffff82c4801a05a2:       48 21 e0                and    %rsp,%rax

which I guess crashes because either freq_table or data is freed at this point (indeed seems that cpufreq driver has some cpu up/down logic which frees it). Given this is not even first place in acpi_freq_target this is accessed, it looks like the cpu got torn down halfway thru this function... Suspect there are likely to be more sites affected by this.

I also tried Jan's suggestion of making do_softirq skip its job if we are resuming, that causes a hang in rcu_barrier(), adding another resume conditional rcu_barrier() made it progress further but crash elsewhere (don't remember where exactly, this approach looked a bit like dead end so i abandoned it quickly)

So still not having a better solution than the revert of the cpu_disable_schedule() hunk.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 11:56                         ` Tomasz Wroblewski
@ 2013-01-25 12:27                           ` Jan Beulich
  2013-01-25 13:58                             ` Tomasz Wroblewski
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2013-01-25 12:27 UTC (permalink / raw)
  To: Tomasz Wroblewski
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

>>> On 25.01.13 at 12:56, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:
> I also tried Jan's suggestion of making do_softirq skip its job if we are 
> resuming, that causes a hang in rcu_barrier(), adding another resume 
> conditional rcu_barrier() made it progress further but crash elsewhere (don't 
> remember where exactly, this approach looked a bit like dead end so i 
> abandoned it quickly)

I didn't say exactly that - for avoiding the hang, you still need to
service the RCU softirq.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] Fix scheduler crash after s3 resume
  2013-01-25 12:27                           ` Jan Beulich
@ 2013-01-25 13:58                             ` Tomasz Wroblewski
  0 siblings, 0 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-25 13:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Juergen Gross, Keir (Xen.org),
	xen-devel@lists.xen.org

On 25/01/13 13:27, Jan Beulich wrote:
>>>> On 25.01.13 at 12:56, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>>>          
>> I also tried Jan's suggestion of making do_softirq skip its job if we are
>> resuming, that causes a hang in rcu_barrier(), adding another resume
>> conditional rcu_barrier() made it progress further but crash elsewhere (don't
>> remember where exactly, this approach looked a bit like dead end so i
>> abandoned it quickly)
>>      
> I didn't say exactly that - for avoiding the hang, you still need to
> service the RCU softirq.
>    
Ah sorry. Ok this approach seems to work for avoiding vcpu_wake crash, 
although I am still getting this occasional crash in acpi_freq_target, 
same as when I tried Jurgen's suggestion. Not getting call stack 
captured for that so not sure where's that called from, but will try to 
pursue it abit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-01-25 13:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-23 15:51 [PATCH] Fix scheduler crash after s3 resume Tomasz Wroblewski
2013-01-23 16:11 ` Jan Beulich
2013-01-23 16:57   ` Tomasz Wroblewski
2013-01-23 17:01     ` Tomasz Wroblewski
2013-01-23 17:50     ` Tomasz Wroblewski
2013-01-24  6:18 ` Juergen Gross
2013-01-24 14:26   ` [PATCH v2] " Tomasz Wroblewski
2013-01-24 15:36     ` Jan Beulich
2013-01-24 15:57       ` George Dunlap
2013-01-24 16:25       ` Tomasz Wroblewski
2013-01-24 16:56         ` Jan Beulich
2013-01-25  9:07           ` Tomasz Wroblewski
2013-01-25  9:36             ` Jan Beulich
2013-01-25  9:45               ` Tomasz Wroblewski
2013-01-25 10:15                 ` Jan Beulich
2013-01-25 10:18                   ` Tomasz Wroblewski
2013-01-25 10:29                     ` Jan Beulich
2013-01-25 10:23                   ` Juergen Gross
2013-01-25 10:29                     ` Tomasz Wroblewski
2013-01-25 10:31                     ` Jan Beulich
2013-01-25 10:35                       ` Juergen Gross
2013-01-25 10:40                         ` Jan Beulich
2013-01-25 11:05                           ` Juergen Gross
2013-01-25 11:56                         ` Tomasz Wroblewski
2013-01-25 12:27                           ` Jan Beulich
2013-01-25 13:58                             ` Tomasz Wroblewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).