xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix scheduler crash after s3 resume
@ 2013-01-23 15:51 Tomasz Wroblewski
  2013-01-23 16:11 ` Jan Beulich
  2013-01-24  6:18 ` Juergen Gross
  0 siblings, 2 replies; 26+ messages in thread
From: Tomasz Wroblewski @ 2013-01-23 15:51 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: george.dunlap, keir, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 4706 bytes --]

Hi all,

This was also discussed earlier, for example here 
http://xen.markmail.org/thread/iqvkylp3mclmsnbw

Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a 
global variable, which, among other things, is used to prevent disabling 
cpu scheduler, prevent breaking vcpu affinities, prevent removing the 
cpu from cpupool on suspend. However, it missed one place where cpu is 
removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), 
line 840:

     cpumask_clear_cpu(cpu, cpupool0->cpu_valid);

This causes the vcpu in the default pool to be considered inactive, and 
the following assertion is violated in sched_credit.c soon after resume 
transitions out of xen, causing a platform reboot:

(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
failed at sched_credit.c:507
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: 0000000000000001   rbx: 0000000000000008   rcx: 0000000000000008
(XEN) rdx: 00000000000000ff   rsi: 0000000000000008   rdi: 0000000000000000
(XEN) rbp: ffff83011415fdd8   rsp: ffff83011415fcf8   r8:  0000000000000000
(XEN) r9:  000000000000003e   r10: 00000008f3de731f   r11: ffffea0000063800
(XEN) r12: ffff82c480261720   r13: ffff830137b4d950   r14: ffff830137beb010
(XEN) r15: ffff82c480261720   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 000000013c17d000   cr2: ffff8800ac6ef8f0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff83011415fcf8:
(XEN)    00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
(XEN)    0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
(XEN)    ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
(XEN)    ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
(XEN)    0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
(XEN)    ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
(XEN)    000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
(XEN)    e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
(XEN)    00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
(XEN)    0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
(XEN)    ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
(XEN)    80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
(XEN)    ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
(XEN)    ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
(XEN)    ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
(XEN)    00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
(XEN)    ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
(XEN)    00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
(XEN) Xen call trace:
(XEN)    [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN)    [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
(XEN)    [<ffff82c48012748f>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
(XEN)    [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
(XEN)    [<ffff82c480124bb4>] do_softirq+0x13/0x15
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
failed at sched_credit.c:507
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

^ reason for above being that "cpus" cpumask is empty as it is a logical 
"and" between cpupool's valid cpus (from which the cpu was removed) and 
cpu affinity mask.

Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 
(which blocked removal of the cpu from the cpupool in cpupool.c) by also 
blocking it's removal from the cpupool's valid cpumask. So cpu 
affinities are still preserved across suspend/resume, and scheuduler 
does not need to be disabled, as per original intent (I think). Would 
welcome comments.

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

Commit message:
Fix s3 resume regression (crash in scheduler) after c-s 
25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the 
cpupool's cpu_valid mask - in the spirit of mentioned c-s.


[-- Attachment #2: fix-suspend-cpu-valid-mask --]
[-- Type: text/plain, Size: 501 bytes --]

diff -r 4b476378fc35 xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c	Mon Jan 21 17:03:10 2013 +0000
+++ b/xen/arch/x86/smpboot.c	Wed Jan 23 15:25:28 2013 +0000
@@ -837,7 +837,8 @@
     remove_siblinginfo(cpu);
 
     /* It's now safe to remove this processor from the online map */
-    cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
+    if (system_state != SYS_STATE_suspend)
+        cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
     cpumask_clear_cpu(cpu, &cpu_online_map);
     fixup_irqs();
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-01-25 13:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-23 15:51 [PATCH] Fix scheduler crash after s3 resume Tomasz Wroblewski
2013-01-23 16:11 ` Jan Beulich
2013-01-23 16:57   ` Tomasz Wroblewski
2013-01-23 17:01     ` Tomasz Wroblewski
2013-01-23 17:50     ` Tomasz Wroblewski
2013-01-24  6:18 ` Juergen Gross
2013-01-24 14:26   ` [PATCH v2] " Tomasz Wroblewski
2013-01-24 15:36     ` Jan Beulich
2013-01-24 15:57       ` George Dunlap
2013-01-24 16:25       ` Tomasz Wroblewski
2013-01-24 16:56         ` Jan Beulich
2013-01-25  9:07           ` Tomasz Wroblewski
2013-01-25  9:36             ` Jan Beulich
2013-01-25  9:45               ` Tomasz Wroblewski
2013-01-25 10:15                 ` Jan Beulich
2013-01-25 10:18                   ` Tomasz Wroblewski
2013-01-25 10:29                     ` Jan Beulich
2013-01-25 10:23                   ` Juergen Gross
2013-01-25 10:29                     ` Tomasz Wroblewski
2013-01-25 10:31                     ` Jan Beulich
2013-01-25 10:35                       ` Juergen Gross
2013-01-25 10:40                         ` Jan Beulich
2013-01-25 11:05                           ` Juergen Gross
2013-01-25 11:56                         ` Tomasz Wroblewski
2013-01-25 12:27                           ` Jan Beulich
2013-01-25 13:58                             ` Tomasz Wroblewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).