From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: "Andrew Cooper" <andrew.cooper3@citrix.com>,
"Michał Kowalczyk" <mkow@invisiblethingslab.com>,
xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] Xen crash on S3 resume on 4.13 and unstable if any CPU is re-offlined
Date: Sun, 5 Jan 2020 10:02:06 +0100 [thread overview]
Message-ID: <20200105090206.GG1314@mail-itl> (raw)
In-Reply-To: <fe785b74-5e54-26e6-ffc6-6bc2741b35ee@suse.com>
[-- Attachment #1.1: Type: text/plain, Size: 8893 bytes --]
On Sun, Jan 05, 2020 at 09:25:42AM +0100, Jürgen Groß wrote:
> On 05.01.20 08:39, Marek Marczykowski-Górecki wrote:
> > On Sun, Jan 05, 2020 at 12:42:30AM +0000, Andrew Cooper wrote:
> > > On 04/01/2020 15:30, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > >
> > > > I have a reliable crash on resume from S3. I can reproduce it on both
> > > > real hardware and nested within KVM, although call traces are different
> > > > between those platforms. In any case, it happens only if some CPU is to
> > > > be re-offlined after resume (smt=off and/or maxcpus=... options).
> > > >
> > > > I think the crash from the real hardware gives more clues, but the one
> > > > from qemu may also be interesting, maybe it's even another bug?
> > > >
> > > > The crash message (full console log attached):
...
> > > > (XEN) Xen call trace:
> > > > (XEN) [<ffff82d08023beb7>] R schedule.c#cpu_schedule_callback+0xea/0x1a1
> > > > (XEN) [<ffff82d080221289>] F notifier_call_chain+0x6b/0x96
> > > > (XEN) [<ffff82d080203476>] F cpu.c#cpu_notifier_call_chain+0x1b/0x33
> > > > (XEN) [<ffff82d080203550>] F cpu_down+0x5e/0x15c
> > > > (XEN) [<ffff82d080203999>] F enable_nonboot_cpus+0x113/0x1fb
> > > > (XEN) [<ffff82d0802e4240>] F power.c#enter_state_helper+0x107/0x51b
> > > > (XEN) [<ffff82d08020828f>] F domain.c#continue_hypercall_tasklet_handler+0x8b/0xb7
> > > > (XEN) [<ffff82d08023fd39>] F tasklet.c#do_tasklet_work+0x76/0xa9
> > > > (XEN) [<ffff82d08024001a>] F do_tasklet+0x58/0x8a
> > > > (XEN) [<ffff82d08027247a>] F domain.c#idle_loop+0x40/0x96
...
> > > > And the one from qemu:
> > > >
> > > > (XEN) mce_intel.c:772: MCA Capability: firstbank 1, extended MCE MSR 0, SER
> > > > (XEN) Finishing wakeup from ACPI S3 state.
> > > > (XEN) Enabling non-boot CPUs ...
> > > > (XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at sched_credit2.c:2137
> > > > (XEN) ----[ Xen-4.14-unstable x86_64 debug=y Not tainted ]----
> > > > (XEN) CPU: 1
> > > > (XEN) RIP: e008:[<ffff82d08022fe1a>] sched_credit2.c#csched2_unit_wake+0x174/0x176
> > > > (XEN) RFLAGS: 0000000000010097 CONTEXT: hypervisor (d0v0)
> > > > (XEN) rax: ffff83013a7313e8 rbx: ffff83013a6bdf40 rcx: 0000000000000051
> > > > (XEN) rdx: ffff83013a731160 rsi: ffff83013a7310e0 rdi: 0000000000000003
> > > > (XEN) rbp: ffff83013a6f7d98 rsp: ffff83013a6f7d78 r8: deadbeefdeadf00d
> > > > (XEN) r9: deadbeefdeadf00d r10: 0000000000000000 r11: 0000000000000000
> > > > (XEN) r12: ffff83013a6bc7e0 r13: ffff82d08043e720 r14: 0000000000000003
> > > > (XEN) r15: 00000003c5ffecac cr0: 0000000080050033 cr4: 0000000000000660
> > > > (XEN) cr3: 000000004b005000 cr2: 0000000000000000
> > > > (XEN) fsb: 00007751649f4740 gsb: ffff888134a00000 gss: 0000000000000000
> > > > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> > > > (XEN) Xen code around <ffff82d08022fe1a> (sched_credit2.c#csched2_unit_wake+0x174/0x176):
> > > > (XEN) ef e8 1e c1 ff ff eb a7 <0f> 0b 55 48 89 e5 41 57 41 56 41 55 41 54 53 48
> > > > (XEN) Xen stack trace from rsp=ffff83013a6f7d78:
> > > > (XEN) ffff83013a6a3000 ffff83013a6bdf40 ffff83013a6bdf40 ffff83013a7313e8
> > > > (XEN) ffff83013a6f7de8 ffff82d0802391f8 0000000000000202 ffff83013a7313e8
> > > > (XEN) ffff83013a6c1018 0000000000000001 0000000000000000 0000000000000000
> > > > (XEN) ffff83013a6c1018 ffff83013a6a3000 ffff83013a6f7e58 ffff82d08020906c
> > > > (XEN) ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
> > > > (XEN) ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff83013a6f7ef8
> > > > (XEN) 0000000000000180 ffff83013a6aa000 deadbeefdeadf00d 0000000000000003
> > > > (XEN) ffff83013a6f7ee8 ffff82d0803570c7 0000000000000001 0000000000000001
> > > > (XEN) 0000000000000000 deadbeefdeadf00d deadbeefdeadf00d ffff82d08035d3c8
> > > > (XEN) ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
> > > > (XEN) ffff82d08035d3d4 ffff83013a6aa000 0000000000000000 0000000000000000
> > > > (XEN) 0000000000000000 0000000000000000 00007cfec59080e7 ffff82d08035d432
> > > > (XEN) 0000000000015120 0000000000000001 0000000000000000 ffff88813024a540
> > > > (XEN) 0000000000000000 0000000000000001 0000000000000246 0000000000140000
> > > > (XEN) ffff8880bf7db000 ffffea0004be4508 0000000000000018 ffffffff8100130a
> > > > (XEN) 0000000000000000 0000000000000001 0000000000000001 0000010000000000
> > > > (XEN) ffffffff8100130a 000000000000e033 0000000000000246 ffffc90000c97c98
> > > > (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000
> > > > (XEN) 0000000000000000 0000e01000000001 ffff83013a6aa000 00000030ba196000
> > > > (XEN) 0000000000000660 0000000000000000 000000013a6e2000 0000040000000000
> > > > (XEN) Xen call trace:
> > > > (XEN) [<ffff82d08022fe1a>] R sched_credit2.c#csched2_unit_wake+0x174/0x176
> > > > (XEN) [<ffff82d0802391f8>] F vcpu_wake+0xea/0x4d8
> > > > (XEN) [<ffff82d08020906c>] F do_vcpu_op+0x36f/0x687
> > > > (XEN) [<ffff82d0803570c7>] F pv_hypercall+0x28f/0x57d
> > > > (XEN) [<ffff82d08035d432>] F lstar_enter+0x112/0x120
> > > > (XEN)
> > > > (XEN)
> > > > (XEN) ****************************************
> > > > (XEN) Panic on CPU 1:
> > > > (XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at sched_credit2.c:2137
> > > > (XEN) ****************************************
> > >
> > > This looks very much like the core scheduling crash found on specific
> > > machines in S5. From my analysis, it was a use-after-free on a
> > > schedulling resource.
> > >
> > > Does switching back to thread mode (as opposed to core mode) make the
> > > crash go away?
> >
> > It is the thread mode (unless default has changed).
>
> Does the attached patch fix it for you?
Yes, it helps with the issue on the real hardware, thanks! On qemu it helps only
partially - I don't get the crash with "qemu ... -smp 4 -append maxcpus=1"
anymore, but still get it with just "qemu ... -smp 4". It looks like a
different issue.
> From f53e105a9789b6d268e7fe4d05e4b989b9143338 Mon Sep 17 00:00:00 2001
> From: Juergen Gross <jgross@suse.com>
> To: xen-devel@lists.xenproject.org
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Dario Faggioli <dfaggioli@suse.com>
> Date: Sun, 5 Jan 2020 09:21:41 +0100
> Subject: [PATCH] xen/sched: fix resuming from S3 with smt=0
>
> When resuming from S3 and smt=0 or maxcpus= are specified we must not
> do anything in cpu_schedule_callback(). This is not true today for
> taking down a cpu during resume.
>
> If anything goes wrong during resume all the scheduler related error
> handling is in cpupool.c, so we can just bail out early from
> cpu_schedule_callback() when suspending or resuming.
>
> This fixes commit 0763cd2687897b55e7 ("xen/sched: don't disable
> scheduler on cpus during suspend").
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> xen/common/schedule.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index e70cc70a65..54a07ff9e8 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -2562,6 +2562,13 @@ static int cpu_schedule_callback(
> unsigned int cpu = (unsigned long)hcpu;
> int rc = 0;
>
> + /*
> + * All scheduler related suspend/resume handling needed is done in
> + * cpupool.c.
> + */
> + if ( system_state > SYS_STATE_active )
> + return NOTIFY_DONE;
> +
> rcu_read_lock(&sched_res_rculock);
>
> /*
> @@ -2589,8 +2596,7 @@ static int cpu_schedule_callback(
> switch ( action )
> {
> case CPU_UP_PREPARE:
> - if ( system_state != SYS_STATE_resume )
> - rc = cpu_schedule_up(cpu);
> + rc = cpu_schedule_up(cpu);
> break;
> case CPU_DOWN_PREPARE:
> rcu_read_lock(&domlist_read_lock);
> @@ -2598,13 +2604,10 @@ static int cpu_schedule_callback(
> rcu_read_unlock(&domlist_read_lock);
> break;
> case CPU_DEAD:
> - if ( system_state == SYS_STATE_suspend )
> - break;
> sched_rm_cpu(cpu);
> break;
> case CPU_UP_CANCELED:
> - if ( system_state != SYS_STATE_resume )
> - cpu_schedule_down(cpu);
> + cpu_schedule_down(cpu);
> break;
> default:
> break;
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
prev parent reply other threads:[~2020-01-05 9:02 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-04 15:30 [Xen-devel] Xen crash on S3 resume on 4.13 and unstable if any CPU is re-offlined Marek Marczykowski-Górecki
2020-01-05 0:42 ` Andrew Cooper
2020-01-05 7:39 ` Marek Marczykowski-Górecki
2020-01-05 8:25 ` Jürgen Groß
2020-01-05 9:02 ` Marek Marczykowski-Górecki [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200105090206.GG1314@mail-itl \
--to=marmarek@invisiblethingslab.com \
--cc=andrew.cooper3@citrix.com \
--cc=jgross@suse.com \
--cc=mkow@invisiblethingslab.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.