From: Simon Gaiser <simon@invisiblethingslab.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
Juergen Gross <jgross@suse.com>,
xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH] x86/XPTI: fix S3 resume (and CPU offlining in general)
Date: Thu, 24 May 2018 16:12:00 +0000 [thread overview]
Message-ID: <7323393c-9ade-3ab9-3d6e-a38be2f68ed0@invisiblethingslab.com> (raw)
In-Reply-To: <5B06DE6602000078001C5ACC@prv1-mh.provo.novell.com>
[-- Attachment #1.1.1: Type: text/plain, Size: 5187 bytes --]
Jan Beulich:
>>>> On 24.05.18 at 17:10, <simon@invisiblethingslab.com> wrote:
>> Jan Beulich:
>>>>>> On 24.05.18 at 16:14, <simon@invisiblethingslab.com> wrote:
>>>> Jan Beulich:
>>>>>>>> On 24.05.18 at 16:00, <simon@invisiblethingslab.com> wrote:
>>>>>> Jan Beulich:
>>>>>>> In commit d1d6fc97d6 ("x86/xpti: really hide almost all of Xen image")
>>>>>>> I've failed to remember the fact that multiple CPUs share a stub
>>>>>>> mapping page. Therefore it is wrong to unconditionally zap the mapping
>>>>>>> when bringing down a CPU; it may only be unmapped when no other online
>>>>>>> CPU uses that same page.
>>>>>>>
>>>>>>> Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
>>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>>>
>>>>>>> --- a/xen/arch/x86/smpboot.c
>>>>>>> +++ b/xen/arch/x86/smpboot.c
>>>>>>> @@ -876,7 +876,21 @@ static void cleanup_cpu_root_pgt(unsigne
>>>>>>>
>>>>>>> free_xen_pagetable(rpt);
>>>>>>>
>>>>>>> - /* Also zap the stub mapping for this CPU. */
>>>>>>> + /*
>>>>>>> + * Also zap the stub mapping for this CPU, if no other online one uses
>>>>>>> + * the same page.
>>>>>>> + */
>>>>>>> + if ( stub_linear )
>>>>>>> + {
>>>>>>> + unsigned int other;
>>>>>>> +
>>>>>>> + for_each_online_cpu(other)
>>>>>>> + if ( !((per_cpu(stubs.addr, other) ^ stub_linear) >> PAGE_SHIFT)
>> )
>>>>>>> + {
>>>>>>> + stub_linear = 0;
>>>>>>> + break;
>>>>>>> + }
>>>>>>> + }
>>>>>>> if ( stub_linear )
>>>>>>> {
>>>>>>> l3_pgentry_t *l3t = l4e_to_l3e(common_pgt);
>>>>>>
>>>>>> Tried this on-top of staging (fc5805daef) and I still get the same
>>>>>> double fault.
>>>>>
>>>>> Hmm, it worked for me offlining (and later re-onlining) several pCPU-s. What
>>>>> size a system are you testing on? Mine has got only 12 CPUs, i.e. all stubs
>>>>> are in the same page (and I'd never unmap anything here at all).
>>>>
>>>> 4 cores + HT, so 8 CPUs from Xen's PoV.
>>>
>>> May I ask you to do two things:
>>> 1) confirm that you can offline CPUs successfully using xen-hptool,
>>> 2) add a printk() to the code above making clear whether/when any
>>> of the mappings actually get zapped?
>>
>> There seem to be two failure modes now. It seems that both can be
>> triggered either by offlining a cpu or by suspend. Using cpu offlining
>> below since during suspend I often loose part of the serial output.
>>
>> Failure mode 1, the double fault as before:
>>
>> root@localhost:~# xen-hptool cpu-offline 3
>> Prepare to offline CPU 3
>> (XEN) Broke affinity for irq 9
>> (XEN) Broke affinity for irq 29
>> (XEN) dbg: stub_linear't1 = 18446606431818858880
>> (XEN) dbg: first stub_linear if
>> (XEN) dbg: stub_linear't2 = 18446606431818858880
>> (XEN) dbg: second stub_linear if
>> CPU 3 offlined successfully
>> root@localhost:~# (XEN) *** DOUBLE FAULT ***
>> (XEN) ----[ Xen-4.11-rc x86_64 debug=y Not tainted ]----
>> (XEN) CPU: 0
>> (XEN) RIP: e008:[<ffff82d08037b964>] handle_exception+0x9c/0xff
>> (XEN) RFLAGS: 0000000000010006 CONTEXT: hypervisor
>> (XEN) rax: ffffc90040cdc0a8 rbx: 0000000000000000 rcx: 0000000000000006
>> (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: 0000000000000000
>> (XEN) rbp: 000036ffbf323f37 rsp: ffffc90040cdc000 r8: 0000000000000000
>> (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000
>> (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: ffffc90040cdffff
>> (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 0000000000042660
>> (XEN) cr3: 0000000128109000 cr2: ffffc90040cdbff8
>> (XEN) fsb: 00007fc01c3c6dc0 gsb: ffff88021e700000 gss: 0000000000000000
>> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
>> (XEN) Xen code around <ffff82d08037b964> (handle_exception+0x9c/0xff):
>> (XEN) 00 f3 90 0f ae e8 eb f9 <e8> 07 00 00 00 f3 90 0f ae e8 eb f9 83 e9 01
>> 75
>> (XEN) Current stack base ffffc90040cd8000 differs from expected
>> ffff8300cec88000
>> (XEN) Valid stack range: ffffc90040cde000-ffffc90040ce0000,
>> sp=ffffc90040cdc000, tss.rsp0=ffff8300cec8ffa0
>> (XEN) No stack overflow detected. Skipping stack trace.
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) DOUBLE FAULT -- system shutdown
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>
> Oh, so CPU 0 gets screwed by offlining CPU 3. How about this alternative
> (but so far untested) patch:
>
> --- unstable.orig/xen/arch/x86/smpboot.c
> +++ unstable/xen/arch/x86/smpboot.c
> @@ -874,7 +874,7 @@ static void cleanup_cpu_root_pgt(unsigne
> l2_pgentry_t *l2t = l3e_to_l2e(l3t[l3_table_offset(stub_linear)]);
> l1_pgentry_t *l1t = l2e_to_l1e(l2t[l2_table_offset(stub_linear)]);
>
> - l1t[l2_table_offset(stub_linear)] = l1e_empty();
> + l1t[l1_table_offset(stub_linear)] = l1e_empty();
> }
> }
>
Yes, this fixes cpu on-/offlining and suspend for me on staging.
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
next prev parent reply other threads:[~2018-05-24 16:12 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-24 13:41 [PATCH] x86/XPTI: fix S3 resume (and CPU offlining in general) Jan Beulich
2018-05-24 13:48 ` Andrew Cooper
2018-05-24 14:05 ` Jan Beulich
2018-05-24 14:00 ` Simon Gaiser
2018-05-24 14:08 ` Jan Beulich
2018-05-24 14:14 ` Simon Gaiser
2018-05-24 14:18 ` Andrew Cooper
2018-05-24 14:22 ` Jan Beulich
2018-05-24 14:24 ` Andrew Cooper
2018-05-24 14:31 ` Jan Beulich
2018-05-24 14:35 ` Simon Gaiser
2018-05-24 14:53 ` Andrew Cooper
2018-05-24 15:10 ` George Dunlap
2018-05-24 15:16 ` Simon Gaiser
2018-05-24 14:28 ` Jan Beulich
2018-05-24 15:10 ` Simon Gaiser
2018-05-24 15:31 ` Jan Beulich
2018-05-24 15:46 ` Jan Beulich
2018-05-24 16:12 ` Simon Gaiser [this message]
[not found] <5B06C0F902000078001C5925@suse.com>
2018-05-28 4:26 ` Juergen Gross
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7323393c-9ade-3ab9-3d6e-a38be2f68ed0@invisiblethingslab.com \
--to=simon@invisiblethingslab.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=jgross@suse.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).