From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
xen-devel <xen-devel@lists.xenproject.org>
Cc: osstest-admin@xenproject.org
Subject: Re: [xen-unstable test] 106580: regressions - trouble: blocked/broken/fail/pass
Date: Fri, 10 Mar 2017 11:12:35 +0000 [thread overview]
Message-ID: <35162274-6bb9-f2d8-9f1c-a454a7f7c74e@citrix.com> (raw)
In-Reply-To: <58C273D20200007800141E94@prv-mh.provo.novell.com>
On 10/03/17 08:37, Jan Beulich wrote:
>>>> On 10.03.17 at 08:20, <osstest-admin@xenproject.org> wrote:
>> flight 106580 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/106580/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>> test-armhf-armhf-xl-arndale 3 host-install(3) broken REGR. vs. 106534
>> test-amd64-amd64-migrupgrade 10 xen-boot/dst_host fail REGR. vs. 106534
> The NMI watchdog has hit the EOI timer waiting to be able to send
> an IPI on CPU1:
>
> Mar 10 00:09:32.745677 (XEN) Xen call trace:
> Mar 10 00:09:32.745727 (XEN) [<ffff82d080134083>] _spin_lock+0x2c/0x4f
> Mar 10 00:09:32.745779 (XEN) [<ffff82d080133e34>] on_selected_cpus+0x2c/0xc6
> Mar 10 00:09:32.753699 (XEN) [<ffff82d080177101>] irq.c#irq_guest_eoi_timer_fn+0x142/0x165
> Mar 10 00:09:32.761711 (XEN) [<ffff82d080136ddc>] timer.c#execute_timer+0x47/0x62
> Mar 10 00:09:32.769683 (XEN) [<ffff82d080136ed2>] timer.c#timer_softirq_action+0xdb/0x22c
> Mar 10 00:09:32.769744 (XEN) [<ffff82d0801337e1>] softirq.c#__do_softirq+0x7f/0x8a
> Mar 10 00:09:32.777697 (XEN) [<ffff82d080133836>] do_softirq+0x13/0x15
> Mar 10 00:09:32.785792 (XEN) [<ffff82d080255081>] entry.o#process_softirqs+0x21/0x30
>
> That lock is being held by CPU2:
>
> Mar 10 00:15:25.133639 (XEN) Xen call trace:
> Mar 10 00:15:25.133655 (XEN) [<ffff82d080102389>] __bitmap_empty+0x54/0x96
> Mar 10 00:15:25.141636 (XEN) [<ffff82d080133eb5>] on_selected_cpus+0xad/0xc6
> Mar 10 00:15:25.149635 (XEN) [<ffff82d0801ca640>] powernow.c#powernow_cpufreq_cpu_init+0x20d/0x372
> Mar 10 00:15:25.157633 (XEN) [<ffff82d08014c476>] cpufreq_add_cpu+0x1d6/0x5d3
> Mar 10 00:15:25.157654 (XEN) [<ffff82d0801ca173>] cpufreq_cpu_init+0x17/0x1a
> Mar 10 00:15:25.165658 (XEN) [<ffff82d08014cd8d>] set_px_pminfo+0x2b6/0x2f7
> Mar 10 00:15:25.165679 (XEN) [<ffff82d0801956dd>] do_platform_op+0xe69/0x1959
> Mar 10 00:15:25.173667 (XEN) [<ffff82d080251485>] pv_hypercall+0x1ef/0x42d
> Mar 10 00:15:25.181678 (XEN) [<ffff82d080254ff6>] entry.o#test_all_events+0/0x30
>
> Register state tells us that it's CPU5 not responding. The only piece
> of information we have about CPU5 is
>
> Mar 10 00:09:32.809709 (XEN) CPU5 @ e008:ffff82d080134083 (0000000000000000)
>
> which is the also in _spin_lock(), but which I'm afraid is too little to
> diagnose the issue. I'm therefore wondering whether we wouldn't
> better default "async-show-all" to true in debug builds.
>
> What I'm also puzzled by is that the system is still partly alive after
> the panic: There's Dom0 output, and it is also reacting to debug
> key input. I would have expected a panic to bring down the system
> right away...
Not very surprising. We crashed because the IPI lock was unavailable,
then disable the watchdog in machine_halt() and try to IPI again. CPU1
is almost certainly waiting trying to broadcast __machine_halt().
This is the second odd corner case we have seen around machine_halt().
The last one was because of being unsafe to use if you panic() from the
middle of context_switch(), as interrupts are re-enabled, and a guest
irq hits an assertion. The solution in both cases to make it more
reliable is to an NMI broadcast and leave interrupts disabled.
IMO, noreboot isn't a clever thing to be using at all. OSSTest should
be installing a crash kernel and collecting crash logs, which will be
far more useful to aid diagnosis.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
prev parent reply other threads:[~2017-03-10 11:12 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-10 7:20 [xen-unstable test] 106580: regressions - trouble: blocked/broken/fail/pass osstest service owner
2017-03-10 8:37 ` Jan Beulich
2017-03-10 11:12 ` Andrew Cooper [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=35162274-6bb9-f2d8-9f1c-a454a7f7c74e@citrix.com \
--to=andrew.cooper3@citrix.com \
--cc=JBeulich@suse.com \
--cc=osstest-admin@xenproject.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).