From: "Wei, Jiangang" <weijg.fnst@cn.fujitsu.com>
To: "mingo@kernel.org" <mingo@kernel.org>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"mingo@redhat.com" <mingo@redhat.com>,
"x86@kernel.org" <x86@kernel.org>,
"fenghua.yu@intel.com" <fenghua.yu@intel.com>
Subject: Re: [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash
Date: Mon, 4 Jul 2016 09:44:44 +0000 [thread overview]
Message-ID: <1467625358.2119.55.camel@localhost> (raw)
In-Reply-To: <20160701103620.GA12394@gmail.com>
Hi, Ingo
Thanks for your comments firstly.
On Fri, 2016-07-01 at 12:36 +0200, Ingo Molnar wrote:
> * Wei Jiangang <weijg.fnst@cn.fujitsu.com> wrote:
>
> > commit <522e66464467> disables I/O APIC before shutdown of
> > the local APIC for both reboot and crash path.
> > and commit <2885432aaf15> declares that 'it still makes sense to
> > quiet IO APIC before disabling Local APIC'.
>
> That's not how we refer to commits in changelogs.
>
OK, I will fix it and pay attention to it in the following patch.
> > However, the former introduced a bug for crashdown.
>
> What is 'crashdown'? It's not referred to in the kernel source even once.
well, I mean ...
If we trigger kernel panic with the following commands, the capture
kernel should boot normally and captures the dump image.
#echo 1 > /proc/sys/kernel/sysrq
#echo c > /proc/sysrq-trigger
But due to commit 522e66464467 changes the APIC shutdown sequence in
native_machine_crash_shutdown(), the capture kernel doesn't boot
normally and
hang in calibrate_delay_converge(), waiting for the jiffies to be
updated.
BTW, without commit 522e66464467, the capture kernel works well.
>
> > If specify 'notsc' for capture-kernel, and then trigger crashdown.
> > The capture-kernel will be blocked at calibrate_delay_converge().
>
> This is a more readable way of saying the same:
>
> If we specify the 'notsc' boot parameter for the dump-capture kernel,
> and then trigger a crash-down, then the dump-capture kernel will hang
> in calibrate_delay_converge():
>
> (Assuming the changelog first explains what a 'crash-down' is.)
>
> > /* wait for "start of" clock tick */
> > ticks = jiffies;
> > while (ticks == jiffies)
> > ; /* nothing */
>
> Plase align quoted code to the right with at least a single tab.
>
OK
> > serial console log as following,
>
> serial log of the hang is as follows:
>
> > ............
> > [ 0.000000] Linux version 4.7.0-rc2+ (root@localhost.localdomain)
> > (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #2 SMP Wed Jun
> > 156
> > [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.7.0-rc2+
> > root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap
> > vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=256M
> > vconsole.keymap=us console=tty0 console=ttyS0,115200n8 LANG=en_US.UTF-8
> > irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off
> > panic=10 rootflags=nofail acpi_no_memhotplug notsc
> > ............
> > [ 0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable
> > TSC completely
> > ............
> > [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles:
> > 0xffffffff, max_idle_ns: 133484882848 ns
> > [ 0.000000] tsc: Fast TSC calibration using PIT
> > [ 0.000000] tsc: Detected 3192.714 MHz processor
> > [ 0.000000] Calibrating delay loop...
>
> Just quote the last few lines and skip the useless timestamp column. Also, please
> right-align this too.
OK
>
> > The bug remains and unsolved for a long time, since 2013.
> > I find the arch-criminal by bisect.
>
> What is an arch-criminal? Did you want to say:
>
> The bug has been introduced in 2013. I found the buggy commit via bisection.
>
> ?
Yes, That's what i want to say.
>
> > The commit <522e66464467> used to fix erratum AVR31 for "Intel Atom
> > Processor C2000 Product Family Specification Update".
> > You can find the doc at http://www.intel.com/content/dam/www/public/us
> > /en/documents/specification-updates/atom-c2000-family-spec-update.pdf.
> >
> > IMO,
> > It doesn't make sense that change the order of disabling between
> > I/O APIC and local APIC just for a certain model C2000.
> > And I couldn't find any related descriptions for Intel 64 and IA-32 Arch.
> >
> > so, I want to revert the crash part of commit <522e66464467>.
>
> So why does the crashdump kernel hang in calibrate_delay_converge()?
The jiffies value doesn't increase, which causes the capture kernel hang
in calibrate_delay_converge().
It seems that there's a relationship with the shutdown(disable) order
between IO APIC and local APIC. I'm not sure of this point ....
One thing for sure by debugging is that do_timer() is not called while
capture kernel boots up. I suspect the timer interrupts (irq0) is not
passed to cpu by APIC.
>
> To me it appears this is a weakness in the crashdump kernel: it is unable to boot
> if we crash the original host system in a particular hardware state, right?
Maybe you're right ...
I specify 'notsc' only for capture-kernel, not the original host
system(first kernel).
And I suspect the APIC shutdown sequence in first kernel maybe bring
some bad influence on capture kernel.
I need to do more investigation.
Do you have any advice?
Thanks in advance.
Wei
> By reverting this change we'll just paper over the bug and re-introduce the bug
> that can result in certain CPUs hanging if the IO-APIC sends an APIC message if
> the lapic is disabled prematurely.
> Thanks,
>
> Ingo
>
>
prev parent reply other threads:[~2016-07-04 9:44 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-29 4:51 [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Wei Jiangang
2016-06-29 4:51 ` [PATCH 2/2] time/tick-schede: fix typos Wei Jiangang
2016-07-01 10:49 ` [tip:timers/core] timers/nohz: Fix several typos tip-bot for Wei Jiangang
2016-07-01 10:36 ` [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Ingo Molnar
2016-07-04 9:44 ` Wei, Jiangang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1467625358.2119.55.camel@localhost \
--to=weijg.fnst@cn.fujitsu.com \
--cc=fenghua.yu@intel.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox