All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Wei, Jiangang" <weijg.fnst@cn.fujitsu.com>
To: "mingo@kernel.org" <mingo@kernel.org>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"fenghua.yu@intel.com" <fenghua.yu@intel.com>
Subject: Re: [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash
Date: Mon, 4 Jul 2016 09:44:44 +0000	[thread overview]
Message-ID: <1467625358.2119.55.camel@localhost> (raw)
In-Reply-To: <20160701103620.GA12394@gmail.com>

Hi, Ingo

Thanks for your comments firstly.

On Fri, 2016-07-01 at 12:36 +0200, Ingo Molnar wrote:
> * Wei Jiangang <weijg.fnst@cn.fujitsu.com> wrote:
> 
> > commit <522e66464467> disables I/O APIC before shutdown of
> > the local APIC for both reboot and crash path.
> > and commit <2885432aaf15> declares that 'it still makes sense to
> > quiet IO APIC before disabling Local APIC'.
> 
> That's not how we refer to commits in changelogs.
> 
OK, I will fix it and pay attention to it in the following patch.

> > However, the former introduced a bug for crashdown.
> 
> What is 'crashdown'? It's not referred to in the kernel source even once.

well, I mean ...
 
If we trigger kernel panic with the following commands, the capture
kernel should boot normally and captures the dump image. 

#echo 1 > /proc/sys/kernel/sysrq
#echo c > /proc/sysrq-trigger

But due to commit 522e66464467 changes the APIC shutdown sequence in
native_machine_crash_shutdown(), the capture kernel doesn't boot
normally and
hang in calibrate_delay_converge(), waiting for the jiffies to be
updated.

BTW, without commit 522e66464467, the capture kernel works well.

> 
> > If specify 'notsc' for capture-kernel, and then trigger crashdown.
> > The capture-kernel will be blocked at calibrate_delay_converge().
> 
> This is a more readable way of saying the same:
> 
>   If we specify the 'notsc' boot parameter for the dump-capture kernel,
>   and then trigger a crash-down, then the dump-capture kernel will hang
>   in calibrate_delay_converge():
> 
> (Assuming the changelog first explains what a 'crash-down' is.)
> 
> > /* wait for "start of" clock tick */
> > ticks = jiffies;
> > while (ticks == jiffies)
> >     ; /* nothing */
> 
> Plase align quoted code to the right with at least a single tab.
> 
OK
> > serial console log as following,
> 
>   serial log of the hang is as follows:
> 
> > ............
> > [    0.000000] Linux version 4.7.0-rc2+ (root@localhost.localdomain)
> > (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #2 SMP Wed Jun
> > 156
> > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.7.0-rc2+
> > root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap
> > vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=256M
> > vconsole.keymap=us console=tty0 console=ttyS0,115200n8 LANG=en_US.UTF-8
> > irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off
> > panic=10 rootflags=nofail acpi_no_memhotplug notsc
> > ............
> > [    0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable
> > TSC completely
> > ............
> > [    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles:
> > 0xffffffff, max_idle_ns: 133484882848 ns
> > [    0.000000] tsc: Fast TSC calibration using PIT
> > [    0.000000] tsc: Detected 3192.714 MHz processor
> > [    0.000000] Calibrating delay loop...
> 
> Just quote the last few lines and skip the useless timestamp column. Also, please 
> right-align this too.
OK
> 
> > The bug remains and unsolved for a long time, since 2013.
> > I find the arch-criminal by bisect.
> 
> What is an arch-criminal? Did you want to say:
> 
>   The bug has been introduced in 2013. I found the buggy commit via bisection.
> 
> ?
Yes, That's what i want to say.

> 
> > The commit <522e66464467> used to fix erratum AVR31 for "Intel Atom
> > Processor C2000 Product Family Specification Update".
> > You can find the doc at http://www.intel.com/content/dam/www/public/us
> > /en/documents/specification-updates/atom-c2000-family-spec-update.pdf.
> >
> > IMO,
> > It doesn't make sense that change the order of disabling between
> > I/O APIC and local APIC just for a certain model C2000.
> > And I couldn't find any related descriptions for Intel 64 and IA-32 Arch.
> > 
> > so, I want to revert the crash part of commit <522e66464467>.
> 
> So why does the crashdump kernel hang in calibrate_delay_converge()?

The jiffies value doesn't increase, which causes the capture kernel hang
in calibrate_delay_converge().

It seems that there's a relationship with the shutdown(disable) order
between IO APIC and local APIC.   I'm not sure of this point ....

One thing for sure by debugging is that do_timer() is not called while
capture kernel boots up. I suspect the timer interrupts (irq0) is not
passed to cpu by APIC.

> 
> To me it appears this is a weakness in the crashdump kernel: it is unable to boot 
> if we crash the original host system in a particular hardware state, right?

Maybe you're right ...
I specify 'notsc' only for capture-kernel, not the original host
system(first kernel).

And I suspect the APIC shutdown sequence in first kernel maybe bring
some bad influence on capture kernel.
I need to do more investigation.
Do you have any advice?
Thanks in advance.

Wei
> By reverting this change we'll just paper over the bug and re-introduce the bug 
> that can result in certain CPUs hanging if the IO-APIC sends an APIC message if 
> the lapic is disabled prematurely.
> Thanks,
> 
> 	Ingo
> 
> 

      reply	other threads:[~2016-07-04  9:44 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-29  4:51 [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Wei Jiangang
2016-06-29  4:51 ` [PATCH 2/2] time/tick-schede: fix typos Wei Jiangang
2016-07-01 10:49   ` [tip:timers/core] timers/nohz: Fix several typos tip-bot for Wei Jiangang
2016-07-01 10:36 ` [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Ingo Molnar
2016-07-04  9:44   ` Wei, Jiangang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1467625358.2119.55.camel@localhost \
    --to=weijg.fnst@cn.fujitsu.com \
    --cc=fenghua.yu@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.