From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932588AbcGDJoy (ORCPT ); Mon, 4 Jul 2016 05:44:54 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:14430 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S932315AbcGDJox (ORCPT ); Mon, 4 Jul 2016 05:44:53 -0400 X-IronPort-AV: E=Sophos;i="5.22,518,1449504000"; d="scan'208";a="8331637" From: "Wei, Jiangang" To: "mingo@kernel.org" CC: "tglx@linutronix.de" , "linux-kernel@vger.kernel.org" , "hpa@zytor.com" , "mingo@redhat.com" , "x86@kernel.org" , "fenghua.yu@intel.com" Subject: Re: [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Thread-Topic: [PATCH 1/2] x86/apic: shutdown local APIC before I/O APIC during crash Thread-Index: AQHR0cJIxNM8FI3iokesr2LMGOyeoqAC3y4AgASn/gA= Date: Mon, 4 Jul 2016 09:44:44 +0000 Message-ID: <1467625358.2119.55.camel@localhost> References: <1467175910-2966-1-git-send-email-weijg.fnst@cn.fujitsu.com> <20160701103620.GA12394@gmail.com> In-Reply-To: <20160701103620.GA12394@gmail.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.167.226.50] Content-Type: text/plain; charset="utf-8" Content-ID: <7FBD54938F0C0F4FBF0051F0710AC41A@fujitsu.local> MIME-Version: 1.0 X-yoursite-MailScanner-ID: 186F741C0BBC.AF844 X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: weijg.fnst@cn.fujitsu.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id u649ixuB011628 Hi, Ingo Thanks for your comments firstly. On Fri, 2016-07-01 at 12:36 +0200, Ingo Molnar wrote: > * Wei Jiangang wrote: > > > commit <522e66464467> disables I/O APIC before shutdown of > > the local APIC for both reboot and crash path. > > and commit <2885432aaf15> declares that 'it still makes sense to > > quiet IO APIC before disabling Local APIC'. > > That's not how we refer to commits in changelogs. > OK, I will fix it and pay attention to it in the following patch. > > However, the former introduced a bug for crashdown. > > What is 'crashdown'? It's not referred to in the kernel source even once. well, I mean ... If we trigger kernel panic with the following commands, the capture kernel should boot normally and captures the dump image. #echo 1 > /proc/sys/kernel/sysrq #echo c > /proc/sysrq-trigger But due to commit 522e66464467 changes the APIC shutdown sequence in native_machine_crash_shutdown(), the capture kernel doesn't boot normally and hang in calibrate_delay_converge(), waiting for the jiffies to be updated. BTW, without commit 522e66464467, the capture kernel works well. > > > If specify 'notsc' for capture-kernel, and then trigger crashdown. > > The capture-kernel will be blocked at calibrate_delay_converge(). > > This is a more readable way of saying the same: > > If we specify the 'notsc' boot parameter for the dump-capture kernel, > and then trigger a crash-down, then the dump-capture kernel will hang > in calibrate_delay_converge(): > > (Assuming the changelog first explains what a 'crash-down' is.) > > > /* wait for "start of" clock tick */ > > ticks = jiffies; > > while (ticks == jiffies) > > ; /* nothing */ > > Plase align quoted code to the right with at least a single tab. > OK > > serial console log as following, > > serial log of the hang is as follows: > > > ............ > > [ 0.000000] Linux version 4.7.0-rc2+ (root@localhost.localdomain) > > (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #2 SMP Wed Jun > > 156 > > [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.7.0-rc2+ > > root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap > > vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=256M > > vconsole.keymap=us console=tty0 console=ttyS0,115200n8 LANG=en_US.UTF-8 > > irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off > > panic=10 rootflags=nofail acpi_no_memhotplug notsc > > ............ > > [ 0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable > > TSC completely > > ............ > > [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: > > 0xffffffff, max_idle_ns: 133484882848 ns > > [ 0.000000] tsc: Fast TSC calibration using PIT > > [ 0.000000] tsc: Detected 3192.714 MHz processor > > [ 0.000000] Calibrating delay loop... > > Just quote the last few lines and skip the useless timestamp column. Also, please > right-align this too. OK > > > The bug remains and unsolved for a long time, since 2013. > > I find the arch-criminal by bisect. > > What is an arch-criminal? Did you want to say: > > The bug has been introduced in 2013. I found the buggy commit via bisection. > > ? Yes, That's what i want to say. > > > The commit <522e66464467> used to fix erratum AVR31 for "Intel Atom > > Processor C2000 Product Family Specification Update". > > You can find the doc at http://www.intel.com/content/dam/www/public/us > > /en/documents/specification-updates/atom-c2000-family-spec-update.pdf. > > > > IMO, > > It doesn't make sense that change the order of disabling between > > I/O APIC and local APIC just for a certain model C2000. > > And I couldn't find any related descriptions for Intel 64 and IA-32 Arch. > > > > so, I want to revert the crash part of commit <522e66464467>. > > So why does the crashdump kernel hang in calibrate_delay_converge()? The jiffies value doesn't increase, which causes the capture kernel hang in calibrate_delay_converge(). It seems that there's a relationship with the shutdown(disable) order between IO APIC and local APIC. I'm not sure of this point .... One thing for sure by debugging is that do_timer() is not called while capture kernel boots up. I suspect the timer interrupts (irq0) is not passed to cpu by APIC. > > To me it appears this is a weakness in the crashdump kernel: it is unable to boot > if we crash the original host system in a particular hardware state, right? Maybe you're right ... I specify 'notsc' only for capture-kernel, not the original host system(first kernel). And I suspect the APIC shutdown sequence in first kernel maybe bring some bad influence on capture kernel. I need to do more investigation. Do you have any advice? Thanks in advance. Wei > By reverting this change we'll just paper over the bug and re-introduce the bug > that can result in certain CPUs hanging if the IO-APIC sends an APIC message if > the lapic is disabled prematurely. > Thanks, > > Ingo > >