From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752819AbbGNSHl (ORCPT ); Tue, 14 Jul 2015 14:07:41 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:53487 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751811AbbGNSHi (ORCPT ); Tue, 14 Jul 2015 14:07:38 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Vivek Goyal Cc: dwalker@fifo99.com, Hidehiro Kawai , Andrew Morton , linux-mips@linux-mips.org, Baoquan He , linux-sh@vger.kernel.org, linux-s390@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, Ingo Molnar , HATAYAMA Daisuke , Masami Hiramatsu , linuxppc-dev@lists.ozlabs.org, linux-metag@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20150713202611.GA16525@fifo99.com> <87h9p7r0we.fsf@x220.int.ebiederm.org> <20150714135919.GA18333@fifo99.com> <20150714150208.GD10792@redhat.com> <20150714153430.GA18766@fifo99.com> <20150714154040.GA3912@redhat.com> <20150714154833.GA18883@fifo99.com> <20150714161612.GH10792@redhat.com> <87a8uyoeig.fsf@x220.int.ebiederm.org> <20150714172953.GA19135@fifo99.com> <20150714175527.GI10792@redhat.com> Date: Tue, 14 Jul 2015 13:01:12 -0500 In-Reply-To: <20150714175527.GI10792@redhat.com> (Vivek Goyal's message of "Tue, 14 Jul 2015 13:55:27 -0400") Message-ID: <87si8qmxef.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-AID: U2FsdGVkX1/O22/RKhEvkpEBGvhwo+e9R2lPzBkhfjs= X-SA-Exim-Connect-IP: 67.3.205.90 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.7 XMSubLong Long Subject * 0.5 XMGappySubj_01 Very gappy subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa04 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa04 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: **;Vivek Goyal X-Spam-Relay-Country: X-Spam-Timing: total 401 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.2 (0.8%), b_tie_ro: 2.3 (0.6%), parse: 0.72 (0.2%), extract_message_metadata: 11 (2.8%), get_uri_detail_list: 1.69 (0.4%), tests_pri_-1000: 4.9 (1.2%), tests_pri_-950: 1.12 (0.3%), tests_pri_-900: 0.98 (0.2%), tests_pri_-400: 23 (5.8%), check_bayes: 22 (5.6%), b_tokenize: 7 (1.8%), b_tok_get_all: 8 (2.1%), b_comp_prob: 2.2 (0.6%), b_tok_touch_all: 2.5 (0.6%), b_finish: 0.61 (0.2%), tests_pri_0: 348 (86.7%), tests_pri_500: 4.3 (1.1%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Wed, 24 Sep 2014 11:00:52 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Vivek Goyal writes: > On Tue, Jul 14, 2015 at 05:29:53PM +0000, dwalker@fifo99.com wrote: > > [..] >> > >> > If a machine is failing, there are high chance it can't deliver you the >> > >> > notification. Detecting that failure suing some kind of polling mechanism >> > >> > might be more reliable. And it will make even kdump mechanism more >> > >> > reliable so that it does not have to run panic notifiers after the crash. >> > >> >> > >> I think what your suggesting is that my company should change how it's hardware works >> > >> and that's not really an option for me. This isn't a simple thing like checking over the >> > >> network if the machine is down or not, this is way more complex hardware design. >> > > >> > > That means you are ready to live with an unreliable design. There might be >> > > cases where notifier does not get run properly and you will not do switch >> > > despite the fact that OS has failed. I was just trying to nudge you in >> > > a direction which could be more reliable mechanism. >> > >> > Sigh I see some deep confusion going on here. >> > >> > The panic notifiers are just that panic notifiers. They have not been >> > nor should they be tied to kexec. If those notifiers force a switch >> > over of between machines I fail to see why you would care if it was >> > kexec or another panic situation that is forcing that switchover. >> >> Hidehiro isn't fixing the failover situation on my side, he's fixing register >> information collection when crash_kexec_post_notifiers is used. > > Sure. Given that we have created this new parameter, let us fix it so that > we can capture the other cpu register state in crash dump. > > I am little disappointed that it was not tested well when this parameter was > introuced. We should have atleast tested it to the extent to see if there > is proper cpu state present for all cpus in the crash dump. > > At that point of time it looked like a simple modification > to allow panic notifiers before crash_kexec(). Either that or we say no one cares enough, and it known broken so let's just revert the fool thing. I honestly can't see how to support panic notifiers, before kexec. There is no way to tell what is being done and all of the pieces including smp_send_stop are known to be buggy. It isn't like this latest set of patches was reviewed/tested much better, as the first patch was wrong. Eric