From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.lixom.net (lixom.net [66.141.50.11]) by ozlabs.org (Postfix) with ESMTP id 401FB67A6B for ; Fri, 24 Mar 2006 10:12:31 +1100 (EST) Date: Thu, 23 Mar 2006 17:11:29 -0600 To: Haren Myneni Subject: Re: [PATCH] kdump: Fix for machine checkstop on DMA fault Message-ID: <20060323231129.GD5538@pb15.lixom.net> References: <20060323201258.GA5538@pb15.lixom.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: From: Olof Johansson Cc: Milton Miller , Michael Ellerman , linuxppc-dev@ozlabs.org, Paul Mackerras , Olaf Hering , ellerman@au1.ibm.com List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi, On Thu, Mar 23, 2006 at 03:06:22PM -0800, Haren Myneni wrote: > On JS21, immediately after the tce entries are initialized, the machine > checkstops with an error "Internal CPU 1 Fault Error" on bladecenter MM. > If we do not initialize tce entries for crash kernel, allows the ongoing > DMA continue to the old kernel memory. I though that, ongoing DMA will be The problem isn't when DMA is going to the old kernel memory. The problem is when that TCE entry gets reused by the crashdump kernel, and some other memory gets overwritten instead. > stopped when the device reset happens later by the drivers. I think, some > hardening is already included in some drivers to take care of this > behavior. I might be wrong. So far, I had e100 issue after testing on p5, What assures that the crash kernel has drivers for all hardware in the system? If there's no driver, what will then be used to quiesce the device? > p4, js20 and js21. Probably, it could be lucky scenario. > So, will be keeping the same change (posted here) plus your suggestion. > Right? Can we apply same approach even for power-4? What you have now might be a 99%-of-the-time-it-works solution, but is that really good enough? The last things you want from a crash kernel is: 1. Have it crash on it's own because of something getting overwritten (small chance, since most mappings are probably for writing out data for later analysis) or: 2. Have it write corrupted data to the crash dump. This makes it more or less useless, since you can't trust what it wrote out: Did the machine go down because of the memory corruption you're spotting, or did that happen after the crash, while dumping it, etc? Either way, a proper solution is needed, not a 99% one. -Olof