From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Andrew Kinney" Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Date: Tue, 23 Nov 2004 17:00:39 -0800 Message-ID: <41A36CB7.16204.D3BCF639@localhost> References: <20041028005302.753a2d52.akpm@osdl.org> Reply-To: andykinney@advantagecom.net Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Return-path: Received: from mail.advantagecom.net ([65.103.151.155]:51082 "EHLO mail.advantagecom.net") by vger.kernel.org with ESMTP id S261406AbUKXBAZ (ORCPT ); Tue, 23 Nov 2004 20:00:25 -0500 In-reply-to: <1101246101.26294.76.camel@ryan2.internal.autoweb.net> Content-description: Mail message body Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Ryan Anderson Cc: linux-scsi@vger.kernel.org I'll just add my two cents having gone through the exact same thing with a pair of PowerEdge 2500 systems with PERC 3/Di RAID controllers. They both had the problem. Our problem was also only able to be replicated with the machine in production with heavy RAM usage, moderate CPU usage, and many simultaneous small I/Os mixed with large file writes. PostGres dumps seemed to be the most likely cause since it crashed on schedule for awhile and a cron job was running a PostGres dump at that time. That wasn't able to cause the crash 100% of the time though. It only caused it about 10% of the time, from what I could tell. I was unable to reproduce the problem when it was out of production. It took some hocus pocus mix of system activity that only the collective consciousness of our customers could seem to cause. I spent weeks using various stress testing programs of all sorts to try to reproduce the issue, but I just couldn't replicate it. First, what enabled us to locate the problem was the following command run in the boot immediately following a crash. It needs to be run from afacli after opening the controller. diagnostic show history /old That will display the series of events leading up to the container going offline. Pay particular attention to the beginning of the problem. It would look similar to this: [13]: ID(0:01:0) Timeout detected on cmd[0x28] [14]: ID(0:01:0): Timeout detected on cmd[0x2a] [15]: ID(0:01:0): Timeout detected on cmd[0x28] [16]: <...repeats 2 more times> [50]: ID(0:01:0) Cmd[0x28] Fail: Block Range 3424717 : 3424718 at [51]: 509184 sec [78]: ID(0:01:0) Cmd[0x28] Fail: Block Range 0 : 0 at 509262 sec [79]: 2 can't read mbr dev_t:1 [80]: <...repeats 1 more times> [81]: can't read config from slice #[1] Note the timestamp on line 51 (continued from line 50). Yours will have a different time stamp and a different line number, but you'll want to make note of the earliest time stamp visible during or just before the problem. Our was almost 6 days from boot. Then note the latest time stamp near the end of the problem (line 78 in our log). In our case it was 78 elapsed seconds. I'm pretty sure it was coincidence that it was line 78 and 78 seconds elapsed. This is, of course, longer than it should take for the firmware to offline a dead drive, but that's what happened. This resulted in the linux SCSI layer timing out and hosing everything since we ran everything (including swap, boot, and root) from this container. Our solution was UGLY. We replaced everything in the entire disk subsystem in addition to getting the latest BIOS, firmware, and drivers. Fortunately, both systems were under warranty, so Dell provided the parts and labor to replace the errant hard drive (ID 0:1 in our case), the backplane, the cables, and the mainboard (since the controller is embedded). The good news is that we've been up 47 days since implimenting this solution. Previous uptimes were typically less than a week. I've had many theories about what the root cause was (some better than others), but my latest iteration is that when the system was under normal load, a power component on the SCSI backplane couldn't supply the proper voltage to one drive for a short period. The drive became unresponsive upon executing a command that used the servo motor enough to cause the power fluctuation to affect the drive's command processor, causing the drive to lockup. The problem was compounded by the firmware not marking the drive as bad in a reasonable time frame so it could continue processing commands in degraded mode before the upper layer wigged out. The upper layer linux SCSI drivers can't do anything about this, unfortunately, and just blindly take down your only available disk storage before the controller comes back from marking the disk as bad. Maybe our experience will benefit someone when they go to modify the controller firmware to properly mark a drive as bad in a reasonable time so the container can continue to operate in degraded mode rather than being taken offline entirely by the OS. Hint. Hint. Andrew On 23 Nov 2004 at 16:41, Ryan Anderson wrote: > On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote: > > Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid > > PERC 3/Di Container goes offline > > > > > > http://bugme.osdl.org/show_bug.cgi?id=3651 > > > > Summary: dell poweredge 4600 aacraid PERC 3/Di Container > > goes > > offline > > Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6 > > Status: NEW > > Severity: high > > Owner: andmike@us.ibm.com > > Submitter: oliver.polterauer@ewave.at > > CC: oliver.polterauer@ewave.at > > Is there any update on this problem? > To reiterate my particular hardware involved that can trigger this > problem: > > Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the > problem occured in 2.4.20 without hyperthreading disabled via "noht") > > 4 GB of ram > Only load is PostgreSQL related (i.e, network queries, plus twice > daily dumps of the database to a NFS store, and a rsync back to the > server for a second copy) > > Under load, I repeatedly saw containers go offline. > > Dell's recommended hardware diagnostics do not turn up anything (at > all!) > > The harddrive are Fujitsu drives, so the Seagate Firmware issue should > not affect them. > > I have since taken this server out of production. Unfortunately, this > makes the error much harder to trigger (i.e, I have failed so far to > trigger it, even with multiple bonnie++ runs) > > Suggestions, diagnostics, etc, would be greatly appreciated. > > > -- > > Ryan Anderson > AutoWeb Communications, Inc. > email: ryan@autoweb.net > Sincerely, Andrew Kinney President and Chief Technology Officer Advantagecom Networks, Inc. http://www.advantagecom.net