From: "Brett G. Durrett" <brett@imvu.com>
To: Joe Malicki <jmalicki@metacarta.com>
Cc: "linux-scsi "@vger.kernel.org,
"David N. Welton" <d.welton@webster.it>,
linux-poweredge@dell.com, Sumant Patro <Sumant.Patro@lsil.com>,
"Marc A. Meadows" <mam+megaraid@metacarta.com>,
"Keith R. Baker" <krbaker+megaraid@metacarta.com>
Subject: Re: megaraid_sas waiting for command and then offline
Date: Mon, 11 Dec 2006 21:24:29 -0800 [thread overview]
Message-ID: <457E3D0D.7070707@imvu.com> (raw)
In-Reply-To: <457E1C59.80208@metacarta.com>
I am still seeing this and we have between 2 and 5 failures per week
(across almost 20 machines). I am seeing it on ext3 (we migrated all of
the machines from XFS) and with ReadAhead disabled.
You mention a firmware update but I don't see any new PERC 5 firmware
packages on Dell's site... can you give me a pointer to the firmware update?
Also, has anybody had this problem on RHE? Dell does not support Linux
unless it is RHE... I would be surprised is somehow RHE did not have
this problem.
B-
Joe Malicki wrote:
>> I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>> megaraid_sas driver hangs waiting for commands and then the filesystem
>> unmounts, leaving the machine in an unusable state until there is a hard
>> reboot (the machine is responsive but any access, shell or otherwise, is
>> impossible without the filesystem). While I do not have much debugging
>> information available, this happens to me about once every 6-7 days in
>> my pool of seven machines, so I can probably get debugging info. Since
>> the disk is offline and I can't get remote console, I don't have any
>> details except something similar to Dave Lloyd's post, below.
>>
>
> Brett, is this still happening to you? We're seeing this very
> sporadically, but it does concern us. We've seen driver updates in
> 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware:
>
> Package Version - 5.0.2-0003
> Firmware Version - 1.00.01-0157
> SASBIOS Version - MT23
> Ctrl-R Version - 1.02-007
> MPT Version - 00.06.71.00-IT
>
> and haven't been able to reproduce it, but we can't find a test case to
> reliably reproduce the problem to know that anything was fixed (out of
> 31 identically configured Dell 2950's with the PERC 5/i RAID controller
> (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them
> Maxtor Atlas 10k, not hot spare). Our 2950s do have 16GB of RAM each,
> so the firmware update (which mentions that it fixes DMA beyond 8GB)
> sounds promising, but I would think that if that was the problem we were
> experiencing, we would reproduce this much more often? We are certainly
> using the RAM for cache and memory, it's not like we've never touched
> beyond 8GB.
>
> Does anyone have a test case to reproduce this problem reliably, or a
> detailed description of what actually happens (on low levels) when this
> problem occurs that can help to make a test? We are more interested in
> making this reproducible now than in finding a workaround... if anyone
> has any tips on how to make this *more* likely to happen we'd like to
> know (so far, I know to try to use XFS and enable ReadAhead).
>
> We have seen this correlated with Patrol Reads going on at the same
> time, but aren't sure if this is a red herring, and haven't been able to
> force the issue to happen by enabling Patrol Reads.
>
> We've only ever seen these on two machines - one machine reproduces the
> problem in a little over a week, and the other has reproduced it a small
> number of times. The machines that reproduce it run an experimental
> demo workload, but we have not found a test case so far to reproduce the
> problem on demand to find or verify solutions. We're currently swapping
> out machines to verify that there are no hardware problems, but the
> machines diagnose themselves cleanly, and the workload they run is
> different enough that something about the workload we can't yet
> synthesize into a test case is the problem.
>
> Thank you!
> Joe Malicki
> Software Engineer
> Metacarta, Inc.
> email: jmalicki@metacarta.com
>
>
>> The only thing that the machines with these failures seem to have in
>> common is the fact that they are almost exclusively writes - they are
>> slave database machines with large memory and pretty much just
>> replicate. The read/write machines seem to have less failures.
>>
>> I am happy to help provide debugging information in any reasonable way.
>> In the mean time, if there are any known suggestions or workarounds for
>> the problem, I would be grateful for the guidance.
>>
>> Here are what details on the controller. If you want additional info,
>> let me know exactly what you need and I will do what I can to get it to
>> you.:
>>
>> Product Name : PERC 5/i Integrated
>> Serial No : 12345
>> FW Package Build: 5.0.1-0030
>> FW Version : 1.00.01-0088
>> BIOS Version : MT23
>> Ctrl-R Version :1.02-007
>>
>> B-
>>
>
>
next prev parent reply other threads:[~2006-12-12 5:23 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-12 3:04 megaraid_sas waiting for command and then offline Joe Malicki
2006-12-12 5:24 ` Brett G. Durrett [this message]
2006-12-12 5:53 ` Joseph Malicki
2006-12-12 12:30 ` Greg Dickie
2006-12-12 18:40 ` Joe Malicki
2006-12-20 1:03 ` Brett G. Durrett
-- strict thread matches above, loose matches on Subject: below --
2006-10-25 8:46 David N. Welton
2006-10-25 22:48 ` Brett G. Durrett
2006-10-25 23:03 ` Alan Cox
2006-11-13 21:40 ` Brett G. Durrett
2006-09-06 17:14 Patro, Sumant
2006-09-06 20:44 ` Brett G. Durrett
2006-09-06 4:49 Brett G. Durrett
2006-09-06 14:11 ` Dave Lloyd
2006-09-06 16:04 ` Brett G. Durrett
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=457E3D0D.7070707@imvu.com \
--to=brett@imvu.com \
--cc="linux-scsi "@vger.kernel.org \
--cc=Sumant.Patro@lsil.com \
--cc=d.welton@webster.it \
--cc=jmalicki@metacarta.com \
--cc=krbaker+megaraid@metacarta.com \
--cc=linux-poweredge@dell.com \
--cc=mam+megaraid@metacarta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.