From: Larkin Lowrey <llowrey@nuclearwinter.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Raid5 device hangs in active state
Date: Tue, 28 Feb 2012 15:33:56 -0600 [thread overview]
Message-ID: <4F4D4844.3010805@nuclearwinter.com> (raw)
In-Reply-To: <20120229065206.60d1e2ea@notabene.brown>
Thank you for taking a look.
I should be able to move the drives to a completely different controller
(and driver) so that will be a good test.
Could NCQ be an issue? IOW, do you think it would be worth disabling NCQ
and re-running this scenario?
--Larkin
On 2/28/2012 1:52 PM, NeilBrown wrote:
> On Tue, 28 Feb 2012 12:21:39 -0600 Larkin Lowrey
<llowrey@nuclearwinter.com>
> wrote:
>
>> I did another sysrq dump and have attached the output.
>
> Thanks. Unfortunately it contains nothing of value - too much has been
> lost. It seems that 'Show State' contains a lot more noise than it used to.
>
> You will need to boot with
> log_buf_len=4M
>
> or something like that.
>
>>
>> Again, 'iostat -dx 1' showed 100% utilization on the LVM which uses
>> /dev/md0 as a pv and /sys/block/md0/md/stripe_cache_active was 29 and
>> that value did not change. There were no error messages in
>> /var/log/messages or 'dmesg'.
>
> The '29' could simply mean that md/raid5 has sent 29 requests down to lower
> levels which have not yet completed.
>>
>> My suspicions lie with md0 since the stripe_cache_active value remains
>> at a fixed non-zero value even though all disks are (or appear to be)
>> idle. Should I be looking elsewhere? This hardware did not exhibit this
>> problem before "upgrading" from Fedora 15 to Fedora 16.
>
> My guess is a problem with one of the drive controllers. Your monthly
'sync'
> puts a much heavier load on them than normal IO does. It is consistently
> sending a bunch of requests to all devices at exactly the same time. This
> could trigger race conditions that normal IO does not.
>
> But that is just a guess. Unfortunately it is very hard to track exactly
> what is going wrong in this sort of case.
>
> I'd suggest shuffling devices so they are on different controllers, or
maybe
> replace a controller. See if you can get the problem to move, and then see
> which controller it stayed with.
>
> NeilBrown
>
>
>>
>> Thank you,
>>
>> --Larkin
>>
>> On 1/8/2012 6:26 PM, NeilBrown wrote:
>>> On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
>> <llowrey@nuclearwinter.com>
>>> wrote:
>>>
>>>> Suggestions?
>>>
>>> # echo t > /proc/sysrq-trigger
>>>
>>> and capture that messages that go to 'dmesg'. Post them.
>>>
>>> Hopefully your message ring buffer is big enough to collect the entire
>>> output. If it isn't you might need to boot with
>>> log_buf_len=1M
>>> or similar.
>>>
>>> That should show what process is blocking on what.
>>>
>>> NeilBrown
>>
>
next prev parent reply other threads:[~2012-02-28 21:33 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-08 22:03 Raid5 device hangs in active state Larkin Lowrey
2012-01-09 0:26 ` NeilBrown
2012-02-28 18:23 ` Larkin Lowrey
[not found] ` <4F4D1B33.3010308@nuclearwinter.com>
2012-02-28 19:52 ` NeilBrown
2012-02-28 21:33 ` Larkin Lowrey [this message]
2012-02-28 21:46 ` NeilBrown
2012-03-11 22:39 ` Larkin Lowrey
2012-03-11 23:29 ` Asdo
2012-03-12 0:18 ` Larkin Lowrey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F4D4844.3010805@nuclearwinter.com \
--to=llowrey@nuclearwinter.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.