linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAID5 hard freeze
@ 2014-02-24 22:01 Denis Golovan
  2014-02-25  2:58 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Denis Golovan @ 2014-02-24 22:01 UTC (permalink / raw)
  To: linux-raid

Hi all

I am struggling to diagnose a strange freeze of software RAID5 array.
My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it.

It works fine unless I start several process writing intensively to it.
At first, it looks like the system is under high pressure, then the
system starts lagging a lot and a hard freeze always follows after
several minutes.

No errors in system log, nothing is emitted to console. Just hard
freeze with HDD light always on. I tried enabling kernel network
logging to another machine and again no information when hanging.
After reboot, my array starts reconstruction and finishes without
errors.

I tried disabling quotas and barriers for ext4.
After disabling barriers, it almost seemed to work, but after some
time the same hard freeze happens.

I tested the same hardware configuration under Linux v3.10, 3.11, 3.12
and now 3.13.5 (all x86 arch) behaves the same way. The same issue can
be reproduced easily.

So now I tested everything Google suggests on the matter.
Could you give a hint on how to debug this issue?

BR,
Denis

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 hard freeze
  2014-02-24 22:01 RAID5 hard freeze Denis Golovan
@ 2014-02-25  2:58 ` NeilBrown
  2014-02-26 20:52   ` Denis Golovan
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2014-02-25  2:58 UTC (permalink / raw)
  To: Denis Golovan; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan <denis.golovan@gmail.com>
wrote:

> Hi all
> 
> I am struggling to diagnose a strange freeze of software RAID5 array.
> My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it.
> 
> It works fine unless I start several process writing intensively to it.
> At first, it looks like the system is under high pressure, then the
> system starts lagging a lot and a hard freeze always follows after
> several minutes.
> 
> No errors in system log, nothing is emitted to console. Just hard
> freeze with HDD light always on. I tried enabling kernel network
> logging to another machine and again no information when hanging.
> After reboot, my array starts reconstruction and finishes without
> errors.
> 
> I tried disabling quotas and barriers for ext4.
> After disabling barriers, it almost seemed to work, but after some
> time the same hard freeze happens.
> 
> I tested the same hardware configuration under Linux v3.10, 3.11, 3.12
> and now 3.13.5 (all x86 arch) behaves the same way. The same issue can
> be reproduced easily.
> 
> So now I tested everything Google suggests on the matter.
> Could you give a hint on how to debug this issue?
> 

The most useful thing for debugging a hard freeze is the alt-sysrq-T output
when it is frozen.  typing that magic sequence should always produce some
output unless it is hard-frozen with interrupts disabled.

So make sure you can produce the output when the system is working properly
(to a log file file the network console would be ideal), then when it hangs,
produce the output again.
To probably need to have a text console rather than a graphic console for it
to work.


If it is hard-hanging with interrupts disabled, then it gets tricky.  I
thought there was some NMI-based lockup detector which would warn if that
happened, but I cannot find it just now.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 hard freeze
  2014-02-25  2:58 ` NeilBrown
@ 2014-02-26 20:52   ` Denis Golovan
  2014-03-01 14:54     ` Denis Golovan
  0 siblings, 1 reply; 6+ messages in thread
From: Denis Golovan @ 2014-02-26 20:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3590 bytes --]

Ok.

I've tried using alt-sysrq-T to produce a log in netconsole after
hang, but could not.
It just didn't respond.
When operating normally it worked fine (see attached file).

One new interesting observation though.

After RAID hang and server reboot, the reconstruction process started.
Everything was as usual. However one interesting thing happened - I
could not reproduce the crash/hang while array was constructing! I
even created an extra pressure for array (I ran extra an process
writing to it).

At first I could not understand that. But then I realized that my
reconstruction process uses too much bandwidth to trigger the
crash/hang. I used following commands to force quicker reconstruction:

   echo 100000 > /proc/sys/dev/raid/speed_limit_max
   echo 50000 > /proc/sys/dev/raid/speed_limit_min
   echo "idle" > /sys/block/md127/md/sync_action

Thus the reconstruction worked at 100Mb/s.

Then I decided to check this assumption and while intensively writing
to the array and reconstructing simultaneously, I tried issuing
following commands:

  echo 10000 > /proc/sys/dev/raid/speed_limit_max
  echo 10000 > /proc/sys/dev/raid/speed_limit_min
  echo "idle" > /sys/block/md127/md/sync_action


Guess what? After those executed (see the last lines in attached log),
just several seconds later - the crash happened.

So, I think there is something wrong with filesystem/RAID co-operation.
You can see that while reconstructing, the pressure for filesystem is
not enough to reproduce the crash.

Could you provide something to move further into debugging the issue?

BR,
Denis

2014-02-25 4:58 GMT+02:00 NeilBrown <neilb@suse.de>:
> On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan <denis.golovan@gmail.com>
> wrote:
>
>> Hi all
>>
>> I am struggling to diagnose a strange freeze of software RAID5 array.
>> My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it.
>>
>> It works fine unless I start several process writing intensively to it.
>> At first, it looks like the system is under high pressure, then the
>> system starts lagging a lot and a hard freeze always follows after
>> several minutes.
>>
>> No errors in system log, nothing is emitted to console. Just hard
>> freeze with HDD light always on. I tried enabling kernel network
>> logging to another machine and again no information when hanging.
>> After reboot, my array starts reconstruction and finishes without
>> errors.
>>
>> I tried disabling quotas and barriers for ext4.
>> After disabling barriers, it almost seemed to work, but after some
>> time the same hard freeze happens.
>>
>> I tested the same hardware configuration under Linux v3.10, 3.11, 3.12
>> and now 3.13.5 (all x86 arch) behaves the same way. The same issue can
>> be reproduced easily.
>>
>> So now I tested everything Google suggests on the matter.
>> Could you give a hint on how to debug this issue?
>>
>
> The most useful thing for debugging a hard freeze is the alt-sysrq-T output
> when it is frozen.  typing that magic sequence should always produce some
> output unless it is hard-frozen with interrupts disabled.
>
> So make sure you can produce the output when the system is working properly
> (to a log file file the network console would be ideal), then when it hangs,
> produce the output again.
> To probably need to have a text console rather than a graphic console for it
> to work.
>
>
> If it is hard-hanging with interrupts disabled, then it gets tricky.  I
> thought there was some NMI-based lockup detector which would warn if that
> happened, but I cannot find it just now.
>
> NeilBrown

[-- Attachment #2: netconsole.txt.gz --]
[-- Type: application/x-gzip, Size: 62663 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 hard freeze
  2014-02-26 20:52   ` Denis Golovan
@ 2014-03-01 14:54     ` Denis Golovan
  2014-03-04 11:57       ` Bernd Schubert
  0 siblings, 1 reply; 6+ messages in thread
From: Denis Golovan @ 2014-03-01 14:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi again

I was contacted by a person who suggested to double-check
vfs_cache_pressure setting.
And it appeared that I had this setting set to 10000. That was a
left-over from previously debugging OOM-killer case.

When I removed this setting from sysctl.conf, I was able to greatly
increase the time to crash/freeze.
My server was able to withstand about a day of continuous write test.

Nevertheless, it froze after that.

Still it looks like something is wrong with RAID/filesystem co-operation.

I would still like to debug the problem.
Please help.

BR,
Denis

2014-02-26 22:52 GMT+02:00 Denis Golovan <denis.golovan@gmail.com>:
> Ok.
>
> I've tried using alt-sysrq-T to produce a log in netconsole after
> hang, but could not.
> It just didn't respond.
> When operating normally it worked fine (see attached file).
>
> One new interesting observation though.
>
> After RAID hang and server reboot, the reconstruction process started.
> Everything was as usual. However one interesting thing happened - I
> could not reproduce the crash/hang while array was constructing! I
> even created an extra pressure for array (I ran extra an process
> writing to it).
>
> At first I could not understand that. But then I realized that my
> reconstruction process uses too much bandwidth to trigger the
> crash/hang. I used following commands to force quicker reconstruction:
>
>    echo 100000 > /proc/sys/dev/raid/speed_limit_max
>    echo 50000 > /proc/sys/dev/raid/speed_limit_min
>    echo "idle" > /sys/block/md127/md/sync_action
>
> Thus the reconstruction worked at 100Mb/s.
>
> Then I decided to check this assumption and while intensively writing
> to the array and reconstructing simultaneously, I tried issuing
> following commands:
>
>   echo 10000 > /proc/sys/dev/raid/speed_limit_max
>   echo 10000 > /proc/sys/dev/raid/speed_limit_min
>   echo "idle" > /sys/block/md127/md/sync_action
>
>
> Guess what? After those executed (see the last lines in attached log),
> just several seconds later - the crash happened.
>
> So, I think there is something wrong with filesystem/RAID co-operation.
> You can see that while reconstructing, the pressure for filesystem is
> not enough to reproduce the crash.
>
> Could you provide something to move further into debugging the issue?
>
> BR,
> Denis
>
> 2014-02-25 4:58 GMT+02:00 NeilBrown <neilb@suse.de>:
>> On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan <denis.golovan@gmail.com>
>> wrote:
>>
>>> Hi all
>>>
>>> I am struggling to diagnose a strange freeze of software RAID5 array.
>>> My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it.
>>>
>>> It works fine unless I start several process writing intensively to it.
>>> At first, it looks like the system is under high pressure, then the
>>> system starts lagging a lot and a hard freeze always follows after
>>> several minutes.
>>>
>>> No errors in system log, nothing is emitted to console. Just hard
>>> freeze with HDD light always on. I tried enabling kernel network
>>> logging to another machine and again no information when hanging.
>>> After reboot, my array starts reconstruction and finishes without
>>> errors.
>>>
>>> I tried disabling quotas and barriers for ext4.
>>> After disabling barriers, it almost seemed to work, but after some
>>> time the same hard freeze happens.
>>>
>>> I tested the same hardware configuration under Linux v3.10, 3.11, 3.12
>>> and now 3.13.5 (all x86 arch) behaves the same way. The same issue can
>>> be reproduced easily.
>>>
>>> So now I tested everything Google suggests on the matter.
>>> Could you give a hint on how to debug this issue?
>>>
>>
>> The most useful thing for debugging a hard freeze is the alt-sysrq-T output
>> when it is frozen.  typing that magic sequence should always produce some
>> output unless it is hard-frozen with interrupts disabled.
>>
>> So make sure you can produce the output when the system is working properly
>> (to a log file file the network console would be ideal), then when it hangs,
>> produce the output again.
>> To probably need to have a text console rather than a graphic console for it
>> to work.
>>
>>
>> If it is hard-hanging with interrupts disabled, then it gets tricky.  I
>> thought there was some NMI-based lockup detector which would warn if that
>> happened, but I cannot find it just now.
>>
>> NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 hard freeze
  2014-03-01 14:54     ` Denis Golovan
@ 2014-03-04 11:57       ` Bernd Schubert
  2014-03-05 21:12         ` Denis Golovan
  0 siblings, 1 reply; 6+ messages in thread
From: Bernd Schubert @ 2014-03-04 11:57 UTC (permalink / raw)
  To: Denis Golovan, NeilBrown; +Cc: linux-raid

On 03/01/2014 03:54 PM, Denis Golovan wrote:
> Hi again
>
> I was contacted by a person who suggested to double-check
> vfs_cache_pressure setting.
> And it appeared that I had this setting set to 10000. That was a
> left-over from previously debugging OOM-killer case.
>
> When I removed this setting from sysctl.conf, I was able to greatly
> increase the time to crash/freeze.
> My server was able to withstand about a day of continuous write test.
>
> Nevertheless, it froze after that.
>
> Still it looks like something is wrong with RAID/filesystem co-operation.

But why should it be RAID/filesystem and not hardware related or another 
linux subsystem?

>
> I would still like to debug the problem.
> Please help.

The only way to get more help and information is probably to attach a 
real serial console (i.e. ipmi-sol).


Bernd


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 hard freeze
  2014-03-04 11:57       ` Bernd Schubert
@ 2014-03-05 21:12         ` Denis Golovan
  0 siblings, 0 replies; 6+ messages in thread
From: Denis Golovan @ 2014-03-05 21:12 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: NeilBrown, linux-raid

2014-03-04 13:57 GMT+02:00 Bernd Schubert <bernd.schubert@fastmail.fm>:

> But why should it be RAID/filesystem and not hardware related or another
> linux subsystem?
>

No idea. It's just my impression.
Direct (using dd utility) access to RAID array does not lead to freezes.
Reconstruction never fails too.

>>
>> I would still like to debug the problem.
>> Please help.
>
>
> The only way to get more help and information is probably to attach a real
> serial console (i.e. ipmi-sol).

Does it work when hard freeze happens?
I'd like to read some howto on how to debug issues like that.

BR,
Denis

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-05 21:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-24 22:01 RAID5 hard freeze Denis Golovan
2014-02-25  2:58 ` NeilBrown
2014-02-26 20:52   ` Denis Golovan
2014-03-01 14:54     ` Denis Golovan
2014-03-04 11:57       ` Bernd Schubert
2014-03-05 21:12         ` Denis Golovan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).