mdadm freezes the system

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mdadm freezes the system
@ 2010-06-08  8:59 Roman Mamedov
  2010-06-08 16:24 ` Roman Mamedov
  0 siblings, 1 reply; 6+ messages in thread
From: Roman Mamedov @ 2010-06-08  8:59 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

Hello.

I am having a strange issue with md RAID on the 2.6.34 kernel. To be
specific, it sometimes locks up the system completely, with the following
symptoms:
- any attempt to read from an array seems to never return
- no errors at all on the server console
- in one lock-up episode I had "top" running, which displayed zero CPU
  load (no mdX_raidX in sight on top of the CPU-load sorted list)
- Alt-SysRQ-B works, and allows to reboot the system

Now, regarding when this happens. I had two such lock-ups shortly after moving
my root FS to RAID5; after the first one I changed the FS from XFS to Ext4
(this did not help), after the second one I disabled NCQ on all drives and the
write intent bitmap on the array. After that, it worked for maybe a week of
intense reads/writes onto the arrays with no more hangs.

Today, I have decided to convert a three-member RAID5 into a four-member
RAID6. mdadm segfaulted(!) right after the --grow command, and dmesg had
an error about md being unable to overwrite the /sys/.....stripe_cache_size
file. (As I understand, this is already fixed in the latest kernel).

The array then started rebuilding as 4-member RAID6 seemingly fine, but
shortly after, the system locked up in the same manner as described above.

Several attempts to do the rebuild after reboots consistently caused the same
lock-ups early in the rebuild (at less than 1% done). So for now, I decided to
give up and returned the array to its previous RAID5 three-member
configuration, which went fine.

The configuration:
md0 is 3* 1990GB RAID5
md1 is 3* 10GB RAID5 (root FS)
Three drives are 2* WD20EADS and 1* Hitachi 2TB drive. Fourth array member I
was trying to add to md0, is a RAID0 of two 1TB drives (Seagate and Hitachi).
SATA controllers are nForce4 chipset and a PCI-E JMicron JMB363. I am using
mdadm 3.1.2 now, and going to try the 2.6.35-rc2 kernel.

So, my question is, does anyone have an idea on what could cause this, and what
would be the best way to diagnose/fix the lockup problem?  Thanks in advance.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdadm freezes the system
  2010-06-08  8:59 mdadm freezes the system Roman Mamedov
@ 2010-06-08 16:24 ` Roman Mamedov
  2010-06-10 18:43   ` Roman Mamedov
  0 siblings, 1 reply; 6+ messages in thread
From: Roman Mamedov @ 2010-06-08 16:24 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]

On Tue, 8 Jun 2010 14:59:13 +0600
Roman Mamedov <roman@rm.pp.ru> wrote:

> Today, I have decided to convert a three-member RAID5 into a four-member
> RAID6. mdadm segfaulted(!) right after the --grow command, and dmesg had
> an error about md being unable to overwrite the /sys/.....stripe_cache_size
> file. (As I understand, this is already fixed in the latest kernel).
> 
> The array then started rebuilding as 4-member RAID6 seemingly fine, but
> shortly after, the system locked up in the same manner as described above.

Interestingly though, when I attempted that reshape in 2.6.34 (complete with
the described segfault), the array _instantly_ became a 4-disk RAID6 with a
rebuilding spare, and the process was running at about 50 MB/sec. And I was
able to then remove that spare and shrink the array back to --level=5 and
--raid-devices=3, instantly too.

But when I rebooted to 2.6.35-rc2, the same --grow command I used initially
(--level=6 --raid-devices=4) while did not produce a segfault, failed, asking
for the "backup file" to be specified. And after I added the --backup-file
switch, it started a slow "Reshape" process, going at about 6 MBytes per
second. (And this too, caused a lockup in a way which I described earlier.)
Apparently, there is no way to abort this process now, so I paused it using
echo idle > /sys/.....sync_action, and copying data away from the array, to
recreate it from scratch.

So why the same RAID5 to RAID6 conversion started so differently in these two
cases? And is it even possible to reshape RAID5 to RAID6 while simultaneously
adding a disk, without overwriting all the other disks' contents (it surely
looked like this is what was happening in the first case)?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdadm freezes the system
  2010-06-08 16:24 ` Roman Mamedov
@ 2010-06-10 18:43   ` Roman Mamedov
  2010-06-16  7:03     ` Michael Evans
  0 siblings, 1 reply; 6+ messages in thread
From: Roman Mamedov @ 2010-06-10 18:43 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

Hello.

To provide some updates on my earlier questions (for future generations
googling through the archives).

> So, my question is, does anyone have an idea on what could cause this, and
> what would be the best way to diagnose/fix the lockup problem?  Thanks in
> advance.

Since then I have re-created the original 3-device RAID5, it synced without
lockups. Perhaps I should write them off as some hardware bug, especially
considering that the nForce chipsets aren't exactly known as the gold standard
of bug-free design.

> So why the same RAID5 to RAID6 conversion started so differently in these two
> cases? And is it even possible to reshape RAID5 to RAID6 while simultaneously
> adding a disk, without overwriting all the other disks' contents (it surely
> looked like this is what was happening in the first case)?

Yes it is possible; I should have read about the --layout=preserve option in
the mdadm man page.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdadm freezes the system
  2010-06-10 18:43   ` Roman Mamedov
@ 2010-06-16  7:03     ` Michael Evans
  2010-06-16  7:16       ` Roman Mamedov
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Evans @ 2010-06-16  7:03 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

Have you exhaustively tested your drives as well?

The smartctl -t long test is a good one to start with.  If you fail it
sometimes re-writing the sector(s) in the area of the failure allow
the drive to resolve the problem by using one of the space sectors
(the drive hides a handful of from normal use and keeps them in
reserve for just such occasions).  However I recently had a drive I
had to pull out and use the manufacturer's utility to 'zero'
(erase/wipe) everything; it then finally passed a similar test within
the drive utility.  I expect it to resync without further issue.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdadm freezes the system
  2010-06-16  7:03     ` Michael Evans
@ 2010-06-16  7:16       ` Roman Mamedov
  2010-06-16 11:47         ` Billy Crook
  0 siblings, 1 reply; 6+ messages in thread
From: Roman Mamedov @ 2010-06-16  7:16 UTC (permalink / raw)
  To: Michael Evans; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

On Wed, 16 Jun 2010 00:03:54 -0700
Michael Evans <mjevans1983@gmail.com> wrote:

> Have you exhaustively tested your drives as well?
> 
> The smartctl -t long test is a good one to start with.

I run that the first thing after buying every new disk. The drives are okay,
even if they weren't, they should absolutely not cause the system lock-up in
the manner I was experiencing.

I believe what I have is either a controller bug, or this:
https://bugzilla.redhat.com/show_bug.cgi?id=602457

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdadm freezes the system
  2010-06-16  7:16       ` Roman Mamedov
@ 2010-06-16 11:47         ` Billy Crook
  0 siblings, 0 replies; 6+ messages in thread
From: Billy Crook @ 2010-06-16 11:47 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Michael Evans, linux-raid

On Wed, Jun 16, 2010 at 02:16, Roman Mamedov <roman@rm.pp.ru> wrote:
> On Wed, 16 Jun 2010 00:03:54 -0700
> Michael Evans <mjevans1983@gmail.com> wrote:
>
>> Have you exhaustively tested your drives as well?
>>
>> The smartctl -t long test is a good one to start with.

smartctl -t short /dev/sdX ; sleep 120; smartctl -t conveyance

Most smart failures I've seen fail the short test, and it only takes
two minutes.  Conveyance isn't much longer.  If either test actually
fails (not just gets interrupted), RMA it.

> I run that the first thing after buying every new disk. The drives are okay,
> even if they weren't, they should absolutely not cause the system lock-up in
> the manner I was experiencing.

Slightly off topic, but I wrote a script I thought I'd share, which I
use on all new drives.  It takes its time depending on how fast a
sustained read/write speed the drive has, but it only takes a couple
seconds to begin.
http://www.kclug.org/wiki/index.php/Hard_Drive_Hell

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-06-16 11:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-08  8:59 mdadm freezes the system Roman Mamedov
2010-06-08 16:24 ` Roman Mamedov
2010-06-10 18:43   ` Roman Mamedov
2010-06-16  7:03     ` Michael Evans
2010-06-16  7:16       ` Roman Mamedov
2010-06-16 11:47         ` Billy Crook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).