Spontaneous rebuild

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Spontaneous rebuild
@ 2007-12-02  4:12 Oliver Martin
  2007-12-02  9:45 ` Justin Piszcz
  2007-12-02 23:14 ` Neil Brown
  0 siblings, 2 replies; 8+ messages in thread
From: Oliver Martin @ 2007-12-02  4:12 UTC (permalink / raw)
  To: linux-raid; +Cc: Oliver Martin

[Please CC me on replies as I'm not subscribed]

Hello!

I've been experimenting with software RAID a bit lately, using two
external 500GB drives. One is connected via USB, one via Firewire. It is
set up as a RAID5 with LVM on top so that I can easily add more drives
when I run out of space.
About a day after the initial setup, things went belly up. First, EXT3
reported strange errors:
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561536, length 1
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561537, length 1
...

There were literally hundreds of these, and they came back immediately
when I reformatted the array. So I tried ReiserFS, which worked fine for
about a day. Then I got errors like these:
ReiserFS: warning: is_tree_node: node level 0 does not match to the
expected one 2
ReiserFS: dm-0: warning: vs-5150: search_by_key: invalid format found in
block 69839092. Fsck?
ReiserFS: dm-0: warning: vs-13070: reiserfs_read_locked_inode: i/o
failure occurred trying to find stat data of [6 10 0x0 SD]

Again, hundreds. So I ran badblocks on the LVM volume, and it reported
some bad blocks near the end. Running badblocks on the md array worked,
so I recreated the LVM stuff and attributed the failures to undervolting
experiments I had been doing (this is my old laptop running as a server).

Anyway, the problems are back: To test my theory that everything is
alright with the CPU running within its specs, I removed one of the
drives while copying some large files yesterday. Initially, everything
seemed to work out nicely, and by the morning, the rebuild had finished.
Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
ran without gripes for some hours, but just now I saw md had started to
rebuild the array again out of the blue:

Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4
Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
488383936 blocks.
Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4

I'm not sure the USB resets are related to the problem - device 4-5.2 is
part of the array, but I get these sometimes at random intervals and
they don't seem to hurt normally. Besides, the first one was long before
the rebuild started, and the second one long afterwards.

Any ideas why md is rebuilding the array? And could this be related to
the bad blocks problem I had first? badblocks is still running, I'll
post an update when it is finished.
In the meantime, mdadm --detail /dev/md0 and mdadm --examine
/dev/sd[bc]1 don't give me any clues as to what went wrong, both disks
are marked as "active sync", and the whole array is "active, recovering".

Before I forget, I'm running 2.6.23.1 with this config:
http://stud4.tuwien.ac.at/~e0626486/config-2.6.23.1-hrt3-fw

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02  4:12 Spontaneous rebuild Oliver Martin
@ 2007-12-02  9:45 ` Justin Piszcz
  2007-12-02 13:07   ` Oliver Martin
  2007-12-02 23:14 ` Neil Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Justin Piszcz @ 2007-12-02  9:45 UTC (permalink / raw)
  To: Oliver Martin; +Cc: linux-raid



On Sun, 2 Dec 2007, Oliver Martin wrote:

> [Please CC me on replies as I'm not subscribed]
>
> Hello!
>
> I've been experimenting with software RAID a bit lately, using two
> external 500GB drives. One is connected via USB, one via Firewire. It is
> set up as a RAID5 with LVM on top so that I can easily add more drives
> when I run out of space.
> About a day after the initial setup, things went belly up. First, EXT3
> reported strange errors:
> EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
> zone - blocks from 106561536, length 1
> EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
> zone - blocks from 106561537, length 1
> ...
>
> There were literally hundreds of these, and they came back immediately
> when I reformatted the array. So I tried ReiserFS, which worked fine for
> about a day. Then I got errors like these:
> ReiserFS: warning: is_tree_node: node level 0 does not match to the
> expected one 2
> ReiserFS: dm-0: warning: vs-5150: search_by_key: invalid format found in
> block 69839092. Fsck?
> ReiserFS: dm-0: warning: vs-13070: reiserfs_read_locked_inode: i/o
> failure occurred trying to find stat data of [6 10 0x0 SD]
>
> Again, hundreds. So I ran badblocks on the LVM volume, and it reported
> some bad blocks near the end. Running badblocks on the md array worked,
> so I recreated the LVM stuff and attributed the failures to undervolting
> experiments I had been doing (this is my old laptop running as a server).
>
> Anyway, the problems are back: To test my theory that everything is
> alright with the CPU running within its specs, I removed one of the
> drives while copying some large files yesterday. Initially, everything
> seemed to work out nicely, and by the morning, the rebuild had finished.
> Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
> ran without gripes for some hours, but just now I saw md had started to
> rebuild the array again out of the blue:
>
> Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
> Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
> bandwidth (but not more than 200000 KB/sec) for data-check.
> Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
> 488383936 blocks.
> Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
>
> I'm not sure the USB resets are related to the problem - device 4-5.2 is
> part of the array, but I get these sometimes at random intervals and
> they don't seem to hurt normally. Besides, the first one was long before
> the rebuild started, and the second one long afterwards.
>
> Any ideas why md is rebuilding the array? And could this be related to
> the bad blocks problem I had first? badblocks is still running, I'll
> post an update when it is finished.
> In the meantime, mdadm --detail /dev/md0 and mdadm --examine
> /dev/sd[bc]1 don't give me any clues as to what went wrong, both disks
> are marked as "active sync", and the whole array is "active, recovering".
>
> Before I forget, I'm running 2.6.23.1 with this config:
> http://stud4.tuwien.ac.at/~e0626486/config-2.6.23.1-hrt3-fw
>
> Thanks,
> Oliver
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

It rebuilds the array because 'something' is causing device 
resets/timeouts on your USB device:

Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4

Naturally, when it is reset, the device is disconnected and then 
re-appears, when MD see's this it rebuilds the array.

Why it is timing out/resetting the device, that is what you need to find 
out.

Justin.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02  9:45 ` Justin Piszcz
@ 2007-12-02 13:07   ` Oliver Martin
  2007-12-02 21:16     ` Janek Kozicki
  0 siblings, 1 reply; 8+ messages in thread
From: Oliver Martin @ 2007-12-02 13:07 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

Justin Piszcz schrieb:
> 
> It rebuilds the array because 'something' is causing device
> resets/timeouts on your USB device:
> 
> Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> 
> Naturally, when it is reset, the device is disconnected and then
> re-appears, when MD see's this it rebuilds the array.
> 
> Why it is timing out/resetting the device, that is what you need to find
> out.
> 
> Justin.
> 

Thanks for your answer, I'll investigate the USB resets. Still, it seems
strange that the rebuild only started five hours after the reset. Is
this normal?
The reason I said the resets don't seem to hurt is that I also get them
for a second disk (not in a raid), and file transfers aren't
interrupted, I haven't (yet?) seen any data corruption, and other than
the message, the kernel doesn't seem to mind at all.

BTW, this time, badblocks ran through without any errors. The only
strange thing remaining is the rebuild.

Oliver

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02 13:07   ` Oliver Martin
@ 2007-12-02 21:16     ` Janek Kozicki
  0 siblings, 0 replies; 8+ messages in thread
From: Janek Kozicki @ 2007-12-02 21:16 UTC (permalink / raw)
  To: linux-raid

> Justin Piszcz schrieb:
> >
> > Naturally, when it is reset, the device is disconnected and then
> > re-appears, when MD see's this it rebuilds the array.

Least you can do is to add an internal bitmap to your raid, this will
make rebuilds faster.... :-/

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02  4:12 Spontaneous rebuild Oliver Martin
  2007-12-02  9:45 ` Justin Piszcz
@ 2007-12-02 23:14 ` Neil Brown
  2007-12-02 23:28   ` Justin Piszcz
  2007-12-03  2:35   ` Oliver Martin
  1 sibling, 2 replies; 8+ messages in thread
From: Neil Brown @ 2007-12-02 23:14 UTC (permalink / raw)
  To: Oliver Martin; +Cc: linux-raid

On Sunday December 2, oliver.martin@student.tuwien.ac.at wrote:
> 
> Anyway, the problems are back: To test my theory that everything is
> alright with the CPU running within its specs, I removed one of the
> drives while copying some large files yesterday. Initially, everything
> seemed to work out nicely, and by the morning, the rebuild had finished.
> Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
> ran without gripes for some hours, but just now I saw md had started to
> rebuild the array again out of the blue:
> 
> Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
                                      ^^^^^^^^^^
> Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
> bandwidth (but not more than 200000 KB/sec) for data-check.
                                                  ^^^^^^^^^^
> Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
> 488383936 blocks.
> Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> 

This isn't a resync, it is a data check.  "Dec  2" is the first Sunday
of the month.  You probably have a crontab entries that does
   echo check > /sys/block/mdX/md/sync_action

early on the first Sunday of the month.  I know that Debian does this.

It is good to do this occasionally to catch sleeping bad blocks.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02 23:14 ` Neil Brown
@ 2007-12-02 23:28   ` Justin Piszcz
  2007-12-03  0:15     ` Richard Scobie
  2007-12-03  2:35   ` Oliver Martin
  1 sibling, 1 reply; 8+ messages in thread
From: Justin Piszcz @ 2007-12-02 23:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: Oliver Martin, linux-raid



On Mon, 3 Dec 2007, Neil Brown wrote:

> On Sunday December 2, oliver.martin@student.tuwien.ac.at wrote:
>>
>> Anyway, the problems are back: To test my theory that everything is
>> alright with the CPU running within its specs, I removed one of the
>> drives while copying some large files yesterday. Initially, everything
>> seemed to work out nicely, and by the morning, the rebuild had finished.
>> Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
>> ran without gripes for some hours, but just now I saw md had started to
>> rebuild the array again out of the blue:
>>
>> Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
>> using ehci_hcd and address 4
>> Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
>                                      ^^^^^^^^^^
>> Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
>> KB/sec/disk.
>> Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for data-check.
>                                                  ^^^^^^^^^^
>> Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
>> 488383936 blocks.
>> Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
>> using ehci_hcd and address 4
>>
>
> This isn't a resync, it is a data check.  "Dec  2" is the first Sunday
> of the month.  You probably have a crontab entries that does
>   echo check > /sys/block/mdX/md/sync_action
>
> early on the first Sunday of the month.  I know that Debian does this.
>
> It is good to do this occasionally to catch sleeping bad blocks.

While we are on the subject of bad blocks, is it possible to do what 3ware 
raid controllers do without an external card?

They know when a block is bad and they remap it to another part of the 
array etc, where as with software raid you never know this is happening 
until the disk is dead.

For example with 3dm2 it notifies you if you have e-mail alerts set to 2 
(warn) it will e-mail you every time there is a sector re-allocation, is 
this possible with software raid or does it *require* HW raid/external 
controller?

Justin.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02 23:28   ` Justin Piszcz
@ 2007-12-03  0:15     ` Richard Scobie
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Scobie @ 2007-12-03  0:15 UTC (permalink / raw)
  To: linux-raid

Justin Piszcz wrote:

> While we are on the subject of bad blocks, is it possible to do what 
> 3ware raid controllers do without an external card?
> 
> They know when a block is bad and they remap it to another part of the 
> array etc, where as with software raid you never know this is happening 
> until the disk is dead.

Are you sure the 3ware software is remapping the bad blocks, or is it 
just reporting the bad blocks were remapped?

As I understand it, bad block remapping (reallocated sectors), are done 
internally at the drive level.

Perhaps all 3ware are doing is running the SMART command for reallocated 
sectors on all drives on a periodic basis and reporting any changes?

Regards,

Richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Spontaneous rebuild
  2007-12-02 23:14 ` Neil Brown
  2007-12-02 23:28   ` Justin Piszcz
@ 2007-12-03  2:35   ` Oliver Martin
  1 sibling, 0 replies; 8+ messages in thread
From: Oliver Martin @ 2007-12-03  2:35 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown schrieb:
> 
> This isn't a resync, it is a data check.  "Dec  2" is the first Sunday
> of the month.  You probably have a crontab entries that does
>    echo check > /sys/block/mdX/md/sync_action
> 
> early on the first Sunday of the month.  I know that Debian does this.
> 
> It is good to do this occasionally to catch sleeping bad blocks.
> 
Duh, thanks for clearing this up. I guess what set the alarm off was
getting what looked like a rebuild to me while stress testing. Yes, I'm
running Debian and I have exactly this entry in my crontab... Perhaps
they should add a short log entry like "starting periodic RAID check" so
that people know there is nothing to worry about.

Or maybe I should just RTFC (read the fine crontab) ;-)

Oliver

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-12-03  2:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-02  4:12 Spontaneous rebuild Oliver Martin
2007-12-02  9:45 ` Justin Piszcz
2007-12-02 13:07   ` Oliver Martin
2007-12-02 21:16     ` Janek Kozicki
2007-12-02 23:14 ` Neil Brown
2007-12-02 23:28   ` Justin Piszcz
2007-12-03  0:15     ` Richard Scobie
2007-12-03  2:35   ` Oliver Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).