Raid 5 - not clean and then a failure.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid 5 - not clean and then a failure.
@ 2009-08-25  7:54 Jon Hardcastle
  2009-08-25  8:16 ` Robin Hill
  2009-08-26 11:18 ` Goswin von Brederlow
  0 siblings, 2 replies; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-25  7:54 UTC (permalink / raw)
  To: linux-raid

Guys,

I have been having some problems with my arrays that I think i have nailed down to a pci controller (well I say that - it is always the drives connected to *a* controller but I have tried 2!) anyway the latest saga is i was trying some new kernel options last night - which didn't work.

But when i booted up again this morning it said one of the drives was in an inconsistent state (not sure of the *exact* error message). I then kicked off an add of the drive and it started syncing. It got about 5% in and then the second drive in on that controller complained and the array failed. 

Is there any hope for my data? If i get a good controller in there will the resync continue? can I try and tell it to assume the drives are good (which they ought to be)?

Please help!

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  7:54 Raid 5 - not clean and then a failure Jon Hardcastle
@ 2009-08-25  8:16 ` Robin Hill
  2009-08-25  8:40   ` Jon Hardcastle
  2009-08-26 11:02   ` Jon Hardcastle
  2009-08-26 11:18 ` Goswin von Brederlow
  1 sibling, 2 replies; 16+ messages in thread
From: Robin Hill @ 2009-08-25  8:16 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon Hardcastle wrote:

> Guys,
> 
> I have been having some problems with my arrays that I think i have
> nailed down to a pci controller (well I say that - it is always the
> drives connected to *a* controller but I have tried 2!) anyway the
> latest saga is i was trying some new kernel options last night - which
> didn't work.
> 
Did they have the same chipset?  I had problems with PCI controllers on
one of my systems, which turned out to be some sort of conflict between
the onboard chipset and the chipset on the controllers.  I found a PCI
card with a different chipset and have had no issues since.

> But when i booted up again this morning it said one of the drives was
> in an inconsistent state (not sure of the *exact* error message). I
> then kicked off an add of the drive and it started syncing. It got
> about 5% in and then the second drive in on that controller complained
> and the array failed.
> 
> Is there any hope for my data? If i get a good controller in there
> will the resync continue? can I try and tell it to assume the drives
> are good (which they ought to be)?
> 
There's definitely hope.  You can assemble the array (using the good
drives and the last drive to fail) using the --force option, then re-add
(and sync) the other drive (I'd recommend doing a fsck on the filesystem
as well).  I've just had to do a similar thing myself after two drives
failed (overheated after a fan failure).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  8:16 ` Robin Hill
@ 2009-08-25  8:40   ` Jon Hardcastle
  2009-08-25  9:34     ` Robin Hill
  2009-08-25 13:47     ` John Robinson
  2009-08-26 11:02   ` Jon Hardcastle
  1 sibling, 2 replies; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-25  8:40 UTC (permalink / raw)
  To: linux-raid, Robin Hill

--- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk> wrote:

> From: Robin Hill <robin@robinhill.me.uk>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 9:16 AM
> On Tue Aug 25, 2009 at 12:54:49AM
> -0700, Jon Hardcastle wrote:
> 
> > Guys,
> > 
> > I have been having some problems with my arrays that I
> think i have
> > nailed down to a pci controller (well I say that - it
> is always the
> > drives connected to *a* controller but I have tried
> 2!) anyway the
> > latest saga is i was trying some new kernel options
> last night - which
> > didn't work.
> > 
> Did they have the same chipset?  I had problems with
> PCI controllers on
> one of my systems, which turned out to be some sort of
> conflict between
> the onboard chipset and the chipset on the
> controllers.  I found a PCI
> card with a different chipset and have had no issues
> since.

They are/were cheapy little via ones from 'aria' I got a new one and installed it along side a week ago 1 drive on it to 'test', when my array came to do a scrub a week later I got a whole host of issues. I am not sure what the cause was but now either of the controllers seem work reliably. I have a pci express controller but my kernel doesnt (yet!) support pci express. Do you know of you can get sata 3 on pci? or is it too slow?

> > But when i booted up again this morning it said one of
> the drives was
> > in an inconsistent state (not sure of the *exact*
> error message). I
> > then kicked off an add of the drive and it started
> syncing. It got
> > about 5% in and then the second drive in on that
> controller complained
> > and the array failed.
> > 
> > Is there any hope for my data? If i get a good
> controller in there
> > will the resync continue? can I try and tell it to
> assume the drives
> > are good (which they ought to be)?
> > 
> There's definitely hope.  You can assemble the array
> (using the good
> drives and the last drive to fail) using the --force
> option, then re-add
> (and sync) the other drive (I'd recommend doing a fsck on
> the filesystem
> as well).  I've just had to do a similar thing myself
> after two drives
> failed (overheated after a fan failure).
> 
> Cheers,
>     Robin
> -- 

Thank you, Thank you, Thank you. I probably wont look at this for a few days now - i find when sitting down without enough time to really see it through is when I get problems! and I am abit busy bee atm!


-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------





      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  8:40   ` Jon Hardcastle
@ 2009-08-25  9:34     ` Robin Hill
  2009-08-25 13:47     ` John Robinson
  1 sibling, 0 replies; 16+ messages in thread
From: Robin Hill @ 2009-08-25  9:34 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]

On Tue Aug 25, 2009 at 01:40:31AM -0700, Jon Hardcastle wrote:

> --- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk> wrote:
> 
> > From: Robin Hill <robin@robinhill.me.uk>
> > Subject: Re: Raid 5 - not clean and then a failure.
> > To: linux-raid@vger.kernel.org
> > Date: Tuesday, 25 August, 2009, 9:16 AM
> > On Tue Aug 25, 2009 at 12:54:49AM
> > -0700, Jon Hardcastle wrote:
> > 
> > > Guys,
> > > 
> > > I have been having some problems with my arrays that I think i have
> > > nailed down to a pci controller (well I say that - it is always the
> > > drives connected to *a* controller but I have tried 2!) anyway the
> > > latest saga is i was trying some new kernel options last night - which
> > > didn't work.
> > > 
> > Did they have the same chipset?  I had problems with PCI controllers on
> > one of my systems, which turned out to be some sort of conflict between
> > the onboard chipset and the chipset on the controllers.  I found a PCI
> > card with a different chipset and have had no issues since.
> 
> They are/were cheapy little via ones from 'aria' I got a new one and
> installed it along side a week ago 1 drive on it to 'test', when my
> array came to do a scrub a week later I got a whole host of issues. I
> am not sure what the cause was but now either of the controllers seem
> work reliably. I have a pci express controller but my kernel doesnt
> (yet!) support pci express. Do you know of you can get sata 3 on pci?
> or is it too slow?
> 
By SATA 3, I assume you're actually referring to SATA 3GBit/s (as SATA 3
has only just been ratified, and I doubt you can get it at all yet)?
You'll probably be able to find PCI cards that support it, but the
standard PCI bus (32-bit, 33MHz) only has a bandwidth of 1 GBit/s, so
can't even keep up with 1.5GBit/s SATA, let alone 3GBit/s.  A 64-bit or
66MHz bus would do better though.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  8:40   ` Jon Hardcastle
  2009-08-25  9:34     ` Robin Hill
@ 2009-08-25 13:47     ` John Robinson
  2009-08-25 14:11       ` Jon Hardcastle
  1 sibling, 1 reply; 16+ messages in thread
From: John Robinson @ 2009-08-25 13:47 UTC (permalink / raw)
  To: Jon; +Cc: linux-raid

On 25/08/2009 09:40, Jon Hardcastle wrote:
> --- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk> wrote:
>> On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon Hardcastle wrote:
>>> I have been having some problems with my arrays that I think i have
>>> nailed down to a pci controller (well I say that - it is always the
>>> drives connected to *a* controller but I have tried 2!) anyway the
>>> latest saga is i was trying some new kernel options last night - which
>>> didn't work.
>>
>> Did they have the same chipset?  I had problems with PCI controllers on
>> one of my systems, which turned out to be some sort of conflict between
>> the onboard chipset and the chipset on the controllers.  I found a PCI
>> card with a different chipset and have had no issues since.
> 
> They are/were cheapy little via ones

That's your problem right there. Well, in my experience and therefore 
opinion, VIA stuff is all too often junk, or at least iffy enough never 
to be trusted with anything professional or important.

[...]
> I have a pci express controller but my kernel doesnt (yet!) support pci express.

*How old* is your kernel?

Cheers,

John.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25 13:47     ` John Robinson
@ 2009-08-25 14:11       ` Jon Hardcastle
  0 siblings, 0 replies; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-25 14:11 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

--- On Tue, 25/8/09, John Robinson <john.robinson@anonymous.org.uk> wrote:

> From: John Robinson <john.robinson@anonymous.org.uk>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 2:47 PM
> On 25/08/2009 09:40, Jon Hardcastle
> wrote:
> > --- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk>
> wrote:
> >> On Tue Aug 25, 2009 at 12:54:49AM -0700, Jon
> Hardcastle wrote:
> >>> I have been having some problems with my
> arrays that I think i have
> >>> nailed down to a pci controller (well I say
> that - it is always the
> >>> drives connected to *a* controller but I have
> tried 2!) anyway the
> >>> latest saga is i was trying some new kernel
> options last night - which
> >>> didn't work.
> >> 
> >> Did they have the same chipset?  I had
> problems with PCI controllers on
> >> one of my systems, which turned out to be some
> sort of conflict between
> >> the onboard chipset and the chipset on the
> controllers.  I found a PCI
> >> card with a different chipset and have had no
> issues since.
> > 
> > They are/were cheapy little via ones
> 
> That's your problem right there. Well, in my experience and
> therefore opinion, VIA stuff is all too often junk, or at
> least iffy enough never to be trusted with anything
> professional or important.
> 
> [...]
> > I have a pci express controller but my kernel doesnt
> (yet!) support pci express.
> 
> *How old* is your kernel?
> 
> Cheers,
> 
> John.
> --

This is what I am finding.. i plugged in a second controller and it seems to have nagered the first one such that I am getting  'port to slow to respond' from the drives connected now.

I am looking at my options. Once i get PCI-Express working I have options. I am also looking at port multipliers.

(ps support is IN my kernel code.. I just ran down a trimmed kernel. Didn't need PCI express  so i didn't enable it. now i do :) )

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  8:16 ` Robin Hill
  2009-08-25  8:40   ` Jon Hardcastle
@ 2009-08-26 11:02   ` Jon Hardcastle
  1 sibling, 0 replies; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-26 11:02 UTC (permalink / raw)
  To: linux-raid, Robin Hill




--- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk> wrote:

> From: Robin Hill <robin@robinhill.me.uk>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 9:16 AM
> On Tue Aug 25, 2009 at 12:54:49AM
> -0700, Jon Hardcastle wrote:
> 
> > Guys,
> > 
> > I have been having some problems with my arrays that I
> think i have
> > nailed down to a pci controller (well I say that - it
> is always the
> > drives connected to *a* controller but I have tried
> 2!) anyway the
> > latest saga is i was trying some new kernel options
> last night - which
> > didn't work.
> > 
> Did they have the same chipset?  I had problems with
> PCI controllers on
> one of my systems, which turned out to be some sort of
> conflict between
> the onboard chipset and the chipset on the
> controllers.  I found a PCI
> card with a different chipset and have had no issues
> since.
> 
> > But when i booted up again this morning it said one of
> the drives was
> > in an inconsistent state (not sure of the *exact*
> error message). I
> > then kicked off an add of the drive and it started
> syncing. It got
> > about 5% in and then the second drive in on that
> controller complained
> > and the array failed.
> > 
> > Is there any hope for my data? If i get a good
> controller in there
> > will the resync continue? can I try and tell it to
> assume the drives
> > are good (which they ought to be)?
> > 
> There's definitely hope.  You can assemble the array
> (using the good
> drives and the last drive to fail) using the --force
> option, then re-add
> (and sync) the other drive (I'd recommend doing a fsck on
> the filesystem
> as well).  I've just had to do a similar thing myself
> after two drives
> failed (overheated after a fan failure).
> 
> Cheers,
>     Robin

It worked! I had to force the array, to assemble.. but it did. Had some more problems with the controller that I think was caused ultimately by the two via controller conflicting. I think removing them *both* and booting up helped the computer to work out what was going on (don't know how) I also took down the 'minimum guaranteed' speed of the rebuild to 50MB as the 2 drives on the PCI/150 card were struggling I think - not sure about this as the drive does a 'check' once a week and has only ever failed last weekend. So basically i am not really 100% sure what caused this problem - but i do know i need to get a more stable way of controller these additional drives!

On a side note, if a 'repair' does everything a 'check' does but also repairs it. Is there any merit in just doing repairs?

Finally, anyone here got a port multiplier working?


-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-25  7:54 Raid 5 - not clean and then a failure Jon Hardcastle
  2009-08-25  8:16 ` Robin Hill
@ 2009-08-26 11:18 ` Goswin von Brederlow
  2009-08-26 11:29   ` Jon Hardcastle
  2009-08-26 14:14   ` Ryan Wagoner
  1 sibling, 2 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2009-08-26 11:18 UTC (permalink / raw)
  To: Jon; +Cc: linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> Guys,
>
> I have been having some problems with my arrays that I think i have nailed down to a pci controller (well I say that - it is always the drives connected to *a* controller but I have tried 2!) anyway the latest saga is i was trying some new kernel options last night - which didn't work.
>
> But when i booted up again this morning it said one of the drives was in an inconsistent state (not sure of the *exact* error message). I then kicked off an add of the drive and it started syncing. It got about 5% in and then the second drive in on that controller complained and the array failed. 
>
> Is there any hope for my data? If i get a good controller in there will the resync continue? can I try and tell it to assume the drives are good (which they ought to be)?
>
> Please help!

The inconsistency is probably just a block here or there and I'm
assuming none of your drives actualy failed. So 99.9999% of your data
should be there. Just rebooting might actualy just get your raid back
(to syncing). If not then you have to force reassembly from the drives
with the newest serials. That will give you some data corruption,
whatever was writing when the controler gave errors. Worst case you
have to recreate the raid with --assume-clean.

I recommend adding a bitmap to the raid. That way a wrongfully failed
drive can be resynced in a matter of minutes instead of hours or
days. Makes it way less likely another error occurs during resync.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 11:18 ` Goswin von Brederlow
@ 2009-08-26 11:29   ` Jon Hardcastle
  2009-08-26 12:47     ` John Robinson
  2009-08-26 20:34     ` Goswin von Brederlow
  2009-08-26 14:14   ` Ryan Wagoner
  1 sibling, 2 replies; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-26 11:29 UTC (permalink / raw)
  To: Jon, Goswin von Brederlow; +Cc: linux-raid




--- On Wed, 26/8/09, Goswin von Brederlow <goswin-v-b@web.de> wrote:

> From: Goswin von Brederlow <goswin-v-b@web.de>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Wednesday, 26 August, 2009, 12:18 PM
> Jon Hardcastle <jd_hardcastle@yahoo.com>
> writes:
> 
> > Guys,
> >
> > I have been having some problems with my arrays that I
> think i have nailed down to a pci controller (well I say
> that - it is always the drives connected to *a* controller
> but I have tried 2!) anyway the latest saga is i was trying
> some new kernel options last night - which didn't work.
> >
> > But when i booted up again this morning it said one of
> the drives was in an inconsistent state (not sure of the
> *exact* error message). I then kicked off an add of the
> drive and it started syncing. It got about 5% in and then
> the second drive in on that controller complained and the
> array failed. 
> >
> > Is there any hope for my data? If i get a good
> controller in there will the resync continue? can I try and
> tell it to assume the drives are good (which they ought to
> be)?
> >
> > Please help!
> 
> The inconsistency is probably just a block here or there
> and I'm
> assuming none of your drives actualy failed. So 99.9999% of
> your data
> should be there. Just rebooting might actualy just get your
> raid back
> (to syncing). If not then you have to force reassembly from
> the drives
> with the newest serials. That will give you some data
> corruption,
> whatever was writing when the controler gave errors. Worst
> case you
> have to recreate the raid with --assume-clean.
> 
> I recommend adding a bitmap to the raid. That way a
> wrongfully failed
> drive can be resynced in a matter of minutes instead of
> hours or
> days. Makes it way less likely another error occurs during
> resync.
> 
> MfG
>         Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I did look into bitmaps *abit* i could easily have the imagine for my 6 drive raid 5 stored on the raid1 I have in the same system.. The googling I did tho did not paint a pretty picture it talked about huge performance hits?

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 11:29   ` Jon Hardcastle
@ 2009-08-26 12:47     ` John Robinson
  2009-08-26 20:34     ` Goswin von Brederlow
  1 sibling, 0 replies; 16+ messages in thread
From: John Robinson @ 2009-08-26 12:47 UTC (permalink / raw)
  To: Jon; +Cc: Linux RAID

On 26/08/2009 12:29, Jon Hardcastle wrote:
[...]
> I did look into bitmaps *abit* i could easily have the imagine for my 6 drive raid 5 stored on the raid1 I have in the same system.. The googling I did tho did not paint a pretty picture it talked about huge performance hits?

There is a performance hit but it can be minimised by picking a bitmap 
chunk size to suit; I ended up getting about 80% of bitmap-less write 
performance using a 16MB bitmap chunk size instead of about 40% with the 
default size, on my 3-drive RAID-5 array.

Cheers,

John.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 11:18 ` Goswin von Brederlow
  2009-08-26 11:29   ` Jon Hardcastle
@ 2009-08-26 14:14   ` Ryan Wagoner
  2009-08-26 14:19     ` Jon Hardcastle
                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Ryan Wagoner @ 2009-08-26 14:14 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Jon, linux-raid

Wouldn't weekly RAID consistency checks reveal a bad block before you
had a failure that required the need to do a full resync? It only
takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
reduce the performance. I've never had to have a resync in the year
I've had the array up. I just wonder if the performance drawback is
worth having the bitmap to save a possible resync once every couple
years. Or are the RAID consistency checks not reliable enough to
prevent more errors during a resync?

Ryan

On Wed, Aug 26, 2009 at 7:18 AM, Goswin von Brederlow<goswin-v-b@web.de> wrote:
> Jon Hardcastle <jd_hardcastle@yahoo.com> writes:
>
>> Guys,
>>
>> I have been having some problems with my arrays that I think i have nailed down to a pci controller (well I say that - it is always the drives connected to *a* controller but I have tried 2!) anyway the latest saga is i was trying some new kernel options last night - which didn't work.
>>
>> But when i booted up again this morning it said one of the drives was in an inconsistent state (not sure of the *exact* error message). I then kicked off an add of the drive and it started syncing. It got about 5% in and then the second drive in on that controller complained and the array failed.
>>
>> Is there any hope for my data? If i get a good controller in there will the resync continue? can I try and tell it to assume the drives are good (which they ought to be)?
>>
>> Please help!
>
> The inconsistency is probably just a block here or there and I'm
> assuming none of your drives actualy failed. So 99.9999% of your data
> should be there. Just rebooting might actualy just get your raid back
> (to syncing). If not then you have to force reassembly from the drives
> with the newest serials. That will give you some data corruption,
> whatever was writing when the controler gave errors. Worst case you
> have to recreate the raid with --assume-clean.
>
> I recommend adding a bitmap to the raid. That way a wrongfully failed
> drive can be resynced in a matter of minutes instead of hours or
> days. Makes it way less likely another error occurs during resync.
>
> MfG
>        Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 14:14   ` Ryan Wagoner
@ 2009-08-26 14:19     ` Jon Hardcastle
  2009-08-26 14:50       ` Robin Hill
  2009-08-26 14:33     ` Robin Hill
  2009-08-26 20:35     ` Goswin von Brederlow
  2 siblings, 1 reply; 16+ messages in thread
From: Jon Hardcastle @ 2009-08-26 14:19 UTC (permalink / raw)
  To: Goswin von Brederlow, Ryan Wagoner; +Cc: Jon, linux-raid

Can a bitmap be easily removed? I might give it ago if it can.

I am never sure how thorough these checks are. Are they read/write, or just read? for example. I make of point of doing read/write badblocks checks with e2fck -cc when I do run them (not the automatic ones tho - dunno how) but that only checks that partition, which is on LVM, which is on RAID so WHO KNOWS what underlying drives are being checked.

I have before now, dismantled the array and run read/write badblocks directly on the constituent drives so at least smart is aware of them and although i aim to do this once every six months, I think I have actually done it only 1nce in the 2 year life of the array.

-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------


--- On Wed, 26/8/09, Ryan Wagoner <rswagoner@gmail.com> wrote:

> From: Ryan Wagoner <rswagoner@gmail.com>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: "Goswin von Brederlow" <goswin-v-b@web.de>
> Cc: Jon@ehardcastle.com, linux-raid@vger.kernel.org
> Date: Wednesday, 26 August, 2009, 3:14 PM
> Wouldn't weekly RAID consistency
> checks reveal a bad block before you
> had a failure that required the need to do a full resync?
> It only
> takes 3 hours to resync my 3 x 1TB drives and having a
> bitmap would
> reduce the performance. I've never had to have a resync in
> the year
> I've had the array up. I just wonder if the performance
> drawback is
> worth having the bitmap to save a possible resync once
> every couple
> years. Or are the RAID consistency checks not reliable
> enough to
> prevent more errors during a resync?
> 
> Ryan
> 
> On Wed, Aug 26, 2009 at 7:18 AM, Goswin von Brederlow<goswin-v-b@web.de>
> wrote:
> > Jon Hardcastle <jd_hardcastle@yahoo.com>
> writes:
> >
> >> Guys,
> >>
> >> I have been having some problems with my arrays
> that I think i have nailed down to a pci controller (well I
> say that - it is always the drives connected to *a*
> controller but I have tried 2!) anyway the latest saga is i
> was trying some new kernel options last night - which didn't
> work.
> >>
> >> But when i booted up again this morning it said
> one of the drives was in an inconsistent state (not sure of
> the *exact* error message). I then kicked off an add of the
> drive and it started syncing. It got about 5% in and then
> the second drive in on that controller complained and the
> array failed.
> >>
> >> Is there any hope for my data? If i get a good
> controller in there will the resync continue? can I try and
> tell it to assume the drives are good (which they ought to
> be)?
> >>
> >> Please help!
> >
> > The inconsistency is probably just a block here or
> there and I'm
> > assuming none of your drives actualy failed. So
> 99.9999% of your data
> > should be there. Just rebooting might actualy just get
> your raid back
> > (to syncing). If not then you have to force reassembly
> from the drives
> > with the newest serials. That will give you some data
> corruption,
> > whatever was writing when the controler gave errors.
> Worst case you
> > have to recreate the raid with --assume-clean.
> >
> > I recommend adding a bitmap to the raid. That way a
> wrongfully failed
> > drive can be resynced in a matter of minutes instead
> of hours or
> > days. Makes it way less likely another error occurs
> during resync.
> >
> > MfG
> >        Goswin
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 14:14   ` Ryan Wagoner
  2009-08-26 14:19     ` Jon Hardcastle
@ 2009-08-26 14:33     ` Robin Hill
  2009-08-26 20:35     ` Goswin von Brederlow
  2 siblings, 0 replies; 16+ messages in thread
From: Robin Hill @ 2009-08-26 14:33 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

On Wed Aug 26, 2009 at 10:14:31AM -0400, Ryan Wagoner wrote:

> Wouldn't weekly RAID consistency checks reveal a bad block before you
> had a failure that required the need to do a full resync? It only
> takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
> reduce the performance. I've never had to have a resync in the year
> I've had the array up. I just wonder if the performance drawback is
> worth having the bitmap to save a possible resync once every couple
> years. Or are the RAID consistency checks not reliable enough to
> prevent more errors during a resync?
> 
If your system is that stable, then bitmaps will be a waste of time for
you.  A lot of people have hardware/software issues which cause drives
to be kicked out of arrays occasionally, or arrays to fail to shut down
cleanly.  A bitmap will save time when adding the drive back into the
array in these cases.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 14:19     ` Jon Hardcastle
@ 2009-08-26 14:50       ` Robin Hill
  0 siblings, 0 replies; 16+ messages in thread
From: Robin Hill @ 2009-08-26 14:50 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]

On Wed Aug 26, 2009 at 07:19:51AM -0700, Jon Hardcastle wrote:

> Can a bitmap be easily removed? I might give it ago if it can.
> 
Yes - you can add/remove a bitmap at any time (on a non-degraded array).

> I am never sure how thorough these checks are. Are they read/write, or
> just read? for example. I make of point of doing read/write badblocks
> checks with e2fck -cc when I do run them (not the automatic ones tho -
> dunno how) but that only checks that partition, which is on LVM, which
> is on RAID so WHO KNOWS what underlying drives are being checked.
> 
My understanding is that the md "check" action does a read-only check,
verifying the checksum is valid for the data.  The "repair" action
will rewrite the checksum if it's not valid.  Neither of these will
write to the data blocks, or any valid checksum blocks.

Running e2fsck -cc should do a read/write check.  This will only check
the filesystem data blocks though (and only on ext2/ext3 filesystems of
course), so will miss the LVM metadata and RAID checksums and metadata.

> I have before now, dismantled the array and run read/write badblocks
> directly on the constituent drives so at least smart is aware of them
> and although i aim to do this once every six months, I think I have
> actually done it only 1nce in the 2 year life of the array.
> 
If you mean running badblocks in read/write mode, that'll be a
destructive test then.  In this case, you're trading the risk of a
failure on one disk for the risk of a failure on one of the others
during rebuild.

You could also run background SMART tests (though this has caused drives
to be kicked out of the array on some occasions for me) - these look to
be mostly read-only tests again (though I'm not 100% sure on that).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 11:29   ` Jon Hardcastle
  2009-08-26 12:47     ` John Robinson
@ 2009-08-26 20:34     ` Goswin von Brederlow
  1 sibling, 0 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2009-08-26 20:34 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> --- On Wed, 26/8/09, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
>> From: Goswin von Brederlow <goswin-v-b@web.de>
>> Subject: Re: Raid 5 - not clean and then a failure.
>> To: Jon@eHardcastle.com
>> Cc: linux-raid@vger.kernel.org
>> Date: Wednesday, 26 August, 2009, 12:18 PM
>> Jon Hardcastle <jd_hardcastle@yahoo.com>
>> writes:
>> 
>> > Guys,
>> >
>> > I have been having some problems with my arrays that I
>> think i have nailed down to a pci controller (well I say
>> that - it is always the drives connected to *a* controller
>> but I have tried 2!) anyway the latest saga is i was trying
>> some new kernel options last night - which didn't work.
>> >
>> > But when i booted up again this morning it said one of
>> the drives was in an inconsistent state (not sure of the
>> *exact* error message). I then kicked off an add of the
>> drive and it started syncing. It got about 5% in and then
>> the second drive in on that controller complained and the
>> array failed. 
>> >
>> > Is there any hope for my data? If i get a good
>> controller in there will the resync continue? can I try and
>> tell it to assume the drives are good (which they ought to
>> be)?
>> >
>> > Please help!
>> 
>> The inconsistency is probably just a block here or there
>> and I'm
>> assuming none of your drives actualy failed. So 99.9999% of
>> your data
>> should be there. Just rebooting might actualy just get your
>> raid back
>> (to syncing). If not then you have to force reassembly from
>> the drives
>> with the newest serials. That will give you some data
>> corruption,
>> whatever was writing when the controler gave errors. Worst
>> case you
>> have to recreate the raid with --assume-clean.
>> 
>> I recommend adding a bitmap to the raid. That way a
>> wrongfully failed
>> drive can be resynced in a matter of minutes instead of
>> hours or
>> days. Makes it way less likely another error occurs during
>> resync.
>> 
>> MfG
>>         Goswin
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>
> I did look into bitmaps *abit* i could easily have the imagine for my 6 drive raid 5 stored on the raid1 I have in the same system.. The googling I did tho did not paint a pretty picture it talked about huge performance hits?

That depends on the bitmap size a lot.

It also depends on the frequency of errors. If your controler has a
hickup once a week causing a drive to fail and you need 1 day to
rebuild the array you will be left with a double disk failure pretty
quickly without bitmaps.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid 5 - not clean and then a failure.
  2009-08-26 14:14   ` Ryan Wagoner
  2009-08-26 14:19     ` Jon Hardcastle
  2009-08-26 14:33     ` Robin Hill
@ 2009-08-26 20:35     ` Goswin von Brederlow
  2 siblings, 0 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2009-08-26 20:35 UTC (permalink / raw)
  To: Ryan Wagoner; +Cc: Goswin von Brederlow, Jon, linux-raid

Ryan Wagoner <rswagoner@gmail.com> writes:

> Wouldn't weekly RAID consistency checks reveal a bad block before you
> had a failure that required the need to do a full resync? It only
> takes 3 hours to resync my 3 x 1TB drives and having a bitmap would
> reduce the performance. I've never had to have a resync in the year
> I've had the array up. I just wonder if the performance drawback is
> worth having the bitmap to save a possible resync once every couple
> years. Or are the RAID consistency checks not reliable enough to
> prevent more errors during a resync?
>
> Ryan

Bitmaps don't protect against disk failures. They help with
intermittent failures, usualy caused by the controler. If you don't
have intermittent failures then bitmaps will only cost you for no
benefit.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-08-26 20:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-25  7:54 Raid 5 - not clean and then a failure Jon Hardcastle
2009-08-25  8:16 ` Robin Hill
2009-08-25  8:40   ` Jon Hardcastle
2009-08-25  9:34     ` Robin Hill
2009-08-25 13:47     ` John Robinson
2009-08-25 14:11       ` Jon Hardcastle
2009-08-26 11:02   ` Jon Hardcastle
2009-08-26 11:18 ` Goswin von Brederlow
2009-08-26 11:29   ` Jon Hardcastle
2009-08-26 12:47     ` John Robinson
2009-08-26 20:34     ` Goswin von Brederlow
2009-08-26 14:14   ` Ryan Wagoner
2009-08-26 14:19     ` Jon Hardcastle
2009-08-26 14:50       ` Robin Hill
2009-08-26 14:33     ` Robin Hill
2009-08-26 20:35     ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).