Raid5 assemble after dual sata port failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid5 assemble after dual sata port failure
@ 2007-11-07 20:23 chrise
  0 siblings, 0 replies; 13+ messages in thread
From: chrise @ 2007-11-07 20:23 UTC (permalink / raw)
  To: linux-raid

Hi,

While on vacation I had one SATA port/cable fail, and then four hours later a second one fail.  After fixing/moving the SATA ports, I can reboot and all drives seem to be OK now, but when assembled it won't recognize the filesystem.  After futzing around with assemble options like --force and disk order I couldn't get it to work.

Appreciate if someone can provide pointers on how to proceed.  Here's the latest state of the array.  My log shows that /dev/sdc1 failed first and I'm using mdadm v2.5.6 on Ubuntu Linux.

Thanks much,
Chris 

mdadm -Av /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 2.
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: added /dev/sdd1 to /dev/md0 as 2
mdadm: added /dev/sdc1 to /dev/md0 as 3
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 3 drives (out of 4).

  mdadm -E /dev/sd[a-d]1
/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e0 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0     254        0        0      active sync   /dev/mapper/sda1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed
/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e3 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1     254        1        1      active sync   /dev/mapper/sdb1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Nov  5 13:31:00 2007
          State : active
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 3fced5a3 - correct
         Events : 0.4857597

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3     254        3        3      active sync   /dev/mapper/sdc1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3     254        3        3      active sync   /dev/mapper/sdc1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e6 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2     254        2        2      active sync   /dev/mapper/sdd1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed




--
This message was sent on behalf of chrise@synplicity.com at openSubscriber.com
http://www.opensubscriber.com/messages/linux-raid@vger.kernel.org/topic.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Raid5 assemble after dual sata port failure
@ 2007-11-07 20:28 Chris Eddington
  2007-11-08 10:33 ` David Greaves
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-07 20:28 UTC (permalink / raw)
  To: linux-raid


Hi,

While on vacation I had one SATA port/cable fail, and then four hours 
later a second one fail.  After fixing/moving the SATA ports, I can 
reboot and all drives seem to be OK now, but when assembled it won't 
recognize the filesystem.  After futzing around with assemble options 
like --force and disk order I couldn't get it to work.

Appreciate if someone can provide pointers on how to proceed.  Here's 
the latest state of the array.  My log shows that /dev/sdc1 failed first 
and I'm using mdadm v2.5.6 on Ubuntu Linux.

Thanks much,
Chris

mdadm -Av /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 2.
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: added /dev/sdd1 to /dev/md0 as 2
mdadm: added /dev/sdc1 to /dev/md0 as 3
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 3 drives (out of 4).

  mdadm -E /dev/sd[a-d]1
/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e0 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0     254        0        0      active sync   /dev/mapper/sda1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed
/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e3 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1     254        1        1      active sync   /dev/mapper/sdb1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Mon Nov  5 13:31:00 2007
          State : active
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 3fced5a3 - correct
         Events : 0.4857597

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3     254        3        3      active sync   /dev/mapper/sdc1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3     254        3        3      active sync   /dev/mapper/sdc1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.03
           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
    Device Size : 488383936 (465.76 GiB 500.11 GB)
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  7 12:02:50 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 401c33e6 - correct
         Events : 0.4880374

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2     254        2        2      active sync   /dev/mapper/sdd1

   0     0     254        0        0      active sync   /dev/mapper/sda1
   1     1     254        1        1      active sync   /dev/mapper/sdb1
   2     2     254        2        2      active sync   /dev/mapper/sdd1
   3     3       0        0        3      faulty removed



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-07 20:28 Chris Eddington
@ 2007-11-08 10:33 ` David Greaves
  2007-11-09 21:23   ` Chris Eddington
  0 siblings, 1 reply; 13+ messages in thread
From: David Greaves @ 2007-11-08 10:33 UTC (permalink / raw)
  To: Chris Eddington; +Cc: linux-raid

Chris Eddington wrote:
> 
> Hi,
Hi
> 
> While on vacation I had one SATA port/cable fail, and then four hours
> later a second one fail.  After fixing/moving the SATA ports, I can
> reboot and all drives seem to be OK now, but when assembled it won't
> recognize the filesystem.

That's unusual - if the array comes back then you should be OK.
In general if two devices fail then there is a real data loss risk.
However if the drives are good and there was just a cable glitch, then unless
you're unlucky it's usually fsck fixable.

I see
mdadm: /dev/md0 has been started with 3 drives (out of 4).

which means it's now up and running.

And:
sda1        Events : 0.4880374
sdb1        Events : 0.4880374
sdc1        Events : 0.4857597
sdd1        Events : 0.4880374

so sdc1 is way out of date... we'll add/resync that when everything else is working.

but:
>  After futzing around with assemble options
> like --force and disk order I couldn't get it to work.

Let me check... what commands did you use? Just 'assemble' - which doesn't care
about disk order - or did you try to re-'create' the array - which does care
about disk order and leads us down a different path...
err, scratch that:
>  Creation Time : Sun Nov  5 14:25:01 2006
OK, it was created a year ago... so you did use assemble.

It is slightly odd to see that the drive order is:
/dev/mapper/sda1
/dev/mapper/sdb1
/dev/mapper/sdd1
/dev/mapper/sdc1
Usually people just create them in order.

Have you done any fsck's that involve a write?

What filesystem are you running? What does your 'fsck -n' (readonly) report?

Also, please report the results of:
 cat /proc/mdadm
 mdadm -D /dev/md0
 cat /etc/mdadm.conf

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-08 10:33 ` David Greaves
@ 2007-11-09 21:23   ` Chris Eddington
  2007-11-10  0:28     ` Chris Eddington
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-09 21:23 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

Thanks David.

I've had cable/port failures in the past and after re-adding the drive, 
the order changed - I'm not sure why, but I noticed it sometime ago but 
don't remember the exact order.

My initial attempt to assemble, it came up with only two drives in the 
array.  Then I tried assembling with --force and that brought up 3 of 
the drives.  At that point I thought I was good, so I tried mount 
/dev/md0 and it failed.  Would that have written to the disk?  I'm using 
XFS.

After that, I tried assembling with different drive orders on the 
command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking 
that the order might not be right.

At the moment I can't access the machine, but I'll try fsck -n and send 
you the other info later this evening.

Many thanks,
Chris

David Greaves wrote:
> Chris Eddington wrote:
>   
>> Hi,
>>     
> Hi
>   
>> While on vacation I had one SATA port/cable fail, and then four hours
>> later a second one fail.  After fixing/moving the SATA ports, I can
>> reboot and all drives seem to be OK now, but when assembled it won't
>> recognize the filesystem.
>>     
>
> That's unusual - if the array comes back then you should be OK.
> In general if two devices fail then there is a real data loss risk.
> However if the drives are good and there was just a cable glitch, then unless
> you're unlucky it's usually fsck fixable.
>
> I see
> mdadm: /dev/md0 has been started with 3 drives (out of 4).
>
> which means it's now up and running.
>
> And:
> sda1        Events : 0.4880374
> sdb1        Events : 0.4880374
> sdc1        Events : 0.4857597
> sdd1        Events : 0.4880374
>
> so sdc1 is way out of date... we'll add/resync that when everything else is working.
>
> but:
>   
>>  After futzing around with assemble options
>> like --force and disk order I couldn't get it to work.
>>     
>
> Let me check... what commands did you use? Just 'assemble' - which doesn't care
> about disk order - or did you try to re-'create' the array - which does care
> about disk order and leads us down a different path...
> err, scratch that:
>   
>>  Creation Time : Sun Nov  5 14:25:01 2006
>>     
> OK, it was created a year ago... so you did use assemble.
>
>
> It is slightly odd to see that the drive order is:
> /dev/mapper/sda1
> /dev/mapper/sdb1
> /dev/mapper/sdd1
> /dev/mapper/sdc1
> Usually people just create them in order.
>
>
> Have you done any fsck's that involve a write?
>
> What filesystem are you running? What does your 'fsck -n' (readonly) report?
>
> Also, please report the results of:
>  cat /proc/mdadm
>  mdadm -D /dev/md0
>  cat /etc/mdadm.conf
>
>
> David
>
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-09 21:23   ` Chris Eddington
@ 2007-11-10  0:28     ` Chris Eddington
  2007-11-10  9:16       ` David Greaves
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-10  0:28 UTC (permalink / raw)
  To: Chris Eddington; +Cc: David Greaves, linux-raid

Hi David,

I ran xfs_check and get this:
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

After mounting (which fails) and re-running xfs_check it gives the same 
message.

The array info details are below and seems it is running correctly ??  I 
interpret the message above as actually a good sign - seems that 
xfs_check sees the filesystem but the log file and maybe the most 
currently written data is corrupted or will be lost.  But I'd like to 
hear some advice/guidance before doing anything permanent with 
xfs_repair.  I also would like to confirm somehow that the array is in 
the right order, etc.  Appreciate your feedback.

Thks,
Chris



--------------------
cat /etc/mdadm/mdadm.conf
DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
ARRAY /dev/md0 level=raid5 num-devices=4 
UUID=bc74c21c:9655c1c6:ba6cc37a:df870496
MAILADDR root

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sda1[0] sdd1[2] sdb1[1]
      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
     
unused devices: <none>

mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sun Nov  5 14:25:01 2006
     Raid Level : raid5
     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
    Device Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Nov  9 16:26:31 2007
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
         Events : 0.4880384

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       49        2      active sync   /dev/sdd1
       3       0        0        3      removed



Chris Eddington wrote:
> Thanks David.
>
> I've had cable/port failures in the past and after re-adding the 
> drive, the order changed - I'm not sure why, but I noticed it sometime 
> ago but don't remember the exact order.
>
> My initial attempt to assemble, it came up with only two drives in the 
> array.  Then I tried assembling with --force and that brought up 3 of 
> the drives.  At that point I thought I was good, so I tried mount 
> /dev/md0 and it failed.  Would that have written to the disk?  I'm 
> using XFS.
>
> After that, I tried assembling with different drive orders on the 
> command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking 
> that the order might not be right.
>
> At the moment I can't access the machine, but I'll try fsck -n and 
> send you the other info later this evening.
>
> Many thanks,
> Chris
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-10  0:28     ` Chris Eddington
@ 2007-11-10  9:16       ` David Greaves
  2007-11-10 18:46         ` Chris Eddington
  0 siblings, 1 reply; 13+ messages in thread
From: David Greaves @ 2007-11-10  9:16 UTC (permalink / raw)
  To: Chris Eddington; +Cc: linux-raid

Ok - it looks like the raid array is up. There will have been an event count
mismatch which is why you needed --force. This may well have caused some
(hopefully minor) corruption.

FWIW, xfs_check is almost never worth running :) (It runs out of memory easily).
xfs_repair -n is much better.

What does the end of dmesg say after trying to mount the fs?

Also try:
xfs_repair -n -L

I think you then have 2 options:
* xfs_repair -L
This may well lose data that was being written as the drives crashed.
* contact the xfs mailing list

David

Chris Eddington wrote:
> Hi David,
> 
> I ran xfs_check and get this:
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed.  Mount the filesystem to replay the log, and unmount it before
> re-running xfs_check.  If you are unable to mount the filesystem, then use
> the xfs_repair -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
> 
> After mounting (which fails) and re-running xfs_check it gives the same
> message.
> 
> The array info details are below and seems it is running correctly ??  I
> interpret the message above as actually a good sign - seems that
> xfs_check sees the filesystem but the log file and maybe the most
> currently written data is corrupted or will be lost.  But I'd like to
> hear some advice/guidance before doing anything permanent with
> xfs_repair.  I also would like to confirm somehow that the array is in
> the right order, etc.  Appreciate your feedback.
> 
> Thks,
> Chris
> 
> 
> 
> --------------------
> cat /etc/mdadm/mdadm.conf
> DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
> ARRAY /dev/md0 level=raid5 num-devices=4
> UUID=bc74c21c:9655c1c6:ba6cc37a:df870496
> MAILADDR root
> 
> cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 sda1[0] sdd1[2] sdb1[1]
>      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
>     unused devices: <none>
> 
> mdadm -D /dev/md0
> /dev/md0:
>        Version : 00.90.03
>  Creation Time : Sun Nov  5 14:25:01 2006
>     Raid Level : raid5
>     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
>    Device Size : 488383936 (465.76 GiB 500.11 GB)
>   Raid Devices : 4
>  Total Devices : 3
> Preferred Minor : 0
>    Persistence : Superblock is persistent
> 
>    Update Time : Fri Nov  9 16:26:31 2007
>          State : clean, degraded
> Active Devices : 3
> Working Devices : 3
> Failed Devices : 0
>  Spare Devices : 0
> 
>         Layout : left-symmetric
>     Chunk Size : 64K
> 
>           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
>         Events : 0.4880384
> 
>    Number   Major   Minor   RaidDevice State
>       0       8        1        0      active sync   /dev/sda1
>       1       8       17        1      active sync   /dev/sdb1
>       2       8       49        2      active sync   /dev/sdd1
>       3       0        0        3      removed
> 
> 
> 
> Chris Eddington wrote:
>> Thanks David.
>>
>> I've had cable/port failures in the past and after re-adding the
>> drive, the order changed - I'm not sure why, but I noticed it sometime
>> ago but don't remember the exact order.
>>
>> My initial attempt to assemble, it came up with only two drives in the
>> array.  Then I tried assembling with --force and that brought up 3 of
>> the drives.  At that point I thought I was good, so I tried mount
>> /dev/md0 and it failed.  Would that have written to the disk?  I'm
>> using XFS.
>>
>> After that, I tried assembling with different drive orders on the
>> command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking
>> that the order might not be right.
>>
>> At the moment I can't access the machine, but I'll try fsck -n and
>> send you the other info later this evening.
>>
>> Many thanks,
>> Chris
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-10  9:16       ` David Greaves
@ 2007-11-10 18:46         ` Chris Eddington
  2007-11-11 17:09           ` David Greaves
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-10 18:46 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

Hi,

Thanks for the pointer on xfs_repair -n , it actually tells me something 
(some listed below) but I'm not sure what it means but there seems to be 
a lot of data loss.  One complication is I see an error message in ata6, 
so I moved the disks around thinking it was a flaky sata port, but I see 
the error again on ata4 so it seems to follow the disk.  But it happens 
exactly at the same time during xfs_repair sequence, so I don't think it 
is a flaky disk.  I'll go to the xfs mailing list on this. 

Is there a way to be sure the disk order is right?  What I mean is when 
using --force does is try to figure out the right order based on best 
possible recognition of something there, or does it just take the 
existing disk order and assemble them?  I want to be sure that this is 
not way out of wack since I'm seeing so much from xfs_repair.  Also 
since I've been moving the disks around, I want to be sure I have the 
right order.

Is there a way to try restoring using the other disk?

Thks,
Chris



        - creating 4 worker thread(s)
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
bad on-disk superblock 2 - inconsistent filesystem geometry in realtime 
filesystem component
primary/secondary superblock 2 conflict - AG superblock geometry info 
conflicts with filesystem geometry
would reset bad sb for ag 2
bad uncorrected agheader 2, skipping ag...
bad on-disk superblock 24 - bad magic number
primary/secondary superblock 24 conflict - AG superblock geometry info 
conflicts with filesystem geometry
bad flags field in superblock 24
bad shared version number in superblock 24
bad inode alignment field in superblock 24
bad stripe unit/width fields in superblock 24
bad log/data device sector size fields in superblock 24
bad magic # 0xc486a1e7 for agi 24
bad version # 127171049 for agi 24
bad sequence # 606867126 for agi 24
bad length # -48052605 for agi 24, should be 11446496
would reset bad sb for ag 24
would reset bad agi for ag 24
bad uncorrected agheader 24, skipping ag...
        - 10:49:34: scanning filesystem freespace - 30 of 32 allocation 
groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
error following ag 24 unlinked list
        - 10:49:34: scanning agi unlinked lists - 32 of 32 allocation 
groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
imap claims a free inode 268435719 is in use, would correct imap and 
clear inode
bad nblocks 23 for inode 268435723, would reset to 13
corrupt block 0 in directory inode 259
    would junk block
no . entry for directory 259
no .. entry for directory 259
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
attribute entry 0 in attr block 0, inode 2147610149 has bad name 
(namelen = 0)
problem with attribute contents in inode 2147610149
would clear attr fork
bad nblocks 11 for inode 2147610149, would reset to 10
bad anextents 1 for inode 2147610149, would reset to 0
attribute entry 0 in attr block 0, inode 2147610376 has bad name 
(namelen = 0)
problem with attribute contents in inode 2147610376
would clear attr fork
bad nblocks 13 for inode 2147610376, would reset to 12
bad anextents 1 for inode 2147610376, would reset to 0
        - agno = 9
        - agno = 10
        - agno = 11
imap claims in-use inode 2173744652 is free, would correct imap
data fork in ino 2423071372 claims free block 201330859
data fork in ino 2423071372 claims free block 201330860
 .....
would have reset inode 4090071559 nlinks from 5 to 3
would have reset inode 4130446080 nlinks from 6 to 4
would have reset inode 4130446132 nlinks from 5 to 4
would have reset inode 4130509338 nlinks from 21 to 19
would have reset inode 4136546816 nlinks from 5 to 4
would have reset inode 4136546819 nlinks from 5 to 4
would have reset inode 4136546822 nlinks from 5 to 4
would have reset inode 4136546825 nlinks from 5 to 4
would have reset inode 4168420144 nlinks from 7 to 4
        - 10:54:24: verify link counts - 191040 of 202304 inodes done
No modify flag set, skipping filesystem flush and exiting.






David Greaves wrote:
> Ok - it looks like the raid array is up. There will have been an event count
> mismatch which is why you needed --force. This may well have caused some
> (hopefully minor) corruption.
>
> FWIW, xfs_check is almost never worth running :) (It runs out of memory easily).
> xfs_repair -n is much better.
>
> What does the end of dmesg say after trying to mount the fs?
>
> Also try:
> xfs_repair -n -L
>
> I think you then have 2 options:
> * xfs_repair -L
> This may well lose data that was being written as the drives crashed.
> * contact the xfs mailing list
>
> David
>
> Chris Eddington wrote:
>   
>> Hi David,
>>
>> I ran xfs_check and get this:
>> ERROR: The filesystem has valuable metadata changes in a log which needs to
>> be replayed.  Mount the filesystem to replay the log, and unmount it before
>> re-running xfs_check.  If you are unable to mount the filesystem, then use
>> the xfs_repair -L option to destroy the log and attempt a repair.
>> Note that destroying the log may cause corruption -- please attempt a mount
>> of the filesystem before doing this.
>>
>> After mounting (which fails) and re-running xfs_check it gives the same
>> message.
>>
>> The array info details are below and seems it is running correctly ??  I
>> interpret the message above as actually a good sign - seems that
>> xfs_check sees the filesystem but the log file and maybe the most
>> currently written data is corrupted or will be lost.  But I'd like to
>> hear some advice/guidance before doing anything permanent with
>> xfs_repair.  I also would like to confirm somehow that the array is in
>> the right order, etc.  Appreciate your feedback.
>>
>> Thks,
>> Chris
>>
>>
>>
>> --------------------
>> cat /etc/mdadm/mdadm.conf
>> DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
>> ARRAY /dev/md0 level=raid5 num-devices=4
>> UUID=bc74c21c:9655c1c6:ba6cc37a:df870496
>> MAILADDR root
>>
>> cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid5 sda1[0] sdd1[2] sdb1[1]
>>      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
>>     unused devices: <none>
>>
>> mdadm -D /dev/md0
>> /dev/md0:
>>        Version : 00.90.03
>>  Creation Time : Sun Nov  5 14:25:01 2006
>>     Raid Level : raid5
>>     Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
>>    Device Size : 488383936 (465.76 GiB 500.11 GB)
>>   Raid Devices : 4
>>  Total Devices : 3
>> Preferred Minor : 0
>>    Persistence : Superblock is persistent
>>
>>    Update Time : Fri Nov  9 16:26:31 2007
>>          State : clean, degraded
>> Active Devices : 3
>> Working Devices : 3
>> Failed Devices : 0
>>  Spare Devices : 0
>>
>>         Layout : left-symmetric
>>     Chunk Size : 64K
>>
>>           UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
>>         Events : 0.4880384
>>
>>    Number   Major   Minor   RaidDevice State
>>       0       8        1        0      active sync   /dev/sda1
>>       1       8       17        1      active sync   /dev/sdb1
>>       2       8       49        2      active sync   /dev/sdd1
>>       3       0        0        3      removed
>>
>>
>>
>> Chris Eddington wrote:
>>     
>>> Thanks David.
>>>
>>> I've had cable/port failures in the past and after re-adding the
>>> drive, the order changed - I'm not sure why, but I noticed it sometime
>>> ago but don't remember the exact order.
>>>
>>> My initial attempt to assemble, it came up with only two drives in the
>>> array.  Then I tried assembling with --force and that brought up 3 of
>>> the drives.  At that point I thought I was good, so I tried mount
>>> /dev/md0 and it failed.  Would that have written to the disk?  I'm
>>> using XFS.
>>>
>>> After that, I tried assembling with different drive orders on the
>>> command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking
>>> that the order might not be right.
>>>
>>> At the moment I can't access the machine, but I'll try fsck -n and
>>> send you the other info later this evening.
>>>
>>> Many thanks,
>>> Chris
>>>
>>>       
>
>
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-10 18:46         ` Chris Eddington
@ 2007-11-11 17:09           ` David Greaves
  2007-11-11 17:41             ` Chris Eddington
  0 siblings, 1 reply; 13+ messages in thread
From: David Greaves @ 2007-11-11 17:09 UTC (permalink / raw)
  To: Chris Eddington; +Cc: linux-raid

Chris Eddington wrote:
> Hi,
> 
> Thanks for the pointer on xfs_repair -n , it actually tells me something
> (some listed below) but I'm not sure what it means but there seems to be
> a lot of data loss.  One complication is I see an error message in ata6,
> so I moved the disks around thinking it was a flaky sata port, but I see
> the error again on ata4 so it seems to follow the disk.  But it happens
> exactly at the same time during xfs_repair sequence, so I don't think it
> is a flaky disk.
Does dmesg have any info/sata errors?

xfs_repair will have problems if the disk is bad. You may want to image the disk
(possibly onto the 'spare'?) if it is bad.

>  I'll go to the xfs mailing list on this.
Very good idea :)

> Is there a way to be sure the disk order is right? 
The order looks right to me.
xfs_repair wouldn't recognise it as well as it does if the order was wrong.

> not way out of wack since I'm seeing so much from xfs_repair.  Also
> since I've been moving the disks around, I want to be sure I have the
> right order.

Bear in mind that -n stops the repair fixing a problem. Then as the 'repair'
proceeds it becomes very confused by problems that should have been fixed.

This is evident in the superblock issue (which also probably explains the failed
mount).


> 
> Is there a way to try restoring using the other disk?
No the event count was very out of date.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-11 17:09           ` David Greaves
@ 2007-11-11 17:41             ` Chris Eddington
  2007-11-11 22:49               ` David Greaves
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-11 17:41 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

Yes, there is some kind of media error message in dmesg, below.  It is 
not random, it happens at exactly the same moments in each xfs_repair -n 
run. 

Nov 11 09:48:25 altair kernel: [37043.300691]          res 
51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:48:25 altair kernel: [37043.307676] ata4.00: configured for 
UDMA/133
Nov 11 09:48:25 altair kernel: [37043.307684] ata4: EH complete
Nov 11 09:48:27 altair kernel: [37043.747838] SCSI device sdd: 976773168 
512-byte hdwr sectors (500108 MB)
Nov 11 09:48:27 altair kernel: [37043.747861] sdd: Write Protect is off
Nov 11 09:48:27 altair kernel: [37043.747878] SCSI device sdd: write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 09:49:19 altair kernel: [37065.709216]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:19 altair kernel: [37065.720197] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:19 altair kernel: [37065.732188] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:19 altair kernel: [37065.732192] ata4.00: configured for 
UDMA/133
Nov 11 09:49:19 altair kernel: [37065.732199] ata4: EH complete
Nov 11 09:49:21 altair kernel: [37067.206243]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:21 altair kernel: [37067.210721] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:21 altair kernel: [37067.215727] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:21 altair kernel: [37067.215731] ata4.00: configured for 
UDMA/133
Nov 11 09:49:21 altair kernel: [37067.215738] ata4: EH complete
Nov 11 09:49:24 altair kernel: [37068.107825]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:24 altair kernel: [37068.112730] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:24 altair kernel: [37068.117732] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:24 altair kernel: [37068.117736] ata4.00: configured for 
UDMA/133
Nov 11 09:49:24 altair kernel: [37068.117740] ata4: EH complete
Nov 11 09:49:26 altair kernel: [37069.095665]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:26 altair kernel: [37069.100156] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:26 altair kernel: [37069.105148] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:26 altair kernel: [37069.105152] ata4.00: configured for 
UDMA/133
Nov 11 09:49:26 altair kernel: [37069.105159] ata4: EH complete
Nov 11 09:49:28 altair kernel: [37069.996842]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:28 altair kernel: [37070.000912] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:28 altair kernel: [37070.005916] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:28 altair kernel: [37070.005919] ata4.00: configured for 
UDMA/133
Nov 11 09:49:28 altair kernel: [37070.005924] ata4: EH complete
Nov 11 09:49:31 altair kernel: [37070.983850]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:31 altair kernel: [37070.987914] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:31 altair kernel: [37070.992917] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:31 altair kernel: [37070.992920] ata4.00: configured for 
UDMA/133
Nov 11 09:49:31 altair kernel: [37070.992935] ata4: EH complete
Nov 11 09:49:31 altair kernel: [37071.000639] SCSI device sdd: 976773168 
512-byte hdwr sectors (500108 MB)
Nov 11 09:49:31 altair kernel: [37071.000719] sdd: Write Protect is off
Nov 11 09:49:31 altair kernel: [37071.000745] SCSI device sdd: write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 09:49:31 altair kernel: [37071.000762] SCSI device sdd: 976773168 
512-byte hdwr sectors (500108 MB)
Nov 11 09:49:31 altair kernel: [37071.000770] sdd: Write Protect is off
Nov 11 09:49:31 altair kernel: [37071.000788] SCSI device sdd: write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 09:49:33 altair kernel: [37072.213749]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:33 altair kernel: [37072.218227] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:33 altair kernel: [37072.223231] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:33 altair kernel: [37072.223235] ata4.00: configured for 
UDMA/133
Nov 11 09:49:33 altair kernel: [37072.223242] ata4: EH complete
Nov 11 09:49:36 altair kernel: [37073.283239]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:36 altair kernel: [37073.286894] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:36 altair kernel: [37073.290220] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:36 altair kernel: [37073.290224] ata4.00: configured for 
UDMA/133
Nov 11 09:49:36 altair kernel: [37073.290231] ata4: EH complete
Nov 11 09:49:38 altair kernel: [37074.094417]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:38 altair kernel: [37074.097652] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:38 altair kernel: [37074.100988] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:38 altair kernel: [37074.100992] ata4.00: configured for 
UDMA/133
Nov 11 09:49:38 altair kernel: [37074.100997] ata4: EH complete
Nov 11 09:49:40 altair kernel: [37074.992267]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:40 altair kernel: [37074.996747] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:40 altair kernel: [37075.000074] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:40 altair kernel: [37075.000078] ata4.00: configured for 
UDMA/133
Nov 11 09:49:40 altair kernel: [37075.000083] ata4: EH complete
Nov 11 09:49:42 altair kernel: [37075.803457]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:42 altair kernel: [37075.807516] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:42 altair kernel: [37075.810842] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:42 altair kernel: [37075.810846] ata4.00: configured for 
UDMA/133
Nov 11 09:49:42 altair kernel: [37075.810853] ata4: EH complete
Nov 11 09:49:44 altair kernel: [37076.700452]          res 
51/40:00:0f:00:00/00:00:00:00:00/ef Emask 0x9 (media error)
Nov 11 09:49:44 altair kernel: [37076.704947] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:44 altair kernel: [37076.708272] ata4.00: ata_hpa_resize 1: 
sectors = 976773168, hpa_sectors = 976773168
Nov 11 09:49:44 altair kernel: [37076.708275] ata4.00: configured for 
UDMA/133
Nov 11 09:49:44 altair kernel: [37076.708290] ata4: EH complete
Nov 11 09:49:44 altair kernel: [37076.709550] SCSI device sdd: 976773168 
512-byte hdwr sectors (500108 MB)
Nov 11 09:49:44 altair kernel: [37076.709572] sdd: Write Protect is off
Nov 11 09:49:44 altair kernel: [37076.709594] SCSI device sdd: write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 09:49:44 altair kernel: [37076.709611] SCSI device sdd: 976773168 
512-byte hdwr sectors (500108 MB)
Nov 11 09:49:44 altair kernel: [37076.709623] sdd: Write Protect is off
Nov 11 09:49:44 altair kernel: [37076.709705] SCSI device sdd: write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA


David Greaves wrote:
> Chris Eddington wrote:
>   
>> Hi,
>>
>> Thanks for the pointer on xfs_repair -n , it actually tells me something
>> (some listed below) but I'm not sure what it means but there seems to be
>> a lot of data loss.  One complication is I see an error message in ata6,
>> so I moved the disks around thinking it was a flaky sata port, but I see
>> the error again on ata4 so it seems to follow the disk.  But it happens
>> exactly at the same time during xfs_repair sequence, so I don't think it
>> is a flaky disk.
>>     
> Does dmesg have any info/sata errors?
>
> xfs_repair will have problems if the disk is bad. You may want to image the disk
> (possibly onto the 'spare'?) if it is bad.
>
>   
>>  I'll go to the xfs mailing list on this.
>>     
> Very good idea :)
>
>   
>> Is there a way to be sure the disk order is right? 
>>     
> The order looks right to me.
> xfs_repair wouldn't recognise it as well as it does if the order was wrong.
>
>   
>> not way out of wack since I'm seeing so much from xfs_repair.  Also
>> since I've been moving the disks around, I want to be sure I have the
>> right order.
>>     
>
> Bear in mind that -n stops the repair fixing a problem. Then as the 'repair'
> proceeds it becomes very confused by problems that should have been fixed.
>
> This is evident in the superblock issue (which also probably explains the failed
> mount).
>
>
>   
>> Is there a way to try restoring using the other disk?
>>     
> No the event count was very out of date.
>
>
>
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-11 17:41             ` Chris Eddington
@ 2007-11-11 22:49               ` David Greaves
  2007-11-12  1:01                 ` Bill Davidsen
  0 siblings, 1 reply; 13+ messages in thread
From: David Greaves @ 2007-11-11 22:49 UTC (permalink / raw)
  To: Chris Eddington; +Cc: linux-raid

Chris Eddington wrote:
> Yes, there is some kind of media error message in dmesg, below.  It is
> not random, it happens at exactly the same moments in each xfs_repair -n
> run.
> Nov 11 09:48:25 altair kernel: [37043.300691]          res
> 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
> Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1:
> sectors = 976773168, hpa_sectors = 976773168
> Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1:
> sectors = 976773168, hpa_sectors = 976773168

I'm not sure what an ata_hpa_resize error is...

It probably explains the problems you've been having with the raid not 'just
recovering' though.

I saw this:
http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/


What does smartctl say about your drive?

IMO the spare drive is no longer useful for data recovery - you may want to use
ddrescue to try and copy this drive to the spare drive.

David
PS Don't get the ddrescue parameters the wrong way round if you go that route...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-11 22:49               ` David Greaves
@ 2007-11-12  1:01                 ` Bill Davidsen
  2007-11-17  6:31                   ` Chris Eddington
  0 siblings, 1 reply; 13+ messages in thread
From: Bill Davidsen @ 2007-11-12  1:01 UTC (permalink / raw)
  To: David Greaves; +Cc: Chris Eddington, linux-raid

David Greaves wrote:
> Chris Eddington wrote:
>   
>> Yes, there is some kind of media error message in dmesg, below.  It is
>> not random, it happens at exactly the same moments in each xfs_repair -n
>> run.
>> Nov 11 09:48:25 altair kernel: [37043.300691]          res
>> 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
>> Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1:
>> sectors = 976773168, hpa_sectors = 976773168
>> Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1:
>> sectors = 976773168, hpa_sectors = 976773168
>>     
>
> I'm not sure what an ata_hpa_resize error is...
>   

HPA = Hardware Protected Area.

By any chance is this disk partitioned such that the partition size 
includes the HPA? If it does, this sounds at least familiar, this 
mailing list post may get you started: 
http://osdir.com/ml/linux.ataraid/2005-09/msg00002.html

In any case, run "fdisk -l" and look at the claimed total disk size and 
the end point of the last partition. The HPA is not included in the 
"disk size" so nothing should be trying to do so.
> It probably explains the problems you've been having with the raid not 'just
> recovering' though.
>
> I saw this:
> http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/
>   

May be the same thing. Let us know what fdisk reports.
>
> What does smartctl say about your drive?
>
> IMO the spare drive is no longer useful for data recovery - you may want to use
> ddrescue to try and copy this drive to the spare drive.
>
> David
> PS Don't get the ddrescue parameters the wrong way round if you go that route...
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-12  1:01                 ` Bill Davidsen
@ 2007-11-17  6:31                   ` Chris Eddington
  2007-11-18 12:25                     ` David Greaves
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Eddington @ 2007-11-17  6:31 UTC (permalink / raw)
  To: Bill Davidsen, David Greaves; +Cc: linux-raid

Yes, this is exactly the kind of symptoms I've experienced.  I was 
losing a drive here and there every couple of months (mostly the last 
two drives sdc and sdd) which I though were cable problems (shut down, 
re-plug the cables and restart and it would always work, with 
add/rebuild the 4th disk).  But now my guess is the motherboard chipset 
is overheating (or maybe the drives).  I have an MSI K9N platinum 
AMD/Nividia chipset that has 4 raid ports + 2 raid ports from a separate 
chip.  The mb chipset comes with a wimpy heatsink on it and it is very 
hot to the touch.  I had been planning to replace it but never got 
around to it.

I've been out of town this week so I had someone image all three disks.  
He used ghost disk image application.  He said the third disk reported 
media problems, and about 5% of the data was not fixable (sector 
errors).  Using these three copied drives, the array comes up and 
xfs_repair still reports a bunch of inode repairs as before, but it is a 
bit different, maybe even a reduction in losses.  But most important is 
the hpa_sector errors no longer occur.

Key questions:
- I assume ddrescue will do a much better job of correcting errors when 
imaging a disk?  My colleague used ghost which is just a copy tool.  I 
don't understand the capabilities of ddrescue on raid partitions that well.
- fdisk -l reports that all the drives are exactly the same size with 
exactly the same # sectors shown below.  I don't quite follow the 
hpa_resize issue, but it appears the drives don't have hidden HPA 
sectors - I guess?  Note that sdc is the original drive, where sda, sdb, 
and sdd are the imaged drives.

So what do you recommend to do first?  Should I try xfs_repair on the 
ghost copy, or just re-copy myself using ddrescue?  Are there special 
settings to ddrescue I should consider to verify/correct potential HPA 
changes?

Thks,
Chris

Disk /dev/sda: 500.1 GB, 500107862016 bytes
/dev/sda1               1       60801   488384001   fd  Linux raid 
autodetect
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
/dev/sdb1               1       60801   488384001   fd  Linux raid 
autodetect
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
/dev/sdc1               1       60801   488384001   fd  Linux raid 
autodetect
Disk /dev/sdd: 500.1 GB, 500107862016 bytes
/dev/sdd1               1       60801   488384001   fd  Linux raid 
autodetect

Bill Davidsen wrote:
> David Greaves wrote:
>> Chris Eddington wrote:
>>  
>>> Yes, there is some kind of media error message in dmesg, below.  It is
>>> not random, it happens at exactly the same moments in each 
>>> xfs_repair -n
>>> run.
>>> Nov 11 09:48:25 altair kernel: [37043.300691]          res
>>> 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
>>> Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: 
>>> ata_hpa_resize 1:
>>> sectors = 976773168, hpa_sectors = 976773168
>>> Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: 
>>> ata_hpa_resize 1:
>>> sectors = 976773168, hpa_sectors = 976773168
>>>     
>>
>> I'm not sure what an ata_hpa_resize error is...
>>   
>
> HPA = Hardware Protected Area.
>
> By any chance is this disk partitioned such that the partition size 
> includes the HPA? If it does, this sounds at least familiar, this 
> mailing list post may get you started: 
> http://osdir.com/ml/linux.ataraid/2005-09/msg00002.html
>
> In any case, run "fdisk -l" and look at the claimed total disk size 
> and the end point of the last partition. The HPA is not included in 
> the "disk size" so nothing should be trying to do so.
>> It probably explains the problems you've been having with the raid 
>> not 'just
>> recovering' though.
>>
>> I saw this:
>> http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/ 
>>
>>   
>
> May be the same thing. Let us know what fdisk reports.
>>
>> What does smartctl say about your drive?
>>
>> IMO the spare drive is no longer useful for data recovery - you may 
>> want to use
>> ddrescue to try and copy this drive to the spare drive.
>>
>> David
>> PS Don't get the ddrescue parameters the wrong way round if you go 
>> that route...
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>   
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid5 assemble after dual sata port failure
  2007-11-17  6:31                   ` Chris Eddington
@ 2007-11-18 12:25                     ` David Greaves
  0 siblings, 0 replies; 13+ messages in thread
From: David Greaves @ 2007-11-18 12:25 UTC (permalink / raw)
  To: Chris Eddington; +Cc: Bill Davidsen, linux-raid

Chris Eddington wrote:
> Key questions:
> - I assume ddrescue will do a much better job of correcting errors when
> imaging a disk?  My colleague used ghost which is just a copy tool.  I
> don't understand the capabilities of ddrescue on raid partitions that well.
ddrescue should do a *much* better job.
It knows nothing about raid and operates on the underlying device. It retries
bad sectors in a clever manner.


> - fdisk -l reports that all the drives are exactly the same size with
> exactly the same # sectors shown below.  I don't quite follow the
> hpa_resize issue, but it appears the drives don't have hidden HPA
> sectors - I guess?  Note that sdc is the original drive, where sda, sdb,
> and sdd are the imaged drives.
> 
> So what do you recommend to do first?  Should I try xfs_repair on the
> ghost copy,
No
 or just re-copy myself using ddrescue?
Yes.
  Are there special
> settings to ddrescue I should consider to verify/correct potential HPA
> changes?
Ideally just ddrescue the entire device to a file and use loopback.

For the faulty disk then if you have space, make a second copy and xfs_repair
using that. If it fails then you can easily re-image the good disks but it may
not be so easy to re-image the bad one.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-11-18 12:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-07 20:23 Raid5 assemble after dual sata port failure chrise
  -- strict thread matches above, loose matches on Subject: below --
2007-11-07 20:28 Chris Eddington
2007-11-08 10:33 ` David Greaves
2007-11-09 21:23   ` Chris Eddington
2007-11-10  0:28     ` Chris Eddington
2007-11-10  9:16       ` David Greaves
2007-11-10 18:46         ` Chris Eddington
2007-11-11 17:09           ` David Greaves
2007-11-11 17:41             ` Chris Eddington
2007-11-11 22:49               ` David Greaves
2007-11-12  1:01                 ` Bill Davidsen
2007-11-17  6:31                   ` Chris Eddington
2007-11-18 12:25                     ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).