linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAID5 with 2 drive failure at the same time
@ 2013-01-31 10:42 Christoph Nelles
  2013-01-31 11:38 ` Robin Hill
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Nelles @ 2013-01-31 10:42 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

Hi,

i hope somebody on this ML can help me.

My RAID5 died last night during a rebuild when two drives failed (looks
like a sata_mv problem). The RAID5 was rebuilding because one of the two
drives failed before and after running badblocks for 2 days, i re-added
it to the RAID.

The used drives are from /dev/sdb1 to /dev/sdj1 (9 Drives, RAID5), the
failed drives are sdj1 and sdg1
The current situation is that I cannot start the RAID. I wanted to try
readding on of the the drives, so removed it beforehand, making it a
spare :\ The layout is as follows:

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       0        0        1      removed
       2       8      113        2      active sync   /dev/sdh1
       3       8       49        3      active sync   /dev/sdd1
       4       8      129        4      active sync   /dev/sdi1
       5       0        0        5      removed
       6       8       17        6      active sync   /dev/sdb1
       7       8       81        7      active sync   /dev/sdf1
       8       8       65        8      active sync   /dev/sde1

Re-adding fails with a simple message:
# mdadm -v /dev/md0 --re-add /dev/sdg1
mdadm: --re-add for /dev/sdg1 to /dev/md0 is not possible

I tried re-adding both failed drives at the same, with the same result.

When examining the drives, sdj1 has the information from before the crash:
   Device Role : Active device 5
   Array State : AAAAAAAAA ('A' == active, '.' == missing)

sdg1 looks like this
   Device Role : spare
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

The other look like
   Device Role : Active device 6
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

So looks that my repair tries made sdg1 a spare :\ I attached the full
output to this mail.

Is there anyway to restart the RAID from the information contained in
drive sdj1? Perhaps via Incremental Build starting from one drive? Could
that work? If the RAID wouldn't have been rebuilding before the crash, i
would just recreate it with --assume-clean.

Thanks in advance for any help

Regards

Christoph Nelles
-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


[-- Attachment #2: mdadm_examine_sdg1.txt --]
[-- Type: text/plain, Size: 849 bytes --]

# mdadm --examine /dev/sdg1
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a1b16284:321fcdd0:93993ff5:832eee3a

    Update Time : Thu Jan 31 00:50:44 2013
       Checksum : 2391e873 - correct
         Events : 27697

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : spare
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

[-- Attachment #3: mdadm_examine_sdj1.txt --]
[-- Type: text/plain, Size: 857 bytes --]

mdadm --examine /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 7023df83:d890ce04:fc28652e:094adffe

    Update Time : Thu Jan 31 00:24:56 2013
       Checksum : 542f70be - correct
         Events : 27691

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 5
   Array State : AAAAAAAAA ('A' == active, '.' == missing)

[-- Attachment #4: mdadm_detail.txt --]
[-- Type: text/plain, Size: 1207 bytes --]

 mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
  Used Dev Size : -1
   Raid Devices : 9
  Total Devices : 7
    Persistence : Superblock is persistent

    Update Time : Thu Jan 31 10:36:28 2013
          State : active, FAILED, Not Started
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : router:0  (local to host router)
           UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
         Events : 27699

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       0        0        1      removed
       2       8      113        2      active sync   /dev/sdh1
       3       8       49        3      active sync   /dev/sdd1
       4       8      129        4      active sync   /dev/sdi1
       5       0        0        5      removed
       6       8       17        6      active sync   /dev/sdb1
       7       8       81        7      active sync   /dev/sdf1
       8       8       65        8      active sync   /dev/sde1

[-- Attachment #5: mdadm_examine_sdb1.txt --]
[-- Type: text/plain, Size: 857 bytes --]

/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 29c62776:e9c58ce6:1c6e9ab1:046ac411

    Update Time : Thu Jan 31 10:36:28 2013
       Checksum : be473d02 - correct
         Events : 27699

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 6
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 10:42 RAID5 with 2 drive failure at the same time Christoph Nelles
@ 2013-01-31 11:38 ` Robin Hill
  2013-01-31 13:15   ` Christoph Nelles
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Hill @ 2013-01-31 11:38 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3846 bytes --]

On Thu Jan 31, 2013 at 11:42:54 +0100, Christoph Nelles wrote:

> Hi,
> 
> i hope somebody on this ML can help me.
> 
> My RAID5 died last night during a rebuild when two drives failed (looks
> like a sata_mv problem). The RAID5 was rebuilding because one of the two
> drives failed before and after running badblocks for 2 days, i re-added
> it to the RAID.
> 
Probably only one drive failed. If the rebuild was incomplete then a
single drive failure would cause the array to fail. Can you post the
errors? If the issue was a read failure then you'll need to fix that
before the array can be recovered properly.

> The used drives are from /dev/sdb1 to /dev/sdj1 (9 Drives, RAID5), the
> failed drives are sdj1 and sdg1
>
You also seriously need to look at moving to RAID6. Using RAID5 for a
9-drive array is not a good idea, and with 3TB drives it's absolutely
crazy. The odds of a single read error out of the 24TB that needs to be
read to recover a drive are not insignificant.

> The current situation is that I cannot start the RAID. I wanted to try
> readding on of the the drives, so removed it beforehand, making it a
> spare :\ The layout is as follows:
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       33        0      active sync   /dev/sdc1
>        1       0        0        1      removed
>        2       8      113        2      active sync   /dev/sdh1
>        3       8       49        3      active sync   /dev/sdd1
>        4       8      129        4      active sync   /dev/sdi1
>        5       0        0        5      removed
>        6       8       17        6      active sync   /dev/sdb1
>        7       8       81        7      active sync   /dev/sdf1
>        8       8       65        8      active sync   /dev/sde1
> 
> Re-adding fails with a simple message:
> # mdadm -v /dev/md0 --re-add /dev/sdg1
> mdadm: --re-add for /dev/sdg1 to /dev/md0 is not possible
> 
> I tried re-adding both failed drives at the same, with the same result.
> 
That's good anyway - it prevented the loss of the existing metadata
which would definitely have reduced your chances of recovery.

> When examining the drives, sdj1 has the information from before the crash:
>    Device Role : Active device 5
>    Array State : AAAAAAAAA ('A' == active, '.' == missing)
> 
> sdg1 looks like this
>    Device Role : spare
>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
> 
> The other look like
>    Device Role : Active device 6
>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
> 
From the looks of it, sdg1 was the drive you were originally adding back
into the array, and sdj1 is the drive that failed part-way through the
rebuild?

> So looks that my repair tries made sdg1 a spare :\ I attached the full
> output to this mail.
> 
> Is there anyway to restart the RAID from the information contained in
> drive sdj1? Perhaps via Incremental Build starting from one drive? Could
> that work? If the RAID wouldn't have been rebuilding before the crash, i
> would just recreate it with --assume-clean.
> 
The first thing to try should _always_ be a forced assemble. Recreating
the array is very much a last-ditch move and should never be attempted
before asking the list for help (any mismatch in your create command, or
in the mdadm/kernel versions could cause data corruption). Stop the
array, then reassemble with the --force flag. It'll probably restart
with sdj1 added back into the array, and you can then add sdg1 back in
again and restart the rebuild.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 11:38 ` Robin Hill
@ 2013-01-31 13:15   ` Christoph Nelles
  2013-01-31 13:45     ` Robin Hill
  2013-01-31 17:46     ` Chris Murphy
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Nelles @ 2013-01-31 13:15 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4430 bytes --]

Hello Robin,

thanks for the answers :)

Am 31.01.2013 12:38, schrieb Robin Hill:
> Probably only one drive failed. If the rebuild was incomplete then a
> single drive failure would cause the array to fail. Can you post the
> errors? If the issue was a read failure then you'll need to fix that
> before the array can be recovered properly.

All drives are available again. And the seecond failed device reports
UREs. I will run badblocks on that device before continuing.
I attached the kernel logs of the first error and of the second error. I
hope i filtered them reasonably.

>> The used drives are from /dev/sdb1 to /dev/sdj1 (9 Drives, RAID5), the
>> failed drives are sdj1 and sdg1
>>
> You also seriously need to look at moving to RAID6. Using RAID5 for a
> 9-drive array is not a good idea, and with 3TB drives it's absolutely
> crazy. The odds of a single read error out of the 24TB that needs to be
> read to recover a drive are not insignificant.

I already thought of switching to RAID6. If I can recover from the
current situation, I will do that.

>> The current situation is that I cannot start the RAID. I wanted to try
>> readding on of the the drives, so removed it beforehand, making it a
>> spare :\ The layout is as follows:
>>
>>     Number   Major   Minor   RaidDevice State
>>        0       8       33        0      active sync   /dev/sdc1
>>        1       0        0        1      removed
>>        2       8      113        2      active sync   /dev/sdh1
>>        3       8       49        3      active sync   /dev/sdd1
>>        4       8      129        4      active sync   /dev/sdi1
>>        5       0        0        5      removed
>>        6       8       17        6      active sync   /dev/sdb1
>>        7       8       81        7      active sync   /dev/sdf1
>>        8       8       65        8      active sync   /dev/sde1
>>
>> Re-adding fails with a simple message:
>> # mdadm -v /dev/md0 --re-add /dev/sdg1
>> mdadm: --re-add for /dev/sdg1 to /dev/md0 is not possible
>>
>> I tried re-adding both failed drives at the same, with the same result.
>>
> That's good anyway - it prevented the loss of the existing metadata
> which would definitely have reduced your chances of recovery.
> 
>> When examining the drives, sdj1 has the information from before the crash:
>>    Device Role : Active device 5
>>    Array State : AAAAAAAAA ('A' == active, '.' == missing)
>>
>> sdg1 looks like this
>>    Device Role : spare
>>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
>>
>> The other look like
>>    Device Role : Active device 6
>>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
>>
> From the looks of it, sdg1 was the drive you were originally adding back
> into the array, and sdj1 is the drive that failed part-way through the
> rebuild?

Exactly. I am running badblocks on that device. SMART reports one
"Pending Sector Count" :(

>> So looks that my repair tries made sdg1 a spare :\ I attached the full
>> output to this mail.
>>
>> Is there anyway to restart the RAID from the information contained in
>> drive sdj1? Perhaps via Incremental Build starting from one drive? Could
>> that work? If the RAID wouldn't have been rebuilding before the crash, i
>> would just recreate it with --assume-clean.
>>
> The first thing to try should _always_ be a forced assemble. Recreating
> the array is very much a last-ditch move and should never be attempted
> before asking the list for help (any mismatch in your create command, or
> in the mdadm/kernel versions could cause data corruption). Stop the
> array, then reassemble with the --force flag. It'll probably restart
> with sdj1 added back into the array, and you can then add sdg1 back in
> again and restart the rebuild.

So
# mdadm -A /dev/md0 -f /dev/sdc1 /dev/sdg1 /dev/sdh1 /dev/sdd1 \
/dev/sdi1 /dev/sdj1 /dev/sdb1 /dev/sdf1 /dev/sde1

should work? That would be a really simple solution :)


On sdj1 there is still a superblock from before the crash, while the
others have newer updated superblocks. are there any means to say that
the RAID should be assembled with the older information from this
particular superblock?

>
> Cheers,
>     Robin


Kind regards

Christoph Nelles


-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


[-- Attachment #2: error2.txt --]
[-- Type: text/plain, Size: 43030 bytes --]

Jan 28 19:50:44 router kernel: md: unbind<sdg1>
Jan 28 19:50:44 router kernel: md: export_rdev(sdg1)
Jan 30 18:17:57 router kernel:  sdg: sdg1
Jan 30 18:19:17 router kernel: md: bind<sdg1>
Jan 30 18:19:18 router kernel: RAID conf printout:
Jan 30 18:19:18 router kernel:  --- level:5 rd:9 wd:8
Jan 30 18:19:18 router kernel:  disk 0, o:1, dev:sdc1
Jan 30 18:19:18 router kernel:  disk 1, o:1, dev:sdg1
Jan 30 18:19:18 router kernel:  disk 2, o:1, dev:sdh1
Jan 30 18:19:18 router kernel:  disk 3, o:1, dev:sdd1
Jan 30 18:19:18 router kernel:  disk 4, o:1, dev:sdi1
Jan 30 18:19:18 router kernel:  disk 5, o:1, dev:sdj1
Jan 30 18:19:18 router kernel:  disk 6, o:1, dev:sdb1
Jan 30 18:19:18 router kernel:  disk 7, o:1, dev:sdf1
Jan 30 18:19:18 router kernel:  disk 8, o:1, dev:sde1
Jan 30 18:19:18 router kernel: md: recovery of RAID array md0
Jan 30 18:19:18 router kernel: md: minimum _guaranteed_  speed: 500000 KB/sec/disk.
Jan 30 18:19:18 router kernel: md: using maximum available idle IO bandwidth (but not more than 5000000 KB/sec) for recovery.
Jan 30 18:19:18 router kernel: md: using 128k window, over a total of 2930264320k.
Jan 31 00:34:12 router kernel: ata14.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6
Jan 31 00:34:12 router kernel: ata14.00: edma_err_cause=02000084 pp_flags=00000003, dev error, EDMA self-disable
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:00:08:d3:45/04:00:21:01:00/40 tag 0 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:08:08:d7:45/04:00:21:01:00/40 tag 1 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:10:08:db:45/04:00:21:01:00/40 tag 2 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:18:08:df:45/04:00:21:01:00/40 tag 3 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:20:08:e3:45/04:00:21:01:00/40 tag 4 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:28:08:13:46/04:00:21:01:00/40 tag 5 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:30:08:17:46/04:00:21:01:00/40 tag 6 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:38:08:2b:46/04:00:21:01:00/40 tag 7 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:40:08:2f:46/04:00:21:01:00/40 tag 8 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:48:08:33:46/04:00:21:01:00/40 tag 9 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:12 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:12 router kernel: ata14.00: cmd 60/00:50:08:bb:45/04:00:21:01:00/40 tag 10 ncq 524288 in
Jan 31 00:34:12 router kernel:          res 41/40:00:a0:bd:45/00:00:21:01:00/40 Emask 0x409 (media error) <F>
Jan 31 00:34:12 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:12 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:58:08:bf:45/04:00:21:01:00/40 tag 11 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:60:08:c3:45/04:00:21:01:00/40 tag 12 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:68:08:c7:45/04:00:21:01:00/40 tag 13 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:70:08:cb:45/04:00:21:01:00/40 tag 14 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:78:08:cf:45/04:00:21:01:00/40 tag 15 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:80:08:e7:45/04:00:21:01:00/40 tag 16 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:88:08:eb:45/04:00:21:01:00/40 tag 17 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:90:08:ef:45/04:00:21:01:00/40 tag 18 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:98:08:f3:45/04:00:21:01:00/40 tag 19 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:a0:08:f7:45/04:00:21:01:00/40 tag 20 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:a8:08:fb:45/04:00:21:01:00/40 tag 21 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:b0:08:ff:45/04:00:21:01:00/40 tag 22 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:b8:08:03:46/04:00:21:01:00/40 tag 23 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:c0:08:07:46/04:00:21:01:00/40 tag 24 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:c8:08:0b:46/04:00:21:01:00/40 tag 25 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:d0:08:0f:46/04:00:21:01:00/40 tag 26 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:d8:08:1b:46/04:00:21:01:00/40 tag 27 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:e0:08:1f:46/04:00:21:01:00/40 tag 28 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:e8:08:23:46/04:00:21:01:00/40 tag 29 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14.00: failed command: READ FPDMA QUEUED
Jan 31 00:34:13 router kernel: ata14.00: cmd 60/00:f0:08:27:46/04:00:21:01:00/40 tag 30 ncq 524288 in
Jan 31 00:34:13 router kernel:          res 41/40:44:08:2f:46/40:00:21:01:00/40 Emask 0x9 (media error)
Jan 31 00:34:13 router kernel: ata14.00: status: { DRDY ERR }
Jan 31 00:34:13 router kernel: ata14.00: error: { UNC }
Jan 31 00:34:13 router kernel: ata14: hard resetting link
Jan 31 00:34:13 router kernel: ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 31 00:34:13 router kernel: ata14.00: failed to get Identify Device Data, Emask 0x1
Jan 31 00:34:13 router kernel: ata14.00: failed to get Identify Device Data, Emask 0x1
Jan 31 00:34:13 router kernel: ata14.00: configured for UDMA/133
Jan 31 00:34:13 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:13 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:13 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:13 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:13 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:13 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:13 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:13 router kernel:         21 46 2f 08 
Jan 31 00:34:13 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:13 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:13 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:13 router kernel: Read(16): 88 00 00 00 00 01 21 45 d3 08 00 00 04 00 00 00
Jan 31 00:34:13 router kernel: end_request: I/O error, dev sdj, sector 4853191432
Jan 31 00:34:13 router kernel: md/raid:md0: read error not correctable (sector 4853189384 on sdj1).
Jan 31 00:34:13 router kernel: md/raid:md0: Disk failure on sdj1, disabling device.
Jan 31 00:34:13 router kernel: md/raid:md0: Operation continuing on 7 devices.
Jan 31 00:34:13 router kernel: md/raid:md0: read error not correctable (sector 4853189392 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189400 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189408 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189416 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189424 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189432 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189440 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189448 on sdj1).
Jan 31 00:34:14 router kernel: md/raid:md0: read error not correctable (sector 4853189456 on sdj1).
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 45 d7 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853192456
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 45 db 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853193480
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 45 df 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853194504
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 45 e3 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853195528
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 46 13 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853207816
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 46 17 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853208840
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 46 2b 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853213960
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:14 router kernel: Read(16): 88 00 00 00 00 01 21 46 2f 08 00 00 04 00 00 00
Jan 31 00:34:14 router kernel: end_request: I/O error, dev sdj, sector 4853214984
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:14 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:14 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:14 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:14 router kernel:         21 46 2f 08 
Jan 31 00:34:14 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 46 33 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: end_request: I/O error, dev sdj, sector 4853216008
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 45 bd a0 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 bb 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 bf 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 c3 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 c7 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 cb 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 cf 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 e7 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 eb 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:15 router kernel: Read(16): 88 00 00 00 00 01 21 45 ef 08 00 00 04 00 00 00
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:15 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:15 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:15 router kernel:         21 46 2f 08 
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:15 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:15 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 45 f3 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 45 f7 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 45 fb 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 45 ff 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 03 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 07 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 0b 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 0f 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 1b 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 1f 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:16 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:16 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:16 router kernel:         21 46 2f 08 
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:16 router kernel: Read(16): 88 00 00 00 00 01 21 46 23 08 00 00 04 00 00 00
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj] Unhandled sense code
Jan 31 00:34:16 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:16 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 00:34:17 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:17 router kernel: Sense Key : Medium Error [current] [descriptor]
Jan 31 00:34:17 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 31 00:34:17 router kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 01 
Jan 31 00:34:17 router kernel:         21 46 2f 08 
Jan 31 00:34:17 router kernel: sd 13:0:0:0: [sdj]  
Jan 31 00:34:17 router kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Jan 31 00:34:17 router kernel: sd 13:0:0:0: [sdj] CDB: 
Jan 31 00:34:17 router kernel: Read(16): 88 00 00 00 00 01 21 46 27 08 00 00 04 00 00 00
Jan 31 00:34:17 router kernel: ata14: EH complete
Jan 31 00:34:17 router kernel: md: md0: recovery done.
Jan 31 00:34:17 router kernel: RAID conf printout:
Jan 31 00:34:17 router kernel:  --- level:5 rd:9 wd:7
Jan 31 00:34:17 router kernel:  disk 0, o:1, dev:sdc1
Jan 31 00:34:17 router kernel:  disk 1, o:1, dev:sdg1
Jan 31 00:34:17 router kernel:  disk 2, o:1, dev:sdh1
Jan 31 00:34:17 router kernel:  disk 3, o:1, dev:sdd1
Jan 31 00:34:17 router kernel:  disk 4, o:1, dev:sdi1
Jan 31 00:34:17 router kernel:  disk 5, o:0, dev:sdj1
Jan 31 00:34:17 router kernel:  disk 6, o:1, dev:sdb1
Jan 31 00:34:17 router kernel:  disk 7, o:1, dev:sdf1
Jan 31 00:34:17 router kernel:  disk 8, o:1, dev:sde1

[-- Attachment #3: error1.txt --]
[-- Type: text/plain, Size: 3169 bytes --]

Jan 28 00:23:05 router kernel: ata11: illegal qc_active transition (00000001->04000001)
Jan 28 00:23:36 router kernel: ata11.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Jan 28 00:23:36 router kernel: ata11.00: failed command: WRITE FPDMA QUEUED
Jan 28 00:23:36 router kernel: ata11.00: cmd 61/30:00:50:55:b2/00:00:36:01:00/40 tag 0 ncq 24576 out
Jan 28 00:23:36 router kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 28 00:23:36 router kernel: ata11.00: status: { DRDY }
Jan 28 00:23:36 router kernel: ata11: hard resetting link
Jan 28 00:23:36 router kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 28 00:23:36 router kernel: ata11.00: failed to get Identify Device Data, Emask 0x1
Jan 28 00:23:36 router kernel: ata11.00: failed to get Identify Device Data, Emask 0x1
Jan 28 00:23:36 router kernel: ata11.00: configured for UDMA/133
Jan 28 00:23:36 router kernel: ata11.00: device reported invalid CHS sector 0
Jan 28 00:23:36 router kernel: sd 10:0:0:0: [sdg]  
Jan 28 00:23:36 router kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 28 00:23:36 router kernel: sd 10:0:0:0: [sdg]  
Jan 28 00:23:36 router kernel: Sense Key : Aborted Command [current] [descriptor]
Jan 28 00:23:36 router kernel: Descriptor sense data with sense descriptors (in hex):
Jan 28 00:23:36 router kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jan 28 00:23:36 router kernel:         00 00 00 00 
Jan 28 00:23:36 router kernel: sd 10:0:0:0: [sdg]  
Jan 28 00:23:36 router kernel: Add. Sense: No additional sense information
Jan 28 00:23:36 router kernel: sd 10:0:0:0: [sdg] CDB: 
Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208
Jan 28 00:23:36 router kernel: ata11: EH complete
Jan 28 00:23:36 router kernel: md/raid:md0: Disk failure on sdg1, disabling device.
Jan 28 00:23:36 router kernel: md/raid:md0: Operation continuing on 8 devices.
Jan 28 00:23:36 router kernel: RAID conf printout:
Jan 28 00:23:36 router kernel:  --- level:5 rd:9 wd:8
Jan 28 00:23:36 router kernel:  disk 0, o:1, dev:sdc1
Jan 28 00:23:36 router kernel:  disk 1, o:0, dev:sdg1
Jan 28 00:23:36 router kernel:  disk 2, o:1, dev:sdh1
Jan 28 00:23:36 router kernel:  disk 3, o:1, dev:sdd1
Jan 28 00:23:36 router kernel:  disk 4, o:1, dev:sdi1
Jan 28 00:23:36 router kernel:  disk 5, o:1, dev:sdj1
Jan 28 00:23:36 router kernel:  disk 6, o:1, dev:sdb1
Jan 28 00:23:36 router kernel:  disk 7, o:1, dev:sdf1
Jan 28 00:23:36 router kernel:  disk 8, o:1, dev:sde1
Jan 28 00:23:36 router kernel: RAID conf printout:
Jan 28 00:23:36 router kernel:  --- level:5 rd:9 wd:8
Jan 28 00:23:36 router kernel:  disk 0, o:1, dev:sdc1
Jan 28 00:23:36 router kernel:  disk 2, o:1, dev:sdh1
Jan 28 00:23:36 router kernel:  disk 3, o:1, dev:sdd1
Jan 28 00:23:36 router kernel:  disk 4, o:1, dev:sdi1
Jan 28 00:23:36 router kernel:  disk 5, o:1, dev:sdj1
Jan 28 00:23:36 router kernel:  disk 6, o:1, dev:sdb1
Jan 28 00:23:36 router kernel:  disk 7, o:1, dev:sdf1
Jan 28 00:23:36 router kernel:  disk 8, o:1, dev:sde1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 13:15   ` Christoph Nelles
@ 2013-01-31 13:45     ` Robin Hill
  2013-01-31 17:46     ` Chris Murphy
  1 sibling, 0 replies; 21+ messages in thread
From: Robin Hill @ 2013-01-31 13:45 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4850 bytes --]

On Thu Jan 31, 2013 at 02:15:00PM +0100, Christoph Nelles wrote:

> Hello Robin,
> 
> thanks for the answers :)
> 
> Am 31.01.2013 12:38, schrieb Robin Hill:
> > Probably only one drive failed. If the rebuild was incomplete then a
> > single drive failure would cause the array to fail. Can you post the
> > errors? If the issue was a read failure then you'll need to fix that
> > before the array can be recovered properly.
> 
> All drives are available again. And the seecond failed device reports
> UREs. I will run badblocks on that device before continuing.
> I attached the kernel logs of the first error and of the second error. I
> hope i filtered them reasonably.
> 
Okay, those show that sdj had a read error during the rebuild. That
would have kicked the drive and failed the rebuild (and the array).

Your earlier error with sdg is a different issue. It looks to have timed
out on a write and then errored again when resetting the drive.

If you're using standard desktop drives then you may be running into
issues with the drive timeout being longer than the kernel's. You need
to reset on or the other to ensure that the drive times out (and is
available for subsequent commands) before the kernel does. Most current
consumer drives don't allow resetting the timeout, but it's worth trying
that first before changing the kernel timeout. For each
drive, do:
    smartctl -l scterc,70,70 /dev/sdX
        || echo 180 > /sys/block/sdX/device/timeout

That'll need to be run on every boot (or whenever a drive is
hot-plugged).

> >> When examining the drives, sdj1 has the information from before the crash:
> >>    Device Role : Active device 5
> >>    Array State : AAAAAAAAA ('A' == active, '.' == missing)
> >>
> >> sdg1 looks like this
> >>    Device Role : spare
> >>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
> >>
> >> The other look like
> >>    Device Role : Active device 6
> >>    Array State : A.AAA.AAA ('A' == active, '.' == missing)
> >>
> > From the looks of it, sdg1 was the drive you were originally adding back
> > into the array, and sdj1 is the drive that failed part-way through the
> > rebuild?
> 
> Exactly. I am running badblocks on that device. SMART reports one
> "Pending Sector Count" :(
> 
That means you'll end up with some corruption. Whether that affects any
data or not will depend on exactly where it is.

> >> So looks that my repair tries made sdg1 a spare :\ I attached the full
> >> output to this mail.
> >>
> >> Is there anyway to restart the RAID from the information contained in
> >> drive sdj1? Perhaps via Incremental Build starting from one drive? Could
> >> that work? If the RAID wouldn't have been rebuilding before the crash, i
> >> would just recreate it with --assume-clean.
> >>
> > The first thing to try should _always_ be a forced assemble. Recreating
> > the array is very much a last-ditch move and should never be attempted
> > before asking the list for help (any mismatch in your create command, or
> > in the mdadm/kernel versions could cause data corruption). Stop the
> > array, then reassemble with the --force flag. It'll probably restart
> > with sdj1 added back into the array, and you can then add sdg1 back in
> > again and restart the rebuild.
> 
> So
> # mdadm -A /dev/md0 -f /dev/sdc1 /dev/sdg1 /dev/sdh1 /dev/sdd1 \
> /dev/sdi1 /dev/sdj1 /dev/sdb1 /dev/sdf1 /dev/sde1
> 
> should work? That would be a really simple solution :)
> 
> 
> On sdj1 there is still a superblock from before the crash, while the
> others have newer updated superblocks. are there any means to say that
> the RAID should be assembled with the older information from this
> particular superblock?
> 
That'll be done automatically - mdadm looks at the event counters for
all the disks and assembles the array using the best set (if possible).
As sdj failed during the rebuild, taking down the array, there shouldn't
be any issues with doing this.

However, given you have unreadable blocks on sdj then you'll need to
sort that out first. (or you'll never be able to complete the rebuild).
Use ddrescue to copy the whole of sdj onto sdg (barring the unreadable
blocks). You can then force assemble the array using the other drives:

    mdadm -A /dev/md0 -f /dev/sdc1 /dev/sdg1 /dev/sdh1 /dev/sdd1 \
        /dev/sdi1 /dev/sdb1 /dev/sdf1 /dev/sde1

If that starts up okay then you can add sdj1 back into the array. You'll
need to run a fsck on the array afterwards to pick up what corruption
there's been (fsck -f /dev/md0).

Good luck,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 13:15   ` Christoph Nelles
  2013-01-31 13:45     ` Robin Hill
@ 2013-01-31 17:46     ` Chris Murphy
       [not found]       ` <510ABC1E.6060308@evilazrael.de>
  2013-01-31 22:10       ` Robin Hill
  1 sibling, 2 replies; 21+ messages in thread
From: Chris Murphy @ 2013-01-31 17:46 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid


On Jan 31, 2013, at 6:15 AM, Christoph Nelles <evilazrael@evilazrael.de> wrote:

> All drives are available again. And the seecond failed device reports
> UREs. I will run badblocks on that device before continuing.
> I attached the kernel logs of the first error and of the second error. I
> hope i filtered them reasonably.

This looks like a write error, resulting in md immediately booting the drive. There's little point in using this drive again.

Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208

What does smartctl -a return for this drive?


> Exactly. I am running badblocks on that device. SMART reports one
> "Pending Sector Count" :(

I'm unclear on the efficacy of badblocks for testing. I'd use smartctl -t long and then -a to see if there are sector problems and at what LBA; and for removing bad blocks (force a remap) I'd use either dd zeros with e.g. bs=1M, or I'd use ATA Secure Erase which is faster.

If you use the badblocks map when formatting a drive, e.g. using mkfs.ext4 -c, then it would allow you to use this disk but not in RAID. On top of raid, md gets the write error before the file system does, and boots the drive out of the array. Or on read error attempts to correct it. And even as a standalone drive do you really want to use a drive that can't remap future bad sectors?

Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
       [not found]       ` <510ABC1E.6060308@evilazrael.de>
@ 2013-01-31 21:19         ` Chris Murphy
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Murphy @ 2013-01-31 21:19 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid@vger.kernel.org Raid


Please reply to the list, not just to me personally.


On Jan 31, 2013, at 11:46 AM, Christoph Nelles <evilazrael@evilazrael.de> wrote:
> 
> Am 31.01.2013 18:46, schrieb Chris Murphy:
>> 
>> This looks like a write error, resulting in md immediately booting
>> the drive. There's little point in using this drive again.
> 
> Sure, but i think in the current half-rebuilt state, i will need the
> striped data on that disk, am i correct about that?

Only if you have no backup, and you want to extract what you can.

> Okay, a badblocks running in read-only mode may not be better than just
> dd' ing the drive to /dev/null, but at least I can see if there more
> read errors.

I don't see the point at all of learning about more read errors. This changes your strategy how? The array has failed. Either you want to backup what you can, or you don't. In either case, ultimately you're going to be creating a new array from scratch after you've tested and replaced the underlying drives.

> I am currently checking all drives and if that succeeds, i
> will try this safe write-read test badblocks offers. Just want to be
> safe that won't do any additional damage.

I think this helps you exactly zero.

> My current goal is to get access to all of my data. Or at least to as
> much data as possible. Then I can replace HDDs and upgrade to RAID6.

I'd stop wasting time with badblocks. Either the array will assemble, or it won't. Maybe someone will respond whether assuming clean is a good idea to avoid any writes to the disks, which has a decent chance of causing sdg to be bounced again, mount the file system read only; and extract what you can.

It seems to me data extraction the priority. Once you have the backup, you can test your individual array components, not now.

Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 17:46     ` Chris Murphy
       [not found]       ` <510ABC1E.6060308@evilazrael.de>
@ 2013-01-31 22:10       ` Robin Hill
  2013-01-31 22:40         ` Chris Murphy
  1 sibling, 1 reply; 21+ messages in thread
From: Robin Hill @ 2013-01-31 22:10 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christoph Nelles, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]

On Thu Jan 31, 2013 at 10:46:17 -0700, Chris Murphy wrote:

> 
> On Jan 31, 2013, at 6:15 AM, Christoph Nelles <evilazrael@evilazrael.de> wrote:
> 
> > All drives are available again. And the seecond failed device reports
> > UREs. I will run badblocks on that device before continuing.
> > I attached the kernel logs of the first error and of the second error. I
> > hope i filtered them reasonably.
> 
> This looks like a write error, resulting in md immediately booting the
> drive. There's little point in using this drive again.
> 
> Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
> Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208
> 
It's definitely a write error, yes. If there's nothing further back in
the log (e.g. a read error that's caused a rewrite to take place) then
this would definitely warn against the drive, but could just be a
transient error (or a controller problem). If there is a read error
further back then I'd blame it on timeout issues, with the drive still
trying to complete the read operation while the kernel's timed out and
trying to send a write.

> What does smartctl -a return for this drive?
> 
> 
> > Exactly. I am running badblocks on that device. SMART reports one
> > "Pending Sector Count" :(
> 
> I'm unclear on the efficacy of badblocks for testing. I'd use smartctl
> -t long and then -a to see if there are sector problems and at what
> LBA; and for removing bad blocks (force a remap) I'd use either dd
> zeros with e.g. bs=1M, or I'd use ATA Secure Erase which is faster.
> 
I don't usually bother with read tests - as you say, they're not
terribly useful. If the data's useful then just use ddrescue to get what
you can, otherwise just write-test it. I usually do a full destructive
badblocks test (I've found cases where zeros write fine but other
patterns fail), followed by a long SMART test.

> If you use the badblocks map when formatting a drive, e.g. using
> mkfs.ext4 -c, then it would allow you to use this disk but not in
> RAID. On top of raid, md gets the write error before the file system
> does, and boots the drive out of the array. Or on read error attempts
> to correct it. And even as a standalone drive do you really want to
> use a drive that can't remap future bad sectors?
> 
Not a chance I'd use it if it's actually failing to remap bad sectors,
no. Only had that with one drive so far though (out of several hundred),
most get failed out after getting more than a handful of remapped
sectors.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 22:10       ` Robin Hill
@ 2013-01-31 22:40         ` Chris Murphy
  2013-01-31 22:48           ` Chris Murphy
  2013-02-01 13:34           ` Robin Hill
  0 siblings, 2 replies; 21+ messages in thread
From: Chris Murphy @ 2013-01-31 22:40 UTC (permalink / raw)
  To: Robin Hill; +Cc: Christoph Nelles, linux-raid


On Jan 31, 2013, at 3:10 PM, Robin Hill <robin@robinhill.me.uk> wrote:

> If there is a read error
> further back then I'd blame it on timeout issues, with the drive still
> trying to complete the read operation while the kernel's timed out and
> trying to send a write.

I think we need the whole log for the time before the start of the error1.txt file provided previously. And also I'd like to know which /dev/ device was the first to have a problem, that instigated the rebuild. And if during the rebuild if the file system was mounted rw, and if any writes were done at all. If so, that probably nixes --assume-clean. If it was rebuilding and not written to from the file system, the disk being rebuilt shouldn't actually be out of sync with the array state.

The disk that needs spot sector repairs is the one with UREs, I think that's sdj1. If that disk is dd'd to another disk, the new disk won't produce UREs for sectors missing data, and the chunks comprised of those sectors won't get rebuilt by md.

So the disk to possibly dd to another is the one with the write error, sdg1. But only if the idea is to not use --assume-clean. That way a reassemble can rebuild, and not encounter another write error on that drive.

> Not a chance I'd use it if it's actually failing to remap bad sectors,
> no. Only had that with one drive so far though (out of several hundred),
> most get failed out after getting more than a handful of remapped
> sectors.

I think I see a use case for badblocks destructive writes if the disk doesn't support enhanced secure erase (which writes a pattern not just zeros). Of on laptops where it's not possible to get a disk to reset on sleep, allowing it to be unfrozen for the purposes of using secure erase. But if available, secure erase is faster and wipes all sectors even those without LBAs. For sure with SSDs it's what should be used.


Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 22:40         ` Chris Murphy
@ 2013-01-31 22:48           ` Chris Murphy
  2013-02-01 13:34           ` Robin Hill
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Murphy @ 2013-01-31 22:48 UTC (permalink / raw)
  To: Robin Hill; +Cc: Christoph Nelles, linux-raid


On Jan 31, 2013, at 3:40 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> So the disk to possibly dd to another is the one with the write error, sdg1. But only if the idea is to not use --assume-clean. That way a reassemble can rebuild, and not encounter another write error on that drive.

I meant resync.

But isn't a resync a state where md assumes parity is dirty and must be recreated? If so, what happens when sdj1 encounters UREs? If there's untrusted parity, does md just use what it has, or does it stop the array?



Chris Murphy


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-01-31 22:40         ` Chris Murphy
  2013-01-31 22:48           ` Chris Murphy
@ 2013-02-01 13:34           ` Robin Hill
  2013-02-01 17:27             ` Chris Murphy
  1 sibling, 1 reply; 21+ messages in thread
From: Robin Hill @ 2013-02-01 13:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Robin Hill, Christoph Nelles, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3203 bytes --]

On Thu Jan 31, 2013 at 03:40:00PM -0700, Chris Murphy wrote:

> 
> On Jan 31, 2013, at 3:10 PM, Robin Hill <robin@robinhill.me.uk> wrote:
> 
> > If there is a read error
> > further back then I'd blame it on timeout issues, with the drive still
> > trying to complete the read operation while the kernel's timed out and
> > trying to send a write.
> 
> I think we need the whole log for the time before the start of the
> error1.txt file provided previously. And also I'd like to know which
> /dev/ device was the first to have a problem, that instigated the
> rebuild. And if during the rebuild if the file system was mounted rw,
> and if any writes were done at all. If so, that probably nixes
> --assume-clean. If it was rebuilding and not written to from the file
> system, the disk being rebuilt shouldn't actually be out of sync with
> the array state.
> 
The timestamps on the logs show that sdg was the first to have a
problem. It'd also be useful to know whether sdg has been rewritten at
all since then (i.e. whether the testing was destructive or not), and
whether or not the array was written to at all since the failure of sdg.

> The disk that needs spot sector repairs is the one with UREs, I think
> that's sdj1. If that disk is dd'd to another disk, the new disk won't
> produce UREs for sectors missing data, and the chunks comprised of
> those sectors won't get rebuilt by md.
> 
> So the disk to possibly dd to another is the one with the write error,
> sdg1. But only if the idea is to not use --assume-clean. That way a
> reassemble can rebuild, and not encounter another write error on that
> drive.
> 
Yes, if sdg still contains valid array data (and the array wasn't
written since then) then it would definitely make more sense to recreate
the array using it, leaving sdj out for now. That'll require more work
checking mdadm versions and data offset values though. That'll avoid the
issues with the unreadable blocks on sdj.

> > Not a chance I'd use it if it's actually failing to remap bad sectors,
> > no. Only had that with one drive so far though (out of several hundred),
> > most get failed out after getting more than a handful of remapped
> > sectors.
> 
> I think I see a use case for badblocks destructive writes if the disk
> doesn't support enhanced secure erase (which writes a pattern not just
> zeros). Of on laptops where it's not possible to get a disk to reset
> on sleep, allowing it to be unfrozen for the purposes of using secure
> erase. But if available, secure erase is faster and wipes all sectors
> even those without LBAs. For sure with SSDs it's what should be used.
> 
I prefer badblocks myself - I can see exactly what it's doing and what
errors are seen. With secure erase you're dependent on the firmware
internals to tell you what's actually going on (and, depending on the
nature of the errors you're getting, this may already be suspect).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-01 13:34           ` Robin Hill
@ 2013-02-01 17:27             ` Chris Murphy
  2013-02-01 19:57               ` Robin Hill
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2013-02-01 17:27 UTC (permalink / raw)
  To: Robin Hill; +Cc: Christoph Nelles, linux-raid


On Feb 1, 2013, at 6:34 AM, Robin Hill <robin@robinhill.me.uk> wrote:
> It'd also be useful to know whether sdg has been rewritten at
> all since then (i.e. whether the testing was destructive or not), and
> whether or not the array was written to at all since the failure of sdg.

OP needs to reply back.

Also I'd like to know what model disks these are, if they're AF or not.

>> Yes, if sdg still contains valid array data (and the array wasn't
> written since then) then it would definitely make more sense to recreate
> the array using it, leaving sdj out for now. That'll require more work
> checking mdadm versions and data offset values though. That'll avoid the
> issues with the unreadable blocks on sdj.

Here's an idea. One possibility is to use dd to read the sector on sdg1 that error1.txt reported with the write error, to a file, and see if there's a read error. If not, rewrite that data back to the same sector and see if there's a write error. If not, attempt to force assemble assume clean, get the array up in degraded mode, and do a non-destructive fsck. If that's OK, just take a backup immediately. Then sdj can be destructively written to, to force bad sectors there to be removed for reserves, but still needs a smart extended offline test to confirm; and then possibly reused and rebuilt.

> I prefer badblocks myself - I can see exactly what it's doing and what
> errors are seen. With secure erase you're dependent on the firmware
> internals to tell you what's actually going on (and, depending on the
> nature of the errors you're getting, this may already be suspect).

The firmware is always a go between, you can't actually get around it. Bad sectors are entirely the domain of the drive firmware so for that purpose I don't see an advantage of an external program over secure erase and SMART testing. If it lies on either of those, it'll lie to badblocks.

Where I can see the usefulness of badblocks, maybe not more so than other tools, is it would show non-disk related errors like UDMA/CRC errors related to controller or cable problems. Whereas the entire duration of secure erase and smart testing is strictly internal to the drive.


Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-01 17:27             ` Chris Murphy
@ 2013-02-01 19:57               ` Robin Hill
  2013-02-02  0:30                 ` Christoph Nelles
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Hill @ 2013-02-01 19:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christoph Nelles, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1996 bytes --]

On Fri Feb 01, 2013 at 10:27:57 -0700, Chris Murphy wrote:

> 
> On Feb 1, 2013, at 6:34 AM, Robin Hill <robin@robinhill.me.uk> wrote:
> > It'd also be useful to know whether sdg has been rewritten at
> > all since then (i.e. whether the testing was destructive or not), and
> > whether or not the array was written to at all since the failure of sdg.
> 
> OP needs to reply back.
> 
> Also I'd like to know what model disks these are, if they're AF or not.
> 
> >> Yes, if sdg still contains valid array data (and the array wasn't
> > written since then) then it would definitely make more sense to recreate
> > the array using it, leaving sdj out for now. That'll require more work
> > checking mdadm versions and data offset values though. That'll avoid the
> > issues with the unreadable blocks on sdj.
> 
> Here's an idea. One possibility is to use dd to read the sector on
> sdg1 that error1.txt reported with the write error, to a file, and see
> if there's a read error. If not, rewrite that data back to the same
> sector and see if there's a write error. If not, attempt to force
> assemble assume clean, get the array up in degraded mode, and do a
> non-destructive fsck. If that's OK, just take a backup immediately.
> Then sdj can be destructively written to, to force bad sectors there
> to be removed for reserves, but still needs a smart extended offline
> test to confirm; and then possibly reused and rebuilt.
> 
That won't work. He's already lost the metadata on sdg1 by trying to
rebuild it in the first place, so a force assemble won't work. He'd need
to recreate the array instead. Otherwise yes, that would sound to be the
best option (assuming there's no other read errors on the other disks).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-01 19:57               ` Robin Hill
@ 2013-02-02  0:30                 ` Christoph Nelles
  2013-02-02  1:24                   ` Phil Turmel
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Nelles @ 2013-02-02  0:30 UTC (permalink / raw)
  To: linux-raid

Hello Robin,
hello Chris,

thanks for the help, the ideas and the discussion so far. And sorry for
the late response, i cam currently down with a cold.

Let me rollup the discussion so far with a little background:
One month ago, the RAID was expanded with the ninth HDD without any
rebuild problems. Last weekend I upgraded the server and kept only the
HDDs and a Marvell-based 4 Port SATA Controller. sda to sdf are
connected to the onboard AMD SB950, sdg-sdj to the Marvell controller
which has been always a little troublesome especially with Western Digitals.
Before adding or replacing a new HDD to the array, i always run a
badblocks write-read test on it, but of course this doesn't help when
blocks become bad over time.
I posted the kernel logs since the last reboot before the RAID failed at
http://evilazrael.net/bilder2/logs/kernel_20130202.log (8k lines, 600kb)
or http://evilazrael.net/bilder2/logs/kernel_20130202.log.gz (44kb)
The SMART logs are
http://evilazrael.net/bilder2/logs/smart_20130202.tar.gz if somebody is
curious. Yes, i roasted the Hitachis when i forget to plugin the cage fan.


After sdg was expelled the first time (Jan 28 00:23), I ran an extended
SMART test and then a read-write badblocks on it for almost 48hs. After
both found no errors I tried to readd it (Jan 30 18:19). And on Jan 31
00:34 the UREs broke the rebuild and kicked both drives :\

In the last two days I did a non-destructive badblocks on all devices,
only sdj reports some UREs consistently. After that i tried two
force-assembles. First broke on a read error on sdh. Then i retried and
this time the error on sdh didn't occur, but the later UREs on sdj
killed the rebuild.
At the beginning of the second try some automounter kicked in and
mounted the FS and I saw the contents of the FS, so at least the first
try didn't do additionally damage :-)

Tomorrow I will buy a new drive and dd_rescue sdj to the new drive.

And if that works then I will switch to RAID6 ASAP and check/replace all
other drives. If not, I won't need the drives anymore.

>> Also I'd like to know what model disks these are, if they're AF or >>
not.

/dev/sdb ST3000DM001-9YN1 CC4B (Seagate Barracuda 7200)
/dev/sdc WDC WD30EZRX-00M 80.0 (WDC Green SATA 3)
/dev/sdd WDC WD30EZRS-00J 80.0 (WDC Green SATA 2)
/dev/sde WDC WD30EFRX-68A 80.0 (WDC Red)
/dev/sdf WDC WD30EURS-63R 80.0 (WDC AV-GP)
/dev/sdg Hitachi HDS72303 MKAO (Deskstar 7k3000)
/dev/sdh Hitachi HDS72303 MKAO (Deskstar 7k3000)
/dev/sdi Hitachi HDS72303 MKAO (Deskstar 7k3000)
/dev/sdj WDC WD30EZRX-00M 80.0 (WDC Green SATA 3)

AV-GP and Red are marketed as 24/7 and RAID-capable, but the
availability was bad.

> If you're using standard desktop drives then you may be running into
> issues with the drive timeout being longer than the kernel's. You need
> to reset on or the other to ensure that the drive times out (and is
> available for subsequent commands) before the kernel does. Most current
> consumer drives don't allow resetting the timeout, but it's worth trying
> that first before changing the kernel timeout. For each
> drive, do:
>     smartctl -l scterc,70,70 /dev/sdX
>         || echo 180 > /sys/block/sdX/device/timeout
> 

Only the WDC Red supports that. The drives on the Marvell Controller all
report
SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled


To be honest, I don't trust SMART much and prefer a write/read badblocks
over SMART tests. But of course i won't do that on a disk which has data
on it.

>>>> Yes, if sdg still contains valid array data (and the array
>>>> wasn't
>>> written since then) then it would definitely make more sense to
>>> recreate the array using it, leaving sdj out for now. That'll
>>> require more work checking mdadm versions and data offset values
>>> though. That'll avoid the issues with the unreadable blocks on
>>> sdj.
>> 
>> Here's an idea. One possibility is to use dd to read the sector on 
>> sdg1 that error1.txt reported with the write error, to a file, and
>> see if there's a read error. If not, rewrite that data back to the
>> same sector and see if there's a write error. If not, attempt to
>> force assemble assume clean, get the array up in degraded mode, and
>> do a non-destructive fsck. If that's OK, just take a backup
>> immediately. Then sdj can be destructively written to, to force bad
>> sectors there to be removed for reserves, but still needs a smart
>> extended offline test to confirm; and then possibly reused and
>> rebuilt.
>> 
> That won't work. He's already lost the metadata on sdg1 by trying to 
> rebuild it in the first place, so a force assemble won't work. He'd
> need to recreate the array instead. Otherwise yes, that would sound
> to be the best option (assuming there's no other read errors on the
> other disks).


I think I don't like this part of the  discussion ("That won't work").


I hope no question is left open



Kind regards and thanks for all the help so far

Christoph


Am 01.02.2013 20:57, schrieb Robin Hill:
> On Fri Feb 01, 2013 at 10:27:57 -0700, Chris Murphy wrote:
> 
>>
>> On Feb 1, 2013, at 6:34 AM, Robin Hill <robin@robinhill.me.uk> wrote:
>>> It'd also be useful to know whether sdg has been rewritten at
>>> all since then (i.e. whether the testing was destructive or not), and
>>> whether or not the array was written to at all since the failure of sdg.
>>
>> OP needs to reply back.
>>
>>
> 
> Cheers,
>     Robin


-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-02  0:30                 ` Christoph Nelles
@ 2013-02-02  1:24                   ` Phil Turmel
  2013-02-02 15:55                     ` Christoph Nelles
  0 siblings, 1 reply; 21+ messages in thread
From: Phil Turmel @ 2013-02-02  1:24 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid

On 02/01/2013 07:30 PM, Christoph Nelles wrote:
[trim /]

>> If you're using standard desktop drives then you may be running into
>> issues with the drive timeout being longer than the kernel's. You need
>> to reset on or the other to ensure that the drive times out (and is
>> available for subsequent commands) before the kernel does. Most current
>> consumer drives don't allow resetting the timeout, but it's worth trying
>> that first before changing the kernel timeout. For each
>> drive, do:
>>     smartctl -l scterc,70,70 /dev/sdX
>>         || echo 180 > /sys/block/sdX/device/timeout
>>
> 
> Only the WDC Red supports that. The drives on the Marvell Controller all
> report
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled

First, the syntax should have had a backslash on the first line, so that
a failure on setting SCTERC would fall back to setting a 180 second
timeout in the driver.

Second, you list three Hitachi Deskstar 7k3000 drives as being on that
controller.  These have supported SCTERC in the past (I have some of
them) and this is the first I've seen where they don't.  Could you
repeat your smart logs, but with "-x" to get a full report?

> To be honest, I don't trust SMART much and prefer a write/read badblocks
> over SMART tests. But of course i won't do that on a disk which has data
> on it.

I've never found badblocks to be of use, but smart monitoring for
relocations is vital information.

Neither SMART nor badblocks will save you if you have a timeout
mismatch.  Enterprise drives work "out-of-the-box" as they have a
default timeout of 7.0 seconds.  Any other drives must have a timeout
set, or the driver adjusted.  Linux drivers default to 30 seconds--not
enough.

[trim /]

> I think I don't like this part of the  discussion ("That won't work").

I've gone back through your data, and part of the story is muddled by
the timeout mismatch.  Your kernel logs show "DRDY" status problems
before the drives are kicked out.  That suggests a drive still in error
recovery when the kernel driver times out, then not being able to talk
to the drive to reset the link.  Classic no-win situation with desktop
drives.

> I hope no question is left open

I didn't see anywhere in your reports whether you've tried "--assemble
--force".  That is always the first tool to revive an array that has
kicked out drives on such problems.

When you ran badblocks for 2 days, what mode did you use?

Your descriptions and kernel logs suggest that is /dev/sdg, but the
"mdadm --examine" reports show /dev/sdg was in the array longer than
/dev/sdj.  Please elaborate.

If you didn't destroy its contents, you should include it in the
"--assemble --force" attempt.  Then, with proper drive timeouts, run a
"check" scrub.  That should fix your UREs.

If you did destroy that drive's contents, you need to clean up the UREs
on the other drives with dd_rescue, then "--assemble --force" with the
remaining drives.

> Kind regards and thanks for all the help so far

I think it would be useful to provide a fresh set of "mdadm --examine"
reports for all member disks, along with a partial listing of
/dev/disk/by-id/ that shows what serial numbers are assigned to what
device names.

I don't think your situation is hopeless.

Phil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-02  1:24                   ` Phil Turmel
@ 2013-02-02 15:55                     ` Christoph Nelles
  2013-02-02 20:34                       ` Chris Murphy
  2013-02-03  1:22                       ` Phil Turmel
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Nelles @ 2013-02-02 15:55 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4888 bytes --]

Hello Phil,

Am 02.02.2013 02:24, schrieb Phil Turmel:
> On 02/01/2013 07:30 PM, Christoph Nelles wrote:
> [trim /]
> 
>>> If you're using standard desktop drives then you may be running into
>>> issues with the drive timeout being longer than the kernel's. You need
>>> to reset on or the other to ensure that the drive times out (and is
>>> available for subsequent commands) before the kernel does. Most current
>>> consumer drives don't allow resetting the timeout, but it's worth trying
>>> that first before changing the kernel timeout. For each
>>> drive, do:
>>>     smartctl -l scterc,70,70 /dev/sdX
>>>         || echo 180 > /sys/block/sdX/device/timeout
>>>
>>
>> Only the WDC Red supports that. The drives on the Marvell Controller all
>> report
>> SCT Error Recovery Control:
>>            Read: Disabled
>>           Write: Disabled
> 
> First, the syntax should have had a backslash on the first line, so that
> a failure on setting SCTERC would fall back to setting a 180 second
> timeout in the driver.
> 
> Second, you list three Hitachi Deskstar 7k3000 drives as being on that
> controller.  These have supported SCTERC in the past (I have some of
> them) and this is the first I've seen where they don't.  Could you
> repeat your smart logs, but with "-x" to get a full report?

You are right, the Hitachis support that. I thought disabled means not
possible. My fault.
Nevertheless I put the smartctl -x -a logs at
http://evilazrael.net/bilder2/logs/smart_xa_20130202.tar.gz

I am currently reading about TLER, and i am wondering why I haven't
heard of that before. Looks like the lower power consumption is not the
only advantage of the WDC Red Edition. Most reviews do not go so deep
into detail.


sdg is a new WDC Red I bought today so all drives from sdg moved one
letter down.

Spent the last three hours analysing why the second onboard controller
does not detect the new HDD. In the end it's a Marvell, IOMMU and linux
driver problem:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1005226
https://bugzilla.kernel.org/show_bug.cgi?id=42679

Marvell = PITA :(

>> To be honest, I don't trust SMART much and prefer a write/read badblocks
>> over SMART tests. But of course i won't do that on a disk which has data
>> on it.
> 
> I've never found badblocks to be of use, but smart monitoring for
> relocations is vital information.
> 
> Neither SMART nor badblocks will save you if you have a timeout
> mismatch.  Enterprise drives work "out-of-the-box" as they have a
> default timeout of 7.0 seconds.  Any other drives must have a timeout
> set, or the driver adjusted.  Linux drivers default to 30 seconds--not
> enough.
> 
> [trim /]
> 
>> I think I don't like this part of the  discussion ("That won't work").
> 
> I've gone back through your data, and part of the story is muddled by
> the timeout mismatch.  Your kernel logs show "DRDY" status problems
> before the drives are kicked out.  That suggests a drive still in error
> recovery when the kernel driver times out, then not being able to talk
> to the drive to reset the link.  Classic no-win situation with desktop
> drives.

I will retry with the changed timeouts and the new disk.

> I didn't see anywhere in your reports whether you've tried "--assemble
> --force".  That is always the first tool to revive an array that has
> kicked out drives on such problems.
> 

I tried the force-assemble  after the first answer from Robin on the ML.
Before that i simply removed and readded the failed drive after doing
running badblocks.

> When you ran badblocks for 2 days, what mode did you use?

destructive write-read (-svw). I canceled after 48 hours as the first
three patterns did not find an error.

> Your descriptions and kernel logs suggest that is /dev/sdg, but the
> "mdadm --examine" reports show /dev/sdg was in the array longer than
> /dev/sdj.  Please elaborate.

Where do you read that?

> If you didn't destroy its contents, you should include it in the
> "--assemble --force" attempt.  Then, with proper drive timeouts, run a
> "check" scrub.  That should fix your UREs.
> 
> If you did destroy that drive's contents, you need to clean up the UREs
> on the other drives with dd_rescue, then "--assemble --force" with the
> remaining drives.

ddrescue is running, this will take some hours.


> I think it would be useful to provide a fresh set of "mdadm --examine"
> reports for all member disks, along with a partial listing of
> /dev/disk/by-id/ that shows what serial numbers are assigned to what
> device names.

How do the serial numbers help?

I attached both to this mail.

> I don't think your situation is hopeless.

Thanks :)



Kind Regards

Christoph

-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


[-- Attachment #2: disk_by-id.txt --]
[-- Type: text/plain, Size: 7726 bytes --]

total 0
drwxr-xr-x 2 root root 1680 Feb  2 14:40 ./
drwxr-xr-x 5 root root  100 Feb  2 15:40 ../
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG248EA -> ../../sdj
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG248EA-part1 -> ../../sdj1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG32VNA -> ../../sdi
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG32VNA-part1 -> ../../sdi1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG6DS3A -> ../../sdh
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-Hitachi_HDS723030ALA640_MK0311YHG6DS3A-part1 -> ../../sdh1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407 -> ../../sda
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part1 -> ../../sda1
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part2 -> ../../sda2
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part5 -> ../../sda5
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part6 -> ../../sda6
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part7 -> ../../sda7
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407-part8 -> ../../sda8
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-ST3000DM001-9YN166_Z1F0D9AW -> ../../sdb
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-ST3000DM001-9YN166_Z1F0D9AW-part1 -> ../../sdb1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1267036 -> ../../sde
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1267036-part1 -> ../../sde1
lrwxrwxrwx 1 root root    9 Feb  2 14:53 ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2001070 -> ../../sdg
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-WDC_WD30EURS-63R8UY0_WD-WCAWZ2236938 -> ../../sdf
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-WDC_WD30EURS-63R8UY0_WD-WCAWZ2236938-part1 -> ../../sdf1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-WDC_WD30EZRS-00J99B0_WD-WCAWZ0319650 -> ../../sdd
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-WDC_WD30EZRS-00J99B0_WD-WCAWZ0319650-part1 -> ../../sdd1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ1394037 -> ../../sdk
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ1394037-part1 -> ../../sdk1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0236402 -> ../../sdc
lrwxrwxrwx 1 root root   10 Feb  2 15:40 ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0236402-part1 -> ../../sdc1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG248EA -> ../../sdj
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG248EA-part1 -> ../../sdj1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG32VNA -> ../../sdi
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG32VNA-part1 -> ../../sdi1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG6DS3A -> ../../sdh
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_Hitachi_HDS7230_MK0311YHG6DS3A-part1 -> ../../sdh1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407 -> ../../sda
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part1 -> ../../sda1
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part2 -> ../../sda2
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part5 -> ../../sda5
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part6 -> ../../sda6
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part7 -> ../../sda7
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_SAMSUNG_SSD_830S0XYNEAC504407-part8 -> ../../sda8
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_ST3000DM001-9YN_Z1F0D9AW -> ../../sdb
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_ST3000DM001-9YN_Z1F0D9AW-part1 -> ../../sdb1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T1267036 -> ../../sde
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T1267036-part1 -> ../../sde1
lrwxrwxrwx 1 root root    9 Feb  2 14:53 scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T2001070 -> ../../sdg
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_WDC_WD30EURS-63_WD-WCAWZ2236938 -> ../../sdf
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_WDC_WD30EURS-63_WD-WCAWZ2236938-part1 -> ../../sdf1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_WDC_WD30EZRS-00_WD-WCAWZ0319650 -> ../../sdd
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_WDC_WD30EZRS-00_WD-WCAWZ0319650-part1 -> ../../sdd1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_WDC_WD30EZRX-00_WD-WCAWZ1394037 -> ../../sdk
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_WDC_WD30EZRX-00_WD-WCAWZ1394037-part1 -> ../../sdk1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0236402 -> ../../sdc
lrwxrwxrwx 1 root root   10 Feb  2 15:40 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0236402-part1 -> ../../sdc1
lrwxrwxrwx 1 root root    9 Feb  2 14:40 usb-Generic_2.0_Reader_-0_00000001-0:0 -> ../../sdl
lrwxrwxrwx 1 root root    9 Feb  2 14:40 usb-Generic_2.0_Reader_-1_00000001-0:1 -> ../../sdm
lrwxrwxrwx 1 root root   10 Feb  2 14:40 usb-Generic_2.0_Reader_-1_00000001-0:1-part1 -> ../../sdm1
lrwxrwxrwx 1 root root    9 Feb  2 14:40 usb-Generic_2.0_Reader_-2_00000001-0:2 -> ../../sdn
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x5000c5003f73d3f3 -> ../../sdb
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5000c5003f73d3f3-part1 -> ../../sdb1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x5000cca225c0f8c7 -> ../../sdj
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5000cca225c0f8c7-part1 -> ../../sdj1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x5000cca225c167d9 -> ../../sdi
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5000cca225c167d9-part1 -> ../../sdi1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x5000cca225c2ea12 -> ../../sdh
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5000cca225c2ea12-part1 -> ../../sdh1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x50014ee0adb3a14b -> ../../sdc
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x50014ee0adb3a14b-part1 -> ../../sdc1
lrwxrwxrwx 1 root root    9 Feb  2 14:53 wwn-0x50014ee0ae2d738f -> ../../sdg
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x50014ee20645bfea -> ../../sdk
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x50014ee20645bfea-part1 -> ../../sdk1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x50014ee2b04244d4 -> ../../sdd
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x50014ee2b04244d4-part1 -> ../../sdd1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x50014ee2b1b314a2 -> ../../sdf
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x50014ee2b1b314a2-part1 -> ../../sdf1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x50014ee6ad9ea39b -> ../../sde
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x50014ee6ad9ea39b-part1 -> ../../sde1
lrwxrwxrwx 1 root root    9 Feb  2 15:40 wwn-0x5002538043584d30 -> ../../sda
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part1 -> ../../sda1
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part2 -> ../../sda2
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part5 -> ../../sda5
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part6 -> ../../sda6
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part7 -> ../../sda7
lrwxrwxrwx 1 root root   10 Feb  2 15:40 wwn-0x5002538043584d30-part8 -> ../../sda8

[-- Attachment #3: mdadm_examine.txt --]
[-- Type: text/plain, Size: 7468 bytes --]

/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 29c62776:e9c58ce6:1c6e9ab1:046ac411

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : be4a3d6b - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 6
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 6e591e9f:7a5c6fc6:33750743:c7d05b5e

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : a527d515 - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4d229bb4:053902db:c4b278ef:d56b567f

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 9ca35322 - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 7acacfa5:a805babf:60bb525c:ef450aa4

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 41daaad - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 8
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 68047273:c10a8f48:e447728f:1f8a7b32

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 1c25ba66 - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 7
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a1b16284:321fcdd0:93993ff5:832eee3a

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 23947226 - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : spare
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4e6931d9:13edc311:8e4bd316:e8c3379d

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 3d373f0c - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 09c3515e:f6c0931d:55bcd591:3d9d58b6

    Update Time : Fri Feb  1 23:02:02 2013
       Checksum : 624ab6c8 - correct
         Events : 27742

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 4
   Array State : A.AAA.AAA ('A' == active, '.' == missing)
/dev/sdk1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
           Name : router:0  (local to host router)
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 5860529039 (2794.52 GiB 3000.59 GB)
     Array Size : 46884229120 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 5860528640 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 7023df83:d890ce04:fc28652e:094adffe

    Update Time : Fri Feb  1 22:48:11 2013
       Checksum : 5431fd40 - correct
         Events : 27738

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 5
   Array State : AAAAAAAAA ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-02 15:55                     ` Christoph Nelles
@ 2013-02-02 20:34                       ` Chris Murphy
  2013-02-02 23:56                         ` Phil Turmel
  2013-02-03  1:22                       ` Phil Turmel
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2013-02-02 20:34 UTC (permalink / raw)
  To: Christoph Nelles, Phil Turmel; +Cc: linux-raid


On Feb 2, 2013, at 8:55 AM, Christoph Nelles <evilazrael@evilazrael.de> wrote:
> 
> Am 02.02.2013 02:24, schrieb Phil Turmel:
>> 
> 
>> Your descriptions and kernel logs suggest that is /dev/sdg, but the
>> "mdadm --examine" reports show /dev/sdg was in the array longer than
>> /dev/sdj.  Please elaborate.
> 
> Where do you read that?

error1.txt and error2.txt attachments. A write error for sdg occurred three days before sdj.

From error1.txt:

Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208

From error2.txt
Jan 31 00:34:13 router kernel: Read(16): 88 00 00 00 00 01 21 45 d3 08 00 00 04 00 00 00
Jan 31 00:34:13 router kernel: end_request: I/O error, dev sdj, sector 4853191432


Chris Murphy


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-02 20:34                       ` Chris Murphy
@ 2013-02-02 23:56                         ` Phil Turmel
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Turmel @ 2013-02-02 23:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christoph Nelles, linux-raid

On 02/02/2013 03:34 PM, Chris Murphy wrote:
> 
> On Feb 2, 2013, at 8:55 AM, Christoph Nelles <evilazrael@evilazrael.de> wrote:
>>
>> Am 02.02.2013 02:24, schrieb Phil Turmel:
>>>
>>
>>> Your descriptions and kernel logs suggest that is /dev/sdg, but the
>>> "mdadm --examine" reports show /dev/sdg was in the array longer than
>>> /dev/sdj.  Please elaborate.
>>
>> Where do you read that?

Event Count

Phil


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-02 15:55                     ` Christoph Nelles
  2013-02-02 20:34                       ` Chris Murphy
@ 2013-02-03  1:22                       ` Phil Turmel
  2013-02-03 15:56                         ` Christoph Nelles
  1 sibling, 1 reply; 21+ messages in thread
From: Phil Turmel @ 2013-02-03  1:22 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: linux-raid

On 02/02/2013 10:55 AM, Christoph Nelles wrote:

[trim /]

> You are right, the Hitachis support that. I thought disabled means not
> possible. My fault.
> Nevertheless I put the smartctl -x -a logs at
> http://evilazrael.net/bilder2/logs/smart_xa_20130202.tar.gz

Very good.

> I am currently reading about TLER, and i am wondering why I haven't
> heard of that before. Looks like the lower power consumption is not the
> only advantage of the WDC Red Edition. Most reviews do not go so deep
> into detail.

"TLER" == "Time Limited Error Recovery", which is WD's name for "SCTERC"
== "Sata Command Transport, Error Recovery Control".  Same purpose.

> sdg is a new WDC Red I bought today so all drives from sdg moved one
> letter down.
> 
> Spent the last three hours analysing why the second onboard controller
> does not detect the new HDD. In the end it's a Marvell, IOMMU and linux
> driver problem:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1005226
> https://bugzilla.kernel.org/show_bug.cgi?id=42679

That sucks.

> Marvell = PITA :(

Indeed.

[trim /]

>> If you did destroy that drive's contents, you need to clean up the UREs
>> on the other drives with dd_rescue, then "--assemble --force" with the
>> remaining drives.
> 
> ddrescue is running, this will take some hours.

Ok.

>> I think it would be useful to provide a fresh set of "mdadm --examine"
>> reports for all member disks, along with a partial listing of
>> /dev/disk/by-id/ that shows what serial numbers are assigned to what
>> device names.
> 
> How do the serial numbers help?

It is vital to keep track of raid device number (logical position in the
array) versus drive serial numbers, as device names are not guaranteed
to be consistent between boots (and certainly not when mucking around
with cables and connectors).

> I attached both to this mail.

Ok.

Summarizing:

ata-SAMSUNG_SSD_830_Series_S0XYNEAC504407 -> ../../sda
ata-ST3000DM001-9YN166_Z1F0D9AW -> ../../sdb
ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0236402 -> ../../sdc
ata-WDC_WD30EZRS-00J99B0_WD-WCAWZ0319650 -> ../../sdd
ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1267036 -> ../../sde
ata-WDC_WD30EURS-63R8UY0_WD-WCAWZ2236938 -> ../../sdf
ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2001070 -> ../../sdg
ata-Hitachi_HDS723030ALA640_MK0311YHG6DS3A -> ../../sdh
ata-Hitachi_HDS723030ALA640_MK0311YHG32VNA -> ../../sdi
ata-Hitachi_HDS723030ALA640_MK0311YHG248EA -> ../../sdj
ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ1394037 -> ../../sdk

and

/dev/sdb1:
   Device Role : Active device 6
/dev/sdc1:
   Device Role : Active device 0
/dev/sdd1:
   Device Role : Active device 3
/dev/sde1:
   Device Role : Active device 8
/dev/sdf1:
   Device Role : Active device 7
/dev/sdh1:
   Device Role : spare
/dev/sdi1:
   Device Role : Active device 2
/dev/sdj1:
   Device Role : Active device 4
/dev/sdk1:
   Device Role : Active device 5

When you are done with dd_rescue, make sure of the mapping again.
lsdrv[1] gives you both pieces of information in one utility, you might
find it easier than mapping by hand.

Phil

[1] http://github.com/pturmel/lsdrv



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-03  1:22                       ` Phil Turmel
@ 2013-02-03 15:56                         ` Christoph Nelles
  2013-02-03 21:59                           ` Robin Hill
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Nelles @ 2013-02-03 15:56 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2470 bytes --]

Hi folks,

the dd_rescue to the new HDD took 14hours. It looks like ddrescue is not
reading and writing in parallel. In the end 8kb couldn't be read after
10 retries.

I just force-assembled the RAID with the new drive, but it failed almost
immediately with an WRITE FPDMA QUEUED error on one of the other drives
(sdj, formerly sdi). I tried immediately again, an this time one disk
was rejected but the RAID started on 8 devices, but xfs_repair failed
when one of the disks failed with an READ FPDMA QUEUED error :( and md
expelled the disk from the RAID.



It looks more like a controller problem as all the messages comming from
the drives on the PCIe Marvell have all the line
ataXX: illegal qc_active transition (00000002->00000003)
I found only one similar report about that problem:
http://marc.info/?l=linux-ide&m=131475722021117

Any recommendations for a decent and affordable SATA Controller with at
least 4 ports and faster than PCIe x1? Looks like there are only
Marvells and more expensive Enterprise RAID controllers.



Currently the RAID is running clean, but degraded. The filesystem is
mounted ro and looks healthy. I attached a mdadm --detail and put the
kernel logs since yesterday at
http://evilazrael.net/bilder2/logs/kernel_20130203.log and
http://evilazrael.net/bilder2/logs/kernel_20130203.log.gz

I think my action plan is:
- Get reliable controller ASAP
- Re-add the missing disk
- Upgrade to RAID 6
- Schedule regularly scrubbing

Thanks for all the help so far, i think i can see the light at the end
of the tunnel :)


Am 03.02.2013 02:22, schrieb Phil Turmel:
>> How do the serial numbers help?
> 
> It is vital to keep track of raid device number (logical position in the
> array) versus drive serial numbers, as device names are not guaranteed
> to be consistent between boots (and certainly not when mucking around
> with cables and connectors).
> 

I am aware of that problem then plugging drives around or adding new
ones during runtime.

> When you are done with dd_rescue, make sure of the mapping again.
> lsdrv[1] gives you both pieces of information in one utility, you might
> find it easier than mapping by hand.

The owner's name sounds familar ;) Will send you a mail later.



Kind regards

Christoph Nelles


-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


[-- Attachment #2: mdadm_detail.txt --]
[-- Type: text/plain, Size: 1244 bytes --]

/dev/md0:
        Version : 1.2
  Creation Time : Fri Apr 27 20:25:04 2012
     Raid Level : raid5
     Array Size : 23442114560 (22356.14 GiB 24004.73 GB)
  Used Dev Size : 2930264320 (2794.52 GiB 3000.59 GB)
   Raid Devices : 9
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sun Feb  3 16:30:02 2013
          State : clean, degraded
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : router:0  (local to host router)
           UUID : 6b21b3ed:d39d5a54:d4939113:77851cb6
         Events : 27770

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       0        0        1      removed
       2       8      129        2      active sync   /dev/sdi1
       3       8       49        3      active sync   /dev/sdd1
       4       8      145        4      active sync   /dev/sdj1
       5       8       97        5      active sync   /dev/sdg1
       6       8       17        6      active sync   /dev/sdb1
       7       8       81        7      active sync   /dev/sdf1
       8       8       65        8      active sync   /dev/sde1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-03 15:56                         ` Christoph Nelles
@ 2013-02-03 21:59                           ` Robin Hill
  2013-02-10 20:48                             ` Christoph Nelles
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Hill @ 2013-02-03 21:59 UTC (permalink / raw)
  To: Christoph Nelles; +Cc: Phil Turmel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]

On Sun Feb 03, 2013 at 04:56:35 +0100, Christoph Nelles wrote:

> Hi folks,
> 
> the dd_rescue to the new HDD took 14hours. It looks like ddrescue is not
> reading and writing in parallel. In the end 8kb couldn't be read after
> 10 retries.
> 
Note that there's a difference between dd_rescue and ddrescue. GNU
ddrescue seems to be the better option nowadays,

> I just force-assembled the RAID with the new drive, but it failed almost
> immediately with an WRITE FPDMA QUEUED error on one of the other drives
> (sdj, formerly sdi). I tried immediately again, an this time one disk
> was rejected but the RAID started on 8 devices, but xfs_repair failed
> when one of the disks failed with an READ FPDMA QUEUED error :( and md
> expelled the disk from the RAID.
> 
> It looks more like a controller problem as all the messages comming from
> the drives on the PCIe Marvell have all the line
> ataXX: illegal qc_active transition (00000002->00000003)
> I found only one similar report about that problem:
> http://marc.info/?l=linux-ide&m=131475722021117
> 
> Any recommendations for a decent and affordable SATA Controller with at
> least 4 ports and faster than PCIe x1? Looks like there are only
> Marvells and more expensive Enterprise RAID controllers.
> 

I can recommend the Intel RS2WC080 (or any other LSI SAS2008 based
controller). Quite frankly, any SAS controller is almost certainly
going to be better than the SATA equivalent (and for not a huge amount
more), while still supporting standard SATA drives.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RAID5 with 2 drive failure at the same time
  2013-02-03 21:59                           ` Robin Hill
@ 2013-02-10 20:48                             ` Christoph Nelles
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Nelles @ 2013-02-10 20:48 UTC (permalink / raw)
  To: linux-raid

Hello ML,

thanks Chris, Phil & Robin. You helped me alot.

After replacing the Marvell Controller with a LSI SAS2008-based
Controller (IBM M1015 flashed to 9211-IT) the RAID was rebuilt
successfully and is running clean and stable. So the cause of the
problems was one HDD with UREs and the unstable Marvell controller. My
next steps are going to RAID6 and a bigger chunk size and scrubbing the
RAID periodically.

I have a last question. I am wondering that reading a huge file in the
XFS on the Array is faster than reading the raw md0 device. Has anybody
an explanation for that?

9 Drives RAID5, chunk size 64kb, Filesystem XFS not optimized:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=dummy.file of=/dev/null bs=1M count=100k
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 211.467 s, 508 MB/s

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/md0 of=/dev/null bs=1M count=100k
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 263.738 s, 407 MB/s

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/md0 of=/dev/null bs=64k count=1600k
1638400+0 records in
1638400+0 records out
107374182400 bytes (107 GB) copied, 253.76 s, 423 MB/s

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/md0 of=/dev/null bs=512k count=200k
204800+0 records in
204800+0 records out
107374182400 bytes (107 GB) copied, 260.837 s, 412 MB/s

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/md0 of=/dev/null bs=576k count=200k
204800+0 records in
204800+0 records out
120795955200 bytes (121 GB) copied, 296.567 s, 407 MB/s

Once again thanks for all help

Kind Regards

Christoph

Am 03.02.2013 22:59, schrieb Robin Hill:
> On Sun Feb 03, 2013 at 04:56:35 +0100, Christoph Nelles wrote:
> 
>> Hi folks,
>>
>> the dd_rescue to the new HDD took 14hours. It looks like ddrescue is not
>> reading and writing in parallel. In the end 8kb couldn't be read after
>> 10 retries.
>>
> Note that there's a difference between dd_rescue and ddrescue. GNU
> ddrescue seems to be the better option nowadays,
> 
>> I just force-assembled the RAID with the new drive, but it failed almost
>> immediately with an WRITE FPDMA QUEUED error on one of the other drives
>> (sdj, formerly sdi). I tried immediately again, an this time one disk
>> was rejected but the RAID started on 8 devices, but xfs_repair failed
>> when one of the disks failed with an READ FPDMA QUEUED error :( and md
>> expelled the disk from the RAID.
>>
>> It looks more like a controller problem as all the messages comming from
>> the drives on the PCIe Marvell have all the line
>> ataXX: illegal qc_active transition (00000002->00000003)
>> I found only one similar report about that problem:
>> http://marc.info/?l=linux-ide&m=131475722021117
>>
>> Any recommendations for a decent and affordable SATA Controller with at
>> least 4 ports and faster than PCIe x1? Looks like there are only
>> Marvells and more expensive Enterprise RAID controllers.
>>
> 
> I can recommend the Intel RS2WC080 (or any other LSI SAS2008 based
> controller). Quite frankly, any SAS controller is almost certainly
> going to be better than the SATA equivalent (and for not a huge amount
> more), while still supporting standard SATA drives.
> 
> Cheers,
>     Robin


-- 
Christoph Nelles

E-Mail    : evilazrael@evilazrael.de
Jabber    : eazrael@evilazrael.net      ICQ       : 78819723

PGP-Key   : ID 0x424FB55B on subkeys.pgp.net
            or http://evilazrael.net/pgp.txt


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-02-10 20:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-31 10:42 RAID5 with 2 drive failure at the same time Christoph Nelles
2013-01-31 11:38 ` Robin Hill
2013-01-31 13:15   ` Christoph Nelles
2013-01-31 13:45     ` Robin Hill
2013-01-31 17:46     ` Chris Murphy
     [not found]       ` <510ABC1E.6060308@evilazrael.de>
2013-01-31 21:19         ` Chris Murphy
2013-01-31 22:10       ` Robin Hill
2013-01-31 22:40         ` Chris Murphy
2013-01-31 22:48           ` Chris Murphy
2013-02-01 13:34           ` Robin Hill
2013-02-01 17:27             ` Chris Murphy
2013-02-01 19:57               ` Robin Hill
2013-02-02  0:30                 ` Christoph Nelles
2013-02-02  1:24                   ` Phil Turmel
2013-02-02 15:55                     ` Christoph Nelles
2013-02-02 20:34                       ` Chris Murphy
2013-02-02 23:56                         ` Phil Turmel
2013-02-03  1:22                       ` Phil Turmel
2013-02-03 15:56                         ` Christoph Nelles
2013-02-03 21:59                           ` Robin Hill
2013-02-10 20:48                             ` Christoph Nelles

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).