All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Michael Sallaway <michael@sallaway.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: 3-way mirrors
Date: Wed, 8 Sep 2010 16:40:38 +1000	[thread overview]
Message-ID: <20100908164038.3067cc6f@notabene> (raw)
In-Reply-To: <20100908061616.31334.qmail@s217.sureserver.com>

On Wed, 08 Sep 2010 06:16:16 +0000
"Michael Sallaway" <michael@sallaway.com> wrote:

> 
> >  -------Original Message-------
> >  From: Neil Brown <neilb@suse.de>
> >  To: Michael Sallaway <michael@sallaway.com>
> >  Cc: linux-raid@vger.kernel.org
> >  Subject: Re: 3-way mirrors
> >  Sent: 08 Sep '10 06:02
> >  
> >  Hmm.... Drive B shouldn't be ejected from the array for a read error.  md
> >  should calculate the data for both A and B from the other devices and then
> >  write that to A and B.
> >  If the write fails, only then should it kick B from the array.  Is that what
> >  is happening?
> >  
> >  i.e. do you see messages like:
> >     read error corrected
> >     read error not correctable
> >     read error NOT corrected
> >  
> >  in the kernel logs??
> 
> 
> The logs for the relevant section are below, at the bottom -- it's a "read error not correctable". So I'm guessing it's also failing a write, although I can't see the ATA error handling mentioning any writes -- it all looks like reads??

Yes, it is just reads.
It looks like you have an ancient kernel - older than April 2010 :-)
A patch went in to 2.6.35 and I think some 2.6.34.y which fixed a bug that
causes md to drop devices in a degraded RAID6 when it could have fixed the
read error.  Commit 7b0bb5368a719

So a newer kernel might fix your problem for you.

> 
> 
> >  If the write is failing, then you want my bad-block-log patches - only they
> >  aren't really finished yet and certainly aren't tested very well.  I really
> >  should get back to those.
> 
> Interesting -- I'm not familiar with them, where would I find these patches? And what would they do -- just allow the bad blocks (even on writes), and keep the drive in the array? That's all I'm really after, in this case, I think.

I posted them to the list for review a few months ago and haven't got back to
them.

http://www.spinics.net/lists/raid/msg28813.html

I wouldn't recommend using them until they've seen more review and testing.

NeilBrown



> 
> Thanks!
> Michael
> 
> 
> 
> Syslog from the failure of the first drive:
> 
> Sep  7 09:31:24 lechuck kernel: [51912.039892] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:24 lechuck kernel: [51912.048227] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:24 lechuck kernel: [51912.056685] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:24 lechuck kernel: [51912.065055] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:24 lechuck kernel: [51912.065061]          res 51/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:25 lechuck kernel: [51912.098113] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:25 lechuck kernel: [51912.106705] ata13.00: error: { UNC }
> Sep  7 09:31:25 lechuck kernel: [51912.128027] ata13.00: configured for UDMA/133
> Sep  7 09:31:25 lechuck kernel: [51912.128054] ata13: EH complete
> Sep  7 09:31:28 lechuck kernel: [51915.216232] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:28 lechuck kernel: [51915.224757] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:28 lechuck kernel: [51915.233283] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:28 lechuck kernel: [51915.241660] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:28 lechuck kernel: [51915.241662]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:28 lechuck kernel: [51915.275603] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:28 lechuck kernel: [51915.284267] ata13.00: error: { UNC }
> Sep  7 09:31:28 lechuck kernel: [51915.305722] ata13.00: configured for UDMA/133
> Sep  7 09:31:28 lechuck kernel: [51915.305746] ata13: EH complete
> Sep  7 09:31:30 lechuck kernel: [51917.992164] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:30 lechuck kernel: [51918.000791] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:30 lechuck kernel: [51918.009631] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:30 lechuck kernel: [51918.018303] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:30 lechuck kernel: [51918.018305]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:30 lechuck kernel: [51918.054117] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:30 lechuck kernel: [51918.062808] ata13.00: error: { UNC }
> Sep  7 09:31:30 lechuck kernel: [51918.084521] ata13.00: configured for UDMA/133
> Sep  7 09:31:30 lechuck kernel: [51918.084547] ata13: EH complete
> Sep  7 09:31:33 lechuck kernel: [51920.956122] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:33 lechuck kernel: [51920.964858] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:33 lechuck kernel: [51920.973829] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:33 lechuck kernel: [51920.982587] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:33 lechuck kernel: [51920.982589]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:33 lechuck kernel: [51921.017401] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:33 lechuck kernel: [51921.026134] ata13.00: error: { UNC }
> Sep  7 09:31:33 lechuck kernel: [51921.048656] ata13.00: configured for UDMA/133
> Sep  7 09:31:33 lechuck kernel: [51921.048680] ata13: EH complete
> Sep  7 09:31:37 lechuck kernel: [51924.153414] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:37 lechuck kernel: [51924.162178] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:37 lechuck kernel: [51924.162182] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:37 lechuck kernel: [51924.162189] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:37 lechuck kernel: [51924.162190]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:37 lechuck kernel: [51924.162193] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:37 lechuck kernel: [51924.162195] ata13.00: error: { UNC }
> Sep  7 09:31:37 lechuck kernel: [51924.175348] ata13.00: configured for UDMA/133
> Sep  7 09:31:37 lechuck kernel: [51924.175374] ata13: EH complete
> Sep  7 09:31:39 lechuck kernel: [51927.005666] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:39 lechuck kernel: [51927.014384] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:39 lechuck kernel: [51927.023299] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:39 lechuck kernel: [51927.031949] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:39 lechuck kernel: [51927.031951]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:39 lechuck kernel: [51927.066322] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:39 lechuck kernel: [51927.074946] ata13.00: error: { UNC }
> Sep  7 09:31:40 lechuck kernel: [51927.096349] ata13.00: configured for UDMA/133
> Sep  7 09:31:40 lechuck kernel: [51927.096393] sd 12:0:0:0: [sdm] Unhandled sense code
> Sep  7 09:31:40 lechuck kernel: [51927.096396] sd 12:0:0:0: [sdm] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Sep  7 09:31:40 lechuck kernel: [51927.096401] sd 12:0:0:0: [sdm] Sense Key : Medium Error [current] [descriptor]
> Sep  7 09:31:40 lechuck kernel: [51927.096406] Descriptor sense data with sense descriptors (in hex):
> Sep  7 09:31:40 lechuck kernel: [51927.096409]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> Sep  7 09:31:40 lechuck kernel: [51927.096420]         5d d9 20 a3
> Sep  7 09:31:40 lechuck kernel: [51927.096425] sd 12:0:0:0: [sdm] Add. Sense: Unrecovered read error - auto reallocate failed
> Sep  7 09:31:40 lechuck kernel: [51927.096431] sd 12:0:0:0: [sdm] CDB: Read(10): 28 00 5d d9 20 00 00 00 d8 00
> Sep  7 09:31:40 lechuck kernel: [51927.096442] end_request: I/O error, dev sdm, sector 1574510755
> Sep  7 09:31:40 lechuck kernel: [51927.104975] raid5:md10: read error not correctable (sector 1574510752 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.104985] raid5: Disk failure on sdm, disabling device.
> Sep  7 09:31:40 lechuck kernel: [51927.104989] raid5: Operation continuing on 10 devices.
> Sep  7 09:31:40 lechuck kernel: [51927.122210] raid5:md10: read error not correctable (sector 1574510760 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122214] raid5:md10: read error not correctable (sector 1574510768 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122218] raid5:md10: read error not correctable (sector 1574510776 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122222] raid5:md10: read error not correctable (sector 1574510784 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122225] raid5:md10: read error not correctable (sector 1574510792 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122229] raid5:md10: read error not correctable (sector 1574510800 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122242] ata13: EH complete
> Sep  7 09:31:40 lechuck kernel: [51927.142926] md: md10: recovery done.
> Sep  7 09:31:40 lechuck mdadm[3840]: Fail event detected on md device /dev/md10, component device /dev/sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344026] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344031]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344034]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344037]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344039]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344042]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344044]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344047]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344049]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344052]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344054]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344057]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344059]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344062]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.344064] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344066]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344068]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344070]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344073]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344075]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344077]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344080]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344082]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344084]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344087]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344089]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344091]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344093]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.344095] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344097]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344100]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344102]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344104]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344106]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344109]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344111]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344113]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344116]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344118]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344120]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344122]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344125]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.400014] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.400017]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.400020]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.400022]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.400025]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.400027]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.400029]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.400032]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.400034]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.400036]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.400039]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.400041]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.400043]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.400138] md: recovery of RAID array md10
> Sep  7 09:31:40 lechuck kernel: [51927.400141] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Sep  7 09:31:40 lechuck kernel: [51927.400145] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Sep  7 09:31:40 lechuck kernel: [51927.400155] md: using 128k window, over a total of 1465138496 blocks.
> Sep  7 09:31:40 lechuck kernel: [51927.400159] md: resuming recovery of md10 from checkpoint.
> Sep  7 09:31:40 lechuck mdadm[3840]: RebuildFinished event detected on md device /dev/md10, component device  mismatches found: 477544
> Sep  7 09:31:40 lechuck mdadm[3840]: RebuildStarted event detected on md device /dev/md10
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2010-09-08  6:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-08  6:16 3-way mirrors Michael Sallaway
2010-09-08  6:40 ` Neil Brown [this message]
2010-09-08  9:06   ` Tim Small
  -- strict thread matches above, loose matches on Subject: below --
2010-09-08  7:01 Michael Sallaway
2010-09-08  9:11 ` Tim Small
2010-09-08  5:45 Michael Sallaway
2010-09-08  6:02 ` Neil Brown
2010-09-08  3:58 Michael Sallaway
2010-09-08  4:16 ` Neil Brown
2010-09-07 14:19 George Spelvin
2010-09-07 16:07 ` Iordan Iordanov
2010-09-07 18:49   ` George Spelvin
2010-09-07 19:55     ` Keld Jørn Simonsen
2010-09-07 18:31 ` Aryeh Gregor
2010-09-07 19:02   ` George Spelvin
2010-09-08 22:28     ` Bill Davidsen
2010-09-07 22:01 ` Neil Brown
2010-09-08  1:33   ` Neil Brown
2010-09-08 14:52   ` George Spelvin
2010-09-08 23:04     ` Neil Brown
2010-09-28 16:42 ` Tim Small

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100908164038.3067cc6f@notabene \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=michael@sallaway.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.