* sync_action repair not reading all sectors?
@ 2009-03-16 11:27 David Greaves
2009-03-16 15:04 ` David Lethe
2009-03-17 21:46 ` Dan Williams
0 siblings, 2 replies; 6+ messages in thread
From: David Greaves @ 2009-03-16 11:27 UTC (permalink / raw)
To: Neil Brown, linux-raid
I have a drive that has bad sectors. Lots of them.
smartctl shows
# 1 Short offline Completed: read failure 20% 530
1953520877
A simple ddrescue to this part of the disk gets this:
Mar 16 10:41:28 elm kernel: [ 8643.123397] sd 3:0:0:0: [sdd] 1953525168 512-byte
hardware sectors (1000205 MB)
<snip<>51/40:00:f0:5c:70/00:00:74:00:00/e0 Emask 0x9 (media error)
Mar 16 10:41:29 elm kernel: [ 8644.190060] ata4.00: status: { DRDY ERR }
Mar 16 10:41:29 elm kernel: [ 8644.190099] ata4.00: error: { UNC }
and reports 30 or so errors.
mdstat tells me:
md0 : active raid5 sdd1[0] sdb1[2] sda1[1]
1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
So sdd1 is in there.
/dev/sdd1 is the full disk
Now this is an enterprise class disk so I thought re-writing the blocks would be
worthwhile as a first step. (It is being RMAed but if it succeeds then I'll stop
the array, mirror/replace the disk and start the array - less risky than a resync).
However (two runs of)
echo repair > /sys/block/md0/md/sync_action
ran to completion without *any* errors being reported in syslog (or anywhere)
Is this expected? It suggests that it isn't reading the bad parts of sdd. It
certainly hasn't repaired it and I'm none the wiser...
kernel is 2.6.26-1-xen-686
mdadm v2.6.7.2
PS
This is an excellent place where I'd love to add in a new 'spare' disk, mirror
sdd to the new disk (apart from the bad sectors which should come from the
array) and then swap new for old.
Instead I'm going to have to go degraded and sync - risking a sector read
failure on one of the other drives and a restore from backup :(
--
"Don't worry, you'll be fine; I saw it work in a cartoon once..."
^ permalink raw reply [flat|nested] 6+ messages in thread* RE: sync_action repair not reading all sectors?
2009-03-16 11:27 sync_action repair not reading all sectors? David Greaves
@ 2009-03-16 15:04 ` David Lethe
2009-03-16 15:20 ` Greg Freemyer
2009-03-17 21:46 ` Dan Williams
1 sibling, 1 reply; 6+ messages in thread
From: David Lethe @ 2009-03-16 15:04 UTC (permalink / raw)
To: David Greaves, Neil Brown, linux-raid
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of David Greaves
> Sent: Monday, March 16, 2009 6:27 AM
> To: Neil Brown; linux-raid@vger.kernel.org
> Subject: sync_action repair not reading all sectors?
>
> I have a drive that has bad sectors. Lots of them.
>
> smartctl shows
> # 1 Short offline Completed: read failure 20% 530
> 1953520877
>
> A simple ddrescue to this part of the disk gets this:
>
> Mar 16 10:41:28 elm kernel: [ 8643.123397] sd 3:0:0:0: [sdd]
1953525168
> 512-byte
> hardware sectors (1000205 MB)
> <snip<>51/40:00:f0:5c:70/00:00:74:00:00/e0 Emask 0x9 (media error)
> Mar 16 10:41:29 elm kernel: [ 8644.190060] ata4.00: status: { DRDY ERR
> }
> Mar 16 10:41:29 elm kernel: [ 8644.190099] ata4.00: error: { UNC }
>
> and reports 30 or so errors.
>
>
> mdstat tells me:
> md0 : active raid5 sdd1[0] sdb1[2] sda1[1]
> 1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>
> So sdd1 is in there.
>
> /dev/sdd1 is the full disk
>
> Now this is an enterprise class disk so I thought re-writing the
blocks
> would be
> worthwhile as a first step. (It is being RMAed but if it succeeds then
> I'll stop
> the array, mirror/replace the disk and start the array - less risky
> than a resync).
>
> However (two runs of)
> echo repair > /sys/block/md0/md/sync_action
> ran to completion without *any* errors being reported in syslog (or
> anywhere)
>
> Is this expected? It suggests that it isn't reading the bad parts of
> sdd. It
> certainly hasn't repaired it and I'm none the wiser...
>
> kernel is 2.6.26-1-xen-686
> mdadm v2.6.7.2
>
>
> PS
> This is an excellent place where I'd love to add in a new 'spare'
disk,
> mirror
> sdd to the new disk (apart from the bad sectors which should come from
> the
> array) and then swap new for old.
> Instead I'm going to have to go degraded and sync - risking a sector
> read
> failure on one of the other drives and a restore from backup :(
>
> --
> "Don't worry, you'll be fine; I saw it work in a cartoon once..."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Personally, If I had a disk that just got this many bad sectors, then I
wouldn't mess
with it further. Just RMA it and get it out of your computer. Every
block of data
you write to that disk is at risk, and since you are running RAID5, then
you have no room
for error if this disk, or one of the others should die on you and you
have a bad block
on one of the surviving disks.
David
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: sync_action repair not reading all sectors?
2009-03-16 15:04 ` David Lethe
@ 2009-03-16 15:20 ` Greg Freemyer
2009-03-17 10:49 ` David Greaves
0 siblings, 1 reply; 6+ messages in thread
From: Greg Freemyer @ 2009-03-16 15:20 UTC (permalink / raw)
To: David Lethe; +Cc: David Greaves, Neil Brown, linux-raid
On Mon, Mar 16, 2009 at 11:04 AM, David Lethe <david@santools.com> wrote:
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of David Greaves
>> Sent: Monday, March 16, 2009 6:27 AM
>> To: Neil Brown; linux-raid@vger.kernel.org
>> Subject: sync_action repair not reading all sectors?
>>
>> I have a drive that has bad sectors. Lots of them.
>>
>> smartctl shows
>> # 1 Short offline Completed: read failure 20% 530
>> 1953520877
>>
>> A simple ddrescue to this part of the disk gets this:
>>
>> Mar 16 10:41:28 elm kernel: [ 8643.123397] sd 3:0:0:0: [sdd]
> 1953525168
>> 512-byte
>> hardware sectors (1000205 MB)
>> <snip<>51/40:00:f0:5c:70/00:00:74:00:00/e0 Emask 0x9 (media error)
>> Mar 16 10:41:29 elm kernel: [ 8644.190060] ata4.00: status: { DRDY ERR
>> }
>> Mar 16 10:41:29 elm kernel: [ 8644.190099] ata4.00: error: { UNC }
>>
>> and reports 30 or so errors.
>>
>>
>> mdstat tells me:
>> md0 : active raid5 sdd1[0] sdb1[2] sda1[1]
>> 1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>>
>> So sdd1 is in there.
>>
>> /dev/sdd1 is the full disk
>>
>> Now this is an enterprise class disk so I thought re-writing the
> blocks
>> would be
>> worthwhile as a first step. (It is being RMAed but if it succeeds then
>> I'll stop
>> the array, mirror/replace the disk and start the array - less risky
>> than a resync).
>>
>> However (two runs of)
>> echo repair > /sys/block/md0/md/sync_action
>> ran to completion without *any* errors being reported in syslog (or
>> anywhere)
>>
>> Is this expected? It suggests that it isn't reading the bad parts of
>> sdd. It
>> certainly hasn't repaired it and I'm none the wiser...
>>
>> kernel is 2.6.26-1-xen-686
>> mdadm v2.6.7.2
>>
>>
>> PS
>> This is an excellent place where I'd love to add in a new 'spare'
> disk,
>> mirror
>> sdd to the new disk (apart from the bad sectors which should come from
>> the
>> array) and then swap new for old.
>> Instead I'm going to have to go degraded and sync - risking a sector
>> read
>> failure on one of the other drives and a restore from backup :(
>>
>> --
>> "Don't worry, you'll be fine; I saw it work in a cartoon once..."
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> Personally, If I had a disk that just got this many bad sectors, then I
> wouldn't mess
> with it further. Just RMA it and get it out of your computer. Every
> block of data
> you write to that disk is at risk, and since you are running RAID5, then
> you have no room
> for error if this disk, or one of the others should die on you and you
> have a bad block
> on one of the surviving disks.
>
> David
David,
I think you read too fast. That is exactly what he proposed. The
question was how to keep the raid-5 as fault tolerant as possible
during the drive swapout.
Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: sync_action repair not reading all sectors?
2009-03-16 15:20 ` Greg Freemyer
@ 2009-03-17 10:49 ` David Greaves
0 siblings, 0 replies; 6+ messages in thread
From: David Greaves @ 2009-03-17 10:49 UTC (permalink / raw)
To: Greg Freemyer; +Cc: David Lethe, Neil Brown, linux-raid
<snip suggestions to replace drive as I am doing that>
Greg Freemyer wrote:
> The question was how to keep the raid-5 as fault tolerant as possible
> during the drive swapout.
I think I'm resigned that that is an on-going feature request :)
Actually I'm more worried that a clearly failing drive is not causing any
problems for a sync_action = repair.
This drive is almost dead and md appears to think it is fine. How come?
David
--
"Don't worry, you'll be fine; I saw it work in a cartoon once..."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: sync_action repair not reading all sectors?
2009-03-16 11:27 sync_action repair not reading all sectors? David Greaves
2009-03-16 15:04 ` David Lethe
@ 2009-03-17 21:46 ` Dan Williams
2009-03-18 12:23 ` David Greaves
1 sibling, 1 reply; 6+ messages in thread
From: Dan Williams @ 2009-03-17 21:46 UTC (permalink / raw)
To: David Greaves; +Cc: Neil Brown, linux-raid
On Mon, Mar 16, 2009 at 4:27 AM, David Greaves <david@dgreaves.com> wrote:
> I have a drive that has bad sectors. Lots of them.
>
> smartctl shows
> # 1 Short offline Completed: read failure 20% 530
> 1953520877
>
> A simple ddrescue to this part of the disk gets this:
>
> Mar 16 10:41:28 elm kernel: [ 8643.123397] sd 3:0:0:0: [sdd] 1953525168 512-byte
> hardware sectors (1000205 MB)
> <snip<>51/40:00:f0:5c:70/00:00:74:00:00/e0 Emask 0x9 (media error)
> Mar 16 10:41:29 elm kernel: [ 8644.190060] ata4.00: status: { DRDY ERR }
> Mar 16 10:41:29 elm kernel: [ 8644.190099] ata4.00: error: { UNC }
>
> and reports 30 or so errors.
>
>
> mdstat tells me:
> md0 : active raid5 sdd1[0] sdb1[2] sda1[1]
> 1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>
> So sdd1 is in there.
>
> /dev/sdd1 is the full disk
>
Are you sure? Maybe I did the following math wrong, but it seems
there is a chance this bad region is outside the raid array.
/proc/mdstat says the array is 1953519872 blocks large which is
3907039744 sectors. For a three disk raid5 that means we are using
1953519872 sectors per disk. The failing sector of 1953520877 is 1005
sectors outside the array, probably 942 assuming partition 1 starts at
sector 63??
--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: sync_action repair not reading all sectors?
2009-03-17 21:46 ` Dan Williams
@ 2009-03-18 12:23 ` David Greaves
0 siblings, 0 replies; 6+ messages in thread
From: David Greaves @ 2009-03-18 12:23 UTC (permalink / raw)
To: Dan Williams; +Cc: Neil Brown, linux-raid
Dan Williams wrote:
> On Mon, Mar 16, 2009 at 4:27 AM, David Greaves <david@dgreaves.com> wrote:
>> I have a drive that has bad sectors. Lots of them.
>>
>> smartctl shows
>> # 1 Short offline Completed: read failure 20% 530
>> 1953520877
>>
>> A simple ddrescue to this part of the disk gets this:
>>
>> Mar 16 10:41:28 elm kernel: [ 8643.123397] sd 3:0:0:0: [sdd] 1953525168 512-byte
>> hardware sectors (1000205 MB)
>> <snip<>51/40:00:f0:5c:70/00:00:74:00:00/e0 Emask 0x9 (media error)
>> Mar 16 10:41:29 elm kernel: [ 8644.190060] ata4.00: status: { DRDY ERR }
>> Mar 16 10:41:29 elm kernel: [ 8644.190099] ata4.00: error: { UNC }
>>
>> and reports 30 or so errors.
>>
>>
>> mdstat tells me:
>> md0 : active raid5 sdd1[0] sdb1[2] sda1[1]
>> 1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>>
>> So sdd1 is in there.
>>
>> /dev/sdd1 is the full disk
>>
>
> Are you sure? Maybe I did the following math wrong, but it seems
> there is a chance this bad region is outside the raid array.
> /proc/mdstat says the array is 1953519872 blocks large which is
> 3907039744 sectors. For a three disk raid5 that means we are using
> 1953519872 sectors per disk. The failing sector of 1953520877 is 1005
> sectors outside the array, probably 942 assuming partition 1 starts at
> sector 63??
>
> --
> Dan
Thanks for taking the time to look and for spotting this Dan.
Well you are right. The media error is occurring outside the partition.
But equally: yes, it's the full disk according to cfdisk,fdisk
I *knew* that I'd allocated the full disk to the partition and checked at a
cursory level but not at a sector level :(
Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
/dev/sdd1 1 121601 976760001 83 Linux
1 Primary 0 1953520064 63 1953520065 Linux (83) None
but kernel.log says:
sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
So I humbly apologise for doubting md :)
Pragmatically it looks like a genuine disk error but I should be OK to recover
by stopping the array and doing a fast ddrescue mirror on this device rather
than a more risky replace/resync now the advance replacement has arrived.
Shame we can't do that without stopping the array yet ;)
David
--
"Don't worry, you'll be fine; I saw it work in a cartoon once..."
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-18 12:23 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-16 11:27 sync_action repair not reading all sectors? David Greaves
2009-03-16 15:04 ` David Lethe
2009-03-16 15:20 ` Greg Freemyer
2009-03-17 10:49 ` David Greaves
2009-03-17 21:46 ` Dan Williams
2009-03-18 12:23 ` David Greaves
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).