read error: how to fix?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* read error: how to fix?
@ 2011-10-07 16:51 Helmut Hullen
  2011-10-10 11:48 ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: Helmut Hullen @ 2011-10-07 16:51 UTC (permalink / raw)
  To: linux-btrfs

Hallo,

I'm just copying about 1.5 TByte from a 3-disks-btrfs directory (data:  
raid0) to another disk. And there seem to be 2 damaged files, they stop  
the copying process.

Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask 0x0 SAct 0x0  
SErr 0x0 action 0x0
Oct  7 18:16:55 Arktur kernel: ata5.00: BMDMA2 stat 0x80d2009
Oct  7 18:16:55 Arktur kernel: ata5.00: failed command: READ DMA
Oct  7 18:16:55 Arktur kernel: ata5.00: cmd c8/00:40:57:d0:34/00:00:00:00:00/ee tag 0 dma 32768 in
Oct  7 18:16:55 Arktur kernel:          res 51/40:40:57:d0:34/00:03:0e:00:00/fe Emask 0x9 (media error)
Oct  7 18:16:55 Arktur kernel: ata5.00: status: { DRDY ERR }
Oct  7 18:16:55 Arktur kernel: ata5.00: error: { UNC }
Oct  7 18:16:55 Arktur kernel: ata5.00: configured for UDMA/100
Oct  7 18:16:55 Arktur kernel: ata5: EH complete

(repeating every 3 seconds)

The files contain no valuable data (*.mpeg files, reproducable). But how  
can I tell the disk not to use the damaged sector(s)?

On an ext2/3 system I used "badblocks" - is there some comparable tool  
for btrfs?

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-07 16:51 read error: how to fix? Helmut Hullen
@ 2011-10-10 11:48 ` David Sterba
  2011-10-10 13:28   ` Helmut Hullen
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2011-10-10 11:48 UTC (permalink / raw)
  To: helmut; +Cc: linux-btrfs

On Fri, Oct 07, 2011 at 06:51:00PM +0200, Helmut Hullen wrote:
> Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask 0x0 SAct 0x0  
> SErr 0x0 action 0x0
> Oct  7 18:16:55 Arktur kernel: ata5.00: BMDMA2 stat 0x80d2009
> Oct  7 18:16:55 Arktur kernel: ata5.00: failed command: READ DMA
> Oct  7 18:16:55 Arktur kernel: ata5.00: cmd c8/00:40:57:d0:34/00:00:00:00:00/ee tag 0 dma 32768 in
> Oct  7 18:16:55 Arktur kernel:          res 51/40:40:57:d0:34/00:03:0e:00:00/fe Emask 0x9 (media error)
> Oct  7 18:16:55 Arktur kernel: ata5.00: status: { DRDY ERR }
> Oct  7 18:16:55 Arktur kernel: ata5.00: error: { UNC }
> Oct  7 18:16:55 Arktur kernel: ata5.00: configured for UDMA/100
> Oct  7 18:16:55 Arktur kernel: ata5: EH complete
> 
> (repeating every 3 seconds)
> 
> The files contain no valuable data (*.mpeg files, reproducable). But how  
> can I tell the disk not to use the damaged sector(s)?
> 
> On an ext2/3 system I used "badblocks" - is there some comparable tool  
> for btrfs?

No there isn't, but it's a good topic for a btrfs project :)

(I see lots of interesting problems like relocating superblocks, damaged
allocator structures, ...)


david

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-10 11:48 ` David Sterba
@ 2011-10-10 13:28   ` Helmut Hullen
  2011-10-10 14:07     ` Jeff Mahoney
  0 siblings, 1 reply; 12+ messages in thread
From: Helmut Hullen @ 2011-10-10 13:28 UTC (permalink / raw)
  To: linux-btrfs

Hallo, David,

Du meintest am 10.10.11:

>> Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> Oct  7 18:16:55 Arktur kernel: ata5.00: BMDMA2 stat 0x80d2009
>> Oct  7 18:16:55 Arktur kernel: ata5.00: failed command: READ DMA
>> Oct  7 18:16:55 Arktur kernel: ata5.00: cmd
>> c8/00:40:57:d0:34/00:00:00:00:00/ee tag 0 dma 32768 in Oct  7
>> 18:16:55 Arktur kernel:          res
>> 51/40:40:57:d0:34/00:03:0e:00:00/fe Emask 0x9 (media error) Oct  7
>> 18:16:55 Arktur kernel: ata5.00: status: { DRDY ERR } Oct  7
>> 18:16:55 Arktur kernel: ata5.00: error: { UNC } Oct  7 18:16:55
>> Arktur kernel: ata5.00: configured for UDMA/100 Oct  7 18:16:55
>> Arktur kernel: ata5: EH complete
>>
>> (repeating every 3 seconds)
>>
>> The files contain no valuable data (*.mpeg files, reproducable). But
>> how can I tell the disk not to use the damaged sector(s)?
>>
>> On an ext2/3 system I used "badblocks" - is there some comparable
>> tool for btrfs?

> No there isn't, but it's a good topic for a btrfs project :)

> (I see lots of interesting problems like relocating superblocks,
> damaged allocator structures, ...)

I've just worked again with the 2 unreadable files.

Copying them to another partition stopped somewhere, one time/file at  
about 98%, the other time at about 2%.

I had to kill the "cp" order with "killall cp".

The same problem with deleting: I had to use "killall rm". "I'm not  
amused" ...

And I'm curious what the system will do with the 2 unreadable sectors.  
In about 1 year I have to add the next 2 TByte disk, with "add" and  
"balance". Maybe I have to copy the 3-disks cluster to a 4-disks-cluster  
...

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-10 13:28   ` Helmut Hullen
@ 2011-10-10 14:07     ` Jeff Mahoney
  2011-10-10 15:58       ` Helmut Hullen
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Mahoney @ 2011-10-10 14:07 UTC (permalink / raw)
  To: helmut; +Cc: Helmut Hullen, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/10/2011 09:28 AM, Helmut Hullen wrote:
> Hallo, David,
> 
> Du meintest am 10.10.11:
> 
>>> Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask 0x0
>>> SAct 0x0 SErr 0x0 action 0x0 Oct  7 18:16:55 Arktur kernel:
>>> ata5.00: BMDMA2 stat 0x80d2009 Oct  7 18:16:55 Arktur kernel:
>>> ata5.00: failed command: READ DMA Oct  7 18:16:55 Arktur
>>> kernel: ata5.00: cmd c8/00:40:57:d0:34/00:00:00:00:00/ee tag 0
>>> dma 32768 in Oct  7 18:16:55 Arktur kernel:          res 
>>> 51/40:40:57:d0:34/00:03:0e:00:00/fe Emask 0x9 (media error) Oct
>>> 7 18:16:55 Arktur kernel: ata5.00: status: { DRDY ERR } Oct  7 
>>> 18:16:55 Arktur kernel: ata5.00: error: { UNC } Oct  7
>>> 18:16:55 Arktur kernel: ata5.00: configured for UDMA/100 Oct  7
>>> 18:16:55 Arktur kernel: ata5: EH complete
>>> 
>>> (repeating every 3 seconds)
>>> 
>>> The files contain no valuable data (*.mpeg files,
>>> reproducable). But how can I tell the disk not to use the
>>> damaged sector(s)?
>>> 
>>> On an ext2/3 system I used "badblocks" - is there some
>>> comparable tool for btrfs?
> 
>> No there isn't, but it's a good topic for a btrfs project :)
> 
>> (I see lots of interesting problems like relocating superblocks, 
>> damaged allocator structures, ...)
> 
> I've just worked again with the 2 unreadable files.
> 
> Copying them to another partition stopped somewhere, one time/file
> at about 98%, the other time at about 2%.
> 
> I had to kill the "cp" order with "killall cp".
> 
> The same problem with deleting: I had to use "killall rm". "I'm not
>  amused" ...
> 
> And I'm curious what the system will do with the 2 unreadable
> sectors. In about 1 year I have to add the next 2 TByte disk, with
> "add" and "balance". Maybe I have to copy the 3-disks cluster to a
> 4-disks-cluster ...

I'd try replacing the SATA cable and if that doesn't fix it up, you
may be out of luck. The thing is that marking sectors bad is a (pretty
poor) band-aid for a much bigger problem: If you're hitting persistent
read errors and re-writing the blocks doesn't fix it, your disk is
already close to being completely kaput and no amount of software is
going to help with that.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6S/DUACgkQLPWxlyuTD7LSXgCfZDTgMjg4mc/cbRBZeYLbmlKS
A08An0DoPONviCz64sYq9H9HL3Xt0ywZ
=p/lR
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-10 14:07     ` Jeff Mahoney
@ 2011-10-10 15:58       ` Helmut Hullen
  2011-10-14 19:47         ` Jeff Mahoney
  2011-10-15 18:47         ` Martin Steigerwald
  0 siblings, 2 replies; 12+ messages in thread
From: Helmut Hullen @ 2011-10-10 15:58 UTC (permalink / raw)
  To: linux-btrfs

Hallo, Jeff,

Du meintest am 10.10.11:

>>>> Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask 0x0
>>>> SAct 0x0 SErr 0x0 action 0x0

[...]

>> I've just worked again with the 2 unreadable files.
>>
>> Copying them to another partition stopped somewhere, one time/file
>> at about 98%, the other time at about 2%.

[...]

> I'd try replacing the SATA cable and if that doesn't fix it up, you
> may be out of luck.

There are 2 unreadable sectors (reproducable). Changing or re-mounting  
the cables doesn't help.

> The thing is that marking sectors bad is a
> (pretty poor) band-aid for a much bigger problem: If you're hitting
> persistent read errors and re-writing the blocks doesn't fix it, your
> disk is already close to being completely kaput and no amount of
> software is going to help with that.

The next steps could be:

- adding a new 2-TByte disk (now there are 3 2-TByte disks)
- balancing
- removing the bad 2-TByte disk

But I'm afraid when I run balancing then the bad sectors damage big  
parts of the contents. I've had such bad luck about 1 year ago, losing  
about 2 TByte of data (ok - I had a kind of backup in a neighbout town).  
I don't like to reproduce this experience.

I'm afraid I have to buy 3 (or 4) 2-TByte disks, building them as a new  
raid0-data cluster and copy the complete contents from the old cluster  
to the new one. Doesn't sound good.

-----------------------

2 bad sectors from a total of 4*10^9 sectors is (in another point of  
view) no bad error rate ...

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-10 15:58       ` Helmut Hullen
@ 2011-10-14 19:47         ` Jeff Mahoney
  2011-10-15 18:47         ` Martin Steigerwald
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff Mahoney @ 2011-10-14 19:47 UTC (permalink / raw)
  To: helmut; +Cc: Helmut Hullen, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/10/2011 11:58 AM, Helmut Hullen wrote:
> Hallo, Jeff,
> 
> Du meintest am 10.10.11:
> 
>>>>> Oct  7 18:16:55 Arktur kernel: ata5.00: exception Emask
>>>>> 0x0 SAct 0x0 SErr 0x0 action 0x0
> 
> [...]
> 
>>> I've just worked again with the 2 unreadable files.
>>> 
>>> Copying them to another partition stopped somewhere, one
>>> time/file at about 98%, the other time at about 2%.
> 
> [...]
> 
>> I'd try replacing the SATA cable and if that doesn't fix it up,
>> you may be out of luck.
> 
> There are 2 unreadable sectors (reproducable). Changing or
> re-mounting the cables doesn't help.
> 
>> The thing is that marking sectors bad is a (pretty poor) band-aid
>> for a much bigger problem: If you're hitting persistent read
>> errors and re-writing the blocks doesn't fix it, your disk is
>> already close to being completely kaput and no amount of software
>> is going to help with that.
> 
> The next steps could be:
> 
> - adding a new 2-TByte disk (now there are 3 2-TByte disks) -
> balancing - removing the bad 2-TByte disk
> 
> But I'm afraid when I run balancing then the bad sectors damage big
>  parts of the contents. I've had such bad luck about 1 year ago,
> losing about 2 TByte of data (ok - I had a kind of backup in a
> neighbout town). I don't like to reproduce this experience.
> 
> I'm afraid I have to buy 3 (or 4) 2-TByte disks, building them as a
> new raid0-data cluster and copy the complete contents from the old
> cluster to the new one. Doesn't sound good.
> 
> -----------------------
> 
> 2 bad sectors from a total of 4*10^9 sectors is (in another point
> of view) no bad error rate ...

Well, it's worse than that. The disk will try to correct for bad
sectors itself internally and will remap them. By the time you start
to see bad sectors on disk (where writing and then reading fails), the
disk's internal remap table has been filled. That hides the true
defect rate but it also means that it's only a matter of time before
you get more bad sectors.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6YkbwACgkQLPWxlyuTD7LrTACeJFBbYZtJrUVBwDM8+R2BBrHS
moIAn3wIZd2Q9TEo8mUkAhVtdZnHgYdr
=hpBv
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-10 15:58       ` Helmut Hullen
  2011-10-14 19:47         ` Jeff Mahoney
@ 2011-10-15 18:47         ` Martin Steigerwald
  2011-10-15 19:59           ` Helmut Hullen
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Steigerwald @ 2011-10-15 18:47 UTC (permalink / raw)
  To: linux-btrfs, helmut

Hi Helmut,

Am Montag, 10. Oktober 2011 schrieb Helmut Hullen:
> > The thing is that marking sectors bad is a
> > (pretty poor) band-aid for a much bigger problem: If you're hitting
> > persistent read errors and re-writing the blocks doesn't fix it, your
> > disk is already close to being completely kaput and no amount of
> > software is going to help with that.
> 
> The next steps could be:
> 
> - adding a new 2-TByte disk (now there are 3 2-TByte disks)
> - balancing
> - removing the bad 2-TByte disk
> 
> But I'm afraid when I run balancing then the bad sectors damage big  
> parts of the contents. I've had such bad luck about 1 year ago,
> losing   about 2 TByte of data (ok - I had a kind of backup in a
> neighbout town). I don't like to reproduce this experience.
> 
> I'm afraid I have to buy 3 (or 4) 2-TByte disks, building them as a
> new   raid0-data cluster and copy the complete contents from the old
> cluster to the new one. Doesn't sound good.

RAID-0 and valuable (?) data does not match together. So if you go 4 
disks, consider a RAID 10 ;). Then you could set the disk faulty, put in a 
new one and let BTRFS resync/balance the RAID. But if everything is only 
stored on one disk thats not possible.

A RAID 5 might also be an alternative, but I am not sure, whether RAID-5 
is already working with BTRFS. I heard about plans to borrow some SoftRAID 
code for that.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-15 18:47         ` Martin Steigerwald
@ 2011-10-15 19:59           ` Helmut Hullen
  2011-10-16 19:32             ` Calvin Walton
  0 siblings, 1 reply; 12+ messages in thread
From: Helmut Hullen @ 2011-10-15 19:59 UTC (permalink / raw)
  To: linux-btrfs

Hallo, Martin,

Du meintest am 15.10.11:

> RAID-0 and valuable (?) data does not match together.

I know. The data isn't valuable. It's *.mpeg2 from DVB-T, repeated at  
least every two years. It's a kind of old LP or old VHS cassette.

But that doesn't solve the problem with errors on one of the disks. I  
don't like to throw away a disk if it has (perhaps) repairable read  
errors. I'd like to use a tool like "badblocks".

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-15 19:59           ` Helmut Hullen
@ 2011-10-16 19:32             ` Calvin Walton
  2011-10-17  3:35               ` Helmut Hullen
                                 ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Calvin Walton @ 2011-10-16 19:32 UTC (permalink / raw)
  To: helmut; +Cc: linux-btrfs

On Sat, 2011-10-15 at 21:59 +0200, Helmut Hullen wrote:
> Hallo, Martin,
> 
> Du meintest am 15.10.11:
> 
> > RAID-0 and valuable (?) data does not match together.
> 
> I know. The data isn't valuable. It's *.mpeg2 from DVB-T, repeated at  
> least every two years. It's a kind of old LP or old VHS cassette.
> 
> But that doesn't solve the problem with errors on one of the disks. I  
> don't like to throw away a disk if it has (perhaps) repairable read  
> errors. I'd like to use a tool like "badblocks".

Well, lets take a look at the state of your drive. Install
smartmontools, and run 'smartctl -A /dev/sdX'. One a properly
operational drive, you'll see these:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

First things first. If the VALUE of Reallocated_Sector_Ct is less than
or equal to THRES, then your drive is garbage; all of the reallocation
space has been used. This means many errors have occured, and more will
keep happening. Get it replaced ASAP.

If the RAW_VALUE of Reallocated_Sector_Ct is above 0, then the drive has
in the past dynamically reallocated some sectors - i.e. it has had
errors, but they have been repaired.

The Current_Pending_Sector value is interesting. It counts the number of
sectors which have had read errors, but have not been remapped
internally in the drive, because it couldn't recover the data using
error correction. These result in Read errors in the OS - this is
probably what you are seeing.

If you have pending sectors, causing the drive to reallocate them is
very simple. Write data (any data) over the sector in question - the
drive will then remap it onto the spare area to do the write. (The
easiest way is to do something like dd if=/dev/zero of=/dev/sdX; but if
you know the exact sector number, "hdparm --write-sector" can remap it
quickly.)

Keep in mind, though - if you have a single reallocated sector on a
drive, it means that the drive medium is deteriorating. It's very likely
that you will have additional failures in the future, resulting in more
IO errors and lost data. For your sanity, I recommend replacing a drive
as soon as you see any one error on it.

-- 
Calvin Walton <calvin.walton@kepstin.ca>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-16 19:32             ` Calvin Walton
@ 2011-10-17  3:35               ` Helmut Hullen
  2011-10-18 15:33               ` Helmut Hullen
  2011-10-21  9:40               ` Helmut Hullen
  2 siblings, 0 replies; 12+ messages in thread
From: Helmut Hullen @ 2011-10-17  3:35 UTC (permalink / raw)
  To: linux-btrfs

Hallo, Calvin,

Du meintest am 16.10.11:

>> I don't like to throw away a disk if it has (perhaps) repairable
>> read errors. I'd like to use a tool like "badblocks".

> Well, lets take a look at the state of your drive. Install
> smartmontools, and run 'smartctl -A /dev/sdX'. One a properly
> operational drive, you'll see these:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
> 5 Reallocated_Sector_Ct     0x0033   200   200   140    Pre-fail
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always      -       0

Here (WDC WD20EARS):

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
  Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age
  Always       -       26
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
  Offline      -       25

-------------------

> First things first. If the VALUE of Reallocated_Sector_Ct is less
> than or equal to THRES, then your drive is garbage; all of the
> reallocation space has been used. This means many errors have
> occured, and more will keep happening. Get it replaced ASAP.

There may be hope ...

> The Current_Pending_Sector value is interesting. It counts the number
> of sectors which have had read errors, but have not been remapped
> internally in the drive, because it couldn't recover the data using
> error correction. These result in Read errors in the OS - this is
> probably what you are seeing.

> If you have pending sectors, causing the drive to reallocate them is
> very simple. Write data (any data) over the sector in question - the
> drive will then remap it onto the spare area to do the write. (The
> easiest way is to do something like dd if=/dev/zero of=/dev/sdX; but
> if you know the exact sector number, "hdparm --write-sector" can
> remap it quickly.)

Ok - I'll take a try.

> Keep in mind, though - if you have a single reallocated sector on a
> drive, it means that the drive medium is deteriorating. It's very
> likely that you will have additional failures in the future,
> resulting in more IO errors and lost data. For your sanity, I
> recommend replacing a drive as soon as you see any one error on it.

In the past most (nearly all) such problems came from a bad power supply  
and/or bad cables, "dd if=/dev/zero" or "badblocks" fixed them ...

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-16 19:32             ` Calvin Walton
  2011-10-17  3:35               ` Helmut Hullen
@ 2011-10-18 15:33               ` Helmut Hullen
  2011-10-21  9:40               ` Helmut Hullen
  2 siblings, 0 replies; 12+ messages in thread
From: Helmut Hullen @ 2011-10-18 15:33 UTC (permalink / raw)
  To: linux-btrfs

Hallo, Calvin,

Du meintest am 16.10.11:


[...]

> If you have pending sectors, causing the drive to reallocate them is
> very simple. Write data (any data) over the sector in question - the
> drive will then remap it onto the spare area to do the write. (The
> easiest way is to do something like dd if=/dev/zero of=/dev/sdX; but
> if you know the exact sector number, "hdparm --write-sector" can
> remap it quickly.)

I have to try in the next days ...

> Keep in mind, though - if you have a single reallocated sector on a
> drive, it means that the drive medium is deteriorating. It's very
> likely that you will have additional failures in the future,
> resulting in more IO errors and lost data. For your sanity, I
> recommend replacing a drive as soon as you see any one error on it.

Actually "dd if=/dev/sdg of=/dev/zero " tells (in "/var/log/warn")  
strange things like

Oct 18 14:42:48 Arktur kernel: Buffer I/O error on device sdg, logical block 29792786
Oct 18 14:42:48 Arktur kernel: Buffer I/O error on device sdg, logical block 29792787
Oct 18 14:43:04 Arktur kernel: end_request: I/O error, dev sdg, sector 238342224
Oct 18 14:43:04 Arktur kernel: Buffer I/O error on device sdg, logical block 29792778
Oct 18 14:43:20 Arktur kernel: end_request: I/O error, dev sdg, sector 238342224
Oct 18 14:43:20 Arktur kernel: Buffer I/O error on device sdg, logical block 29792778

-------------------------

>From yesterday to this morning the number of offline uncorrectable has  
grown from 25 to 26 - no good omen.

Maybe there are some files damaged - the disk is filled with about 1.4  
TByte, it's part of a btrfs cluster with more than 4 TByte data.

What about "btrfsck" - can it help? Or may it lead to one more crash?

When I try to copy the whole cluster to another place (I had this  
problem some days ago) then the system crashes when it tries to access  
that special file that uses such a defect sector. When I can first  
detect the name of this file and then exclude it from copying then "cp"  
works.

Nice problems ...

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: read error: how to fix?
  2011-10-16 19:32             ` Calvin Walton
  2011-10-17  3:35               ` Helmut Hullen
  2011-10-18 15:33               ` Helmut Hullen
@ 2011-10-21  9:40               ` Helmut Hullen
  2 siblings, 0 replies; 12+ messages in thread
From: Helmut Hullen @ 2011-10-21  9:40 UTC (permalink / raw)
  To: linux-btrfs

Hallo, Calvin,

Du meintest am 16.10.11:

[...]

> If you have pending sectors, causing the drive to reallocate them is
> very simple. Write data (any data) over the sector in question - the
> drive will then remap it onto the spare area to do the write. (The
> easiest way is to do something like dd if=/dev/zero of=/dev/sdX; but
> if you know the exact sector number, "hdparm --write-sector" can
> remap it quickly.)

(instead of a blog ... and please excuse my gerlish)

I've buyed another 2-TByte disk (Samsung - seems to be bullet proofed).

   dd if=/dev/baddisk of=/dev/gooddisk bs=8M conv=noerror

worked (about 30 hours for 2 TByte), it produced many error messages.

Unplugged /dev/baddisk, plugged /dev/gooddisk, mounted the 3-disk- 
cluster: worked.
Looking into the directories: showed no problem (with the bad disk even  
that produced error messages).
Trying to play an *.mpg: nothing. Shit.
Some error messages.

Next adventure:
Removed the good disk, plugged the bad disk.

Extracted the bad sectors (for baddisk = sdd) with

  grep 'I/O error' /var/log/warn | grep 'dev sdd' | \
    cut -d' ' -f11- | sort -u > /home/tmp/WDC-20111021.txt

"repaired" them with

#! /bin/bash
# Geruest: Joerg Sommer, de.comp.os.unix.linux.hardware 18.10.2011

Platte=/dev/sdd
# WD 20EARS

# bad sector 778550400

  for blk in $(seq 778550000 778551000)
  do
    hdparm --read-sector $blk $Platte > /dev/null
    test $? -eq 5 || continue
    hdparm --write-sector $blk --yes-i-know-what-i-am-doing $Platte
done
#

Seems to work as desired; it seems to be a good idea to "repair" not  
only the sectors shown in "/var/log/warn" but their probably environment  
too.

Ok - the file that uses this sector may be badly damaged. But people who  
have worked with LPs or MCs know such a behaviour ... no real problem.

Run btrfsck.
Many error messages.
When
        btrfs filesystem show

shows 3 disks in the cluster: have I to run btrfsck for each disk, or  
runs btrfsck over all disks of this cluster?

But now I can not only the the directories but can "open" the contents  
too - much better than nothing!

Possible next adventure:

        for Datei in /Path/to/*.mpg
          do
            cat "$Datei" > /dev/null
          done

produces (if necessary) not only the error messages in "/var/log/warn"  
but the names of the damaged files too.

Viele Gruesse!
Helmut

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-10-21  9:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-07 16:51 read error: how to fix? Helmut Hullen
2011-10-10 11:48 ` David Sterba
2011-10-10 13:28   ` Helmut Hullen
2011-10-10 14:07     ` Jeff Mahoney
2011-10-10 15:58       ` Helmut Hullen
2011-10-14 19:47         ` Jeff Mahoney
2011-10-15 18:47         ` Martin Steigerwald
2011-10-15 19:59           ` Helmut Hullen
2011-10-16 19:32             ` Calvin Walton
2011-10-17  3:35               ` Helmut Hullen
2011-10-18 15:33               ` Helmut Hullen
2011-10-21  9:40               ` Helmut Hullen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox