Disk errors

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Disk errors
@ 2005-01-31 14:27 Kit Gerrits
  0 siblings, 0 replies; 20+ messages in thread
From: Kit Gerrits @ 2005-01-31 14:27 UTC (permalink / raw)
  To: linux-scsi

Exactly how many errors is a SCSI disk allowed to have?

I have a PE2400 with a PERC2/Si with 4x9GB

My disks show:
AFA0> disk show defects 0
Executing: disk show defects (ID=0)
Number of PRIMARY defects on drive: 1912
Number of GROWN defects on drive: 0

AFA0> disk show defects 1
Executing: disk show defects (ID=1)
Number of PRIMARY defects on drive: 952
Number of GROWN defects on drive: 1

AFA0> disk show defects 2
Executing: disk show defects (ID=2)
Number of PRIMARY defects on drive: 2457
Number of GROWN defects on drive: 0

AFA0> disk show defects 3
Executing: disk show defects (ID=3)
Number of PRIMARY defects on drive: 2794
Number of GROWN defects on drive: 0

The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) has recently
hung with the error:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Thanks in advance,

Kit Gerrits

kit@gerritsaa.nl


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
@ 2005-01-31 14:46 Cress, Andrew R
  2005-01-31 15:22 ` Kit Gerrits
  0 siblings, 1 reply; 20+ messages in thread
From: Cress, Andrew R @ 2005-01-31 14:46 UTC (permalink / raw)
  To: Kit Gerrits, linux-scsi

Kit,

With the growing size of disk drives, and a more sectors allocated to
reserve sectors, the number of defects alone is not a big concern,
expecially if they are PRIMARY defects (found at manufacture-time).
What would be of concern, is an increase in the number of GROWN defects
over a short period of time.  Unfortunately, it is quite common for one
defect to cause a disk to be replaced, when it could be remapped without
the expense and trouble of a field replacement.

The automatic remapping of grown defects is a feature of SCSI disks, but
may not be configured in the disk's mode pages.  The mode pages can be
changed without affecting the content of the disk (with the exception of
size & sector mapping parameters).  There are several Linux tools to
read/set mode pages, among which is 'sgmode' from
http://scsirastools.sf.net.

As a guess, it appears that you had a grown defect occur on one of your
disks, but the remapping was not set to occur automatically on that
disk, so a write never finished.

Andy

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Kit Gerrits
Sent: Monday, January 31, 2005 9:28 AM
To: linux-scsi@vger.kernel.org
Subject: Disk errors

Exactly how many errors is a SCSI disk allowed to have?

I have a PE2400 with a PERC2/Si with 4x9GB

My disks show:
AFA0> disk show defects 0
Executing: disk show defects (ID=0)
Number of PRIMARY defects on drive: 1912
Number of GROWN defects on drive: 0

AFA0> disk show defects 1
Executing: disk show defects (ID=1)
Number of PRIMARY defects on drive: 952
Number of GROWN defects on drive: 1

AFA0> disk show defects 2
Executing: disk show defects (ID=2)
Number of PRIMARY defects on drive: 2457
Number of GROWN defects on drive: 0

AFA0> disk show defects 3
Executing: disk show defects (ID=3)
Number of PRIMARY defects on drive: 2794
Number of GROWN defects on drive: 0

The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) has
recently
hung with the error:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but
shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Thanks in advance,

Kit Gerrits

kit@gerritsaa.nl

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
  2005-01-31 14:46 Cress, Andrew R
@ 2005-01-31 15:22 ` Kit Gerrits
  0 siblings, 0 replies; 20+ messages in thread
From: Kit Gerrits @ 2005-01-31 15:22 UTC (permalink / raw)
  To: 'Cress, Andrew R', linux-scsi

Andrew,

Thanks for explaining the initial vs grown error list.
 
Unfortunately, the tool itself monitors softwareRAID and SCSI devices.
This means that sgmode itself sees only the containers on the PERC.


Would you happen to know how to accomplish this in afacli?


AFA0> disk set ?
disk set default - Sets the various disk defaults for all subsequent CLI
commands.
disk set smart - Change a device's SMART configuration.

AFA0> disk show ?
disk show default - Shows the various defaults set for the CLI commands.
disk show defects - Shows the number of defects and/or defect list on a
particular disk drive.
disk show partition - Shows the partitions on the disks attached to this
controller.
disk show smart - Displays SMART values and settings for SMART enabled
devices.
disk show space - Shows space usage on the disks attached to the controller.

AFA0> disk show default
Executing: disk show default
No Default

AFA0>disk show smart
Executing: disk show smart
        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N


Thanks for the info

Kit


> -----Oorspronkelijk bericht-----
> Van: Cress, Andrew R [mailto:andrew.r.cress@intel.com] 
> Verzonden: maandag 31 januari 2005 15:46
> Aan: Kit Gerrits; linux-scsi@vger.kernel.org
> Onderwerp: RE: Disk errors
> 
> Kit,
> 
> With the growing size of disk drives, and a more sectors 
> allocated to reserve sectors, the number of defects alone is 
> not a big concern, expecially if they are PRIMARY defects 
> (found at manufacture-time).
> What would be of concern, is an increase in the number of 
> GROWN defects over a short period of time.  Unfortunately, it 
> is quite common for one defect to cause a disk to be 
> replaced, when it could be remapped without the expense and 
> trouble of a field replacement.
> 
> The automatic remapping of grown defects is a feature of SCSI 
> disks, but may not be configured in the disk's mode pages.  
> The mode pages can be changed without affecting the content 
> of the disk (with the exception of size & sector mapping 
> parameters).  There are several Linux tools to read/set mode 
> pages, among which is 'sgmode' from http://scsirastools.sf.net.
> 
> As a guess, it appears that you had a grown defect occur on 
> one of your disks, but the remapping was not set to occur 
> automatically on that disk, so a write never finished.
> 
> Andy
> 
> 
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Kit Gerrits
> Sent: Monday, January 31, 2005 9:28 AM
> To: linux-scsi@vger.kernel.org
> Subject: Disk errors
> 
> 
> Exactly how many errors is a SCSI disk allowed to have?
> 
> I have a PE2400 with a PERC2/Si with 4x9GB
> 
> My disks show:
> AFA0> disk show defects 0
> Executing: disk show defects (ID=0)
> Number of PRIMARY defects on drive: 1912 Number of GROWN 
> defects on drive: 0
> 
> AFA0> disk show defects 1
> Executing: disk show defects (ID=1)
> Number of PRIMARY defects on drive: 952
> Number of GROWN defects on drive: 1
> 
> AFA0> disk show defects 2
> Executing: disk show defects (ID=2)
> Number of PRIMARY defects on drive: 2457 Number of GROWN 
> defects on drive: 0
> 
> AFA0> disk show defects 3
> Executing: disk show defects (ID=3)
> Number of PRIMARY defects on drive: 2794 Number of GROWN 
> defects on drive: 0
> 
> The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) 
> has recently hung with the error:
> 
> I/O Error Dev 08:05 Sector 529712
> 
> I would assume that this error is generated by the harddrive, 
> but shouldn't the controller catch SCSI errors (and relocate 
> sectors automagically)?
> 
> Thanks in advance,
> 
> Kit Gerrits
> 
> kit@gerritsaa.nl
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in the body of a message to 
> majordomo@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
       [not found] <60807403EABEB443939A5A7AA8A7458BB51FD1@otce2k01.adaptec.com>
@ 2005-01-31 16:43 ` Kit Gerrits
  0 siblings, 0 replies; 20+ messages in thread
From: Kit Gerrits @ 2005-01-31 16:43 UTC (permalink / raw)
  To: 'Salyzyn, Mark'; +Cc: linux-scsi

Indeed, I had an entire screenful of errors (a few each second) when I came
in in the morning...
The strange thing is, that the drive with the grown error is part of the
DATA container (/home and /data), whilst the disk with the rest ( / ) was
fine.

You'd expect the error to show  up in /var/log/messages, but it didn't. 
I think the entire controller gave up as soon as the error popped up.

-----
Is there a way of having the controller detect / handle grown errors?
Will setting automatic remapping handle this?

Does Anyone know how to read / write mode pages?
----

Thanks all!

Kit

> -----Oorspronkelijk bericht-----
> Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com] 
> Verzonden: maandag 31 januari 2005 17:03
> Aan: Kit Gerrits
> Onderwerp: RE: Disk errors
> 
> You get tones of I/O error messages from the filesystem 
> driver once the device goes offline. You can check 
> /var/log/messages to find the root cause.
> 
> You will need to run the RAID management tools (afacli) to 
> display the underlying components (container list). Dell has 
> their own customized tools for this, I can not comment on their usage.
> 
> Sincerely -- Mark Salyzyn
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
@ 2005-01-31 17:11 Cress, Andrew R
  0 siblings, 0 replies; 20+ messages in thread
From: Cress, Andrew R @ 2005-01-31 17:11 UTC (permalink / raw)
  To: Kit Gerrits, linux-scsi


I don't know much about agacli.

The mode pages do have bits to enable SMART, but that's not what I think
you are interested in.
However, SMART can generate info events that the OS may not be
recognizing.  

What you are interested is mode page 0x01 to see if AWRE and ARRE are
turned on (bits 7 & 6, 0xC0).
The default setting for these may be documented in the disk manual for
your drives also, which can be obtained from the vendor web site.  Or,
the PERC vendor may be able to help get this info.

Andy

-----Original Message-----
From: Kit Gerrits [mailto:kit@gerritsacc.nl] 
Sent: Monday, January 31, 2005 10:22 AM
To: Cress, Andrew R; linux-scsi@vger.kernel.org
Subject: RE: Disk errors


Andrew,

Thanks for explaining the initial vs grown error list.
 
Unfortunately, the tool itself monitors softwareRAID and SCSI devices.
This means that sgmode itself sees only the containers on the PERC.


Would you happen to know how to accomplish this in afacli?


AFA0> disk set ?
disk set default - Sets the various disk defaults for all subsequent CLI
commands.
disk set smart - Change a device's SMART configuration.

AFA0> disk show ?
disk show default - Shows the various defaults set for the CLI commands.
disk show defects - Shows the number of defects and/or defect list on a
particular disk drive.
disk show partition - Shows the partitions on the disks attached to this
controller.
disk show smart - Displays SMART values and settings for SMART enabled
devices.
disk show space - Shows space usage on the disks attached to the
controller.

AFA0> disk show default
Executing: disk show default
No Default

AFA0>disk show smart
Executing: disk show smart
        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N


Thanks for the info

Kit


> -----Oorspronkelijk bericht-----
> Van: Cress, Andrew R [mailto:andrew.r.cress@intel.com] 
> Verzonden: maandag 31 januari 2005 15:46
> Aan: Kit Gerrits; linux-scsi@vger.kernel.org
> Onderwerp: RE: Disk errors
> 
> Kit,
> 
> With the growing size of disk drives, and a more sectors 
> allocated to reserve sectors, the number of defects alone is 
> not a big concern, expecially if they are PRIMARY defects 
> (found at manufacture-time).
> What would be of concern, is an increase in the number of 
> GROWN defects over a short period of time.  Unfortunately, it 
> is quite common for one defect to cause a disk to be 
> replaced, when it could be remapped without the expense and 
> trouble of a field replacement.
> 
> The automatic remapping of grown defects is a feature of SCSI 
> disks, but may not be configured in the disk's mode pages.  
> The mode pages can be changed without affecting the content 
> of the disk (with the exception of size & sector mapping 
> parameters).  There are several Linux tools to read/set mode 
> pages, among which is 'sgmode' from http://scsirastools.sf.net.
> 
> As a guess, it appears that you had a grown defect occur on 
> one of your disks, but the remapping was not set to occur 
> automatically on that disk, so a write never finished.
> 
> Andy
> 
> 
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Kit Gerrits
> Sent: Monday, January 31, 2005 9:28 AM
> To: linux-scsi@vger.kernel.org
> Subject: Disk errors
> 
> 
> Exactly how many errors is a SCSI disk allowed to have?
> 
> I have a PE2400 with a PERC2/Si with 4x9GB
> 
> My disks show:
> AFA0> disk show defects 0
> Executing: disk show defects (ID=0)
> Number of PRIMARY defects on drive: 1912 Number of GROWN 
> defects on drive: 0
> 
> AFA0> disk show defects 1
> Executing: disk show defects (ID=1)
> Number of PRIMARY defects on drive: 952
> Number of GROWN defects on drive: 1
> 
> AFA0> disk show defects 2
> Executing: disk show defects (ID=2)
> Number of PRIMARY defects on drive: 2457 Number of GROWN 
> defects on drive: 0
> 
> AFA0> disk show defects 3
> Executing: disk show defects (ID=3)
> Number of PRIMARY defects on drive: 2794 Number of GROWN 
> defects on drive: 0
> 
> The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) 
> has recently hung with the error:
> 
> I/O Error Dev 08:05 Sector 529712
> 
> I would assume that this error is generated by the harddrive, 
> but shouldn't the controller catch SCSI errors (and relocate 
> sectors automagically)?
> 
> Thanks in advance,
> 
> Kit Gerrits
> 
> kit@gerritsaa.nl
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in the body of a message to 
> majordomo@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
@ 2005-01-31 18:21 Salyzyn, Mark
  2005-01-31 23:41 ` Kit Gerrits
  0 siblings, 1 reply; 20+ messages in thread
From: Salyzyn, Mark @ 2005-01-31 18:21 UTC (permalink / raw)
  To: Kit Gerrits; +Cc: linux-scsi

The PERC controller looks after bad block reassignment.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Kit Gerrits [mailto:kit@gerritsacc.nl] 
Sent: Monday, January 31, 2005 11:44 AM
To: Salyzyn, Mark
Cc: linux-scsi@vger.kernel.org
Subject: RE: Disk errors

Indeed, I had an entire screenful of errors (a few each second) when I
came
in in the morning...
The strange thing is, that the drive with the grown error is part of the
DATA container (/home and /data), whilst the disk with the rest ( / )
was
fine.

You'd expect the error to show  up in /var/log/messages, but it didn't. 
I think the entire controller gave up as soon as the error popped up.

-----
Is there a way of having the controller detect / handle grown errors?
Will setting automatic remapping handle this?

Does Anyone know how to read / write mode pages?
----

Thanks all!

Kit

> -----Oorspronkelijk bericht-----
> Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com] 
> Verzonden: maandag 31 januari 2005 17:03
> Aan: Kit Gerrits
> Onderwerp: RE: Disk errors
> 
> You get tones of I/O error messages from the filesystem 
> driver once the device goes offline. You can check 
> /var/log/messages to find the root cause.
> 
> You will need to run the RAID management tools (afacli) to 
> display the underlying components (container list). Dell has 
> their own customized tools for this, I can not comment on their usage.
> 
> Sincerely -- Mark Salyzyn
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
  2005-01-31 18:21 Disk errors Salyzyn, Mark
@ 2005-01-31 23:41 ` Kit Gerrits
  2005-01-31 23:55   ` Matt Domsch
  2005-02-01  2:05   ` Guy
  0 siblings, 2 replies; 20+ messages in thread
From: Kit Gerrits @ 2005-01-31 23:41 UTC (permalink / raw)
  To: 'Salyzyn, Mark'; +Cc: linux-scsi

But if the PERC (controller) handles disk errors, what could cause:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Kit

SCSI relevant DMESG:
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec aic7880 Ultra SCSI adapter>
        aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec 2940 Ultra2 SCSI adapter>
        aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue d7ab8814, I/O limit 4095Mb (mask 0xffffffff)
(scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15)
  Vendor: NEC       Model: CD-ROM DRIVE:466  Rev: 1.06
  Type:   CD-ROM                             ANSI SCSI revision: 02
blk: queue c1fc1e14, I/O limit 4095Mb (mask 0xffffffff)
(scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit)
  Vendor: QUANTUM   Model: DLT7000           Rev: 2561
  Type:   Sequential-Access                  ANSI SCSI revision: 02
blk: queue c1fc1a14, I/O limit 4095Mb (mask 0xffffffff)
Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27)
PCI: Found IRQ 14 for device 00:02.1
AAC0: kernel 2.1.4 build 2939
AAC0: monitor 2.1.4 build 2939
AAC0: bios 2.1.0 build 2939
AAC0: serial 410010d0fafaf001
spurious 8259A interrupt: IRQ7.
scsi2 : percraid
  Vendor: DELL      Model: PERCRAID Volume   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue c1fc1c14, I/O limit 4095Mb (mask 0xffffffff)
  Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue d7ab9e14, I/O limit 4095Mb (mask 0xffffffff)
Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0
Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0
SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB)
sda: Write Protect is off
Partition check:
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB)
sdb: Write Protect is off
 sdb: sdb1 sdb2 

> -----Oorspronkelijk bericht-----
> Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com] 
> Verzonden: maandag 31 januari 2005 19:22
> Aan: Kit Gerrits
> CC: linux-scsi@vger.kernel.org
> Onderwerp: RE: Disk errors
> 
> The PERC controller looks after bad block reassignment.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Kit Gerrits [mailto:kit@gerritsacc.nl]
> Sent: Monday, January 31, 2005 11:44 AM
> To: Salyzyn, Mark
> Cc: linux-scsi@vger.kernel.org
> Subject: RE: Disk errors
> 
> Indeed, I had an entire screenful of errors (a few each 
> second) when I came in in the morning...
> The strange thing is, that the drive with the grown error is 
> part of the DATA container (/home and /data), whilst the disk 
> with the rest ( / ) was fine.
> 
> You'd expect the error to show  up in /var/log/messages, but 
> it didn't. 
> I think the entire controller gave up as soon as the error popped up.
> 
> -----
> Is there a way of having the controller detect / handle grown errors?
> Will setting automatic remapping handle this?
> 
> Does Anyone know how to read / write mode pages?
> ----
> 
> Thanks all!
> 
> Kit
> 
> > -----Oorspronkelijk bericht-----
> > Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
> > Verzonden: maandag 31 januari 2005 17:03
> > Aan: Kit Gerrits
> > Onderwerp: RE: Disk errors
> > 
> > You get tones of I/O error messages from the filesystem driver once 
> > the device goes offline. You can check /var/log/messages to 
> find the 
> > root cause.
> > 
> > You will need to run the RAID management tools (afacli) to 
> display the 
> > underlying components (container list). Dell has their own 
> customized 
> > tools for this, I can not comment on their usage.
> > 
> > Sincerely -- Mark Salyzyn
> > 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk errors
  2005-01-31 23:41 ` Kit Gerrits
@ 2005-01-31 23:55   ` Matt Domsch
  2005-02-01  2:05   ` Guy
  1 sibling, 0 replies; 20+ messages in thread
From: Matt Domsch @ 2005-01-31 23:55 UTC (permalink / raw)
  To: Kit Gerrits; +Cc: 'Salyzyn, Mark', linux-scsi

On Tue, Feb 01, 2005 at 12:41:13AM +0100, Kit Gerrits wrote:
> But if the PERC (controller) handles disk errors, what could cause:
> 
> I/O Error Dev 08:05 Sector 529712
> 
> I would assume that this error is generated by the harddrive, but shouldn't
> the controller catch SCSI errors (and relocate sectors automagically)?

In this case, the RAID controller is reporting the I/O error.  It may
be that you've got bad sectors on more than one physical disk, in the
same stripe, and the RAID controller can't fix them.

Thanks,
Matt

-- 
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk errors
  2005-01-31 23:41 ` Kit Gerrits
  2005-01-31 23:55   ` Matt Domsch
@ 2005-02-01  2:05   ` Guy
  1 sibling, 0 replies; 20+ messages in thread
From: Guy @ 2005-02-01  2:05 UTC (permalink / raw)
  To: 'Kit Gerrits', 'Salyzyn, Mark'; +Cc: linux-scsi

Maybe you have a failed disk, and another has bad blocks.  So, no good copy
of the data exists.  Just a guess!!!

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Kit Gerrits
Sent: Monday, January 31, 2005 6:41 PM
To: 'Salyzyn, Mark'
Cc: linux-scsi@vger.kernel.org
Subject: RE: Disk errors

But if the PERC (controller) handles disk errors, what could cause:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Kit

SCSI relevant DMESG:
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec aic7880 Ultra SCSI adapter>
        aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec 2940 Ultra2 SCSI adapter>
        aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue d7ab8814, I/O limit 4095Mb (mask 0xffffffff)
(scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15)
  Vendor: NEC       Model: CD-ROM DRIVE:466  Rev: 1.06
  Type:   CD-ROM                             ANSI SCSI revision: 02
blk: queue c1fc1e14, I/O limit 4095Mb (mask 0xffffffff)
(scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit)
  Vendor: QUANTUM   Model: DLT7000           Rev: 2561
  Type:   Sequential-Access                  ANSI SCSI revision: 02
blk: queue c1fc1a14, I/O limit 4095Mb (mask 0xffffffff)
Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27)
PCI: Found IRQ 14 for device 00:02.1
AAC0: kernel 2.1.4 build 2939
AAC0: monitor 2.1.4 build 2939
AAC0: bios 2.1.0 build 2939
AAC0: serial 410010d0fafaf001
spurious 8259A interrupt: IRQ7.
scsi2 : percraid
  Vendor: DELL      Model: PERCRAID Volume   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue c1fc1c14, I/O limit 4095Mb (mask 0xffffffff)
  Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue d7ab9e14, I/O limit 4095Mb (mask 0xffffffff)
Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0
Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0
SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB)
sda: Write Protect is off
Partition check:
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB)
sdb: Write Protect is off
 sdb: sdb1 sdb2 

> -----Oorspronkelijk bericht-----
> Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com] 
> Verzonden: maandag 31 januari 2005 19:22
> Aan: Kit Gerrits
> CC: linux-scsi@vger.kernel.org
> Onderwerp: RE: Disk errors
> 
> The PERC controller looks after bad block reassignment.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Kit Gerrits [mailto:kit@gerritsacc.nl]
> Sent: Monday, January 31, 2005 11:44 AM
> To: Salyzyn, Mark
> Cc: linux-scsi@vger.kernel.org
> Subject: RE: Disk errors
> 
> Indeed, I had an entire screenful of errors (a few each 
> second) when I came in in the morning...
> The strange thing is, that the drive with the grown error is 
> part of the DATA container (/home and /data), whilst the disk 
> with the rest ( / ) was fine.
> 
> You'd expect the error to show  up in /var/log/messages, but 
> it didn't. 
> I think the entire controller gave up as soon as the error popped up.
> 
> -----
> Is there a way of having the controller detect / handle grown errors?
> Will setting automatic remapping handle this?
> 
> Does Anyone know how to read / write mode pages?
> ----
> 
> Thanks all!
> 
> Kit
> 
> > -----Oorspronkelijk bericht-----
> > Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
> > Verzonden: maandag 31 januari 2005 17:03
> > Aan: Kit Gerrits
> > Onderwerp: RE: Disk errors
> > 
> > You get tones of I/O error messages from the filesystem driver once 
> > the device goes offline. You can check /var/log/messages to 
> find the 
> > root cause.
> > 
> > You will need to run the RAID management tools (afacli) to 
> display the 
> > underlying components (container list). Dell has their own 
> customized 
> > tools for this, I can not comment on their usage.
> > 
> > Sincerely -- Mark Salyzyn
> > 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
@ 2005-02-01  8:53 Kit Gerrits
  2005-02-01 12:43 ` Douglas Gilbert
  0 siblings, 1 reply; 20+ messages in thread
From: Kit Gerrits @ 2005-02-01  8:53 UTC (permalink / raw)
  To: linux-scsi

I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for
the pointer!).

Sda is the single-drive volume
(non-RAID, as it is only for the O/S,
which needs to be speedy and can be pulled from tape easily).

This explains several things:
A/ Why a single error can take an entire volume offline B/ Why the error is
not logged
	If it only took the partition offline, 
	it would still have been logged, 
	as / is mounted from sda3

And leaves one question:
What caused the error?

There are no GROWN defects on the drive in this volume


---------------
Reference logs:
---------------

Executing: disk show defects (ID=0)
Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0

Executing: container list
Num          Total  Oth Chunk          Scsi   Partition    
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
 0    Volume 8.47GB            Open    0:00:0 64.0KB:8.47GB
 /dev/sda             NT
 1    RAID-5 16.9GB       32KB Open    0:01:0 64.0KB:8.47GB
 /dev/sdb             DATA             0:02:0 64.0KB:8.47GB
                                       ?:??:?  - Missing - Mount points it
to:
# /dev/sda5             5.3G  1.5G  3.6G  30% /usr
 

> -----Oorspronkelijk bericht-----
> Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
> Verzonden: dinsdag 1 februari 2005 4:15
> Aan: Kit Gerrits
> Onderwerp: RE: Disk errors
> 
> The controller does not appear to be busted; you have a Volume and a 
> RAID-5. Are you missing an Array?
> 
> A two drive failure on a RAID-5 gives you an offline array.
> 
> A single drive failure in a Volume gives you an offline array.
> 
> You need to find who is 08:05, look through /dev for the major/minor 
> number and relate it to the 'device'. Look through /proc/scsi/scsi and 
> /var/messages to help correlate it.
> 
> Sincerely -- Mark Salyzyn
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-01  8:53 Kit Gerrits
@ 2005-02-01 12:43 ` Douglas Gilbert
  2005-02-01 18:01   ` Bryan Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Douglas Gilbert @ 2005-02-01 12:43 UTC (permalink / raw)
  To: Kit Gerrits; +Cc: linux-scsi

Kit Gerrits wrote:
> I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for
> the pointer!).
> 
> Sda is the single-drive volume
> (non-RAID, as it is only for the O/S,
> which needs to be speedy and can be pulled from tape easily).
> 
> This explains several things:
> A/ Why a single error can take an entire volume offline B/ Why the error is
> not logged
> 	If it only took the partition offline, 
> 	it would still have been logged, 
> 	as / is mounted from sda3
> 
> And leaves one question:
> What caused the error?
> 
> There are no GROWN defects on the drive in this volume

Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).

So there are two situations in which damaged blocks remain
accessible:
    1) unrecoverable medium errors
    2) recoverable medium errors when AWRE and/or ARRE
       are clear

Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.

Using the sg3_utils package various things can be
done:
    - "sginfo -e /dev/sda" will show the AWRE and ARRE
      settings. Changing them with sginfo is a bit ugly
    - "sginfo -G /dev/sda" will show the grown defect list
      in "index" format (up to 3 other formats may be
      available)
    - "sg_dd if=/dev/sg0 of=/dev/null bs=512" will read the
      whole disk or fail at the first unrecoverable (medium)
      error. If a medium error is detected the "info"
      field is the lba of the defect. ***
    - "sg_reassign -a <lba> /dev/sda" will reassign the
      <lba> block. If this succeeds <lba> should appear
      in the grown defect list ("sginfo -G -Flogical /dev/sda").

When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.


** recoverable errors can be ignored. Assuming these
    recoverable errors occur on read operations then the
    "read error counter" log page's
    recovered error counter (one of them depending on the
    duration of the recovery process) will be incremented

*** due to error processing, it is still better to use /dev/sg0
     rather than than /dev/sda with the sg_dd utility. Recent
     changes (lk 2.6.11-rc2-bk8) make the following work:
     "sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512"
     in the presence of errors

Doug Gilbert

> ---------------
> Reference logs:
> ---------------
> 
> Executing: disk show defects (ID=0)
> Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0
> 
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition    
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    Volume 8.47GB            Open    0:00:0 64.0KB:8.47GB
>  /dev/sda             NT
>  1    RAID-5 16.9GB       32KB Open    0:01:0 64.0KB:8.47GB
>  /dev/sdb             DATA             0:02:0 64.0KB:8.47GB
>                                        ?:??:?  - Missing - Mount points it
> to:
> # /dev/sda5             5.3G  1.5G  3.6G  30% /usr
>  
> 
> 
>>-----Oorspronkelijk bericht-----
>>Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
>>Verzonden: dinsdag 1 februari 2005 4:15
>>Aan: Kit Gerrits
>>Onderwerp: RE: Disk errors
>>
>>The controller does not appear to be busted; you have a Volume and a 
>>RAID-5. Are you missing an Array?
>>
>>A two drive failure on a RAID-5 gives you an offline array.
>>
>>A single drive failure in a Volume gives you an offline array.
>>
>>You need to find who is 08:05, look through /dev for the major/minor 
>>number and relate it to the 'device'. Look through /proc/scsi/scsi and 
>>/var/messages to help correlate it.
>>
>>Sincerely -- Mark Salyzyn
>>
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk Errors
@ 2005-02-01 12:50 Salyzyn, Mark
  0 siblings, 0 replies; 20+ messages in thread
From: Salyzyn, Mark @ 2005-02-01 12:50 UTC (permalink / raw)
  To: dougg, Kit Gerrits; +Cc: linux-scsi

Good information for a single drive on a simple SCSI card. This will not
work for drives that are part of an array (volume) as /dev/sda
references a pseudo device. Besides, the firmware in the RAID controller
takes the actions necessary to perform recoverable bad block remaps.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Douglas Gilbert
Sent: Tuesday, February 01, 2005 7:44 AM
To: Kit Gerrits
Cc: linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

Kit Gerrits wrote:
> I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks
for
> the pointer!).
> 
> Sda is the single-drive volume
> (non-RAID, as it is only for the O/S,
> which needs to be speedy and can be pulled from tape easily).
> 
> This explains several things:
> A/ Why a single error can take an entire volume offline B/ Why the
error is
> not logged
> 	If it only took the partition offline, 
> 	it would still have been logged, 
> 	as / is mounted from sda3
> 
> And leaves one question:
> What caused the error?
> 
> There are no GROWN defects on the drive in this volume

Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).

So there are two situations in which damaged blocks remain
accessible:
    1) unrecoverable medium errors
    2) recoverable medium errors when AWRE and/or ARRE
       are clear

Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.

Using the sg3_utils package various things can be
done:
    - "sginfo -e /dev/sda" will show the AWRE and ARRE
      settings. Changing them with sginfo is a bit ugly
    - "sginfo -G /dev/sda" will show the grown defect list
      in "index" format (up to 3 other formats may be
      available)
    - "sg_dd if=/dev/sg0 of=/dev/null bs=512" will read the
      whole disk or fail at the first unrecoverable (medium)
      error. If a medium error is detected the "info"
      field is the lba of the defect. ***
    - "sg_reassign -a <lba> /dev/sda" will reassign the
      <lba> block. If this succeeds <lba> should appear
      in the grown defect list ("sginfo -G -Flogical /dev/sda").

When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.


** recoverable errors can be ignored. Assuming these
    recoverable errors occur on read operations then the
    "read error counter" log page's
    recovered error counter (one of them depending on the
    duration of the recovery process) will be incremented

*** due to error processing, it is still better to use /dev/sg0
     rather than than /dev/sda with the sg_dd utility. Recent
     changes (lk 2.6.11-rc2-bk8) make the following work:
     "sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512"
     in the presence of errors

Doug Gilbert

> ---------------
> Reference logs:
> ---------------
> 
> Executing: disk show defects (ID=0)
> Number of PRIMARY defects on drive: 1912 Number of GROWN defects on
drive: 0
> 
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition    
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    Volume 8.47GB            Open    0:00:0 64.0KB:8.47GB
>  /dev/sda             NT
>  1    RAID-5 16.9GB       32KB Open    0:01:0 64.0KB:8.47GB
>  /dev/sdb             DATA             0:02:0 64.0KB:8.47GB
>                                        ?:??:?  - Missing - Mount
points it
> to:
> # /dev/sda5             5.3G  1.5G  3.6G  30% /usr
>  
> 
> 
>>-----Oorspronkelijk bericht-----
>>Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
>>Verzonden: dinsdag 1 februari 2005 4:15
>>Aan: Kit Gerrits
>>Onderwerp: RE: Disk errors
>>
>>The controller does not appear to be busted; you have a Volume and a 
>>RAID-5. Are you missing an Array?
>>
>>A two drive failure on a RAID-5 gives you an offline array.
>>
>>A single drive failure in a Volume gives you an offline array.
>>
>>You need to find who is 08:05, look through /dev for the major/minor 
>>number and relate it to the 'device'. Look through /proc/scsi/scsi and

>>/var/messages to help correlate it.
>>
>>Sincerely -- Mark Salyzyn
>>
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk Errors
@ 2005-02-01 15:56 Cress, Andrew R
  0 siblings, 0 replies; 20+ messages in thread
From: Cress, Andrew R @ 2005-02-01 15:56 UTC (permalink / raw)
  To: Salyzyn, Mark, dougg, Kit Gerrits; +Cc: linux-scsi

Kit,

If you have another (non-RAID) SCSI system, you could take the faulty
drive there to modify the mode pages to turn on AWRE and ARRE with
either sgmode (scsirastools.sf.net) or sginfo (sg3_utils).

Otherwise, you are dependent on the tools that are provided for the
PowerEdge RAID controller.

Andy

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Douglas Gilbert
Sent: Tuesday, February 01, 2005 7:44 AM
To: Kit Gerrits
Cc: linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

Kit Gerrits wrote:
> I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks
for
> the pointer!).
> 
> Sda is the single-drive volume
> (non-RAID, as it is only for the O/S,
> which needs to be speedy and can be pulled from tape easily).
> 
> This explains several things:
> A/ Why a single error can take an entire volume offline B/ Why the
error is
> not logged
> 	If it only took the partition offline, 
> 	it would still have been logged, 
> 	as / is mounted from sda3
> 
> And leaves one question:
> What caused the error?
> 
> There are no GROWN defects on the drive in this volume

Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).

So there are two situations in which damaged blocks remain
accessible:
    1) unrecoverable medium errors
    2) recoverable medium errors when AWRE and/or ARRE
       are clear

Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.

Using the sg3_utils package various things can be
done:
    - "sginfo -e /dev/sda" will show the AWRE and ARRE
      settings. Changing them with sginfo is a bit ugly
    - "sginfo -G /dev/sda" will show the grown defect list
      in "index" format (up to 3 other formats may be
      available)
    - "sg_dd if=/dev/sg0 of=/dev/null bs=512" will read the
      whole disk or fail at the first unrecoverable (medium)
      error. If a medium error is detected the "info"
      field is the lba of the defect. ***
    - "sg_reassign -a <lba> /dev/sda" will reassign the
      <lba> block. If this succeeds <lba> should appear
      in the grown defect list ("sginfo -G -Flogical /dev/sda").

When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.


** recoverable errors can be ignored. Assuming these
    recoverable errors occur on read operations then the
    "read error counter" log page's
    recovered error counter (one of them depending on the
    duration of the recovery process) will be incremented

*** due to error processing, it is still better to use /dev/sg0
     rather than than /dev/sda with the sg_dd utility. Recent
     changes (lk 2.6.11-rc2-bk8) make the following work:
     "sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512"
     in the presence of errors

Doug Gilbert

> ---------------
> Reference logs:
> ---------------
> 
> Executing: disk show defects (ID=0)
> Number of PRIMARY defects on drive: 1912 Number of GROWN defects on
drive: 0
> 
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition    
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    Volume 8.47GB            Open    0:00:0 64.0KB:8.47GB
>  /dev/sda             NT
>  1    RAID-5 16.9GB       32KB Open    0:01:0 64.0KB:8.47GB
>  /dev/sdb             DATA             0:02:0 64.0KB:8.47GB
>                                        ?:??:?  - Missing - Mount
points it
> to:
> # /dev/sda5             5.3G  1.5G  3.6G  30% /usr
>  
> 
> 
>>-----Oorspronkelijk bericht-----
>>Van: Salyzyn, Mark [mailto:mark_salyzyn@adaptec.com]
>>Verzonden: dinsdag 1 februari 2005 4:15
>>Aan: Kit Gerrits
>>Onderwerp: RE: Disk errors
>>
>>The controller does not appear to be busted; you have a Volume and a 
>>RAID-5. Are you missing an Array?
>>
>>A two drive failure on a RAID-5 gives you an offline array.
>>
>>A single drive failure in a Volume gives you an offline array.
>>
>>You need to find who is 08:05, look through /dev for the major/minor 
>>number and relate it to the 'device'. Look through /proc/scsi/scsi and

>>/var/messages to help correlate it.
>>
>>Sincerely -- Mark Salyzyn
>>
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-01 12:43 ` Douglas Gilbert
@ 2005-02-01 18:01   ` Bryan Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Bryan Henderson @ 2005-02-01 18:01 UTC (permalink / raw)
  To: dougg; +Cc: Kit Gerrits, linux-scsi

>So there are two situations in which damaged blocks remain
>accessible:
>    1) unrecoverable medium errors
> ...

What's the rationale behind leaving a damaged block accessible in the case 
of an unrecoverable medium error?  A possibility that someone might 
actually be able to recover the data?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk Errors
@ 2005-02-01 18:24 Salyzyn, Mark
  2005-02-02  3:55 ` Douglas Gilbert
  0 siblings, 1 reply; 20+ messages in thread
From: Salyzyn, Mark @ 2005-02-01 18:24 UTC (permalink / raw)
  To: Bryan Henderson, dougg; +Cc: Kit Gerrits, linux-scsi

An unrecoverable medium error is typically `corrected' when a write to
the block occurs. RAID cards will use the redundancy to calculate the
data and write it back to the offending drive for instance.

Otherwise, for none-redundant stores, bad media is as good as anything
to remind one that the data is gone ;->

Sincerely -- Mark Salyzyn

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Bryan Henderson
Sent: Tuesday, February 01, 2005 1:01 PM
To: dougg@torque.net
Cc: Kit Gerrits; linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

>So there are two situations in which damaged blocks remain
>accessible:
>    1) unrecoverable medium errors
> ...

What's the rationale behind leaving a damaged block accessible in the
case 
of an unrecoverable medium error?  A possibility that someone might 
actually be able to recover the data?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-01 18:24 Salyzyn, Mark
@ 2005-02-02  3:55 ` Douglas Gilbert
  2005-02-03 18:50   ` Bryan Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Douglas Gilbert @ 2005-02-02  3:55 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: Bryan Henderson, Kit Gerrits, linux-scsi

Salyzyn, Mark wrote:
> An unrecoverable medium error is typically `corrected' when a write to
> the block occurs. RAID cards will use the redundancy to calculate the
> data and write it back to the offending drive for instance.
> 
> Otherwise, for none-redundant stores, bad media is as good as anything
> to remind one that the data is gone ;->
> 
> Sincerely -- Mark Salyzyn

All may not be lost. If a medium error occurs and the ASC and
ASCQ imply the sector could be read but
failed ECC then the READ LONG SCSI command should fetch the
block (plus ECC and other data). For example a Fujitsu MAM3184
returns 576 bytes. It is probably too much to expect that all
the damage will be in the last 64 bytes.

As Mark pointed out, if /dev/sda is a virtual disk then it is
unlikely that the READ LONG SCSI command will be supported.

sg3_utils has a sg_read_long utility. "Long" blocks can
be written to the media with the sg_write_long utility
which was introduced mainly for testing (e.g. creating
"artificial" medium errors).

BTW I noticed that the block layer reads "around" a medium
error. Say 8 KB is being read and a medium error occurs
(and the info field is set to the lba of the first failure)
then several small reads are done to reconstruct as much
of the original 8 KB as possible (probably with a block of
zeroes corresponding to the medium error).

Doug Gilbert

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Bryan Henderson
> Sent: Tuesday, February 01, 2005 1:01 PM
> To: dougg@torque.net
> Cc: Kit Gerrits; linux-scsi@vger.kernel.org
> Subject: Re: Disk Errors
> 
> 
>>So there are two situations in which damaged blocks remain
>>accessible:
>>   1) unrecoverable medium errors
>>...
> 
> 
> What's the rationale behind leaving a damaged block accessible in the
> case 
> of an unrecoverable medium error?  A possibility that someone might 
> actually be able to recover the data?
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Disk Errors
@ 2005-02-02 14:12 Salyzyn, Mark
  2005-02-03  8:18 ` Andi Kleen
  2005-02-15  5:56 ` Douglas Gilbert
  0 siblings, 2 replies; 20+ messages in thread
From: Salyzyn, Mark @ 2005-02-02 14:12 UTC (permalink / raw)
  To: dougg; +Cc: Bryan Henderson, Kit Gerrits, linux-scsi

From: Douglas Gilbert [mailto:dougg@torque.net] writes:
> All may not be lost. If a medium error occurs and the ASC and
> ASCQ imply the sector could be read but
> failed ECC then the READ LONG SCSI command should fetch the
> block (plus ECC and other data). For example a Fujitsu MAM3184
> returns 576 bytes. It is probably too much to expect that all
> the damage will be in the last 64 bytes.

However, the drive has taken whatever action it could to reconstruct the
data, the failure to report the block for a standard read means that the
data is in fact `lost'. The data+ECC combination must be in a state
where there are more bits of damage than the error correction can deal
with; 64 bytes of ECC deals with single bit errors thus we know that we
have more than 1 bit of damage to the disk. We could have 4096 bits of
damage in the worst case :-) and never know that fact.

If I wanted in desperation to recover whatever data I could, this would
be grand, but as it stands, from the Linux File System Driver
perspective, it would be dangerous to accept this block as anything more
than it is.

If the data is of the form to permit some loss, for example video, audio
content or an error correcting stream of data, someone can make a case
where READ_LONG is an appropriate action to take to help fill in missing
content. 

A fun thought ...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-02 14:12 Disk Errors Salyzyn, Mark
@ 2005-02-03  8:18 ` Andi Kleen
  2005-02-15  5:56 ` Douglas Gilbert
  1 sibling, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2005-02-03  8:18 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: Bryan Henderson, Kit Gerrits, linux-scsi

"Salyzyn, Mark" <mark_salyzyn@adaptec.com> writes:
>
> If the data is of the form to permit some loss, for example video, audio
> content or an error correcting stream of data, someone can make a case
> where READ_LONG is an appropriate action to take to help fill in missing
> content. 
>
> A fun thought ...


It's an interesting idea. How about adding a sysfs attribute for
the device that says "tolerate some errors". Default to off of course.
I guess a lot of people would value such an option while recovering
their disks.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-02  3:55 ` Douglas Gilbert
@ 2005-02-03 18:50   ` Bryan Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Bryan Henderson @ 2005-02-03 18:50 UTC (permalink / raw)
  To: dougg; +Cc: Kit Gerrits, linux-scsi, Salyzyn, Mark

>BTW I noticed that the block layer reads "around" a medium
>error. Say 8 KB is being read and a medium error occurs
>(and the info field is set to the lba of the first failure)
>then several small reads are done to reconstruct as much
>of the original 8 KB as possible (probably with a block of
>zeroes corresponding to the medium error).

The only way that makes sense is if the 8K I/O was in service of multiple 
block requests and the block layer is separating them and retrying in 
order to fail the smallest possible set of them.  The block layer's upper 
interface doesn't provide a means to indicate that the middle of a request 
failed, and it certainly isn't going to substitute zeroes for the 
requested data and call it successful.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Disk Errors
  2005-02-02 14:12 Disk Errors Salyzyn, Mark
  2005-02-03  8:18 ` Andi Kleen
@ 2005-02-15  5:56 ` Douglas Gilbert
  1 sibling, 0 replies; 20+ messages in thread
From: Douglas Gilbert @ 2005-02-15  5:56 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: Bryan Henderson, Kit Gerrits, linux-scsi

Salyzyn, Mark wrote:
> From: Douglas Gilbert [mailto:dougg@torque.net] writes:
> 
>>All may not be lost. If a medium error occurs and the ASC and
>>ASCQ imply the sector could be read but
>>failed ECC then the READ LONG SCSI command should fetch the
>>block (plus ECC and other data). For example a Fujitsu MAM3184
>>returns 576 bytes. It is probably too much to expect that all
>>the damage will be in the last 64 bytes.
> 
> 
> However, the drive has taken whatever action it could to reconstruct the
> data, the failure to report the block for a standard read means that the
> data is in fact `lost'. The data+ECC combination must be in a state
> where there are more bits of damage than the error correction can deal
> with; 64 bytes of ECC deals with single bit errors thus we know that we
> have more than 1 bit of damage to the disk. We could have 4096 bits of
> damage in the worst case :-) and never know that fact.
> 
> If I wanted in desperation to recover whatever data I could, this would
> be grand, but as it stands, from the Linux File System Driver
> perspective, it would be dangerous to accept this block as anything more
> than it is.
> 
> If the data is of the form to permit some loss, for example video, audio
> content or an error correcting stream of data, someone can make a case
> where READ_LONG is an appropriate action to take to help fill in missing
> content. 
> 
> A fun thought ...

Mark,
I will try extending sg_dd in sg3_utils to do this
when its "continue on error" flag is set. It could
return additional counts of dubious blocks as well as
completely lost ones.

If that is useful then perhaps sd could be extended.

Doug Gilbert


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-02-15  5:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-02 14:12 Disk Errors Salyzyn, Mark
2005-02-03  8:18 ` Andi Kleen
2005-02-15  5:56 ` Douglas Gilbert
  -- strict thread matches above, loose matches on Subject: below --
2005-02-01 18:24 Salyzyn, Mark
2005-02-02  3:55 ` Douglas Gilbert
2005-02-03 18:50   ` Bryan Henderson
2005-02-01 15:56 Cress, Andrew R
2005-02-01 12:50 Salyzyn, Mark
2005-02-01  8:53 Kit Gerrits
2005-02-01 12:43 ` Douglas Gilbert
2005-02-01 18:01   ` Bryan Henderson
2005-01-31 18:21 Disk errors Salyzyn, Mark
2005-01-31 23:41 ` Kit Gerrits
2005-01-31 23:55   ` Matt Domsch
2005-02-01  2:05   ` Guy
2005-01-31 17:11 Cress, Andrew R
     [not found] <60807403EABEB443939A5A7AA8A7458BB51FD1@otce2k01.adaptec.com>
2005-01-31 16:43 ` Kit Gerrits
2005-01-31 14:46 Cress, Andrew R
2005-01-31 15:22 ` Kit Gerrits
2005-01-31 14:27 Kit Gerrits

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).