linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] ATA/ATAPI exceptions doc
@ 2005-08-27  5:35 Tejun Heo
  2005-09-07  8:02 ` Jeff Garzik
  0 siblings, 1 reply; 3+ messages in thread
From: Tejun Heo @ 2005-08-27  5:35 UTC (permalink / raw)
  To: Jeff Garzik, albertcc, bzolnier; +Cc: linux-ide

 Hello, ATA people.

 This is the first section of libata EH doc.  This section tries to
describe ATA/ATAPI errors and exceptions in driver-neutral way and is
intended to be used as reference when implementing new libata EH.

 The second section will be about current libata EH implementation and
the last will be how to implement new libata EH.

 Thanks.

libata EH
======================================

 This document first discusses what ATA/ATAPI error conditions exist
and how they should be handled.  Then, we move on to how libata
currently handles them and how it can be improved.  Where 'current'
represents ALL head of libata-dev-2.6 git tree as of 2005-08-26,
commit ab9b494f6aeab24eda2e6462e2fe73789c288e73.  References are made
to SCSI EH document.  Please read SCSI EH document first.

 A lot of EH ideas are from Jeff Garzik and others in the following
and other discussion threads on linux-ide.

 http://marc.theaimsgroup.com/?l=linux-ide&m=112451335416913&w=2


[1] ATA/ATAPI errors and exceptions

 This section tries to identify what error/exception conditions exist
for ATA/ATAPI devices and describe how they should be handled in
implementation-neutral way.

 The term 'error' is used to describe conditions where either an
explicit error condition is reported from device or a command has
timed out.

 The term 'exception' is either used to describe exceptional
conditions which are not errors (say, power or hotplug events), or to
describe both errors and non-error exceptional conditions.  Where
explicit distinction between error and exception is necessary, the
term 'non-error exception' is used.

 The following categories of exceptions exist for ATA/ATAPI devices.

 - HSM violation error
 - ATA command error (non-NCQ)
 - ATA command timeout (non-NCQ)
 - ATAPI command error
 - ATAPI command timeout
 - NCQ command error
 - NCQ command timeout
 - other errors
 - non-error exceptions


[1-1] Exception categories

 All error indications are described according to legacy taskfile +
bus master IDE interface.  If a controller provides other (better)
mechanism for error reporting, mapping those into categories described
here shouldn't be difficult.

 In the following sections, two recovery actions - reset and
reconfiguring transport - are mentioned.  These are described in
further detail in [1-2].


[1-1-1] HSM (host state machine) violation error

 This error is indicated when STATUS value doesn't match HSM
requirement during issuing or excution any ATA/ATAPI command.

ex) ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying to
    issue a command.

ex) !BSY && !DRQ during PIO data transfer.

ex) DRQ on command completion.

 In this case, HSM is violated and not much information regarding the
error can be acquired from STATUS or ERROR register.  IOW, this error
can be anything - software error, faulty device, controller or cable.

 As HSM is violated, reset is necessary to bring it back to known
state.  Reconfiguring transport for lower speed might be a good idea
too as transmission errors do cause this behavior.


[1-1-2] ATA command error (non-NCQ)

 This error is indicated by set ERR bit on ATA command completion.
STATUS and ERROR registers indicate what kind of error has occurred.
Interpretation of STATUS and ERROR may differ depending on command.

 This type of errors can be further categorized.

 a. CRC error during transmission

    This is indicated by ICRC bit in the ERROR register.  Reset is not
    necessary as HSM is not violated but reconfiguring transport speed
    would help.

 b. Media errors

    This is indicated by UNC bit in the ERROR register.  ATA devices
    reports UNC error only after certain number of retries cannot
    recover the data, so there's nothing much else to do other than
    notifying upper layer.  Note that READ and WRITE commands report
    CHS or LBA of the first failed sector.  This could be used to
    complete successfully sectors in the request preceding the address
    although it's doubtful if it would actually help.

 c. Media changed / media change requested error

    Is there any SATA device with removable media?

 d. Other errors

    This can be invalid command or parameter indicated by ABRT ERROR
    bit or some other error condition.  Report to upper layer.

*TODO* Describe how STATUS and ERROR bits can be mapped to error
       categories.

*QUESTION* Do we have to ignore command-specific 'not applicable' bits
           when interpreting register values?


[1-1-3] ATA command timeout (non-NCQ)

 ATA command timeout occurs if a ATA command fails to complete in some
specified time.  When timeout occurs, HSM could be in any valid or
invalid state.  To bring the device to known state and make it forget
about the command, resetting is necessary.  The timed out command can
be retried.

 Timeouts can also be caused by transmission errors.  Reconfiguring
transport might help.


[1-1-4] ATAPI command error

 ATAPI command error is indicated by set CHK bit (ERR bit) in the
STATUS register on ATAPI command completion.  CHK bit indicates SAM
CHECK CONDITION status.  Sense data is needed to determine why the
error occurred and what actions to take.  As ATAPI doesn't do
autosensing, explicit REQUEST SENSE command should be issued to the
device.  Once sense data is acquired, the error can be handled
similary to other SCSI errors.


[1-1-5] ATAPI command timeout

 ATAPI command timeout occurs if a ATAPI command fails to complete in
some specified time.  It can be handled in the same way as ATA command
timeouts described in [1-2].


[1-1-6] NCQ command error

 NCQ command error is indicated by cleared BSY and set ERR bit during
NCQ command phase (one or more NCQ commands outstanding).  Although
STATUS and ERROR registers will contain valid values describing the
error, READ LOG EXT is required to clear the error condition,
determine which command has failed and acquire more information.

 READ LOG EXT Log Page 10h reports which tag has failed and TF
register values describing the error.  With this information the
failed command can be handled as a normal ATA command error as in
[1-1] and all other in-flight commands should be retried.  Note that
this retry should not be counted - it's likely that commands retried
this way would have completed normally without the failed command.

 If READ LOG EXT Log Page 10h fails or reports NQ, we're thoroughly
screwed.  This condition should be treated as a HSM violation.


[1-1-7] NCQ command timeout

 NCQ command timeout occurs if a NCQ command fails to complete in some
specified time.  Sane recovery action seems to be waiting for all
other commands to finish and take the same action as for ATA command
timeout [1-2].


[1-1-8] Other errors

 a. (PCI) bus error

    This is indicated by Error bit in BMDMA Status register.  The
    following is an excerpt from Jeff's mail regarding this error.

    "PCI bus errors should be handled by resetting the host controller
    (if possible), and then retrying the command [NOTE: better
    suggestions welcome]"

 b. Other controller specific errors

    Most errors should fit into one of above described categories and
    handled accordingly.


[1-1-9] Non-error exceptions

 *TODO* Write about PM and hot plugging.


[1-2] EH recovery actions

 This section discusses two important recovery actions mentioned
previously - resetting device/HBA and reconfiguring transport speed.


[1-2-1] Reset

 During EH, resetting is necessary in the following cases.

 - HSM is in unknown or invalid state
 - HBA is in unknown or invalid state
 - EH needs to make HBA/device forget about in-flight commands
 - HBA/device behaves weirdly.

 Resetting during EH might be a good idea regardless of error
condition to improve EH robustness.

 HBA resetting is implementation specific and even controllers
complying to taskfile/BMDMA PCI IDE interface are likely to have
implementation-specific ways to reset whole HBA.  So, this probably
should be addressed by specific drivers.

 OTOH, ATA/ATAPI standard describes in detail ways to reset ATA/ATAPI
devices.

 a. PATA hardware reset

    This is hardware initiated device reset signalled with asserted
    RESET- signal.  In PATA, there is no way to initiate hardware
    reset from software.

 b. Software reset

    This is achieved by turning CONTROL SRST bit on for at least 5us.
    Both PATA and SATA support it but, in case of SATA, this may
    require controller-specific support as the second Register FIS to
    clear FIS should be transmitted while BSY bit is still set.  Note
    that on PATA, this resets both master and slave devices on a
    channel.

 c. ATAPI DEVICE RESET command

    This is very similar to software reset except that reset can be
    restricted to the selected device without affecting the other
    device sharing the cable.

 d. SATA phy reset

    This is the preferred way of resetting a SATA device.  In effect
    it's identical to PATA hardware reset.  Note that this can be done
    with the standard SCR Control register.  As such, it's usually
    easier to implement than software reset too.

 Although above reset methods are standard, different HBA
implementations may have different requirements for resetting devices.
For standard BMDMA implementation, BMDMA state is the only context and
stopping active DMA transaction suffices.  For other types of HBAs,
there are different requirements to put them in consistent state.

 One more thing to consider when resetting devices is that resetting
clears certain configuration parameters and they need to be set to
their previous or newly adjusted values after reset.

 Parameters affected are.

 - CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used)
 - Parameters set with SET FEATURES including transfer mode setting.
 - Block count set with SET MULTIPLE MODE
 - Other parameters (SET MAX, MEDIA LOCK...)

 ATA/ATAPI standard specifies that some parameters should be kept
across hardware reset or software reset, but doesn't strictly specify
all of them.  IMHO, always reconfiguring needed parameters after reset
would be a good idea for robustness.

 Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / IDENTIFY
PACKET DEVICE is issued after a hardware reset and the result is used
for further operation.  *QUESTION* Would this be necessary?  If so,
revalidation mechanism needs to be implemented.


[1-2-2] Reconfigure transport

 For both PATA and SATA, a lot of corners are cut for cheap
connectors, cables or controllers and it's quite common to see high
transmission error rate.  This can be mitigated by lowering
transmission speed.

 The following scheme can be applied.  (from Jeff's comment)

 If more than $N (3?) transmission errors happen in 15 minutes,

- if SATA, decrease SATA PHY speed.  if speed cannot be decreased,
- decrease UDMA xfer speed.  if at UDMA0, switch to PIO4,
- decrease PIO xfer speed.  if at PIO3, complain, but continue

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] ATA/ATAPI exceptions doc
  2005-08-27  5:35 [RFC] ATA/ATAPI exceptions doc Tejun Heo
@ 2005-09-07  8:02 ` Jeff Garzik
  2005-09-07 12:27   ` Tejun Heo
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Garzik @ 2005-09-07  8:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: albertcc, bzolnier, linux-ide

Tejun Heo wrote:
>  Hello, ATA people.
> 
>  This is the first section of libata EH doc.  This section tries to
> describe ATA/ATAPI errors and exceptions in driver-neutral way and is
> intended to be used as reference when implementing new libata EH.
> 
>  The second section will be about current libata EH implementation and
> the last will be how to implement new libata EH.
> 
>  Thanks.
> 
> libata EH
> ======================================
> 
>  This document first discusses what ATA/ATAPI error conditions exist
> and how they should be handled.  Then, we move on to how libata
> currently handles them and how it can be improved.  Where 'current'
> represents ALL head of libata-dev-2.6 git tree as of 2005-08-26,
> commit ab9b494f6aeab24eda2e6462e2fe73789c288e73.  References are made
> to SCSI EH document.  Please read SCSI EH document first.
> 
>  A lot of EH ideas are from Jeff Garzik and others in the following
> and other discussion threads on linux-ide.
> 
>  http://marc.theaimsgroup.com/?l=linux-ide&m=112451335416913&w=2
> 
> 
> [1] ATA/ATAPI errors and exceptions
> 
>  This section tries to identify what error/exception conditions exist
> for ATA/ATAPI devices and describe how they should be handled in
> implementation-neutral way.
> 
>  The term 'error' is used to describe conditions where either an
> explicit error condition is reported from device or a command has
> timed out.
> 
>  The term 'exception' is either used to describe exceptional
> conditions which are not errors (say, power or hotplug events), or to
> describe both errors and non-error exceptional conditions.  Where
> explicit distinction between error and exception is necessary, the
> term 'non-error exception' is used.
> 
>  The following categories of exceptions exist for ATA/ATAPI devices.
> 
>  - HSM violation error
>  - ATA command error (non-NCQ)
>  - ATA command timeout (non-NCQ)
>  - ATAPI command error
>  - ATAPI command timeout
>  - NCQ command error
>  - NCQ command timeout
>  - other errors
>  - non-error exceptions


I would list the categories in this way:

- HSM violation, if driver or hardware is out of spec
- ATA/ATAPI device error (device populates Error register, or ABRT)
- ATAPI device check condition error
- ATA device error, during NCQ operations
- ATA bus error (usually indicated via command timeout, but some
   hardware includes error register bits specifically for these
   conditions)
	* includes DMA errors
	* includes SATA PHY errors
- PCI bus error (or whatever bus your host<->device path uses)
- Late successful completion.  Indicated via command timeout, where a
   final check of the hardware indicates the command actually did
   complete successfully.
- Unknown error.  Indicated via command timeout, where one cannot
   discern why the command timed out.
- Hotplug and power management exceptions.




> [1-1-2] ATA command error (non-NCQ)
> 
>  This error is indicated by set ERR bit on ATA command completion.
> STATUS and ERROR registers indicate what kind of error has occurred.
> Interpretation of STATUS and ERROR may differ depending on command.
> 
>  This type of errors can be further categorized.
> 
>  a. CRC error during transmission
> 
>     This is indicated by ICRC bit in the ERROR register.  Reset is not
>     necessary as HSM is not violated but reconfiguring transport speed
>     would help.

note this is a "bus" not "device" error


>  b. Media errors
> 
>     This is indicated by UNC bit in the ERROR register.  ATA devices
>     reports UNC error only after certain number of retries cannot
>     recover the data, so there's nothing much else to do other than
>     notifying upper layer.  Note that READ and WRITE commands report
>     CHS or LBA of the first failed sector.  This could be used to
>     complete successfully sectors in the request preceding the address
>     although it's doubtful if it would actually help.

Long term, yes, we should use available ATA information to partially 
complete the SCSI request, up to the point where the data transfer failed.


>  c. Media changed / media change requested error
> 
>     Is there any SATA device with removable media?

compact flash and cdrom


>  d. Other errors
> 
>     This can be invalid command or parameter indicated by ABRT ERROR
>     bit or some other error condition.  Report to upper layer.
> 
> *TODO* Describe how STATUS and ERROR bits can be mapped to error
>        categories.
> 
> *QUESTION* Do we have to ignore command-specific 'not applicable' bits
>            when interpreting register values?

Not sure how to answer this.  Which register values?  What is the entity 
doing the interpreting?


> [1-1-3] ATA command timeout (non-NCQ)
> 
>  ATA command timeout occurs if a ATA command fails to complete in some
> specified time.  When timeout occurs, HSM could be in any valid or
> invalid state.  To bring the device to known state and make it forget
> about the command, resetting is necessary.  The timed out command can
> be retried.
> 
>  Timeouts can also be caused by transmission errors.  Reconfiguring
> transport might help.

Note that, by design, when a DMA error occurs some hardware will simply 
not send an interrupt.  They rely on the OS driver to notice the lack of 
response, and from there, read the hardware registers to determine if a 
DMA error occured.


> [1-2] EH recovery actions
> 
>  This section discusses two important recovery actions mentioned
> previously - resetting device/HBA and reconfiguring transport speed.
> 
> 
> [1-2-1] Reset
> 
>  During EH, resetting is necessary in the following cases.
> 
>  - HSM is in unknown or invalid state
>  - HBA is in unknown or invalid state
>  - EH needs to make HBA/device forget about in-flight commands
>  - HBA/device behaves weirdly.
> 
>  Resetting during EH might be a good idea regardless of error
> condition to improve EH robustness.

Note that a lot of vendor driver interrupt handlers do the following, 
after processing an interrupt:

	tmp = read(SError)
	write(tmp, SError)

At the very least we should do that on error.


>  HBA resetting is implementation specific and even controllers
> complying to taskfile/BMDMA PCI IDE interface are likely to have
> implementation-specific ways to reset whole HBA.  So, this probably
> should be addressed by specific drivers.

s/should/must/

Although for PATA controllers, sometimes the best you can do is SRST, 
which implies that specific drivers can use a common reset facility.


>  OTOH, ATA/ATAPI standard describes in detail ways to reset ATA/ATAPI
> devices.
> 
>  a. PATA hardware reset
> 
>     This is hardware initiated device reset signalled with asserted
>     RESET- signal.  In PATA, there is no way to initiate hardware
>     reset from software.

Some PATA hardware provides registers that allow the OS driver to 
directly tweak the RESET- signal.


>  b. Software reset
> 
>     This is achieved by turning CONTROL SRST bit on for at least 5us.
>     Both PATA and SATA support it but, in case of SATA, this may
>     require controller-specific support as the second Register FIS to
>     clear FIS should be transmitted while BSY bit is still set.  Note
>     that on PATA, this resets both master and slave devices on a
>     channel.

ditto for EXECUTE DEVICE DIAGNOSTIC


>  c. ATAPI DEVICE RESET command
> 
>     This is very similar to software reset except that reset can be
>     restricted to the selected device without affecting the other
>     device sharing the cable.
> 
>  d. SATA phy reset
> 
>     This is the preferred way of resetting a SATA device.  In effect
>     it's identical to PATA hardware reset.  Note that this can be done
>     with the standard SCR Control register.  As such, it's usually
>     easier to implement than software reset too.
> 
>  Although above reset methods are standard, different HBA
> implementations may have different requirements for resetting devices.
> For standard BMDMA implementation, BMDMA state is the only context and
> stopping active DMA transaction suffices.  For other types of HBAs,
> there are different requirements to put them in consistent state.

I would definitely do an SRST after a DMA error.


>  One more thing to consider when resetting devices is that resetting
> clears certain configuration parameters and they need to be set to
> their previous or newly adjusted values after reset.

Yep.  Same problem coming back from power-off (resume).


>  Parameters affected are.
> 
>  - CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used)
>  - Parameters set with SET FEATURES including transfer mode setting.
>  - Block count set with SET MULTIPLE MODE
>  - Other parameters (SET MAX, MEDIA LOCK...)
> 
>  ATA/ATAPI standard specifies that some parameters should be kept
> across hardware reset or software reset, but doesn't strictly specify
> all of them.  IMHO, always reconfiguring needed parameters after reset
> would be a good idea for robustness.

s/good idea/required/ :)


>  Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / IDENTIFY
> PACKET DEVICE is issued after a hardware reset and the result is used
> for further operation.  *QUESTION* Would this be necessary?  If so,
> revalidation mechanism needs to be implemented.

Any time features are turned on/off, etc., the identify-device page 
should be re-read.

	Jeff



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] ATA/ATAPI exceptions doc
  2005-09-07  8:02 ` Jeff Garzik
@ 2005-09-07 12:27   ` Tejun Heo
  0 siblings, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2005-09-07 12:27 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: albertcc, bzolnier, linux-ide


  Howdy, Jeff.

  I'll be moving to another city this weekend.  Then, there's big Korean 
thanksgiving coming, so I won't have a lot of time for the next two 
weeks.  I'll try to get things done as much as possible but I'll surely 
be slow.

Jeff Garzik wrote:
> Tejun Heo wrote:
> 
>>  Hello, ATA people.
>>
>>  This is the first section of libata EH doc.  This section tries to
>> describe ATA/ATAPI errors and exceptions in driver-neutral way and is
>> intended to be used as reference when implementing new libata EH.
>>
>>  The second section will be about current libata EH implementation and
>> the last will be how to implement new libata EH.
>>
>>  Thanks.
>>
>> libata EH
>> ======================================
>>
>>  This document first discusses what ATA/ATAPI error conditions exist
>> and how they should be handled.  Then, we move on to how libata
>> currently handles them and how it can be improved.  Where 'current'
>> represents ALL head of libata-dev-2.6 git tree as of 2005-08-26,
>> commit ab9b494f6aeab24eda2e6462e2fe73789c288e73.  References are made
>> to SCSI EH document.  Please read SCSI EH document first.
>>
>>  A lot of EH ideas are from Jeff Garzik and others in the following
>> and other discussion threads on linux-ide.
>>
>>  http://marc.theaimsgroup.com/?l=linux-ide&m=112451335416913&w=2
>>
>>
>> [1] ATA/ATAPI errors and exceptions
>>
>>  This section tries to identify what error/exception conditions exist
>> for ATA/ATAPI devices and describe how they should be handled in
>> implementation-neutral way.
>>
>>  The term 'error' is used to describe conditions where either an
>> explicit error condition is reported from device or a command has
>> timed out.
>>
>>  The term 'exception' is either used to describe exceptional
>> conditions which are not errors (say, power or hotplug events), or to
>> describe both errors and non-error exceptional conditions.  Where
>> explicit distinction between error and exception is necessary, the
>> term 'non-error exception' is used.
>>
>>  The following categories of exceptions exist for ATA/ATAPI devices.
>>
>>  - HSM violation error
>>  - ATA command error (non-NCQ)
>>  - ATA command timeout (non-NCQ)
>>  - ATAPI command error
>>  - ATAPI command timeout
>>  - NCQ command error
>>  - NCQ command timeout
>>  - other errors
>>  - non-error exceptions
> 
> 
> 
> I would list the categories in this way:
> 
> - HSM violation, if driver or hardware is out of spec
> - ATA/ATAPI device error (device populates Error register, or ABRT)
> - ATAPI device check condition error
> - ATA device error, during NCQ operations
> - ATA bus error (usually indicated via command timeout, but some
>   hardware includes error register bits specifically for these
>   conditions)
>     * includes DMA errors
>     * includes SATA PHY errors
> - PCI bus error (or whatever bus your host<->device path uses)
> - Late successful completion.  Indicated via command timeout, where a
>   final check of the hardware indicates the command actually did
>   complete successfully.
> - Unknown error.  Indicated via command timeout, where one cannot
>   discern why the command timed out.
> - Hotplug and power management exceptions.
> 

  Okay, I'll reorganize the document accordingly.

> 
> 
> 
>> [1-1-2] ATA command error (non-NCQ)
>>
>>  This error is indicated by set ERR bit on ATA command completion.
>> STATUS and ERROR registers indicate what kind of error has occurred.
>> Interpretation of STATUS and ERROR may differ depending on command.
>>
>>  This type of errors can be further categorized.
>>
>>  a. CRC error during transmission
>>
>>     This is indicated by ICRC bit in the ERROR register.  Reset is not
>>     necessary as HSM is not violated but reconfiguring transport speed
>>     would help.
> 
> 
> note this is a "bus" not "device" error
> 

  Meaning....?

  Note that for PATA devices, transfer speed is set per-device although 
they share a bus.

> 
>>  b. Media errors
>>
>>     This is indicated by UNC bit in the ERROR register.  ATA devices
>>     reports UNC error only after certain number of retries cannot
>>     recover the data, so there's nothing much else to do other than
>>     notifying upper layer.  Note that READ and WRITE commands report
>>     CHS or LBA of the first failed sector.  This could be used to
>>     complete successfully sectors in the request preceding the address
>>     although it's doubtful if it would actually help.
> 
> 
> Long term, yes, we should use available ATA information to partially 
> complete the SCSI request, up to the point where the data transfer failed.
> 

  I'll takeout the 'doubful' sentence.

> 
>>  c. Media changed / media change requested error
>>
>>     Is there any SATA device with removable media?
> 
> 
> compact flash and cdrom
> 

  The question was misleading.  What I meant was 'is there any SATA 
device which makes use of removable media status notification feature 
set or removable media feature set currently?' as SCSI is handling 
cdrom's door locking and revalidation on media change and AFAIK there's 
no SATA compact flash yet.  But in the long term, yes, we do need to 
handle this.

> 
>>  d. Other errors
>>
>>     This can be invalid command or parameter indicated by ABRT ERROR
>>     bit or some other error condition.  Report to upper layer.
>>
>> *TODO* Describe how STATUS and ERROR bits can be mapped to error
>>        categories.
>>
>> *QUESTION* Do we have to ignore command-specific 'not applicable' bits
>>            when interpreting register values?
> 
> 
> Not sure how to answer this.  Which register values?  What is the entity 
> doing the interpreting?
> 

  The entity is driver doing EH.  READ/WRITE commands use all the bits 
in the error register but other commands have 'na' on some bits.  I 
think most devices report 0 for those bits but as the spec says 'na' I 
just wanted to make sure.  For example, READ BUFFER's error output 
description states that only ABRT bit is applicable.  This is mostly 
non-issue, IMHO.  Maybe stating in the document or comment that we don't 
consider them is enough.

> 
>> [1-1-3] ATA command timeout (non-NCQ)
>>
>>  ATA command timeout occurs if a ATA command fails to complete in some
>> specified time.  When timeout occurs, HSM could be in any valid or
>> invalid state.  To bring the device to known state and make it forget
>> about the command, resetting is necessary.  The timed out command can
>> be retried.
>>
>>  Timeouts can also be caused by transmission errors.  Reconfiguring
>> transport might help.
> 
> 
> Note that, by design, when a DMA error occurs some hardware will simply 
> not send an interrupt.  They rely on the OS driver to notice the lack of 
> response, and from there, read the hardware registers to determine if a 
> DMA error occured.
> 

  I'll take this into account when I reorganize this document.

> 
>> [1-2] EH recovery actions
>>
>>  This section discusses two important recovery actions mentioned
>> previously - resetting device/HBA and reconfiguring transport speed.
>>
>>
>> [1-2-1] Reset
>>
>>  During EH, resetting is necessary in the following cases.
>>
>>  - HSM is in unknown or invalid state
>>  - HBA is in unknown or invalid state
>>  - EH needs to make HBA/device forget about in-flight commands
>>  - HBA/device behaves weirdly.
>>
>>  Resetting during EH might be a good idea regardless of error
>> condition to improve EH robustness.
> 
> 
> Note that a lot of vendor driver interrupt handlers do the following, 
> after processing an interrupt:
> 
>     tmp = read(SError)
>     write(tmp, SError)
> 
> At the very least we should do that on error.
> 

  I'll add that.

> 
>>  HBA resetting is implementation specific and even controllers
>> complying to taskfile/BMDMA PCI IDE interface are likely to have
>> implementation-specific ways to reset whole HBA.  So, this probably
>> should be addressed by specific drivers.
> 
> 
> s/should/must/
> 
> Although for PATA controllers, sometimes the best you can do is SRST, 
> which implies that specific drivers can use a common reset facility.
> 
> 
>>  OTOH, ATA/ATAPI standard describes in detail ways to reset ATA/ATAPI
>> devices.
>>
>>  a. PATA hardware reset
>>
>>     This is hardware initiated device reset signalled with asserted
>>     RESET- signal.  In PATA, there is no way to initiate hardware
>>     reset from software.
> 
> 
> Some PATA hardware provides registers that allow the OS driver to 
> directly tweak the RESET- signal.
> 
> 
>>  b. Software reset
>>
>>     This is achieved by turning CONTROL SRST bit on for at least 5us.
>>     Both PATA and SATA support it but, in case of SATA, this may
>>     require controller-specific support as the second Register FIS to
>>     clear FIS should be transmitted while BSY bit is still set.  Note
>>     that on PATA, this resets both master and slave devices on a
>>     channel.
> 
> 
> ditto for EXECUTE DEVICE DIAGNOSTIC
> 
> 
>>  c. ATAPI DEVICE RESET command
>>
>>     This is very similar to software reset except that reset can be
>>     restricted to the selected device without affecting the other
>>     device sharing the cable.
>>
>>  d. SATA phy reset
>>
>>     This is the preferred way of resetting a SATA device.  In effect
>>     it's identical to PATA hardware reset.  Note that this can be done
>>     with the standard SCR Control register.  As such, it's usually
>>     easier to implement than software reset too.
>>
>>  Although above reset methods are standard, different HBA
>> implementations may have different requirements for resetting devices.
>> For standard BMDMA implementation, BMDMA state is the only context and
>> stopping active DMA transaction suffices.  For other types of HBAs,
>> there are different requirements to put them in consistent state.
> 
> 
> I would definitely do an SRST after a DMA error.
> 
> 
>>  One more thing to consider when resetting devices is that resetting
>> clears certain configuration parameters and they need to be set to
>> their previous or newly adjusted values after reset.
> 
> 
> Yep.  Same problem coming back from power-off (resume).
> 
> 
>>  Parameters affected are.
>>
>>  - CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used)
>>  - Parameters set with SET FEATURES including transfer mode setting.
>>  - Block count set with SET MULTIPLE MODE
>>  - Other parameters (SET MAX, MEDIA LOCK...)
>>
>>  ATA/ATAPI standard specifies that some parameters should be kept
>> across hardware reset or software reset, but doesn't strictly specify
>> all of them.  IMHO, always reconfiguring needed parameters after reset
>> would be a good idea for robustness.
> 
> 
> s/good idea/required/ :)
> 
> 
>>  Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / IDENTIFY
>> PACKET DEVICE is issued after a hardware reset and the result is used
>> for further operation.  *QUESTION* Would this be necessary?  If so,
>> revalidation mechanism needs to be implemented.
> 
> 
> Any time features are turned on/off, etc., the identify-device page 
> should be re-read.
> 

  If we do that, do you think it's necessary to implement a mechanism to 
revalidate devices after re-reading ID when it does no longer match the 
device we were handling before re-reading?  Say, when device ID, device 
type or maximum sector changes?  I don't know to what extent we should 
verify the re-read identify data.  What if supported transfer mode changes?

  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-09-07 12:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-27  5:35 [RFC] ATA/ATAPI exceptions doc Tejun Heo
2005-09-07  8:02 ` Jeff Garzik
2005-09-07 12:27   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).