Re: Weird I/O errors with USB hard drive not remounting filesystem readonly

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
       [not found]     ` <20091120094641.GB15422@duck.suse.cz>
@ 2009-11-23  8:09       ` tmhikaru
  2009-11-23 10:54         ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-23  8:09 UTC (permalink / raw)
  To: Jan Kara, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1460 bytes --]

	Okay, finally had some time to dismantle the machine in question and
inserted the backup drive without the enclosure. Now that I was able to get
smartctl to give me information on what was going on, it seems there's
nothing wrong with the drive itself (no remapped sectors, nothing prefail or
fail going wrong, drive's smart status is good) but it's had a history of
406 (!!!) errors reported, the last five happening within the same second,
all of which seem (to my untrained eye) to imply a bad IDE cable. Which
means likely my USB enclosure is screwing up.

	To be sure, I did some extended testing with it hooked up: Tried
doing a full and incremental backup of my machine with no errors reported,
as well as an offline and then immediately afterwards, long test via
smartctl. Nothing changed in the output.

	I've attached the output of smartctl at various points in the tests:
the first being the initial immediate display of information at startup, the
second just after the long test, and the third just after the two backups
(Full and incremental) were tried. Nothing seems to shout at me in the
output, I'm frankly pretty sure something's wrong with the enclosure. If
it's the drive, it's not showing up here at least.

	I'd like your thoughts and suggestions on this. While I'm pretty
sure I know what the smartctl output means, I'm always capable of being
entirely wrong. Please let me know if I've misinterpreted the information.

Timothy McGrath

[-- Attachment #2: diskstats --]
[-- Type: text/plain, Size: 8579 bytes --]

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.8 family
Device Model:     ST3250823A
Serial Number:    5ND3CC5A
Firmware Version: 3.06
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 23 00:23:54 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  84) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   074   036   006    Pre-fail  Always       -       50613791
  3 Spin_Up_Time            0x0003   099   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   050   047   030    Pre-fail  Always       -       1443141758905
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       354
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       157
194 Temperature_Celsius     0x0022   027   052   000    Old_age   Always       -       27 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   051   046   000    Old_age   Always       -       760201
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 406 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 406 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 405 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 404 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 403 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 402 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:30.579  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


[-- Attachment #3: diskstats2 --]
[-- Type: text/plain, Size: 8684 bytes --]

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.8 family
Device Model:     ST3250823A
Serial Number:    5ND3CC5A
Firmware Version: 3.06
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 23 02:37:01 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  84) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   053   036   006    Pre-fail  Always       -       73989115
  3 Spin_Up_Time            0x0003   099   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   050   047   030    Pre-fail  Always       -       1443142340790
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       357
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       157
194 Temperature_Celsius     0x0022   040   052   000    Old_age   Always       -       40 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   050   046   000    Old_age   Always       -       33250523
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 406 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 406 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 405 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 404 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 403 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 402 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:30.579  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Completed without error       00%       356         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


[-- Attachment #4: diskstats3 --]
[-- Type: text/plain, Size: 8684 bytes --]

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.8 family
Device Model:     ST3250823A
Serial Number:    5ND3CC5A
Firmware Version: 3.06
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 23 02:42:34 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  84) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   051   036   006    Pre-fail  Always       -       72983290
  3 Spin_Up_Time            0x0003   099   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   050   047   030    Pre-fail  Always       -       1443142359378
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       357
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       157
194 Temperature_Celsius     0x0022   040   052   000    Old_age   Always       -       40 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   049   046   000    Old_age   Always       -       48607152
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 406 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 406 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 405 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 404 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 403 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:36.934  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

Error 402 occurred at disk power-on lifetime: 160 hours (6 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 03 08 00 00 00 e0 00      00:00:30.579  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.511  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:36.087  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.663  READ DMA
  c8 03 08 00 00 00 e0 00      00:00:35.240  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Completed without error       00%       356         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23  8:09       ` Weird I/O errors with USB hard drive not remounting filesystem readonly tmhikaru
@ 2009-11-23 10:54         ` Jan Kara
  2009-11-23 15:06           ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2009-11-23 10:54 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

  Hi,

On Mon 23-11-09 03:09:19, tmhikaru@gmail.com wrote:
> 	Okay, finally had some time to dismantle the machine in question and
> inserted the backup drive without the enclosure. Now that I was able to get
> smartctl to give me information on what was going on, it seems there's
> nothing wrong with the drive itself (no remapped sectors, nothing prefail or
> fail going wrong, drive's smart status is good) but it's had a history of
> 406 (!!!) errors reported, the last five happening within the same second,
> all of which seem (to my untrained eye) to imply a bad IDE cable. Which
> means likely my USB enclosure is screwing up.
> 
> 	To be sure, I did some extended testing with it hooked up: Tried
> doing a full and incremental backup of my machine with no errors reported,
> as well as an offline and then immediately afterwards, long test via
> smartctl. Nothing changed in the output.
  Yeah, from what you write, it looks like USB enclosure is at fault (or it
could still be your USB controller but I doubt it).  It's still a bit
bothering that the error reported by the drive was not properly propagated
up to VFS. Either it's some block layer retry/ignore magic that I missed or
we ignore errors from block layer in some place.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 10:54         ` Jan Kara
@ 2009-11-23 15:06           ` Alan Stern
  2009-11-23 16:09             ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-23 15:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: tmhikaru, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

On Mon, 23 Nov 2009, Jan Kara wrote:

>   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> could still be your USB controller but I doubt it).  It's still a bit
> bothering that the error reported by the drive was not properly propagated
> up to VFS. Either it's some block layer retry/ignore magic that I missed or
> we ignore errors from block layer in some place.

Is there any interest in tracking this down?  It's not hard to find out
what low-level errors are being reported and to generate them on demand
with an emulated USB disk drive.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 15:06           ` Alan Stern
@ 2009-11-23 16:09             ` Jan Kara
  2009-11-23 18:20               ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2009-11-23 16:09 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jan Kara, tmhikaru, linux-kernel, linux-usb, Jens Axboe,
	linux-scsi, linux-ext4

On Mon 23-11-09 10:06:03, Alan Stern wrote:
> On Mon, 23 Nov 2009, Jan Kara wrote:
> 
> >   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> > could still be your USB controller but I doubt it).  It's still a bit
> > bothering that the error reported by the drive was not properly propagated
> > up to VFS. Either it's some block layer retry/ignore magic that I missed or
> > we ignore errors from block layer in some place.
> 
> Is there any interest in tracking this down?  It's not hard to find out
> what low-level errors are being reported and to generate them on demand
> with an emulated USB disk drive.
  Well, if you could provide me with the patch, I could try to track down
why the errors aren't propagated... It would be interesting because if it's
not some retry logic in block layer, it's a bug in VFS ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 16:09             ` Jan Kara
@ 2009-11-23 18:20               ` Alan Stern
  2009-11-23 18:54                 ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-23 18:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: tmhikaru, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

On Mon, 23 Nov 2009, Jan Kara wrote:

> On Mon 23-11-09 10:06:03, Alan Stern wrote:
> > On Mon, 23 Nov 2009, Jan Kara wrote:
> > 
> > >   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> > > could still be your USB controller but I doubt it).  It's still a bit
> > > bothering that the error reported by the drive was not properly propagated
> > > up to VFS. Either it's some block layer retry/ignore magic that I missed or
> > > we ignore errors from block layer in some place.
> > 
> > Is there any interest in tracking this down?  It's not hard to find out
> > what low-level errors are being reported and to generate them on demand
> > with an emulated USB disk drive.
>   Well, if you could provide me with the patch, I could try to track down
> why the errors aren't propagated... It would be interesting because if it's
> not some retry logic in block layer, it's a bug in VFS ;).

I can't provide a patch without first knowing what the errors are.  The 
way to find out is to use usbmon.  See Documentation/usb/usbmon.txt for 
instructions.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 18:20               ` Alan Stern
@ 2009-11-23 18:54                 ` Jan Kara
  2009-11-23 19:50                   ` tmhikaru
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2009-11-23 18:54 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jan Kara, tmhikaru, linux-kernel, linux-usb, Jens Axboe,
	linux-scsi, linux-ext4

On Mon 23-11-09 13:20:26, Alan Stern wrote:
> On Mon, 23 Nov 2009, Jan Kara wrote:
> 
> > On Mon 23-11-09 10:06:03, Alan Stern wrote:
> > > On Mon, 23 Nov 2009, Jan Kara wrote:
> > > 
> > > >   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> > > > could still be your USB controller but I doubt it).  It's still a bit
> > > > bothering that the error reported by the drive was not properly propagated
> > > > up to VFS. Either it's some block layer retry/ignore magic that I missed or
> > > > we ignore errors from block layer in some place.
> > > 
> > > Is there any interest in tracking this down?  It's not hard to find out
> > > what low-level errors are being reported and to generate them on demand
> > > with an emulated USB disk drive.
> >   Well, if you could provide me with the patch, I could try to track down
> > why the errors aren't propagated... It would be interesting because if it's
> > not some retry logic in block layer, it's a bug in VFS ;).
> 
> I can't provide a patch without first knowing what the errors are.  The 
> way to find out is to use usbmon.  See Documentation/usb/usbmon.txt for 
> instructions.
  Ah, OK. The problem manifests itself as an error in SATA communication
(which in fact somehow happens over USB, but I don't really know the
details of mass storage over USB) so debugging USB would become actual only
if we found out it's really some bug in an usb stack. But so far the most
probable is just an error somewhere between the USB controller in the
enclosure and the drive itself. BTW: Is the data transferred over USB
checksummed?

									Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 18:54                 ` Jan Kara
@ 2009-11-23 19:50                   ` tmhikaru
  2009-11-23 20:06                     ` tmhikaru
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-23 19:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-usb, Jens Axboe, linux-scsi, linux-ext4

On Mon, Nov 23, 2009 at 07:54:50PM +0100, Jan Kara wrote:
> On Mon 23-11-09 13:20:26, Alan Stern wrote:
> > On Mon, 23 Nov 2009, Jan Kara wrote:
> > 
> > > On Mon 23-11-09 10:06:03, Alan Stern wrote:
> > > > On Mon, 23 Nov 2009, Jan Kara wrote:
> > > > 
> > > > >   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> > > > > could still be your USB controller but I doubt it).  It's still a bit
> > > > > bothering that the error reported by the drive was not properly propagated
> > > > > up to VFS. Either it's some block layer retry/ignore magic that I missed or
> > > > > we ignore errors from block layer in some place.
> > > > 
> > > > Is there any interest in tracking this down?  It's not hard to find out
> > > > what low-level errors are being reported and to generate them on demand
> > > > with an emulated USB disk drive.
> > >   Well, if you could provide me with the patch, I could try to track down
> > > why the errors aren't propagated... It would be interesting because if it's
> > > not some retry logic in block layer, it's a bug in VFS ;).
> > 
> > I can't provide a patch without first knowing what the errors are.  The 
> > way to find out is to use usbmon.  See Documentation/usb/usbmon.txt for 
> > instructions.
>   Ah, OK. The problem manifests itself as an error in SATA communication
> (which in fact somehow happens over USB, but I don't really know the
> details of mass storage over USB) so debugging USB would become actual only
> if we found out it's really some bug in an usb stack. But so far the most
> probable is just an error somewhere between the USB controller in the
> enclosure and the drive itself. BTW: Is the data transferred over USB
> checksummed?
> 
> 									Honza
	Mind that this is an ide hard drive in a usb enclosure, not sata.
The error reported is bizzare. I don't know if I'll be able to get the
enclosure to cause the error on demand, but before I try supergluing it,
I'll try to do whatever I need to do with usbmon so we can get the error
that's actually happening so you can do whatever you need to do to track it
down :)

Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 19:50                   ` tmhikaru
@ 2009-11-23 20:06                     ` tmhikaru
  2009-11-23 20:33                       ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-23 20:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-usb, Jens Axboe, linux-scsi, linux-ext4

On Mon, Nov 23, 2009 at 02:50:22PM -0500, tm@ wrote:
> On Mon, Nov 23, 2009 at 07:54:50PM +0100, Jan Kara wrote:
> > On Mon 23-11-09 13:20:26, Alan Stern wrote:
> > > On Mon, 23 Nov 2009, Jan Kara wrote:
> > > 
> > > > On Mon 23-11-09 10:06:03, Alan Stern wrote:
> > > > > On Mon, 23 Nov 2009, Jan Kara wrote:
> > > > > 
> > > > > >   Yeah, from what you write, it looks like USB enclosure is at fault (or it
> > > > > > could still be your USB controller but I doubt it).  It's still a bit
> > > > > > bothering that the error reported by the drive was not properly propagated
> > > > > > up to VFS. Either it's some block layer retry/ignore magic that I missed or
> > > > > > we ignore errors from block layer in some place.
> > > > > 
> > > > > Is there any interest in tracking this down?  It's not hard to find out
> > > > > what low-level errors are being reported and to generate them on demand
> > > > > with an emulated USB disk drive.
> > > >   Well, if you could provide me with the patch, I could try to track down
> > > > why the errors aren't propagated... It would be interesting because if it's
> > > > not some retry logic in block layer, it's a bug in VFS ;).
> > > 
> > > I can't provide a patch without first knowing what the errors are.  The 
> > > way to find out is to use usbmon.  See Documentation/usb/usbmon.txt for 
> > > instructions.
> >   Ah, OK. The problem manifests itself as an error in SATA communication
> > (which in fact somehow happens over USB, but I don't really know the
> > details of mass storage over USB) so debugging USB would become actual only
> > if we found out it's really some bug in an usb stack. But so far the most
> > probable is just an error somewhere between the USB controller in the
> > enclosure and the drive itself. BTW: Is the data transferred over USB
> > checksummed?
> > 
> > 									Honza
> 	Mind that this is an ide hard drive in a usb enclosure, not sata.
> The error reported is bizzare. I don't know if I'll be able to get the
> enclosure to cause the error on demand, but before I try supergluing it,
> I'll try to do whatever I need to do with usbmon so we can get the error
> that's actually happening so you can do whatever you need to do to track it
> down :)
> 
> Tim McGrath
Uhh, this could be a problem. CONFIG_USB_MON=y is set in my .config BUT,
although I followed the directions and debugfs is mounted,

root@roll:/home/tm# mount
/dev/root on / type ext4 (rw,relatime,barrier=1,data=ordered,data_err=abort)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
usbfs on /proc/bus/usb type usbfs (rw)
none_debugs on /sys/kernel/debug type debugfs (rw)

I receive this message when I test if usbmon's directory exists:

root@roll:/home/tm# ls /sys/kernel/debug/usbmon
ls: cannot access /sys/kernel/debug/usbmon: No such file or directory

According to the kernel documentation usbmon should be loaded already. Help?

Timothy McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 20:06                     ` tmhikaru
@ 2009-11-23 20:33                       ` Alan Stern
  2009-11-23 23:42                         ` tmhikaru
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-23 20:33 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

On Mon, 23 Nov 2009 tmhikaru@gmail.com wrote:

> Uhh, this could be a problem. CONFIG_USB_MON=y is set in my .config BUT,
> although I followed the directions and debugfs is mounted,
> 
> root@roll:/home/tm# mount
> /dev/root on / type ext4 (rw,relatime,barrier=1,data=ordered,data_err=abort)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> usbfs on /proc/bus/usb type usbfs (rw)
> none_debugs on /sys/kernel/debug type debugfs (rw)
> 
> I receive this message when I test if usbmon's directory exists:
> 
> root@roll:/home/tm# ls /sys/kernel/debug/usbmon
> ls: cannot access /sys/kernel/debug/usbmon: No such file or directory
> 
> According to the kernel documentation usbmon should be loaded already. Help?

The documentation is slightly out of date.  The usbmon directory has 
been moved down a level, to /sys/kernel/debug/usb/usbmon.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 20:33                       ` Alan Stern
@ 2009-11-23 23:42                         ` tmhikaru
  2009-11-24 17:16                           ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-23 23:42 UTC (permalink / raw)
  To: Alan Stern
  Cc: tmhikaru, Jan Kara, linux-kernel, linux-usb, Jens Axboe,
	linux-scsi, linux-ext4

On Mon, Nov 23, 2009 at 03:33:56PM -0500, Alan Stern wrote:
> On Mon, 23 Nov 2009 tmhikaru@gmail.com wrote:
> 
> > Uhh, this could be a problem. CONFIG_USB_MON=y is set in my .config BUT,
> > although I followed the directions and debugfs is mounted,
> > 
> > root@roll:/home/tm# mount
> > /dev/root on / type ext4 (rw,relatime,barrier=1,data=ordered,data_err=abort)
> > proc on /proc type proc (rw)
> > sysfs on /sys type sysfs (rw)
> > usbfs on /proc/bus/usb type usbfs (rw)
> > none_debugs on /sys/kernel/debug type debugfs (rw)
> > 
> > I receive this message when I test if usbmon's directory exists:
> > 
> > root@roll:/home/tm# ls /sys/kernel/debug/usbmon
> > ls: cannot access /sys/kernel/debug/usbmon: No such file or directory
> > 
> > According to the kernel documentation usbmon should be loaded already. Help?
> 
> The documentation is slightly out of date.  The usbmon directory has 
> been moved down a level, to /sys/kernel/debug/usb/usbmon.
> 
> Alan Stern

Thank you. I've gotten output of it screwing up, but it's a 62MB file. I
don't think I'm allowed to send attachments here, nor do I know what I'm
supposed to be looking for in this output. So, instead I'm putting the
(compressed) file up on my home computer. If you've got suggestions for what
I should use instead, let me know.

http://hikaru.no-ip.info:3000/1.mon.out.xz

Although I don't think you'll need it, I've included the dmesg output of
what happened when I ran my backup script, just in case it helps at all.

Thanks for the help so far,
Tim McGrath

usb 1-6: new high speed USB device using ehci_hcd and address 2
usb 1-6: New USB device found, idVendor=0471, idProduct=0332
usb 1-6: New USB device strings: Mfr=2, Product=1, SerialNumber=0
usb 1-6: Product: Philips SPC 1000NC Webcam
usb 1-6: Manufacturer: Philips CE
usb 1-6: configuration #1 chosen from 1 choice
2:3:1: cannot get freq at ep 0x84
usbcore: registered new interface driver snd-usb-audio
usb 1-3: new high speed USB device using ehci_hcd and address 3
usb 1-3: New USB device found, idVendor=067b, idProduct=2506
usb 1-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-3: Product: Mass Storage Device
usb 1-3: Manufacturer: Prolific Technology Inc.
usb 1-3: SerialNumber: 0
usb 1-3: configuration #1 chosen from 1 choice
Initializing USB Mass Storage driver...
scsi0 : SCSI emulation for USB Mass Storage devices
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
usb-storage: device found at 3
usb-storage: waiting for device to settle before scanning
scsi 0:0:0:0: Direct-Access     ST325082 3A               3.06 PQ: 0 ANSI: 0
usb-storage: device scan complete
sd 0:0:0:0: [sda] 488397168 512-byte logical blocks: (250 GB/232 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 03 00 00 00
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
usb 1-6: USB disconnect, address 2
EXT4-fs (sda1): barriers enabled
kjournald2 starting: pid 9416, dev sda1:8, commit interval 5 seconds
EXT4-fs (sda1): internal journal on sda1:8
EXT4-fs (sda1): delayed allocation enabled
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled
EXT4-fs (sda1): mounted filesystem with ordered data mode
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
end_request: I/O error, dev sda, sector 44783
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
end_request: I/O error, dev sda, sector 51823

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-23 23:42                         ` tmhikaru
@ 2009-11-24 17:16                           ` Alan Stern
  2009-11-24 17:47                             ` Boaz Harrosh
  2009-11-24 18:00                             ` Jan Kara
  0 siblings, 2 replies; 26+ messages in thread
From: Alan Stern @ 2009-11-24 17:16 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, linux-kernel, linux-usb, Jens Axboe, linux-scsi,
	linux-ext4

On Mon, 23 Nov 2009 tmhikaru@gmail.com wrote:

> Thank you. I've gotten output of it screwing up, but it's a 62MB file. I
> don't think I'm allowed to send attachments here, nor do I know what I'm
> supposed to be looking for in this output. So, instead I'm putting the
> (compressed) file up on my home computer. If you've got suggestions for what
> I should use instead, let me know.
> 
> http://hikaru.no-ip.info:3000/1.mon.out.xz
> 
> Although I don't think you'll need it, I've included the dmesg output of
> what happened when I ran my backup script, just in case it helps at all.

Here's an annotated example of one of those hiccups:

f1aa1f00 2416018820 S Bo:1:003:1 -115 31 = 55534243 07050100 00100000 80000a28 000000ae ef000008 00000000 000000
f1aa1f00 2416018929 C Bo:1:003:1 0 31 >

The computer issued a READ command for 8 blocks (4096 bytes) starting
at block number 0x0000aeef = 44783.

d2588b00 2416019342 S Bi:1:003:2 -115 4096 <
d2588b00 2416019428 C Bi:1:003:2 -32 0
f1aa1f00 2416019435 S Co:1:003:0 s 02 01 0000 0082 0000 0
f1aa1f00 2416019554 C Co:1:003:0 0 0

The drive returned 0 bytes of data.

f1aa1f00 2416019560 S Bi:1:003:2 -115 13 <
f1aa1f00 2416019678 C Bi:1:003:2 0 13 = 55534253 07050100 00100000 00

And then it returned a status indicating no error but 4096 bytes
residue (i.e., incorrect or undelivered data).  This caused the
usb-storage driver to send the SCSI layer a result code of DID_ERROR
with no sense data.

> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
> end_request: I/O error, dev sda, sector 44783

The DID_ERROR code caused the SCSI layer to display this error message.  

> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
> end_request: I/O error, dev sda, sector 51823

I would have expected the READ to be retried, but in these two cases 
it wasn't.  The usbmon log contained five instances of this error 
sequence; the other three were retried successfully.  I don't know what 
the difference was.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 17:16                           ` Alan Stern
@ 2009-11-24 17:47                             ` Boaz Harrosh
  2009-11-24 19:28                               ` Alan Stern
  2009-11-24 18:00                             ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2009-11-24 17:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: tmhikaru, Jan Kara, linux-kernel, linux-usb, Jens Axboe,
	linux-scsi, linux-ext4

On 11/24/2009 07:16 PM, Alan Stern wrote:
> On Mon, 23 Nov 2009 tmhikaru@gmail.com wrote:
> 
>> Thank you. I've gotten output of it screwing up, but it's a 62MB file. I
>> don't think I'm allowed to send attachments here, nor do I know what I'm
>> supposed to be looking for in this output. So, instead I'm putting the
>> (compressed) file up on my home computer. If you've got suggestions for what
>> I should use instead, let me know.
>>
>> http://hikaru.no-ip.info:3000/1.mon.out.xz
>>
>> Although I don't think you'll need it, I've included the dmesg output of
>> what happened when I ran my backup script, just in case it helps at all.
> 
> Here's an annotated example of one of those hiccups:
> 
> f1aa1f00 2416018820 S Bo:1:003:1 -115 31 = 55534243 07050100 00100000 80000a28 000000ae ef000008 00000000 000000
> f1aa1f00 2416018929 C Bo:1:003:1 0 31 >
> 
> The computer issued a READ command for 8 blocks (4096 bytes) starting
> at block number 0x0000aeef = 44783.
> 
> d2588b00 2416019342 S Bi:1:003:2 -115 4096 <
> d2588b00 2416019428 C Bi:1:003:2 -32 0
> f1aa1f00 2416019435 S Co:1:003:0 s 02 01 0000 0082 0000 0
> f1aa1f00 2416019554 C Co:1:003:0 0 0
> 
> The drive returned 0 bytes of data.
> 
> f1aa1f00 2416019560 S Bi:1:003:2 -115 13 <
> f1aa1f00 2416019678 C Bi:1:003:2 0 13 = 55534253 07050100 00100000 00
> 
> And then it returned a status indicating no error but 4096 bytes
> residue (i.e., incorrect or undelivered data).  This caused the
> usb-storage driver to send the SCSI layer a result code of DID_ERROR
> with no sense data.
> 
>> sd 0:0:0:0: [sda] Unhandled error code
>> sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
>> end_request: I/O error, dev sda, sector 44783
> 
> The DID_ERROR code caused the SCSI layer to display this error message.  
> 
>> sd 0:0:0:0: [sda] Unhandled error code
>> sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
>> end_request: I/O error, dev sda, sector 51823
> 
> I would have expected the READ to be retried, but in these two cases 
> it wasn't.  The usbmon log contained five instances of this error 
> sequence; the other three were retried successfully.  I don't know what 
> the difference was.
> 

Perhaps the time it took to complete.

I have a very old IDE disk connected to a USB box here, and some bad sectors take ages
to return. One thing I wanted to investigate is why the complete Linux system is frozen
for minutes when I "cp" one of these bad-sectors and then every thing is back to normal.
It's just an inserted external box. Swap system and everything is on an another healthy HD.
As if the USB controller actually locks the PCI bus, or the interrupts are off for a long
while. (Or is it the BKL?)

Do you have time stamps on these?

> Alan Stern
> 

Boaz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 17:47                             ` Boaz Harrosh
@ 2009-11-24 19:28                               ` Alan Stern
  2009-11-24 19:56                                 ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-24 19:28 UTC (permalink / raw)
  To: Jan Kara, Boaz Harrosh
  Cc: tmhikaru, Kernel development list, USB list, Jens Axboe,
	SCSI development list, linux-ext4

On Tue, 24 Nov 2009, Boaz Harrosh wrote:

> > I would have expected the READ to be retried, but in these two cases 
> > it wasn't.  The usbmon log contained five instances of this error 
> > sequence; the other three were retried successfully.  I don't know what 
> > the difference was.
> > 
> 
> Perhaps the time it took to complete.
> 
> I have a very old IDE disk connected to a USB box here, and some bad sectors take ages
> to return. One thing I wanted to investigate is why the complete Linux system is frozen
> for minutes when I "cp" one of these bad-sectors and then every thing is back to normal.
> It's just an inserted external box. Swap system and everything is on an another healthy HD.
> As if the USB controller actually locks the PCI bus, or the interrupts are off for a long
> while. (Or is it the BKL?)
> 
> Do you have time stamps on these?

Yes.  Each of the unsuccessful reads, whether retried or not, lasted
less than one millisecond.  So that's probably not the reason.


On Tue, 24 Nov 2009, Jan Kara wrote:

>   My naive guess would be that those non-retried reads were actually
> readahead. That's not retried if I remember correctly. Later, when we
> really needed the data, we sent another read request...

That would be my guess too.  I don't know how to verify it, though.

If you're interested in pursuing this farther, I can show you how to 
generate equivalent errors on demand using an emulated USB drive.  
At this point it's not clear how much more one could learn by doing 
this, however.

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 19:28                               ` Alan Stern
@ 2009-11-24 19:56                                 ` Jan Kara
  2009-11-24 20:13                                   ` Alan Stern
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2009-11-24 19:56 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jan Kara, Boaz Harrosh, tmhikaru, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Tue 24-11-09 14:28:35, Alan Stern wrote:
> On Tue, 24 Nov 2009, Jan Kara wrote:
> 
> >   My naive guess would be that those non-retried reads were actually
> > readahead. That's not retried if I remember correctly. Later, when we
> > really needed the data, we sent another read request...
> 
> That would be my guess too.  I don't know how to verify it, though.
> 
> If you're interested in pursuing this farther, I can show you how to 
> generate equivalent errors on demand using an emulated USB drive.  
> At this point it's not clear how much more one could learn by doing 
> this, however.
  After digging in block layer code, it's as we suspected:
In case of host error DID_ERROR (which is our case), scsi request is
retried iff it is not a FAILFAST request which is set if bio is doing
readahead... So this is explained and everything behaves as it should.
Thanks everybody involved :).

								Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 19:56                                 ` Jan Kara
@ 2009-11-24 20:13                                   ` Alan Stern
  2009-11-24 20:39                                     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-24 20:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, tmhikaru, Kernel development list, USB list,
	Jens Axboe, SCSI development list, linux-ext4

On Tue, 24 Nov 2009, Jan Kara wrote:

>   After digging in block layer code, it's as we suspected:
> In case of host error DID_ERROR (which is our case), scsi request is
> retried iff it is not a FAILFAST request which is set if bio is doing
> readahead... So this is explained and everything behaves as it should.
> Thanks everybody involved :).

Okay, very good.  There remains the question of the disturbing error
messages in the system log.  Should they be supressed for FAILFAST
requests?

Alan Stern


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 20:13                                   ` Alan Stern
@ 2009-11-24 20:39                                     ` Jan Kara
  2009-11-24 21:50                                       ` tmhikaru
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2009-11-24 20:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jan Kara, Boaz Harrosh, tmhikaru, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Tue 24-11-09 15:13:01, Alan Stern wrote:
> On Tue, 24 Nov 2009, Jan Kara wrote:
> 
> >   After digging in block layer code, it's as we suspected:
> > In case of host error DID_ERROR (which is our case), scsi request is
> > retried iff it is not a FAILFAST request which is set if bio is doing
> > readahead... So this is explained and everything behaves as it should.
> > Thanks everybody involved :).
> 
> Okay, very good.  There remains the question of the disturbing error
> messages in the system log.  Should they be supressed for FAILFAST
> requests?
  I think it's useful they are there because ultimately, something really
went wrong and you should better investigate. BTW, "end_request: I/O error"
messages are in the log even for requests where we retried and succeeded...

								Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 20:39                                     ` Jan Kara
@ 2009-11-24 21:50                                       ` tmhikaru
  2009-11-24 22:23                                         ` tmhikaru
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-24 21:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alan Stern, Boaz Harrosh, tmhikaru, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Tue, Nov 24, 2009 at 09:39:44PM +0100, Jan Kara wrote:
> On Tue 24-11-09 15:13:01, Alan Stern wrote:
> > On Tue, 24 Nov 2009, Jan Kara wrote:
> > 
> > >   After digging in block layer code, it's as we suspected:
> > > In case of host error DID_ERROR (which is our case), scsi request is
> > > retried iff it is not a FAILFAST request which is set if bio is doing
> > > readahead... So this is explained and everything behaves as it should.
> > > Thanks everybody involved :).
> > 
> > Okay, very good.  There remains the question of the disturbing error
> > messages in the system log.  Should they be supressed for FAILFAST
> > requests?
>   I think it's useful they are there because ultimately, something really
> went wrong and you should better investigate. BTW, "end_request: I/O error"
> messages are in the log even for requests where we retried and succeeded...
> 
> 								Honza

While I agree it is useful information, I think that if the error messages
are going to be printed, you should *also* print that this is a NON FATAL
error and that it's going to be retried. It'd help diagnosing the path it's
following through the failure code IMHO as well as not making users
completely freak out like I did in my case. It is *not* particularly obvious
given the message printed to syslog what is going wrong or why.

	Just my opinion,
	Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 21:50                                       ` tmhikaru
@ 2009-11-24 22:23                                         ` tmhikaru
  2009-11-25  8:42                                           ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: tmhikaru @ 2009-11-24 22:23 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, Alan Stern, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Tue, Nov 24, 2009 at 04:50:44PM -0500, tmhikaru@gmail.com wrote:
> On Tue, Nov 24, 2009 at 09:39:44PM +0100, Jan Kara wrote:
> > On Tue 24-11-09 15:13:01, Alan Stern wrote:
> > > On Tue, 24 Nov 2009, Jan Kara wrote:
> > > 
> > > >   After digging in block layer code, it's as we suspected:
> > > > In case of host error DID_ERROR (which is our case), scsi request is
> > > > retried iff it is not a FAILFAST request which is set if bio is doing
> > > > readahead... So this is explained and everything behaves as it should.
> > > > Thanks everybody involved :).
> > > 
> > > Okay, very good.  There remains the question of the disturbing error
> > > messages in the system log.  Should they be supressed for FAILFAST
> > > requests?
> >   I think it's useful they are there because ultimately, something really
> > went wrong and you should better investigate. BTW, "end_request: I/O error"
> > messages are in the log even for requests where we retried and succeeded...
> > 
> > 								Honza
> 
> While I agree it is useful information, I think that if the error messages
> are going to be printed, you should *also* print that this is a NON FATAL
> error and that it's going to be retried. It'd help diagnosing the path it's
> following through the failure code IMHO as well as not making users
> completely freak out like I did in my case. It is *not* particularly obvious
> given the message printed to syslog what is going wrong or why.
> 
> 	Just my opinion,
> 	Tim McGrath

	I should have asked since I'm here at the moment - do you need any
more information out of the buggy USB enclosure at the moment, or can I work
on trying to fix/replace it now?

Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 22:23                                         ` tmhikaru
@ 2009-11-25  8:42                                           ` Jan Kara
  2009-11-25  9:37                                             ` tmhikaru
  2009-11-25 16:10                                             ` Alan Stern
  0 siblings, 2 replies; 26+ messages in thread
From: Jan Kara @ 2009-11-25  8:42 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, Alan Stern, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Tue 24-11-09 17:23:34, tmhikaru@gmail.com wrote:
> On Tue, Nov 24, 2009 at 04:50:44PM -0500, tmhikaru@gmail.com wrote:
> > On Tue, Nov 24, 2009 at 09:39:44PM +0100, Jan Kara wrote:
> > > On Tue 24-11-09 15:13:01, Alan Stern wrote:
> > > > On Tue, 24 Nov 2009, Jan Kara wrote:
> > > > 
> > > > >   After digging in block layer code, it's as we suspected:
> > > > > In case of host error DID_ERROR (which is our case), scsi request is
> > > > > retried iff it is not a FAILFAST request which is set if bio is doing
> > > > > readahead... So this is explained and everything behaves as it should.
> > > > > Thanks everybody involved :).
> > > > 
> > > > Okay, very good.  There remains the question of the disturbing error
> > > > messages in the system log.  Should they be supressed for FAILFAST
> > > > requests?
> > >   I think it's useful they are there because ultimately, something really
> > > went wrong and you should better investigate. BTW, "end_request: I/O error"
> > > messages are in the log even for requests where we retried and succeeded...
> > > 
> > > 								Honza
> > 
> > While I agree it is useful information, I think that if the error messages
> > are going to be printed, you should *also* print that this is a NON FATAL
> > error and that it's going to be retried. It'd help diagnosing the path it's
> > following through the failure code IMHO as well as not making users
> > completely freak out like I did in my case. It is *not* particularly obvious
> > given the message printed to syslog what is going wrong or why.
  Yeah, we might make it more obvious that read failed and whether or not
we are going to retry. Just technically it's not so simple because a
different layer prints messages about errors (generic block layer) and
different (scsi disk driver) decides what to do (retry, don't retry, ...).

> 	I should have asked since I'm here at the moment - do you need any
> more information out of the buggy USB enclosure at the moment, or can I work
> on trying to fix/replace it now?
  No, feel free to do anything with it :). Thanks for your help with
debugging this.

									Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-25  8:42                                           ` Jan Kara
@ 2009-11-25  9:37                                             ` tmhikaru
  2009-11-25 16:10                                             ` Alan Stern
  1 sibling, 0 replies; 26+ messages in thread
From: tmhikaru @ 2009-11-25  9:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: tmhikaru, Alan Stern, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Wed, Nov 25, 2009 at 09:42:41AM +0100, Jan Kara wrote:
>   No, feel free to do anything with it :). Thanks for your help with
> debugging this.

Thank *you* for giving me the initial reply. Without someone telling me they
were at least listening I may have given up entirely.

	Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-25  8:42                                           ` Jan Kara
  2009-11-25  9:37                                             ` tmhikaru
@ 2009-11-25 16:10                                             ` Alan Stern
  2009-11-27  9:43                                               ` tmhikaru
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-25 16:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: tmhikaru, Boaz Harrosh, Kernel development list, USB list,
	Jens Axboe, SCSI development list, linux-ext4

On Wed, 25 Nov 2009, Jan Kara wrote:

> > > > > Okay, very good.  There remains the question of the disturbing error
> > > > > messages in the system log.  Should they be supressed for FAILFAST
> > > > > requests?
> > > >   I think it's useful they are there because ultimately, something really
> > > > went wrong and you should better investigate. BTW, "end_request: I/O error"
> > > > messages are in the log even for requests where we retried and succeeded...

That isn't true.  Take a look at the dmesg log accompanying Tim's 
usbmon log.  Although there were 5 read errors in the usbmon log, there 
were only 2 I/O error messages in dmesg, corresponding to the 2 reads 
that weren't retried successfully.

Personally, I think it makes little sense to print error messages in 
the system log for commands where retries are disallowed.  Unless we go 
ahead and print error messages for _all_ failures, including those 
which are retried successfully.

Perhaps a good compromise would be to set the REQ_QUIET flag in 
req->cmd_flags for readaheads.  That would suppress the error messages 
coming from the SCSI core.

>   Yeah, we might make it more obvious that read failed and whether or not
> we are going to retry. Just technically it's not so simple because a
> different layer prints messages about errors (generic block layer) and
> different (scsi disk driver) decides what to do (retry, don't retry, ...).

Actually the retry decisions (or many of them) are made by the SCSI 
core, and that's also where some of those error messages come from.

> > 	I should have asked since I'm here at the moment - do you need any
> > more information out of the buggy USB enclosure at the moment, or can I work
> > on trying to fix/replace it now?
>   No, feel free to do anything with it :). Thanks for your help with
> debugging this.

To clarify, the enclosure isn't really very buggy.  It _should_ have
carried out the failed commands, or if it had a valid reason for not
doing so then it _should_ have reported the reason.  Regardless, the
errors that occurred were harmless because they went away when the
commands were retried.  (Although if they weren't harmless, you
wouldn't be able to tell just from reading the system log...)

Alan Stern

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-25 16:10                                             ` Alan Stern
@ 2009-11-27  9:43                                               ` tmhikaru
  2009-11-27 13:13                                                 ` Jan Kara
  2009-11-27 17:58                                                 ` Alan Stern
  0 siblings, 2 replies; 26+ messages in thread
From: tmhikaru @ 2009-11-27  9:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jan Kara, tmhikaru, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Wed, Nov 25, 2009 at 11:10:48AM -0500, Alan Stern wrote:
> On Wed, 25 Nov 2009, Jan Kara wrote:
> 
> > > > > > Okay, very good.  There remains the question of the disturbing error
> > > > > > messages in the system log.  Should they be supressed for FAILFAST
> > > > > > requests?
> > > > >   I think it's useful they are there because ultimately, something really
> > > > > went wrong and you should better investigate. BTW, "end_request: I/O error"
> > > > > messages are in the log even for requests where we retried and succeeded...
> 
> That isn't true.  Take a look at the dmesg log accompanying Tim's 
> usbmon log.  Although there were 5 read errors in the usbmon log, there 
> were only 2 I/O error messages in dmesg, corresponding to the 2 reads 
> that weren't retried successfully.
> 
> Personally, I think it makes little sense to print error messages in 
> the system log for commands where retries are disallowed.  Unless we go 
> ahead and print error messages for _all_ failures, including those 
> which are retried successfully.
> 
> Perhaps a good compromise would be to set the REQ_QUIET flag in 
> req->cmd_flags for readaheads.  That would suppress the error messages 
> coming from the SCSI core.
> 
> >   Yeah, we might make it more obvious that read failed and whether or not
> > we are going to retry. Just technically it's not so simple because a
> > different layer prints messages about errors (generic block layer) and
> > different (scsi disk driver) decides what to do (retry, don't retry, ...).
> 
> Actually the retry decisions (or many of them) are made by the SCSI 
> core, and that's also where some of those error messages come from.
> 
> > > 	I should have asked since I'm here at the moment - do you need any
> > > more information out of the buggy USB enclosure at the moment, or can I work
> > > on trying to fix/replace it now?
> >   No, feel free to do anything with it :). Thanks for your help with
> > debugging this.
> 
> To clarify, the enclosure isn't really very buggy.  It _should_ have
> carried out the failed commands, or if it had a valid reason for not
> doing so then it _should_ have reported the reason.  Regardless, the
> errors that occurred were harmless because they went away when the
> commands were retried.  (Although if they weren't harmless, you
> wouldn't be able to tell just from reading the system log...)
> 
> Alan Stern

Okay. Okay. Back up a moment here - Clarify a little. I have the filesystem
set to remount readonly on errors. I have not seen any filesystem
corruption or file corruption I could find. The filesystem *was* remounting
readonly under 2.6.31.5, but has not since .6 came out. (and I reformatted
and redid the entire backup under 2.6.31.6 without errors)

How do I know when it has generated an actual failure that was not
corrected?

How do I know when errors have been detected but they were corrected?

I'm guessing in the former, it'll remount ro, and in the latter it won't. Am
I correct?

I would like to save some money and not trash the usb enclosure... At the
same time, I don't want to use an enclosure that's trashing my data.

It is important to me to know exactly how the failure path operates. Please
explain to me what I will see happen. - Not knowing is driving me nuts.

Thank you,
Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-27  9:43                                               ` tmhikaru
@ 2009-11-27 13:13                                                 ` Jan Kara
  2009-11-27 17:58                                                 ` Alan Stern
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2009-11-27 13:13 UTC (permalink / raw)
  To: tmhikaru
  Cc: Alan Stern, Jan Kara, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

On Fri 27-11-09 04:43:39, tmhikaru@gmail.com wrote:
> On Wed, Nov 25, 2009 at 11:10:48AM -0500, Alan Stern wrote:
> > On Wed, 25 Nov 2009, Jan Kara wrote:
> > 
> > > > > > > Okay, very good.  There remains the question of the disturbing error
> > > > > > > messages in the system log.  Should they be supressed for FAILFAST
> > > > > > > requests?
> > > > > >   I think it's useful they are there because ultimately, something really
> > > > > > went wrong and you should better investigate. BTW, "end_request: I/O error"
> > > > > > messages are in the log even for requests where we retried and succeeded...
> > 
> > That isn't true.  Take a look at the dmesg log accompanying Tim's 
> > usbmon log.  Although there were 5 read errors in the usbmon log, there 
> > were only 2 I/O error messages in dmesg, corresponding to the 2 reads 
> > that weren't retried successfully.
> > 
> > Personally, I think it makes little sense to print error messages in 
> > the system log for commands where retries are disallowed.  Unless we go 
> > ahead and print error messages for _all_ failures, including those 
> > which are retried successfully.
> > 
> > Perhaps a good compromise would be to set the REQ_QUIET flag in 
> > req->cmd_flags for readaheads.  That would suppress the error messages 
> > coming from the SCSI core.
> > 
> > >   Yeah, we might make it more obvious that read failed and whether or not
> > > we are going to retry. Just technically it's not so simple because a
> > > different layer prints messages about errors (generic block layer) and
> > > different (scsi disk driver) decides what to do (retry, don't retry, ...).
> > 
> > Actually the retry decisions (or many of them) are made by the SCSI 
> > core, and that's also where some of those error messages come from.
> > 
> > > > 	I should have asked since I'm here at the moment - do you need any
> > > > more information out of the buggy USB enclosure at the moment, or can I work
> > > > on trying to fix/replace it now?
> > >   No, feel free to do anything with it :). Thanks for your help with
> > > debugging this.
> > 
> > To clarify, the enclosure isn't really very buggy.  It _should_ have
> > carried out the failed commands, or if it had a valid reason for not
> > doing so then it _should_ have reported the reason.  Regardless, the
> > errors that occurred were harmless because they went away when the
> > commands were retried.  (Although if they weren't harmless, you
> > wouldn't be able to tell just from reading the system log...)
> > 
> > Alan Stern
> 
> Okay. Okay. Back up a moment here - Clarify a little. I have the filesystem
> set to remount readonly on errors. I have not seen any filesystem
> corruption or file corruption I could find. The filesystem *was* remounting
> readonly under 2.6.31.5, but has not since .6 came out. (and I reformatted
> and redid the entire backup under 2.6.31.6 without errors)
> 
> How do I know when it has generated an actual failure that was not
> corrected?
> 
> How do I know when errors have been detected but they were corrected?
> 
> I'm guessing in the former, it'll remount ro, and in the latter it won't. Am
> I correct?
  So if just the file data cannot be read / written, you see messages
about IO errors in the log (as you see now), some application might get EIO
error but the filesystem will not be remounted. It gets remounted only if
some filesystem metadata cannot be read / written.
  It's true that retrying commands usually succeeds but not always and thus
I find it possible that eventually it fails several times in a row so that
kernel just gives up and discards data block to write because it thinks it
has no way of writing it. So I personally would ask for a warranty exchange
of the enclosure...

									Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-27  9:43                                               ` tmhikaru
  2009-11-27 13:13                                                 ` Jan Kara
@ 2009-11-27 17:58                                                 ` Alan Stern
  2009-11-29  4:30                                                   ` tmhikaru
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Stern @ 2009-11-27 17:58 UTC (permalink / raw)
  To: tmhikaru
  Cc: Jan Kara, Boaz Harrosh, Kernel development list, USB list,
	Jens Axboe, SCSI development list, linux-ext4

On Fri, 27 Nov 2009 tmhikaru@gmail.com wrote:

> It is important to me to know exactly how the failure path operates. Please
> explain to me what I will see happen. - Not knowing is driving me nuts.

It goes like this.  The computer sends a lot of READ commands to the
drive.  (For all we know the same thing might happen with WRITEs, but
you didn't do any writing in the test data you sent.)  Every now and
then the drive fails to carry out a READ, for no apparent reason.

Normally the computer then retries, and the READ succeeds the second
time.  But of course, success isn't guaranteed.  When the retry does
succeed, no error messages are printed in the log and everything
continues normally.

Occasionally the computer does not retry the READ -- in circumstances
where it doesn't really need the data (optimistic readahead).  When 
this happens, the failed READ does cause an error message to appear.  
The same thing would happen if the attempted retries were to fail.

Whether or not this would result in lost or corrupted data, or
remounted readonly filesystems, depends on the kind of data being read.

Alan Stern

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-27 17:58                                                 ` Alan Stern
@ 2009-11-29  4:30                                                   ` tmhikaru
  0 siblings, 0 replies; 26+ messages in thread
From: tmhikaru @ 2009-11-29  4:30 UTC (permalink / raw)
  To: Alan Stern
  Cc: tmhikaru, Jan Kara, Boaz Harrosh, Kernel development list,
	USB list, Jens Axboe, SCSI development list, linux-ext4

	Thank you for the time and effort you had to spend on this, all of
you, I really appreciate all the hard work spent diagnosing my hardware
issues as well as explaining what the various error messages were coming
from.


	Hope to see you next time,
	Tim McGrath

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Weird I/O errors with USB hard drive not remounting filesystem readonly
  2009-11-24 17:16                           ` Alan Stern
  2009-11-24 17:47                             ` Boaz Harrosh
@ 2009-11-24 18:00                             ` Jan Kara
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2009-11-24 18:00 UTC (permalink / raw)
  To: Alan Stern
  Cc: tmhikaru, Jan Kara, linux-kernel, linux-usb, Jens Axboe,
	linux-scsi, linux-ext4

On Tue 24-11-09 12:16:17, Alan Stern wrote:
> On Mon, 23 Nov 2009 tmhikaru@gmail.com wrote:
> 
> > Thank you. I've gotten output of it screwing up, but it's a 62MB file. I
> > don't think I'm allowed to send attachments here, nor do I know what I'm
> > supposed to be looking for in this output. So, instead I'm putting the
> > (compressed) file up on my home computer. If you've got suggestions for what
> > I should use instead, let me know.
> > 
> > http://hikaru.no-ip.info:3000/1.mon.out.xz
> > 
> > Although I don't think you'll need it, I've included the dmesg output of
> > what happened when I ran my backup script, just in case it helps at all.
> 
> Here's an annotated example of one of those hiccups:
> 
> f1aa1f00 2416018820 S Bo:1:003:1 -115 31 = 55534243 07050100 00100000 80000a28 000000ae ef000008 00000000 000000
> f1aa1f00 2416018929 C Bo:1:003:1 0 31 >
> 
> The computer issued a READ command for 8 blocks (4096 bytes) starting
> at block number 0x0000aeef = 44783.
> 
> d2588b00 2416019342 S Bi:1:003:2 -115 4096 <
> d2588b00 2416019428 C Bi:1:003:2 -32 0
> f1aa1f00 2416019435 S Co:1:003:0 s 02 01 0000 0082 0000 0
> f1aa1f00 2416019554 C Co:1:003:0 0 0
> 
> The drive returned 0 bytes of data.
> 
> f1aa1f00 2416019560 S Bi:1:003:2 -115 13 <
> f1aa1f00 2416019678 C Bi:1:003:2 0 13 = 55534253 07050100 00100000 00
> 
> And then it returned a status indicating no error but 4096 bytes
> residue (i.e., incorrect or undelivered data).  This caused the
> usb-storage driver to send the SCSI layer a result code of DID_ERROR
> with no sense data.
> 
> > sd 0:0:0:0: [sda] Unhandled error code
> > sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
> > end_request: I/O error, dev sda, sector 44783
> 
> The DID_ERROR code caused the SCSI layer to display this error message.  
> 
> > sd 0:0:0:0: [sda] Unhandled error code
> > sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
> > end_request: I/O error, dev sda, sector 51823
> 
> I would have expected the READ to be retried, but in these two cases 
> it wasn't.  The usbmon log contained five instances of this error 
> sequence; the other three were retried successfully.  I don't know what 
> the difference was.
  My naive guess would be that those non-retried reads were actually
readahead. That's not retried if I remember correctly. Later, when we
really needed the data, we sent another read request...

								Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-11-29  4:31 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20091113050907.GA4965@roll>
     [not found] ` <20091119160700.GC2943@atrey.karlin.mff.cuni.cz>
     [not found]   ` <20091120082359.GA29538@roll>
     [not found]     ` <20091120094641.GB15422@duck.suse.cz>
2009-11-23  8:09       ` Weird I/O errors with USB hard drive not remounting filesystem readonly tmhikaru
2009-11-23 10:54         ` Jan Kara
2009-11-23 15:06           ` Alan Stern
2009-11-23 16:09             ` Jan Kara
2009-11-23 18:20               ` Alan Stern
2009-11-23 18:54                 ` Jan Kara
2009-11-23 19:50                   ` tmhikaru
2009-11-23 20:06                     ` tmhikaru
2009-11-23 20:33                       ` Alan Stern
2009-11-23 23:42                         ` tmhikaru
2009-11-24 17:16                           ` Alan Stern
2009-11-24 17:47                             ` Boaz Harrosh
2009-11-24 19:28                               ` Alan Stern
2009-11-24 19:56                                 ` Jan Kara
2009-11-24 20:13                                   ` Alan Stern
2009-11-24 20:39                                     ` Jan Kara
2009-11-24 21:50                                       ` tmhikaru
2009-11-24 22:23                                         ` tmhikaru
2009-11-25  8:42                                           ` Jan Kara
2009-11-25  9:37                                             ` tmhikaru
2009-11-25 16:10                                             ` Alan Stern
2009-11-27  9:43                                               ` tmhikaru
2009-11-27 13:13                                                 ` Jan Kara
2009-11-27 17:58                                                 ` Alan Stern
2009-11-29  4:30                                                   ` tmhikaru
2009-11-24 18:00                             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox