* raid1 issue after disk failure: both disks of the array are still active
@ 2012-09-13 10:01 Niccolò Belli
2012-09-13 10:34 ` Robin Hill
0 siblings, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-13 10:01 UTC (permalink / raw)
To: linux-raid
Hi,
I have a raid1 array with two disks, distro is Squeeze amd64. /dev/sda
is slowly dying, here is a snippet of "smartctl -a /dev/sda":
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 2
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
- 1
The bad sector is in the second half-MB of the disk, in fact with "dd
if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1" I get this output in
/var/log/syslog:
root@asterisk:~# dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1
0+1 record dentro
0+1 record fuori
430140 byte (430 kB) copiati, 11,7265 s, 36,7 kB/s
Sep 12 22:15:02 asterisk kernel: [ 8921.561978] dd: sending ioctl
80306d02 to a partition!
Sep 12 22:15:02 asterisk kernel: [ 8921.561986] dd: sending ioctl
80306d02 to a partition!
Sep 12 22:15:03 asterisk kernel: [ 8922.529099] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:03 asterisk kernel: [ 8922.531774] ata3.00: BMDMA stat 0x44
Sep 12 22:15:03 asterisk kernel: [ 8922.533547] ata3.00: failed command:
READ DMA
Sep 12 22:15:03 asterisk kernel: [ 8922.535313] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:03 asterisk kernel: [ 8922.535316] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:03 asterisk kernel: [ 8922.538891] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:03 asterisk kernel: [ 8922.540675] ata3.00: error: { UNC }
Sep 12 22:15:04 asterisk kernel: [ 8923.508206] ata3.00: configured for
UDMA/133
Sep 12 22:15:04 asterisk kernel: [ 8923.508220] ata3: EH complete
Sep 12 22:15:05 asterisk kernel: [ 8924.469512] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:05 asterisk kernel: [ 8924.472323] ata3.00: BMDMA stat 0x44
Sep 12 22:15:05 asterisk kernel: [ 8924.475260] ata3.00: failed command:
READ DMA
Sep 12 22:15:05 asterisk kernel: [ 8924.477023] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:05 asterisk kernel: [ 8924.477025] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:05 asterisk kernel: [ 8924.480595] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:05 asterisk kernel: [ 8924.482370] ata3.00: error: { UNC }
Sep 12 22:15:06 asterisk kernel: [ 8925.452209] ata3.00: configured for
UDMA/133
Sep 12 22:15:06 asterisk kernel: [ 8925.452224] ata3: EH complete
Sep 12 22:15:07 asterisk kernel: [ 8926.418504] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:07 asterisk kernel: [ 8926.420741] ata3.00: BMDMA stat 0x44
Sep 12 22:15:07 asterisk kernel: [ 8926.422486] ata3.00: failed command:
READ DMA
Sep 12 22:15:07 asterisk kernel: [ 8926.424279] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:07 asterisk kernel: [ 8926.424281] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:07 asterisk kernel: [ 8926.427861] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:07 asterisk kernel: [ 8926.429660] ata3.00: error: { UNC }
Sep 12 22:15:08 asterisk kernel: [ 8927.396270] ata3.00: configured for
UDMA/133
Sep 12 22:15:08 asterisk kernel: [ 8927.396285] ata3: EH complete
Sep 12 22:15:09 asterisk kernel: [ 8928.359173] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:09 asterisk kernel: [ 8928.361647] ata3.00: BMDMA stat 0x44
Sep 12 22:15:09 asterisk kernel: [ 8928.364273] ata3.00: failed command:
READ DMA
Sep 12 22:15:09 asterisk kernel: [ 8928.366028] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:09 asterisk kernel: [ 8928.366030] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:09 asterisk kernel: [ 8928.369643] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:09 asterisk kernel: [ 8928.371420] ata3.00: error: { UNC }
Sep 12 22:15:10 asterisk kernel: [ 8929.340218] ata3.00: configured for
UDMA/133
Sep 12 22:15:10 asterisk kernel: [ 8929.340233] ata3: EH complete
Sep 12 22:15:11 asterisk kernel: [ 8930.332648] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:11 asterisk kernel: [ 8930.334453] ata3.00: BMDMA stat 0x44
Sep 12 22:15:11 asterisk kernel: [ 8930.336245] ata3.00: failed command:
READ DMA
Sep 12 22:15:11 asterisk kernel: [ 8930.337995] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:11 asterisk kernel: [ 8930.337998] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:11 asterisk kernel: [ 8930.341583] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:11 asterisk kernel: [ 8930.343360] ata3.00: error: { UNC }
Sep 12 22:15:12 asterisk kernel: [ 8931.344205] ata3.00: configured for
UDMA/133
Sep 12 22:15:12 asterisk kernel: [ 8931.344220] ata3: EH complete
Sep 12 22:15:13 asterisk kernel: [ 8932.306376] ata3.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 12 22:15:13 asterisk kernel: [ 8932.308201] ata3.00: BMDMA stat 0x44
Sep 12 22:15:13 asterisk kernel: [ 8932.309948] ata3.00: failed command:
READ DMA
Sep 12 22:15:13 asterisk kernel: [ 8932.311695] ata3.00: cmd
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:13 asterisk kernel: [ 8932.311697] res
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:13 asterisk kernel: [ 8932.315262] ata3.00: status: { DRDY
ERR }
Sep 12 22:15:13 asterisk kernel: [ 8932.317070] ata3.00: error: { UNC }
Sep 12 22:15:14 asterisk kernel: [ 8933.284204] ata3.00: configured for
UDMA/133
Sep 12 22:15:14 asterisk kernel: [ 8933.284234] sd 2:0:0:0: [sda]
Unhandled sense code
Sep 12 22:15:14 asterisk kernel: [ 8933.284237] sd 2:0:0:0: [sda]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 12 22:15:14 asterisk kernel: [ 8933.284241] sd 2:0:0:0: [sda] Sense
Key : Medium Error [current] [descriptor]
Sep 12 22:15:14 asterisk kernel: [ 8933.284246] Descriptor sense data
with sense descriptors (in hex):
Sep 12 22:15:14 asterisk kernel: [ 8933.284248] 72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Sep 12 22:15:14 asterisk kernel: [ 8933.284256] 00 00 0f 48
Sep 12 22:15:14 asterisk kernel: [ 8933.284260] sd 2:0:0:0: [sda] Add.
Sense: Unrecovered read error - auto reallocate failed
Sep 12 22:15:14 asterisk kernel: [ 8933.284267] sd 2:0:0:0: [sda] CDB:
Read(10): 28 00 00 00 0f 48 00 00 08 00
Sep 12 22:15:14 asterisk kernel: [ 8933.284274] end_request: I/O error,
dev sda, sector 3912
Sep 12 22:15:14 asterisk kernel: [ 8933.286065] Buffer I/O error on
device sda1, logical block 233
Sep 12 22:15:14 asterisk kernel: [ 8933.287889] ata3: EH complete
*Why doesn't it fail the first hard disk of the array!!??*
root@asterisk:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
949236 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
311619448 blocks super 1.2 [2/2] [UU]
unused devices: <none>
root@asterisk:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Jun 15 22:45:13 2012
Raid Level : raid1
Array Size : 311619448 (297.18 GiB 319.10 GB)
Used Dev Size : 311619448 (297.18 GiB 319.10 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Sep 12 22:07:58 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : asterisk:0 (local to host asterisk)
UUID : cea0c4c3:181e2ee3:e4d1f3c0:1008ea62
Events : 68
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
As you can see the firmware of the hard disk reports a read error and
linux still doesn't fail the drive: this is the best way to corrupt data
As far as I know it should fail the bad drive or at least try to resync
it allowing the firmware to reallocate the bad sectors on write.
I really want to understand how raid1 is expected to work, I simply
cannot trust something like this. I'd like to take advantage of the
failure to learn something about linux's raid1 behavior.
Thanks,
Niccolò
More info about the failed disk:
root@asterisk:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-2-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F1 DT
Device Model: SAMSUNG HD322HJ
Serial Number: S17AJDWQ402689
LU WWN Device Id: 5 0000f0 003046298
Firmware Version: 1AC01110
User Capacity: 320,072,933,376 bytes [320 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Wed Sep 12 22:27:56 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 114) The previous self-test completed
having
the read element of the test
failed.
Total time to complete Offline
data collection: ( 3888) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 66) minutes.
Conveyance self-test routine
recommended polling time: ( 8) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control
supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 099 099 051 Pre-fail Always
- 428
3 Spin_Up_Time 0x0007 094 094 011 Pre-fail Always
- 2810
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always
- 1077
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always
- 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail
Offline - 9666
9 Power_On_Hours 0x0032 098 098 000 Old_age Always
- 8915
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always
- 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always
- 1077
13 Read_Soft_Error_Rate 0x000e 099 099 000 Old_age Always
- 400
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 400
188 Command_Timeout 0x0032 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 063 055 000 Old_age Always
- 37 (Min/Max 28/45)
194 Temperature_Celsius 0x0022 063 054 000 Old_age Always
- 37 (Min/Max 28/46)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 355155576
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 2
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
- 1
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always
- 0
201 Soft_Read_Error_Rate 0x000a 096 096 000 Old_age Always
- 361
SMART Error Log Version: 1
ATA Error Count: 173 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 173 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 0f 00 e0 Error: UNC at LBA = 0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 0f 00 e0 08 18d+09:02:03.824 READ DMA
ec 00 00 00 00 00 a0 08 18d+09:02:03.814 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:02:03.814 SET FEATURES [Set transfer
mode]
ec 00 00 00 00 00 a0 08 18d+09:02:02.824 IDENTIFY DEVICE
Error 172 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 0f 00 e0 Error: UNC at LBA = 0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 0f 00 e0 08 18d+09:02:01.814 READ DMA
ec 00 00 00 00 00 a0 08 18d+09:02:01.804 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:02:01.804 SET FEATURES [Set transfer
mode]
ec 00 00 00 00 00 a0 08 18d+09:02:00.854 IDENTIFY DEVICE
Error 171 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 0f 00 e0 Error: UNC at LBA = 0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 0f 00 e0 08 18d+09:01:59.874 READ DMA
ec 00 00 00 00 00 a0 08 18d+09:01:59.864 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:59.864 SET FEATURES [Set transfer
mode]
ec 00 00 00 00 00 a0 08 18d+09:01:58.904 IDENTIFY DEVICE
Error 170 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 0f 00 e0 Error: UNC at LBA = 0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 0f 00 e0 08 18d+09:01:57.924 READ DMA
ec 00 00 00 00 00 a0 08 18d+09:01:57.924 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:57.924 SET FEATURES [Set transfer
mode]
ec 00 00 00 00 00 a0 08 18d+09:01:56.964 IDENTIFY DEVICE
Error 169 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 0f 00 e0 Error: UNC at LBA = 0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 0f 00 e0 08 18d+09:01:55.984 READ DMA
ec 00 00 00 00 00 a0 08 18d+09:01:55.974 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 18d+09:01:55.974 SET FEATURES [Set transfer
mode]
ec 00 00 00 00 00 a0 08 18d+09:01:55.014 IDENTIFY DEVICE
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision
number = 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 8895
3912
# 2 Short offline Aborted by host 20% 8871 -
# 3 Short offline Aborted by host 20% 8847 -
# 4 Short offline Aborted by host 20% 8823 -
# 5 Extended offline Aborted by host 90% 8800 -
# 6 Short offline Aborted by host 20% 8799 -
# 7 Short offline Aborted by host 20% 8775 -
# 8 Short offline Aborted by host 20% 8751 -
# 9 Short offline Aborted by host 20% 8727 -
#10 Short offline Aborted by host 20% 8703 -
#11 Short offline Aborted by host 20% 8679 -
#12 Short offline Aborted by host 20% 8655 -
#13 Extended offline Aborted by host 90% 8632 -
#14 Short offline Aborted by host 20% 8631 -
#15 Short offline Aborted by host 20% 8607 -
#16 Short offline Aborted by host 20% 8583 -
#17 Short offline Aborted by host 20% 8559 -
#18 Short offline Aborted by host 20% 8535 -
#19 Short offline Aborted by host 20% 8511 -
#20 Short offline Aborted by host 20% 8487 -
#21 Extended offline Aborted by host 90% 8464 -
Note: selective self-test log revision number (0) not 1 implies that no
selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever
been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 10:01 raid1 issue after disk failure: both disks of the array are still active Niccolò Belli
@ 2012-09-13 10:34 ` Robin Hill
2012-09-13 10:46 ` Niccolò Belli
2012-09-13 17:02 ` Chris Murphy
0 siblings, 2 replies; 27+ messages in thread
From: Robin Hill @ 2012-09-13 10:34 UTC (permalink / raw)
To: Niccolò Belli; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]
On Thu Sep 13, 2012 at 12:01:59PM +0200, Niccolò Belli wrote:
> Hi,
> I have a raid1 array with two disks, distro is Squeeze amd64. /dev/sda
> is slowly dying, here is a snippet of "smartctl -a /dev/sda":
>
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
> - 2
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
> - 1
>
> The bad sector is in the second half-MB of the disk, in fact with "dd
> if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1" I get this output in
> /var/log/syslog:
>
> root@asterisk:~# dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1
> 0+1 record dentro
> 0+1 record fuori
> 430140 byte (430 kB) copiati, 11,7265 s, 36,7 kB/s
>
<- snip dmesg output ->
>
> *Why doesn't it fail the first hard disk of the array!!??*
>
Has anything actually attempted to read from that part of the array?
Even if so, it may just have happened to read from the working disk
anyway. md can only detect the error when it tries to read/write that
sector of that disk.
Your best bet now is to do an array check:
echo check > /sys/block/md0/md/sync_action
This will force a read of all disks in the array. This should trigger
the read error, causing an attempt to re-write the faulty block, in turn
causing the drive remap the bad sector (assuming the re-write fails).
This should also be scheduled to run regularly for all arrays in order
to pick up these sort of issues before they cause major problems during
a rebuild.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 10:34 ` Robin Hill
@ 2012-09-13 10:46 ` Niccolò Belli
[not found] ` <5051BBC3.4050805@websitemanagers.com.au>
[not found] ` <CABYL=TpKD2B0vwTrHH=iFK3PcMWueEsi84ACRbBQkDXuiWG3kw@mail.gmail.com>
2012-09-13 17:02 ` Chris Murphy
1 sibling, 2 replies; 27+ messages in thread
From: Niccolò Belli @ 2012-09-13 10:46 UTC (permalink / raw)
To: linux-raid
Il 13/09/2012 12:34, Robin Hill ha scritto:
> Has anything actually attempted to read from that part of the array?
> Even if so, it may just have happened to read from the working disk
> anyway. md can only detect the error when it tries to read/write that
> sector of that disk.
I forced a read with "dd if=/dev/md0 of=/dev/null bs=524228 count=1
skip=1", I even get errors in syslog!
> Your best bet now is to do an array check:
> echo check> /sys/block/md0/md/sync_action
>
> This will force a read of all disks in the array. This should trigger
> the read error, causing an attempt to re-write the faulty block, in turn
> causing the drive remap the bad sector (assuming the re-write fails).
> This should also be scheduled to run regularly for all arrays in order
> to pick up these sort of issues before they cause major problems during
> a rebuild.
/etc/init.d/mdadm should do exactly this kind of things (distro is
Debian Squeeze). I have this in cron.d:
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d)
-le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
Unfortunately it seems it didn't work :(
Shouldn't a dd if=/dev/md0 be enough to trigger the read error?
Thanks,
Niccolò
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 10:34 ` Robin Hill
2012-09-13 10:46 ` Niccolò Belli
@ 2012-09-13 17:02 ` Chris Murphy
2012-09-13 17:39 ` Roberto Spadim
2012-09-14 7:16 ` Mikael Abrahamsson
1 sibling, 2 replies; 27+ messages in thread
From: Chris Murphy @ 2012-09-13 17:02 UTC (permalink / raw)
To: Linux RAID
On Sep 13, 2012, at 4:34 AM, Robin Hill wrote:
>
> Your best bet now is to do an array check:
> echo check > /sys/block/md0/md/sync_action
>
> This will force a read of all disks in the array. This should trigger
> the read error, causing an attempt to re-write the faulty block, in turn
> causing the drive remap the bad sector (assuming the re-write fails).
"check" records errors, no action is taken by the md driver to correct it, although the disk firmware itself may try reallocation. So far, that appears to not be the case.
"repair" causes the md driver to write correct data (from copy or reconstructed from parity), which should force the disk firmware to reallocate the affected LBAs from bad physical sectors to good ones.
It seems in this case "repair" is indicated.
Chris Murphy
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 17:02 ` Chris Murphy
@ 2012-09-13 17:39 ` Roberto Spadim
2012-09-13 20:13 ` Chris Murphy
2012-09-14 7:16 ` Mikael Abrahamsson
1 sibling, 1 reply; 27+ messages in thread
From: Roberto Spadim @ 2012-09-13 17:39 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux RAID
> "check" records errors, no action is taken by the md driver to correct it, although the disk firmware itself may try reallocation. So far, that appears to not be the case.
>
> "repair" causes the md driver to write correct data (from copy or reconstructed from parity), which should force the disk firmware to reallocate the affected LBAs from bad physical sectors to good ones.
>
> It seems in this case "repair" is indicated.
>
Or replace the bad disk =)
--
Roberto Spadim
Spadim Technology / SPAEmpresarial
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 17:39 ` Roberto Spadim
@ 2012-09-13 20:13 ` Chris Murphy
0 siblings, 0 replies; 27+ messages in thread
From: Chris Murphy @ 2012-09-13 20:13 UTC (permalink / raw)
To: Linux RAID
On Sep 13, 2012, at 11:39 AM, Roberto Spadim wrote:
>> "check" records errors, no action is taken by the md driver to correct it, although the disk firmware itself may try reallocation. So far, that appears to not be the case.
>>
>> "repair" causes the md driver to write correct data (from copy or reconstructed from parity), which should force the disk firmware to reallocate the affected LBAs from bad physical sectors to good ones.
>>
>> It seems in this case "repair" is indicated.
>>
> Or replace the bad disk =)
Yes or replace the disk. But from the SMART info provided, it's just a few sectors that are affected. None of the attribute values have even budged so far.
Chris Murphy
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-13 17:02 ` Chris Murphy
2012-09-13 17:39 ` Roberto Spadim
@ 2012-09-14 7:16 ` Mikael Abrahamsson
2012-09-14 7:45 ` Niccolò Belli
2012-09-14 8:13 ` NeilBrown
1 sibling, 2 replies; 27+ messages in thread
From: Mikael Abrahamsson @ 2012-09-14 7:16 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux RAID
On Thu, 13 Sep 2012, Chris Murphy wrote:
> "check" records errors, no action is taken by the md driver to correct
> it, although the disk firmware itself may try reallocation. So far, that
> appears to not be the case.
>
> "repair" causes the md driver to write correct data (from copy or
> reconstructed from parity), which should force the disk firmware to
> reallocate the affected LBAs from bad physical sectors to good ones.
>
> It seems in this case "repair" is indicated.
I was under the impression that "check" would check if all data blocks and
parity are correct, and record if there is a parity mismatch. This would
then be corrected by using "repair" at a later time.
I was also under the impression that if there was a read error on a drive
during "check", that read error would be corrected using parity because
it's obviously a hard error, not a logical error.
Could you (or someone else) please confirm that my impression is wrong and
if there indeed is a hard read error using "check", this will not be
corrected? I would be interested in knowing why this decision was taken to
have this behaviour, as I feel that if there is a hard read error, this
should always be corrected using parity.
--
Mikael Abrahamsson email: swmike@swm.pp.se
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 7:16 ` Mikael Abrahamsson
@ 2012-09-14 7:45 ` Niccolò Belli
2012-09-14 18:04 ` Chris Murphy
2012-09-14 8:13 ` NeilBrown
1 sibling, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-14 7:45 UTC (permalink / raw)
To: linux-raid
I also would like to know if the raid1 will *surely* use data from the
other disk to write on the broken sector after a CHECK. I mean, i did
nothing even after a read error on md0 with a "failed command: READ DMA"
in dmesg (possibly because after a few reads it succeeded reading?). I
read that when raid1 is in doubt there is a 50%-50% chance it uses data
from the good disk, wouldn't be better to fail the broken disk and then
re-add it to the array?
Cheers,
Niccolò
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 7:45 ` Niccolò Belli
@ 2012-09-14 18:04 ` Chris Murphy
2012-09-14 18:27 ` Robin Hill
0 siblings, 1 reply; 27+ messages in thread
From: Chris Murphy @ 2012-09-14 18:04 UTC (permalink / raw)
To: Linux RAID
On Sep 14, 2012, at 1:45 AM, Niccolò Belli wrote:
> I also would like to know if the raid1 will *surely* use data from the other disk to write on the broken sector after a CHECK.
Not according to documentation. In normal operation, and for a repair, what you describe is correct. But not for check.
> I mean, i did nothing even after a read error on md0 with a "failed command: READ DMA" in dmesg (possibly because after a few reads it succeeded reading?).
Possibly. Possibly the disk firmware finally was able to relocate that sector's data. Possibly its ECC thinks it has reconstructed the data on that sector, but in fact the data is corrupt.
> I read that when raid1 is in doubt there is a 50%-50% chance it uses data from the good disk, wouldn't be better to fail the broken disk and then re-add it to the array?
I don't know what this means.
But I think there's a misunderstanding about disk behavior. A disk's reliability is not always a binary condition. Most often it's a continuum, because sector problems are masked by disk's ECC, and they go entirely unreported to the kernel, and thus md. This includes the case when the disk ECC detects an error, and thinks it has corrected it, but actually returns bogus (corrupt) data rather than a read error; as well as when disk ECC does not detect an error at all, but the data is in fact corrupt.
The md driver has no practical choice but to trust the data the disk returns, absent an error. So I'm confused by what you mean by "when raid1 is in doubt" and what you mean by this "50/50 chance" part.
When the disk ECC detects an error, and fails to correct it, only then will it report a read error to the kernel, and then md will get that data elsewhere. There is no good reason for md to mark a 99.99% correctly performing disk as faulty. If it did this, you've unnecessarily abandoned those 99.99% useful sectors, and in so doing have significantly reduced redundancy.
There's a reason why there are check and repair functions, rather than wholesale discarding an otherwise valuable disk with a handful of bad sectors (or even one), and the ensuing loss of redundancy. Check is read only so it will be faster than repair, is a good reason to use check frequently and repair less frequently unless check warrants it. And there's good reason to include some smartd periodic testing as well since there are parts of the disk that md check/repair can't test *and* because the disk ECC masks problems, whereas SMART should report them.
Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 18:04 ` Chris Murphy
@ 2012-09-14 18:27 ` Robin Hill
2012-09-14 18:53 ` Chris Murphy
0 siblings, 1 reply; 27+ messages in thread
From: Robin Hill @ 2012-09-14 18:27 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux RAID
[-- Attachment #1: Type: text/plain, Size: 1958 bytes --]
On Fri Sep 14, 2012 at 12:04:56 -0600, Chris Murphy wrote:
>
> On Sep 14, 2012, at 1:45 AM, Niccolò Belli wrote:
>
> > I also would like to know if the raid1 will *surely* use data from
> > the other disk to write on the broken sector after a CHECK.
>
> Not according to documentation. In normal operation, and for a repair,
> what you describe is correct. But not for check.
>
Maybe you need to reread the documentation. The md manual page says:
Requesting a scrub will cause md to read every block on every device
in the array, and check that the data is consistent. For RAID1 and
RAID10, this means checking that the copies are identical. For
RAID4, RAID5, RAID6 this means checking that the parity block is (or
blocks are) correct.
If a read error is detected during this process, the normal
read-error handling causes correct data to be found from other
devices and to be written back to the faulty device. In many case
this will effectively fix the bad block.
So a check will repair cases where the data cannot be read at all, but
will not repair cases where the data is returned but does not match the
data on the other mirror(s).
> > I read that when raid1 is in doubt there is a 50%-50% chance it uses
> > data from the good disk, wouldn't be better to fail the broken disk
> > and then re-add it to the array?
>
> I don't know what this means.
>
I assume he's referring to cases where the data is read successfully but
does not match the data on the mirror(s). In this case a repair will
cause one copy to overwrite the others, which may or may not be the
correct copy (md has no way of knowing for a mirrored pair).
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 18:27 ` Robin Hill
@ 2012-09-14 18:53 ` Chris Murphy
2012-09-15 19:05 ` Niccolò Belli
0 siblings, 1 reply; 27+ messages in thread
From: Chris Murphy @ 2012-09-14 18:53 UTC (permalink / raw)
To: Linux RAID
On Sep 14, 2012, at 12:27 PM, Robin Hill wrote:
> On Fri Sep 14, 2012 at 12:04:56 -0600, Chris Murphy wrote:
>
>>
>> On Sep 14, 2012, at 1:45 AM, Niccolò Belli wrote:
>>
>>> I also would like to know if the raid1 will *surely* use data from
>>> the other disk to write on the broken sector after a CHECK.
>>
>> Not according to documentation. In normal operation, and for a repair,
>> what you describe is correct. But not for check.
>>
> Maybe you need to reread the documentation.
Probably. It's densely packed.
> So a check will repair cases where the data cannot be read at all, but
> will not repair cases where the data is returned but does not match the
> data on the other mirror(s).
Yes, I now see the distinction between disk read-error and an array block mismatch.
>
>>> I read that when raid1 is in doubt there is a 50%-50% chance it uses
>>> data from the good disk, wouldn't be better to fail the broken disk
>>> and then re-add it to the array?
>>
>> I don't know what this means.
>>
> I assume he's referring to cases where the data is read successfully but
> does not match the data on the mirror(s). In this case a repair will
> cause one copy to overwrite the others, which may or may not be the
> correct copy (md has no way of knowing for a mirrored pair).
I understand the ambiguity. It seems ill advised to arbitrarily replace what could be valid data. So some clarification on what repair does in a raid 1,10 block mismatch would be useful, as it may be repair shouldn't be used: rather use check and find out what file(s) are affected by the mismatch and replace the files from backup.
This statement from documentation is confusing to me: "For RAID1/RAID10, all but one block are overwritten with the content of that one block."
Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 18:53 ` Chris Murphy
@ 2012-09-15 19:05 ` Niccolò Belli
2012-09-15 19:41 ` Robin Hill
0 siblings, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-15 19:05 UTC (permalink / raw)
To: linux-raid
CHECK didn't help me, so I did a echo "repair >
/sys/block/md0/md/sync_action". REPAIR didn't work too :(
Here is syslog of REPAIR:
Sep 15 19:34:10 asterisk mdadm[2117]: RebuildStarted event detected on
md device /dev/md/0
Sep 15 19:34:10 asterisk kernel: [258470.152296] md: requested-resync of
RAID array md0
Sep 15 19:34:10 asterisk kernel: [258470.152301] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Sep 15 19:34:10 asterisk kernel: [258470.152304] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
requested-resync.
Sep 15 19:34:10 asterisk kernel: [258470.152310] md: using 128k window,
over a total of 311619448k.
Sep 15 19:34:11 asterisk kernel: [258471.165653] ata3.00: exception
Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 15 19:34:11 asterisk kernel: [258471.167468] ata3.00: BMDMA stat 0x44
Sep 15 19:34:11 asterisk kernel: [258471.169912] ata3.00: failed
command: READ DMA EXT
Sep 15 19:34:11 asterisk kernel: [258471.172769] ata3.00: cmd
25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
Sep 15 19:34:11 asterisk kernel: [258471.172771] res
51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 15 19:34:11 asterisk kernel: [258471.176753] ata3.00: status: { DRDY
ERR }
Sep 15 19:34:11 asterisk kernel: [258471.178605] ata3.00: error: { UNC }
Sep 15 19:34:12 asterisk kernel: [258472.148217] ata3.00: configured for
UDMA/133
Sep 15 19:34:12 asterisk kernel: [258472.148232] ata3: EH complete
Sep 15 19:34:13 asterisk kernel: [258473.131054] ata3.00: exception
Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 15 19:34:13 asterisk kernel: [258473.132881] ata3.00: BMDMA stat 0x44
Sep 15 19:34:13 asterisk kernel: [258473.134639] ata3.00: failed
command: READ DMA EXT
Sep 15 19:34:13 asterisk kernel: [258473.136413] ata3.00: cmd
25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
Sep 15 19:34:13 asterisk kernel: [258473.136415] res
51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 15 19:34:13 asterisk kernel: [258473.141768] ata3.00: status: { DRDY
ERR }
Sep 15 19:34:13 asterisk kernel: [258473.144049] ata3.00: error: { UNC }
Sep 15 19:34:14 asterisk kernel: [258474.112209] ata3.00: configured for
UDMA/133
Sep 15 19:34:14 asterisk kernel: [258474.112224] ata3: EH complete
Sep 15 19:34:15 asterisk kernel: [258475.071642] ata3.00: exception
Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 15 19:34:15 asterisk kernel: [258475.073476] ata3.00: BMDMA stat 0x44
Sep 15 19:34:15 asterisk kernel: [258475.075240] ata3.00: failed
command: READ DMA EXT
Sep 15 19:34:15 asterisk kernel: [258475.077027] ata3.00: cmd
25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
Sep 15 19:34:15 asterisk kernel: [258475.077029] res
51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 15 19:34:15 asterisk kernel: [258475.080720] ata3.00: status: { DRDY
ERR }
Sep 15 19:34:15 asterisk kernel: [258475.083512] ata3.00: error: { UNC }
Sep 15 19:34:16 asterisk kernel: [258476.100935] ata3.00: configured for
UDMA/133
Sep 15 19:34:16 asterisk kernel: [258476.100960] ata3: EH complete
Sep 15 19:41:29 asterisk asterisk[3492]: rc_avpair_new: unknown
attribute 1490026597
Sep 15 19:41:46 asterisk asterisk[3492]: rc_avpair_new: unknown
attribute 1490026597
Sep 15 19:41:52 asterisk asterisk[3492]: rc_avpair_new: unknown
attribute 1490026597
Sep 15 19:42:52 asterisk asterisk[3492]: rc_avpair_new: unknown
attribute 1490026597
Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
Currently unreadable (pending) sectors
Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
uncorrectable sectors
Sep 15 19:50:51 asterisk mdadm[2117]: Rebuild26 event detected on md
device /dev/md/0
Sep 15 20:07:31 asterisk mdadm[2117]: Rebuild53 event detected on md
device /dev/md/0
Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
Currently unreadable (pending) sectors
Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
uncorrectable sectors
Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT],
Temperature changed +4 Celsius to 42 Celsius (Min/Max 30/46)
Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART
Usage Attribute: 201 Soft_Read_Error_Rate changed from 99 to 100
Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
Sep 15 20:24:11 asterisk mdadm[2117]: Rebuild75 event detected on md
device /dev/md/0
Sep 15 20:40:51 asterisk mdadm[2117]: Rebuild93 event detected on md
device /dev/md/0
Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
Currently unreadable (pending) sectors
Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
uncorrectable sectors
Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
Sep 15 20:47:24 asterisk kernel: [262863.781068] md: md0:
requested-resync done.
Sep 15 20:47:24 asterisk mdadm[2117]: RebuildFinished event detected on
md device /dev/md/0
I still get:
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Offline Completed: read failure 90% 8985
3912
and
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 2
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - 1
How is it possible? Next thing I will try is manually failing /dev/sda
and filling it with zeros. I would like to do a *low level format* but I
didn't find the utility for my disk :(
Disk is:
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F1 DT
Device Model: SAMSUNG HD322HJ
Serial Number: S17AJDWQ402689
LU WWN Device Id: 5 0000f0 003046298
Firmware Version: 1AC01110
User Capacity: 320,072,933,376 bytes [320 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Sat Sep 15 21:02:36 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
root@asterisk:~# smartctl -a /dev/sda -P show
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-2-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Drive found in smartmontools Database. Drive identity strings:
MODEL: SAMSUNG HD322HJ
FIRMWARE: 1AC01110
match smartmontools Drive Database entry:
MODEL REGEXP: SAMSUNG
HD(083G|16[12]G|25[12]H|32[12]H|50[12]I|642J|75[23]L|10[23]U)J
FIRMWARE REGEXP: .*
MODEL FAMILY: SAMSUNG SpinPoint F1 DT
ATTRIBUTE OPTIONS: None preset; no -v options are required.
Thanks,
Niccolò
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-15 19:05 ` Niccolò Belli
@ 2012-09-15 19:41 ` Robin Hill
2012-09-15 22:06 ` Niccolò Belli
2012-09-16 10:42 ` Niccolò Belli
0 siblings, 2 replies; 27+ messages in thread
From: Robin Hill @ 2012-09-15 19:41 UTC (permalink / raw)
To: Niccolò Belli; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 7309 bytes --]
On Sat Sep 15, 2012 at 09:05:25 +0200, Niccolò Belli wrote:
> CHECK didn't help me, so I did a echo "repair >
> /sys/block/md0/md/sync_action". REPAIR didn't work too :(
>
Didn't work for what you were wanting anyway. It may well have worked
for its intended purpose.
> Here is syslog of REPAIR:
>
> Sep 15 19:34:10 asterisk mdadm[2117]: RebuildStarted event detected on
> md device /dev/md/0
> Sep 15 19:34:10 asterisk kernel: [258470.152296] md: requested-resync of
> RAID array md0
> Sep 15 19:34:10 asterisk kernel: [258470.152301] md: minimum
> _guaranteed_ speed: 1000 KB/sec/disk.
> Sep 15 19:34:10 asterisk kernel: [258470.152304] md: using maximum
> available idle IO bandwidth (but not more than 200000 KB/sec) for
> requested-resync.
> Sep 15 19:34:10 asterisk kernel: [258470.152310] md: using 128k window,
> over a total of 311619448k.
> Sep 15 19:34:11 asterisk kernel: [258471.165653] ata3.00: exception
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:11 asterisk kernel: [258471.167468] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:11 asterisk kernel: [258471.169912] ata3.00: failed
> command: READ DMA EXT
> Sep 15 19:34:11 asterisk kernel: [258471.172769] ata3.00: cmd
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:11 asterisk kernel: [258471.172771] res
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:11 asterisk kernel: [258471.176753] ata3.00: status: { DRDY
> ERR }
> Sep 15 19:34:11 asterisk kernel: [258471.178605] ata3.00: error: { UNC }
> Sep 15 19:34:12 asterisk kernel: [258472.148217] ata3.00: configured for
> UDMA/133
> Sep 15 19:34:12 asterisk kernel: [258472.148232] ata3: EH complete
> Sep 15 19:34:13 asterisk kernel: [258473.131054] ata3.00: exception
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:13 asterisk kernel: [258473.132881] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:13 asterisk kernel: [258473.134639] ata3.00: failed
> command: READ DMA EXT
> Sep 15 19:34:13 asterisk kernel: [258473.136413] ata3.00: cmd
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:13 asterisk kernel: [258473.136415] res
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:13 asterisk kernel: [258473.141768] ata3.00: status: { DRDY
> ERR }
> Sep 15 19:34:13 asterisk kernel: [258473.144049] ata3.00: error: { UNC }
> Sep 15 19:34:14 asterisk kernel: [258474.112209] ata3.00: configured for
> UDMA/133
> Sep 15 19:34:14 asterisk kernel: [258474.112224] ata3: EH complete
> Sep 15 19:34:15 asterisk kernel: [258475.071642] ata3.00: exception
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 19:34:15 asterisk kernel: [258475.073476] ata3.00: BMDMA stat 0x44
> Sep 15 19:34:15 asterisk kernel: [258475.075240] ata3.00: failed
> command: READ DMA EXT
> Sep 15 19:34:15 asterisk kernel: [258475.077027] ata3.00: cmd
> 25/00:00:00:15:00/00:04:00:00:00/e0 tag 0 dma 524288 in
> Sep 15 19:34:15 asterisk kernel: [258475.077029] res
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 19:34:15 asterisk kernel: [258475.080720] ata3.00: status: { DRDY
> ERR }
> Sep 15 19:34:15 asterisk kernel: [258475.083512] ata3.00: error: { UNC }
> Sep 15 19:34:16 asterisk kernel: [258476.100935] ata3.00: configured for
> UDMA/133
> Sep 15 19:34:16 asterisk kernel: [258476.100960] ata3: EH complete
> Sep 15 19:41:29 asterisk asterisk[3492]: rc_avpair_new: unknown
> attribute 1490026597
> Sep 15 19:41:46 asterisk asterisk[3492]: rc_avpair_new: unknown
> attribute 1490026597
> Sep 15 19:41:52 asterisk asterisk[3492]: rc_avpair_new: unknown
> attribute 1490026597
> Sep 15 19:42:52 asterisk asterisk[3492]: rc_avpair_new: unknown
> attribute 1490026597
> Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
> Currently unreadable (pending) sectors
> Sep 15 19:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
> uncorrectable sectors
> Sep 15 19:50:51 asterisk mdadm[2117]: Rebuild26 event detected on md
> device /dev/md/0
> Sep 15 20:07:31 asterisk mdadm[2117]: Rebuild53 event detected on md
> device /dev/md/0
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
> Currently unreadable (pending) sectors
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
> uncorrectable sectors
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT],
> Temperature changed +4 Celsius to 42 Celsius (Min/Max 30/46)
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART
> Usage Attribute: 201 Soft_Read_Error_Rate changed from 99 to 100
> Sep 15 20:16:34 asterisk smartd[2581]: Device: /dev/sdb [SAT], SMART
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
> Sep 15 20:24:11 asterisk mdadm[2117]: Rebuild75 event detected on md
> device /dev/md/0
> Sep 15 20:40:51 asterisk mdadm[2117]: Rebuild93 event detected on md
> device /dev/md/0
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 2
> Currently unreadable (pending) sectors
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], 1 Offline
> uncorrectable sectors
> Sep 15 20:46:34 asterisk smartd[2581]: Device: /dev/sda [SAT], SMART
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
> Sep 15 20:47:24 asterisk kernel: [262863.781068] md: md0:
> requested-resync done.
> Sep 15 20:47:24 asterisk mdadm[2117]: RebuildFinished event detected on
> md device /dev/md/0
>
>
Okay, so the drive logs an exception at 19:34:11, then completes its
error handling at 19:34:16.
If md hasn't failed the drive then either:
- md didn't get a read error
- md got a success message when re-writing the block
- there's a bug in md and it's not handled the error at all
My guess would be on one of the first two (I'm not sure what's logged if
md gets a read error and does a re-write).
>
> I still get:
>
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Offline Completed: read failure 90% 8985
> 3912
>
> and
>
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
> - 2
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
> Offline - 1
>
>
> How is it possible? Next thing I will try is manually failing /dev/sda
> and filling it with zeros. I would like to do a *low level format* but I
> didn't find the utility for my disk :(
>
I'm pretty sure there's no such thing as a *low level format* for any
modern disk (or not one that does anything more than writing a known
pattern to the disk). The low-level information is far too precisely
laid out for the disk heads to be able to write.
Writing zeros is certainly what I'd do in this situation - I've done it
for several drives in the past where they've had offline uncorrectable
sectors flagged.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-15 19:41 ` Robin Hill
@ 2012-09-15 22:06 ` Niccolò Belli
2012-09-16 10:18 ` Robin Hill
2012-09-16 10:42 ` Niccolò Belli
1 sibling, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-15 22:06 UTC (permalink / raw)
To: linux-raid
Il 15/09/2012 21:41, Robin Hill ha scritto:
> If md hasn't failed the drive then either:
> - md didn't get a read error
> - md got a success message when re-writing the block
> - there's a bug in md and it's not handled the error at all
It seems it's case one, while manually verifying the checksums with
for i in $(seq 50); do dd if=/dev/sda1 of=sda${i} bs=100000 count=50
skip=$((($i-1)*50+10)) > /dev/null 2> /dev/null; dd if=/dev/sdb1
of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+10)) > /dev/null 2>
/dev/null; md5sum sda${i}; md5sum sdb${i}; echo; done
I get this in syslog:
Sep 15 23:50:09 asterisk kernel: [273828.407914] scsi_verify_blk_ioctl:
30 callbacks suppressed
Sep 15 23:50:09 asterisk kernel: [273828.407920] dd: sending ioctl
80306d02 to a partition!
Sep 15 23:50:09 asterisk kernel: [273828.407925] dd: sending ioctl
80306d02 to a partition!
Sep 15 23:50:10 asterisk kernel: [273829.422247] ata3.00: exception
Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 15 23:50:10 asterisk kernel: [273829.424071] ata3.00: BMDMA stat 0x44
Sep 15 23:50:10 asterisk kernel: [273829.425855] ata3.00: failed
command: READ DMA
Sep 15 23:50:10 asterisk kernel: [273829.427625] ata3.00: cmd
c8/00:00:68:17:00/00:00:00:00:00/e0 tag 0 dma 131072 in
Sep 15 23:50:10 asterisk kernel: [273829.427627] res
51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 15 23:50:10 asterisk kernel: [273829.431184] ata3.00: status: { DRDY
ERR }
Sep 15 23:50:10 asterisk kernel: [273829.432992] ata3.00: error: { UNC }
Sep 15 23:50:11 asterisk kernel: [273830.404203] ata3.00: configured for
UDMA/133
Sep 15 23:50:11 asterisk kernel: [273830.404217] ata3: EH complete
but this is the output of the command:
b7d4e3c3bb461a1aa6619c22ef11d072 sda1
b7d4e3c3bb461a1aa6619c22ef11d072 sdb1
8649ae5a732bc808f228677b27a1e9b6 sda2
8649ae5a732bc808f228677b27a1e9b6 sdb2
8649ae5a732bc808f228677b27a1e9b6 sda3
8649ae5a732bc808f228677b27a1e9b6 sdb3
8649ae5a732bc808f228677b27a1e9b6 sda4
8649ae5a732bc808f228677b27a1e9b6 sdb4
8649ae5a732bc808f228677b27a1e9b6 sda5
8649ae5a732bc808f228677b27a1e9b6 sdb5
8649ae5a732bc808f228677b27a1e9b6 sda6
8649ae5a732bc808f228677b27a1e9b6 sdb6
8649ae5a732bc808f228677b27a1e9b6 sda7
8649ae5a732bc808f228677b27a1e9b6 sdb7
f2fb77841db5dd577449cfeee07c4108 sda8
f2fb77841db5dd577449cfeee07c4108 sdb8
e311789a1fabd3758694c35c74e20612 sda9
e311789a1fabd3758694c35c74e20612 sdb9
8649ae5a732bc808f228677b27a1e9b6 sda10
8649ae5a732bc808f228677b27a1e9b6 sdb10
8649ae5a732bc808f228677b27a1e9b6 sda11
8649ae5a732bc808f228677b27a1e9b6 sdb11
8649ae5a732bc808f228677b27a1e9b6 sda12
8649ae5a732bc808f228677b27a1e9b6 sdb12
8649ae5a732bc808f228677b27a1e9b6 sda13
8649ae5a732bc808f228677b27a1e9b6 sdb13
8649ae5a732bc808f228677b27a1e9b6 sda14
8649ae5a732bc808f228677b27a1e9b6 sdb14
8649ae5a732bc808f228677b27a1e9b6 sda15
8649ae5a732bc808f228677b27a1e9b6 sdb15
8649ae5a732bc808f228677b27a1e9b6 sda16
8649ae5a732bc808f228677b27a1e9b6 sdb16
8649ae5a732bc808f228677b27a1e9b6 sda17
8649ae5a732bc808f228677b27a1e9b6 sdb17
8649ae5a732bc808f228677b27a1e9b6 sda18
8649ae5a732bc808f228677b27a1e9b6 sdb18
8649ae5a732bc808f228677b27a1e9b6 sda19
8649ae5a732bc808f228677b27a1e9b6 sdb19
8649ae5a732bc808f228677b27a1e9b6 sda20
8649ae5a732bc808f228677b27a1e9b6 sdb20
8649ae5a732bc808f228677b27a1e9b6 sda21
8649ae5a732bc808f228677b27a1e9b6 sdb21
8649ae5a732bc808f228677b27a1e9b6 sda22
8649ae5a732bc808f228677b27a1e9b6 sdb22
8649ae5a732bc808f228677b27a1e9b6 sda23
8649ae5a732bc808f228677b27a1e9b6 sdb23
8649ae5a732bc808f228677b27a1e9b6 sda24
8649ae5a732bc808f228677b27a1e9b6 sdb24
8649ae5a732bc808f228677b27a1e9b6 sda25
8649ae5a732bc808f228677b27a1e9b6 sdb25
8649ae5a732bc808f228677b27a1e9b6 sda26
8649ae5a732bc808f228677b27a1e9b6 sdb26
4531da1579310425e2d3343846f5b16d sda27
4531da1579310425e2d3343846f5b16d sdb27
3721bf34547dc2967741bf6bfbd76670 sda28
3721bf34547dc2967741bf6bfbd76670 sdb28
14a2be518f90d3060b3438ac75d91e7e sda29
14a2be518f90d3060b3438ac75d91e7e sdb29
36fb275af7608d0aff8c7b454168f8c3 sda30
36fb275af7608d0aff8c7b454168f8c3 sdb30
2026b2cf40470f059d264b2c78f3a989 sda31
2026b2cf40470f059d264b2c78f3a989 sdb31
36f825d926a6195c70efabd0a045fce0 sda32
36f825d926a6195c70efabd0a045fce0 sdb32
44be6fdd8adb83f1328d6fa21e72a5f9 sda33
44be6fdd8adb83f1328d6fa21e72a5f9 sdb33
90a771705992c1ba15c17a30520b0b56 sda34
90a771705992c1ba15c17a30520b0b56 sdb34
c37584adcad03dc74b0ea9e431fd78e3 sda35
c37584adcad03dc74b0ea9e431fd78e3 sdb35
f044f24e528316cf5a40e894e7d84c36 sda36
f044f24e528316cf5a40e894e7d84c36 sdb36
4447d6a338fdac8cf179dde83deb7f43 sda37
4447d6a338fdac8cf179dde83deb7f43 sdb37
b4115994e66cb739dc49fedcaf5649eb sda38
b4115994e66cb739dc49fedcaf5649eb sdb38
65c9226105cbba0fd7dbefb9bedac940 sda39
65c9226105cbba0fd7dbefb9bedac940 sdb39
e05366f8be4b66595c2aadbb133c6b4c sda40
e05366f8be4b66595c2aadbb133c6b4c sdb40
afc039520def52590a5fd289b423545a sda41
afc039520def52590a5fd289b423545a sdb41
6d47c3b1265afc3dbbd832d8088501c4 sda42
6d47c3b1265afc3dbbd832d8088501c4 sdb42
749140fe9a80f20dd5449976db66ce0f sda43
749140fe9a80f20dd5449976db66ce0f sdb43
41bd354c1cca819dd4a8d19b8c1a637e sda44
41bd354c1cca819dd4a8d19b8c1a637e sdb44
b2fc15b0147853d76a7c5fe87820d26b sda45
b2fc15b0147853d76a7c5fe87820d26b sdb45
a9b3ac7ac3556950887959dea3b6ae3c sda46
a9b3ac7ac3556950887959dea3b6ae3c sdb46
3daf2ee98c1d3d24f779234f6f7d58d6 sda47
3daf2ee98c1d3d24f779234f6f7d58d6 sdb47
31fe58f24393d199b63102a45b8b44c3 sda48
31fe58f24393d199b63102a45b8b44c3 sdb48
43e0657b350cd60efdf1ca0c8324f85c sda49
43e0657b350cd60efdf1ca0c8324f85c sdb49
94f883b45084b72cd9269a4821b2d509 sda50
94f883b45084b72cd9269a4821b2d509 sdb50
*BUT* if I start reading from the start of partition (+0 instead of +10
in count=) I get a mismatch, on both md0 and md1 (which is supposed to
be ok)!!!
root@asterisk:~# i=1; dd if=/dev/sda1 of=sda${i} bs=100000 count=50
skip=$((($i-1)*50+0)) > /dev/null 2> /dev/null; dd if=/dev/sdb1
of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+0)) > /dev/null 2>
/dev/null; md5sum sda${i}; md5sum sdb${i}
9f9f11ffeb0aed0abc8097417b293f41 sda1
394efde218ad700774bfcb3c43255529 sdb1
root@asterisk:~# i=1; dd if=/dev/sda2 of=sda${i} bs=100000 count=50
skip=$((($i-1)*50+0)) > /dev/null 2> /dev/null; dd if=/dev/sdb2
of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+0)) > /dev/null 2>
/dev/null; md5sum sda${i}; md5sum sdb${i}
8cb0b6fa2bf7f0f88a2a2a91598429d4 sda1
732c42e14b8e78930d08cdb4f1c49a40 sdb1
Shouldn't raid1 match even at the very beginning of the partition?
Il 15/09/2012 22:40, Roberto Spadim ha scritto:
> today disks arent expensives, why not change the disk and be happy?
Because I get the problem after a power failure, disk *should* be ok I
think.
Cheers,
Niccolò
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-15 22:06 ` Niccolò Belli
@ 2012-09-16 10:18 ` Robin Hill
0 siblings, 0 replies; 27+ messages in thread
From: Robin Hill @ 2012-09-16 10:18 UTC (permalink / raw)
To: Niccolò Belli; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 3653 bytes --]
On Sun Sep 16, 2012 at 12:06:48 +0200, Niccolò Belli wrote:
> Il 15/09/2012 21:41, Robin Hill ha scritto:
> > If md hasn't failed the drive then either:
> > - md didn't get a read error
> > - md got a success message when re-writing the block
> > - there's a bug in md and it's not handled the error at all
>
> It seems it's case one, while manually verifying the checksums with
>
> for i in $(seq 50); do dd if=/dev/sda1 of=sda${i} bs=100000 count=50
> skip=$((($i-1)*50+10)) > /dev/null 2> /dev/null; dd if=/dev/sdb1
> of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+10)) > /dev/null 2>
> /dev/null; md5sum sda${i}; md5sum sdb${i}; echo; done
>
> I get this in syslog:
>
> Sep 15 23:50:09 asterisk kernel: [273828.407914] scsi_verify_blk_ioctl:
> 30 callbacks suppressed
> Sep 15 23:50:09 asterisk kernel: [273828.407920] dd: sending ioctl
> 80306d02 to a partition!
> Sep 15 23:50:09 asterisk kernel: [273828.407925] dd: sending ioctl
> 80306d02 to a partition!
> Sep 15 23:50:10 asterisk kernel: [273829.422247] ata3.00: exception
> Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> Sep 15 23:50:10 asterisk kernel: [273829.424071] ata3.00: BMDMA stat 0x44
> Sep 15 23:50:10 asterisk kernel: [273829.425855] ata3.00: failed
> command: READ DMA
> Sep 15 23:50:10 asterisk kernel: [273829.427625] ata3.00: cmd
> c8/00:00:68:17:00/00:00:00:00:00/e0 tag 0 dma 131072 in
> Sep 15 23:50:10 asterisk kernel: [273829.427627] res
> 51/40:00:90:17:00/40:00:00:00:00/e0 Emask 0x9 (media error)
> Sep 15 23:50:10 asterisk kernel: [273829.431184] ata3.00: status: { DRDY
> ERR }
> Sep 15 23:50:10 asterisk kernel: [273829.432992] ata3.00: error: { UNC }
> Sep 15 23:50:11 asterisk kernel: [273830.404203] ata3.00: configured for
> UDMA/133
> Sep 15 23:50:11 asterisk kernel: [273830.404217] ata3: EH complete
>
>
>
> but this is the output of the command:
>
>
> b7d4e3c3bb461a1aa6619c22ef11d072 sda1
> b7d4e3c3bb461a1aa6619c22ef11d072 sdb1
>
<- snip sets of identical checksums ->
>
> 94f883b45084b72cd9269a4821b2d509 sda50
> 94f883b45084b72cd9269a4821b2d509 sdb50
>
Okay, so it looks like the drive is managing to return the correct data
eventually (or it's returning some default value which has also been
written to the other mirror now).
> *BUT* if I start reading from the start of partition (+0 instead of +10
> in count=) I get a mismatch, on both md0 and md1 (which is supposed to
> be ok)!!!
>
> root@asterisk:~# i=1; dd if=/dev/sda1 of=sda${i} bs=100000 count=50
> skip=$((($i-1)*50+0)) > /dev/null 2> /dev/null; dd if=/dev/sdb1
> of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+0)) > /dev/null 2>
> /dev/null; md5sum sda${i}; md5sum sdb${i}
> 9f9f11ffeb0aed0abc8097417b293f41 sda1
> 394efde218ad700774bfcb3c43255529 sdb1
> root@asterisk:~# i=1; dd if=/dev/sda2 of=sda${i} bs=100000 count=50
> skip=$((($i-1)*50+0)) > /dev/null 2> /dev/null; dd if=/dev/sdb2
> of=sdb${i} bs=100000 count=50 skip=$((($i-1)*50+0)) > /dev/null 2>
> /dev/null; md5sum sda${i}; md5sum sdb${i}
> 8cb0b6fa2bf7f0f88a2a2a91598429d4 sda1
> 732c42e14b8e78930d08cdb4f1c49a40 sdb1
>
> Shouldn't raid1 match even at the very beginning of the partition?
>
No, the start of the partition will contain the md superblock (for 1.1
and 1.2 metadata formats), which will be slightly different for the two
devices.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-15 19:41 ` Robin Hill
2012-09-15 22:06 ` Niccolò Belli
@ 2012-09-16 10:42 ` Niccolò Belli
2012-09-16 15:26 ` Chris Murphy
1 sibling, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-16 10:42 UTC (permalink / raw)
To: linux-raid
Il 15/09/2012 21:41, Robin Hill ha scritto:
> Writing zeros is certainly what I'd do in this situation - I've done it
> for several drives in the past where they've had offline uncorrectable
> sectors flagged.
I just tried to write zeros, it didn't help: the disk doesn't reallocate
the bad sector :(
--
http://www.linuxsystems.it
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-16 10:42 ` Niccolò Belli
@ 2012-09-16 15:26 ` Chris Murphy
2012-09-16 15:31 ` Niccolò Belli
0 siblings, 1 reply; 27+ messages in thread
From: Chris Murphy @ 2012-09-16 15:26 UTC (permalink / raw)
To: Linux RAID
On Sep 16, 2012, at 4:42 AM, Niccolò Belli wrote:
> Il 15/09/2012 21:41, Robin Hill ha scritto:
>> Writing zeros is certainly what I'd do in this situation - I've done it
>> for several drives in the past where they've had offline uncorrectable
>> sectors flagged.
>
> I just tried to write zeros, it didn't help: the disk doesn't reallocate the bad sector :(
Something isn't right. How did you write zeros?
I went through the archives and wasn't able to find the full smartctl -x results for this drive, can you post them?
Does anyone know for sure if the ATA Secure Erase command verifies its writes? i.e. does it even have a way of knowing if there are bad sectors on a write and remove them from use? Or is the write-read verification always occurring on hard drives?
Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-16 15:26 ` Chris Murphy
@ 2012-09-16 15:31 ` Niccolò Belli
2012-09-16 23:35 ` Niccolò Belli
0 siblings, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-16 15:31 UTC (permalink / raw)
To: linux-raid
Il 16/09/2012 17:26, Chris Murphy ha scritto:
> Something isn't right. How did you write zeros?
dd if=/dev/zero of=/dev/sda
> I went through the archives and wasn't able to find the full smartctl -x results for this drive, can you post them?
root@asterisk:~# smartctl -x /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-2-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F1 DT
Device Model: SAMSUNG HD322HJ
Serial Number: S17AJDWQ402689
LU WWN Device Id: 5 0000f0 003046298
Firmware Version: 1AC01110
User Capacity: 320,072,933,376 bytes [320 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Sun Sep 16 17:29:50 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x06) Offline data collection activity
was aborted by the device with
a fatal error.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 114) The previous self-test completed
having
the read element of the test
failed.
Total time to complete Offline
data collection: ( 3888) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 66) minutes.
Conveyance self-test routine
recommended polling time: ( 8) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control
supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 099 099 051 - 712
3 Spin_Up_Time POS--- 094 094 011 - 2810
4 Start_Stop_Count -O--CK 099 099 000 - 1077
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 253 253 051 - 0
8 Seek_Time_Performance P-S--K 100 100 015 - 9508
9 Power_On_Hours -O--CK 098 098 000 - 9006
10 Spin_Retry_Count PO--CK 100 100 051 - 0
11 Calibration_Retry_Count -O--C- 100 100 000 - 0
12 Power_Cycle_Count -O--CK 099 099 000 - 1077
13 Read_Soft_Error_Rate -OSR-- 099 099 000 - 654
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error PO--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 908
188 Command_Timeout -O--CK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 063 055 000 - 37 (Min/Max
28/45)
194 Temperature_Celsius -O---K 063 054 000 - 37 (Min/Max
28/46)
195 Hardware_ECC_Recovered -O-RC- 100 100 000 - 988053162
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_Sector -O--C- 100 100 000 - 3
198 Offline_Uncorrectable ----CK 100 100 000 - 1
199 UDMA_CRC_Error_Count -OSRCK 100 100 000 - 0
200 Multi_Zone_Error_Rate -O-R-- 100 100 000 - 0
201 Soft_Read_Error_Rate -O-R-- 095 095 000 - 440
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART
error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 2 sectors [Extended self-test log]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 450 (device log contains only the most recent 8 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 450 [1] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:29.664 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:29.664 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:03:29.654 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:03:29.654 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:29.654 READ NATIVE
MAX ADDRESS EXT
Error 449 [0] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:27.714 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:27.714 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:03:27.714 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:03:27.714 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:27.714 READ NATIVE
MAX ADDRESS EXT
Error 448 [7] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:25.774 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:25.774 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:03:25.774 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:03:25.774 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:25.764 READ NATIVE
MAX ADDRESS EXT
Error 447 [6] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:23.804 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:23.804 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:03:23.794 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:03:23.794 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:23.794 READ NATIVE
MAX ADDRESS EXT
Error 446 [5] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:21.824 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:21.824 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:03:21.814 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:03:21.814 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:03:21.814 READ NATIVE
MAX ADDRESS EXT
Error 445 [4] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:03:20.254 READ DMA
c8 00 00 00 08 00 00 00 00 0f 40 e0 08 21d+23:03:20.254 READ DMA
c8 00 00 00 08 00 00 00 00 0f 38 e0 08 21d+23:03:20.254 READ DMA
c8 00 00 00 08 00 00 00 00 0f 30 e0 08 21d+23:03:20.254 READ DMA
c8 00 00 00 08 00 00 00 00 0f 28 e0 08 21d+23:03:20.254 READ DMA
Error 444 [3] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:02:10.594 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:02:10.594 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:02:10.594 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:02:10.594 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:02:10.594 READ NATIVE
MAX ADDRESS EXT
Error 443 [2] occurred at disk power-on lifetime: 9001 hours (375 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 00 00 0f 48 e0 00 Error: UNC at LBA =
0x00000f48 = 3912
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
c8 00 00 00 08 00 00 00 00 0f 48 e0 08 21d+23:02:08.654 READ DMA
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:02:08.654 READ NATIVE
MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 08 21d+23:02:08.654 IDENTIFY DEVICE
ef 00 03 00 46 00 00 00 00 00 00 a0 08 21d+23:02:08.654 SET FEATURES
[Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 08 21d+23:02:08.654 READ NATIVE
MAX ADDRESS EXT
SMART Extended Self-test Log Version: 0 (2 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 8991
3912
# 2 Offline Aborted by host 90% 8985
-
# 3 Offline Aborted by host 90% 8981
-
# 4 Offline Aborted by host 90% 8981
-
# 5 Extended offline Aborted by host 90% 8980
-
# 6 Extended offline Aborted by host 90% 8980
-
# 7 Short offline Aborted by host 20% 8980
-
# 8 Short offline Aborted by host 20% 8980
-
# 9 Extended offline Aborted by host 90% 8968
-
#10 Short offline Aborted by host 20% 8967
-
#11 Short offline Aborted by host 20% 8943
-
#12 Short offline Aborted by host 20% 8919
-
#13 Short offline Aborted by host 20% 8895
-
#14 Short offline Aborted by host 20% 8871
-
#15 Short offline Aborted by host 20% 8847
-
#16 Short offline Aborted by host 20% 8823
-
#17 Extended offline Aborted by host 90% 8800
-
#18 Short offline Aborted by host 20% 8799
-
#19 Short offline Aborted by host 20% 8775
-
#20 Short offline Aborted by host 20% 8751
-
#21 Short offline Aborted by host 20% 8727
-
Note: selective self-test log revision number (0) not 1 implies that no
selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever
been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 37 Celsius
Power Cycle Max Temperature: 46 Celsius
Lifetime Max Temperature: 46 Celsius
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: -4/72 Celsius
Min/Max Temperature Limit: -9/77 Celsius
Temperature History Size (Index): 128 (36)
Index Estimated Time Temperature Celsius
37 2012-09-16 15:22 37 ******************
... ..(126 skipped). .. ******************
36 2012-09-16 17:29 37 ******************
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 24 Device-to-host register FISes sent due to a COMRESET
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 32 Transition from drive PhyRdy to drive PhyNRdy
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 2 0 R_ERR response for host-to-device non-data FIS,
non-CRC
--
http://www.linuxsystems.it
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-16 15:31 ` Niccolò Belli
@ 2012-09-16 23:35 ` Niccolò Belli
2012-09-17 0:00 ` Chris Murphy
0 siblings, 1 reply; 27+ messages in thread
From: Niccolò Belli @ 2012-09-16 23:35 UTC (permalink / raw)
To: linux-raid
I finally managed to reallocate the sectors!
I tried with sg3-utils but I get:
root@asterisk:~# sg_reassign --address=3912 /dev/sda
REASSIGN BLOCKS not supported
then I read this on the smartmontools mailing list:
<<
Possibly what is happening is that because he is only writing a partial
block, the OS is first trying to read the the original block so that it
can preserve the parts that won't be changing. When this operation fails,
it blocks the write that would trigger reallocation of the bad sector.
Writing using the OS blocksize (typically 4096 on linux systems) properly
aligned should work around that issue.
>>
so I tried with
dd if=/dev/zero of=/dev/sda bs=4096
and ta-daaa! :D
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- *0*
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - 1
Current_Pending_Sector are gone!
with a smartctl -t offline /dev/sda I removed the Offline_Uncorrectable too:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - *0*
For the sake of google:
Reallocated_Event_Count
This is how many sectors have already been reallocated on the
drive. We're hoping to get the hard disk to increase this number!
Current_Pending_Sector
The number of sectors that the drive thinks are dodgy. Bear in mind
sometimes drives change their mind about whether a sector is bad or not
- so this number can go down without a reallocation occuring.
Offline_Uncorrectable
This is the number of sectors that the drive has attempted to
correct itself, but failed. Running the command:
smartctl -t offline /dev/hda
should cause the drive to test the sectors and attempt to fix them.
Not all drives support this though.
Thanks for helping!
Niccolò
--
http://www.linuxsystems.it
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-16 23:35 ` Niccolò Belli
@ 2012-09-17 0:00 ` Chris Murphy
2012-09-17 0:03 ` Niccolò Belli
0 siblings, 1 reply; 27+ messages in thread
From: Chris Murphy @ 2012-09-17 0:00 UTC (permalink / raw)
To: Linux RAID
On Sep 16, 2012, at 5:35 PM, Niccolò Belli wrote:
>
> then I read this on the smartmontools mailing list:
>
> <<
> Possibly what is happening is that because he is only writing a partial
> block, the OS is first trying to read the the original block so that it
> can preserve the parts that won't be changing. When this operation fails,
> it blocks the write that would trigger reallocation of the bad sector.
> Writing using the OS blocksize (typically 4096 on linux systems) properly
> aligned should work around that issue.
> >>
>
> so I tried with
>
> dd if=/dev/zero of=/dev/sda bs=4096
>
> and ta-daaa!
Useful info. Obviously Secure Erasing a disk takes a while and is overkill for just one sector. But I'm still curious if anyone knows if for sure ATA Secure Erase will remove bad sectors from use.
Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: raid1 issue after disk failure: both disks of the array are still active
2012-09-14 7:16 ` Mikael Abrahamsson
2012-09-14 7:45 ` Niccolò Belli
@ 2012-09-14 8:13 ` NeilBrown
1 sibling, 0 replies; 27+ messages in thread
From: NeilBrown @ 2012-09-14 8:13 UTC (permalink / raw)
To: Mikael Abrahamsson; +Cc: Chris Murphy, Linux RAID
[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]
On Fri, 14 Sep 2012 09:16:20 +0200 (CEST) Mikael Abrahamsson
<swmike@swm.pp.se> wrote:
> On Thu, 13 Sep 2012, Chris Murphy wrote:
>
> > "check" records errors, no action is taken by the md driver to correct
> > it, although the disk firmware itself may try reallocation. So far, that
> > appears to not be the case.
> >
> > "repair" causes the md driver to write correct data (from copy or
> > reconstructed from parity), which should force the disk firmware to
> > reallocate the affected LBAs from bad physical sectors to good ones.
> >
> > It seems in this case "repair" is indicated.
>
> I was under the impression that "check" would check if all data blocks and
> parity are correct, and record if there is a parity mismatch. This would
> then be corrected by using "repair" at a later time.
>
> I was also under the impression that if there was a read error on a drive
> during "check", that read error would be corrected using parity because
> it's obviously a hard error, not a logical error.
Both of your impressions are correct.
NeilBrown
>
> Could you (or someone else) please confirm that my impression is wrong and
> if there indeed is a hard read error using "check", this will not be
> corrected? I would be interested in knowing why this decision was taken to
> have this behaviour, as I feel that if there is a hard read error, this
> should always be corrected using parity.
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2012-09-17 0:03 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-13 10:01 raid1 issue after disk failure: both disks of the array are still active Niccolò Belli
2012-09-13 10:34 ` Robin Hill
2012-09-13 10:46 ` Niccolò Belli
[not found] ` <5051BBC3.4050805@websitemanagers.com.au>
2012-09-13 11:29 ` Niccolò Belli
[not found] ` <CABYL=TpKD2B0vwTrHH=iFK3PcMWueEsi84ACRbBQkDXuiWG3kw@mail.gmail.com>
2012-09-13 15:32 ` Roberto Spadim
2012-09-13 15:48 ` Niccolò Belli
2012-09-13 15:53 ` Roberto Spadim
2012-09-14 7:54 ` Niccolò Belli
2012-09-13 17:02 ` Chris Murphy
2012-09-13 17:39 ` Roberto Spadim
2012-09-13 20:13 ` Chris Murphy
2012-09-14 7:16 ` Mikael Abrahamsson
2012-09-14 7:45 ` Niccolò Belli
2012-09-14 18:04 ` Chris Murphy
2012-09-14 18:27 ` Robin Hill
2012-09-14 18:53 ` Chris Murphy
2012-09-15 19:05 ` Niccolò Belli
2012-09-15 19:41 ` Robin Hill
2012-09-15 22:06 ` Niccolò Belli
2012-09-16 10:18 ` Robin Hill
2012-09-16 10:42 ` Niccolò Belli
2012-09-16 15:26 ` Chris Murphy
2012-09-16 15:31 ` Niccolò Belli
2012-09-16 23:35 ` Niccolò Belli
2012-09-17 0:00 ` Chris Murphy
2012-09-17 0:03 ` Niccolò Belli
2012-09-14 8:13 ` NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).