* RAID Issues - RAID10 working but with errors
@ 2020-04-02 2:28 Adam Goryachev
2020-04-02 8:49 ` Reindl Harald
2020-04-02 9:19 ` Wolfgang Denk
0 siblings, 2 replies; 7+ messages in thread
From: Adam Goryachev @ 2020-04-02 2:28 UTC (permalink / raw)
To: linux-raid
Hi all,
I've got a fairly old system which has been working reliably for a long
time, and I've become somewhat lazy in maintaining it, but since it's
still in production, and rather important, I figured it was time for a
checkup while I'm basically confined.
I have run a raid check, and these are the logs from the system:
[11243683.671268] md: data-check of RAID array md1
[11243683.671305] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[11243683.671339] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for data-check.
[11243683.671417] md: using 128k window, over a total of 3867693056k.
[11244378.142926] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11244378.142986] ata8.00: irq_stat 0x40000008
[11244378.143018] ata8.00: failed command: READ FPDMA QUEUED
[11244378.143057] ata8.00: cmd 60/00:b8:00:93:5c/0a:00:0a:00:00/40 tag
23 ncq dma 1310720 in
res 41/40:00:12:97:5c/00:00:0a:00:00/40
Emask 0x409 (media error) <F>
[11244378.143166] ata8.00: status: { DRDY ERR }
[11244378.143196] ata8.00: error: { UNC }
[11244378.150605] ata8.00: configured for UDMA/133
[11244378.150706] sd 7:0:0:0: [sdf] tag#23 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11244378.150783] sd 7:0:0:0: [sdf] tag#23 Sense Key : Medium Error
[current]
[11244378.150821] sd 7:0:0:0: [sdf] tag#23 Add. Sense: Unrecovered read
error - auto reallocate failed
[11244378.150881] sd 7:0:0:0: [sdf] tag#23 CDB: Read(10) 28 00 0a 5c 93
00 00 0a 00 00
[11244378.150936] blk_update_request: I/O error, dev sdf, sector 173840146
[11244378.150990] ata8: EH complete
[11245290.555454] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11245290.555513] ata8.00: irq_stat 0x40000008
[11245290.555545] ata8.00: failed command: READ FPDMA QUEUED
[11245290.555584] ata8.00: cmd 60/00:88:80:35:e0/0a:00:14:00:00/40 tag
17 ncq dma 1310720 in
res 41/40:00:b2:3b:e0/00:00:14:00:00/40
Emask 0x409 (media error) <F>
[11245290.555693] ata8.00: status: { DRDY ERR }
[11245290.555723] ata8.00: error: { UNC }
[11245290.563543] ata8.00: configured for UDMA/133
[11245290.563643] sd 7:0:0:0: [sdf] tag#17 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245290.563703] sd 7:0:0:0: [sdf] tag#17 Sense Key : Medium Error
[current]
[11245290.563741] sd 7:0:0:0: [sdf] tag#17 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245290.563802] sd 7:0:0:0: [sdf] tag#17 CDB: Read(10) 28 00 14 e0 35
80 00 0a 00 00
[11245290.563857] blk_update_request: I/O error, dev sdf, sector 350239666
[11245290.563915] ata8: EH complete
[11245297.098980] ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11245297.099039] ata6.00: irq_stat 0x40000008
[11245297.099072] ata6.00: failed command: READ FPDMA QUEUED
[11245297.099110] ata6.00: cmd 60/00:70:00:fa:ee/02:00:14:00:00/40 tag
14 ncq dma 262144 in
res 41/40:00:38:fa:ee/00:00:14:00:00/40
Emask 0x409 (media error) <F>
[11245297.099219] ata6.00: status: { DRDY ERR }
[11245297.099249] ata6.00: error: { UNC }
[11245297.108981] ata6.00: configured for UDMA/133
[11245297.109064] sd 5:0:0:0: [sdd] tag#14 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245297.109124] sd 5:0:0:0: [sdd] tag#14 Sense Key : Medium Error
[current]
[11245297.109161] sd 5:0:0:0: [sdd] tag#14 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245297.109222] sd 5:0:0:0: [sdd] tag#14 CDB: Read(10) 28 00 14 ee fa
00 00 02 00 00
[11245297.109276] blk_update_request: I/O error, dev sdd, sector 351205944
[11245297.109333] ata6: EH complete
The rest of the log is included below, but it repeats with errors for
both sdd and sdf, but all other drives are not mentioned.
Is there a method to determine if this is a HDD error (ie, 2 drives that
have errors) or a cabling issue (with just these two drives) or some
strange driver/motherboard issue?
I notice in the output below MD is showing a number of bad blocks on the
drives, and logs suggest that the drives have run out of "spare" space
to re-allocate these to.
Is there some test I could run on the drive itself to narrow down the
issue?
Should I replace one of these drives with the spare?
Is there some method to check the status of the spare to ensure it is
working properly?
Any other suggestions?
##################################################################
Kernel: 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64
GNU/Linux
All RAID drives are attached to this controller (lspci output):
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset
Family SATA AHCI Controller (rev 05) (prog-if 01 [AHCI 1.0])
Subsystem: Intel Corporation Server Board S1200BTS
Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 31
I/O ports at 5090 [size=8]
I/O ports at 5080 [size=4]
I/O ports at 5070 [size=8]
I/O ports at 5060 [size=4]
I/O ports at 5020 [size=32]
Memory at c1c40000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [70] Power Management version 3
Capabilities: [a8] SATA HBA v1.0
Capabilities: [b0] PCI Advanced Features
Kernel driver in use: ahci
Kernel modules: ahci
##################################################################
cat /proc/mdstat
Personalities : [raid1] [raid10] [linear] [multipath] [raid0] [raid6]
[raid5] [raid4]
md1 : active raid10 sdd2[2] sde2[3](S) sdb2[0] sdf2[4] sdc2[1]
3867693056 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 7/29 pages [28KB], 65536KB chunk
##################################################################
mdadm --misc --examine /dev/sdd2
/dev/sdd2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : da434497:622f4dc4:5d3861c0:4cf322fd
Name : san2.websitemanagers.com.au:1
Creation Time : Tue Mar 1 01:04:23 2016
Raid Level : raid10
Raid Devices : 4
Avail Dev Size : 3867693232 (1844.26 GiB 1980.26 GB)
Array Size : 3867693056 (3688.52 GiB 3960.52 GB)
Used Dev Size : 3867693056 (1844.26 GiB 1980.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=176 sectors
State : active
Device UUID : 56c20ce1:c61ec674:e69e2a29:3ced20c7
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Apr 2 12:58:32 2020
Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
Checksum : 1ad84fb - correct
Events : 443256
Layout : near=2
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
##################################################################
mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Tue Mar 1 01:04:23 2016
Raid Level : raid10
Array Size : 3867693056 (3688.52 GiB 3960.52 GB)
Used Dev Size : 1933846528 (1844.26 GiB 1980.26 GB)
Raid Devices : 4
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Apr 2 12:55:22 2020
State : active
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Name : san2.websitemanagers.com.au:1
UUID : da434497:622f4dc4:5d3861c0:4cf322fd
Events : 443256
Number Major Minor RaidDevice State
0 8 18 0 active sync set-A /dev/sdb2
1 8 34 1 active sync set-B /dev/sdc2
2 8 50 2 active sync set-A /dev/sdd2
4 8 82 3 active sync set-B /dev/sdf2
3 8 66 - spare /dev/sde2
##################################################################
smartctl -x /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4
Device Model: WDC WD2003FYYS-02W0B0
Serial Number: WD-WMAY00922575
LU WWN Device Id: 5 0014ee 0ad395cea
Firmware Version: 01.01D01
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Thu Apr 2 12:56:11 2020 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Disabled
APM level is: 254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (30180) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 307) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 23
3 Spin_Up_Time POS--K 253 253 021 - 8583
4 Start_Stop_Count -O--CK 100 100 000 - 77
5 Reallocated_Sector_Ct PO--CK 184 184 140 - 126
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 017 017 000 - 61089
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 67
192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
193 Load_Cycle_Count -O--CK 200 200 000 - 28
194 Temperature_Celsius -O---K 118 105 000 - 34
196 Reallocated_Event_Count -O--CK 095 095 000 - 105
197 Current_Pending_Sector -O--CK 200 200 000 - 21
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb5 GPL,SL VS 1 Device vendor specific log
0xb6 GPL VS 1 Device vendor specific log
0xb7 GPL,SL VS 1 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 24 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 765 (device log contains only the most recent 24 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 765 [20] occurred at disk power-on lifetime: 61077 hours (2544
days + 21 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 36 8e d1 98 40 00 Error: UNC at LBA =
0x368ed198 = 915329432
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 00 80 00 78 00 00 36 8f 43 80 40 08 30d+15:33:32.976 READ FPDMA
QUEUED
60 00 80 00 70 00 00 36 8f 43 00 40 08 30d+15:33:32.976 READ FPDMA
QUEUED
60 00 80 00 68 00 00 36 8f 42 80 40 08 30d+15:33:32.976 READ FPDMA
QUEUED
60 00 80 00 60 00 00 36 8f 42 00 40 08 30d+15:33:32.976 READ FPDMA
QUEUED
60 00 80 00 58 00 00 36 8f 41 80 40 08 30d+15:33:32.976 READ FPDMA
QUEUED
Error 764 [19] occurred at disk power-on lifetime: 61077 hours (2544
days + 21 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 26 cd e5 56 40 00 Error: UNC at LBA =
0x26cde556 = 651027798
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 b0 00 00 26 ce 35 00 40 08 30d+15:08:38.213 READ FPDMA
QUEUED
60 0a 00 00 a8 00 00 26 ce 2b 00 40 08 30d+15:08:38.203 READ FPDMA
QUEUED
60 0a 00 00 a0 00 00 26 ce 21 00 40 08 30d+15:08:38.194 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 26 ce 17 00 40 08 30d+15:08:38.185 READ FPDMA
QUEUED
60 09 00 00 90 00 00 26 ce 0e 00 40 08 30d+15:08:38.176 READ FPDMA
QUEUED
Error 763 [18] occurred at disk power-on lifetime: 61077 hours (2544
days + 21 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 09 f0 00 00 23 5a 13 92 40 00 Error: UNC at LBA =
0x235a1392 = 593105810
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 a0 00 00 23 5a 92 00 40 08 30d+15:03:34.673 READ FPDMA
QUEUED
60 00 80 00 98 00 00 23 5a 91 80 40 08 30d+15:03:34.673 READ FPDMA
QUEUED
60 00 80 00 90 00 00 23 5a 91 00 40 08 30d+15:03:34.095 READ FPDMA
QUEUED
60 00 80 00 88 00 00 23 5a 90 80 40 08 30d+15:03:34.095 READ FPDMA
QUEUED
60 00 80 00 80 00 00 23 5a 90 00 40 08 30d+15:03:34.095 READ FPDMA
QUEUED
Error 762 [17] occurred at disk power-on lifetime: 61076 hours (2544
days + 20 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 06 00 00 00 1b 67 f3 60 40 00 Error: UNC at LBA =
0x1b67f360 = 459797344
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 09 00 00 18 00 00 1b 68 77 00 40 08 30d+14:51:09.346 READ FPDMA
QUEUED
60 0a 00 00 10 00 00 1b 68 6d 00 40 08 30d+14:51:09.346 READ FPDMA
QUEUED
60 08 80 00 08 00 00 1b 68 64 80 40 08 30d+14:51:09.335 READ FPDMA
QUEUED
60 05 80 00 00 00 00 1b 68 5f 00 40 08 30d+14:51:09.327 READ FPDMA
QUEUED
60 09 00 00 f0 00 00 1b 68 56 00 40 08 30d+14:51:09.326 READ FPDMA
QUEUED
Error 761 [16] occurred at disk power-on lifetime: 61076 hours (2544
days + 20 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 02 00 00 00 14 ee fa 38 40 00 Error: UNC at LBA =
0x14eefa38 = 351205944
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 60 00 00 14 ef 38 80 40 08 30d+14:41:21.903 READ FPDMA
QUEUED
60 0a 00 00 58 00 00 14 ef 2e 80 40 08 30d+14:41:21.903 READ FPDMA
QUEUED
60 08 00 00 50 00 00 14 ef 26 80 40 08 30d+14:41:21.901 READ FPDMA
QUEUED
60 0a 00 00 48 00 00 14 ef 1c 80 40 08 30d+14:41:21.892 READ FPDMA
QUEUED
60 00 80 00 40 00 00 14 ef 1c 00 40 08 30d+14:41:21.886 READ FPDMA
QUEUED
Error 760 [15] occurred at disk power-on lifetime: 60311 hours (2512
days + 23 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 2b e2 10 8a 40 00 Error: UNC at LBA =
0x2be2108a = 736235658
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 b0 00 00 2b e2 0a 00 40 08 48d+10:17:23.966 READ FPDMA
QUEUED
60 0a 00 00 a8 00 00 2b e2 00 00 40 08 48d+10:17:23.957 READ FPDMA
QUEUED
60 0a 00 00 a0 00 00 2b e1 f6 00 40 08 48d+10:17:23.947 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 2b e1 ec 00 40 08 48d+10:17:23.938 READ FPDMA
QUEUED
60 0a 00 00 90 00 00 2b e1 e2 00 40 08 48d+10:17:23.928 READ FPDMA
QUEUED
Error 759 [14] occurred at disk power-on lifetime: 60311 hours (2512
days + 23 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 18 00 00 23 5a 13 59 40 00 Error: UNC at LBA =
0x235a1359 = 593105753
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 00 18 00 20 00 00 23 5a 13 50 40 08 48d+10:05:03.788 READ FPDMA
QUEUED
60 00 08 00 18 00 00 23 5a 13 38 40 08 48d+10:05:02.381 READ FPDMA
QUEUED
60 03 00 00 10 00 00 23 5a 10 00 40 08 48d+10:05:01.879 READ FPDMA
QUEUED
60 0a 00 00 08 00 00 23 5a 06 00 40 08 48d+10:05:01.869 READ FPDMA
QUEUED
60 0a 00 00 00 00 00 23 59 fc 00 40 08 48d+10:05:01.860 READ FPDMA
QUEUED
Error 758 [13] occurred at disk power-on lifetime: 60311 hours (2512
days + 23 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 22 26 06 1a 40 00 Error: UNC at LBA =
0x2226061a = 572917274
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 78 00 00 22 25 ff 80 40 08 48d+10:03:15.973 READ FPDMA
QUEUED
60 0a 00 00 70 00 00 22 25 f5 80 40 08 48d+10:03:15.964 READ FPDMA
QUEUED
60 0a 00 00 68 00 00 22 25 eb 80 40 08 48d+10:03:15.955 READ FPDMA
QUEUED
60 0a 00 00 60 00 00 22 25 e1 80 40 08 48d+10:03:15.946 READ FPDMA
QUEUED
60 0a 00 00 58 00 00 22 25 d7 80 40 08 48d+10:03:15.928 READ FPDMA
QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 34 Celsius
Power Cycle Min/Max Temperature: 31/42 Celsius
Lifetime Min/Max Temperature: 31/47 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (265)
Index Estimated Time Temperature Celsius
266 2020-04-02 04:59 34 ***************
... ..(280 skipped). .. ***************
69 2020-04-02 09:40 34 ***************
70 2020-04-02 09:41 35 ****************
... ..(129 skipped). .. ****************
200 2020-04-02 11:51 35 ****************
201 2020-04-02 11:52 34 ***************
... ..( 63 skipped). .. ***************
265 2020-04-02 12:56 34 ***************
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Device Statistics (GP/SMART Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 11280431 Vendor specific
##################################################################
smartctl -x /dev/sdf
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4
Device Model: WDC WD2003FYYS-02W0B0
Serial Number: WD-WMAY00611922
LU WWN Device Id: 5 0014ee 0ad2d9d92
Firmware Version: 01.01D01
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Thu Apr 2 12:57:00 2020 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Disabled
APM level is: 254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (28500) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 290) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 253 253 021 - 7350
4 Start_Stop_Count -O--CK 100 100 000 - 73
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 051 051 000 - 36231
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 64
192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
193 Load_Cycle_Count -O--CK 200 200 000 - 26
194 Temperature_Celsius -O---K 118 094 000 - 34
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb5 GPL,SL VS 1 Device vendor specific log
0xb6 GPL VS 1 Device vendor specific log
0xb7 GPL,SL VS 1 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 24 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 79 (device log contains only the most recent 24 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 79 [6] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 2b cc 8c e9 40 00 Error: UNC at LBA =
0x2bcc8ce9 = 734825705
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 58 00 00 2b cc 86 00 40 08 30d+15:17:33.518 READ FPDMA
QUEUED
60 0a 00 00 50 00 00 2b cc 7c 00 40 08 30d+15:17:33.518 READ FPDMA
QUEUED
60 0a 00 00 48 00 00 2b cc 72 00 40 08 30d+15:17:33.518 READ FPDMA
QUEUED
60 0a 00 00 40 00 00 2b cc 68 00 40 08 30d+15:17:33.518 READ FPDMA
QUEUED
60 0a 00 00 38 00 00 2b cc 5e 00 40 08 30d+15:17:33.518 READ FPDMA
QUEUED
Error 78 [5] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 2a 25 4e 10 40 00 Error: UNC at LBA =
0x2a254e10 = 707087888
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 09 80 00 c8 00 00 2a 25 9a 00 40 08 30d+15:15:04.767 READ FPDMA
QUEUED
60 06 00 00 c0 00 00 2a 25 94 00 40 08 30d+15:15:04.750 READ FPDMA
QUEUED
60 0a 00 00 b8 00 00 2a 25 8a 00 40 08 30d+15:15:04.741 READ FPDMA
QUEUED
60 0a 00 00 b0 00 00 2a 25 80 00 40 08 30d+15:15:04.732 READ FPDMA
QUEUED
60 0a 00 00 a8 00 00 2a 25 76 00 40 08 30d+15:15:04.732 READ FPDMA
QUEUED
Error 77 [4] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 1e 5a 3e 86 40 00 Error: UNC at LBA =
0x1e5a3e86 = 509230726
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 00 80 00 40 00 00 1e 5a b5 00 40 08 30d+14:57:14.678 READ FPDMA
QUEUED
60 00 80 00 38 00 00 1e 5a b4 80 40 08 30d+14:57:14.678 READ FPDMA
QUEUED
60 00 80 00 30 00 00 1e 5a b4 00 40 08 30d+14:57:14.678 READ FPDMA
QUEUED
60 00 80 00 28 00 00 1e 5a b3 80 40 08 30d+14:57:14.677 READ FPDMA
QUEUED
60 00 80 00 20 00 00 1e 5a b3 00 40 08 30d+14:57:14.677 READ FPDMA
QUEUED
Error 76 [3] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
04 -- 51 00 00 00 00 ae ee 79 1f ae 00
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
ea 00 00 00 00 00 00 00 00 00 00 e0 08 30d+14:56:32.314 FLUSH CACHE EXT
61 0a 00 00 d8 00 00 1e 0c 77 80 40 08 30d+14:56:32.308 WRITE FPDMA
QUEUED
61 00 08 00 d0 00 00 02 54 38 10 40 08 30d+14:56:32.308 WRITE FPDMA
QUEUED
ea 00 00 00 00 00 00 00 00 00 00 e0 08 30d+14:56:32.261 FLUSH CACHE EXT
60 0a 00 00 08 00 00 1e 0c 81 80 40 08 30d+14:56:32.181 READ FPDMA
QUEUED
Error 75 [2] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 1e 0c 77 d3 40 00 Error: UNC at LBA =
0x1e0c77d3 = 504133587
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 08 80 00 c8 00 00 1e 0c 9f 80 40 08 30d+14:56:28.883 READ FPDMA
QUEUED
60 0a 00 00 c0 00 00 1e 0c 95 80 40 08 30d+14:56:28.883 READ FPDMA
QUEUED
60 0a 00 00 b8 00 00 1e 0c 8b 80 40 08 30d+14:56:28.883 READ FPDMA
QUEUED
60 0a 00 00 b0 00 00 1e 0c 81 80 40 08 30d+14:56:28.883 READ FPDMA
QUEUED
60 0a 00 00 a8 00 00 1e 0c 77 80 40 08 30d+14:56:28.883 READ FPDMA
QUEUED
Error 74 [1] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 1d e4 17 e7 40 00 Error: UNC at LBA =
0x1de417e7 = 501487591
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 02 00 00 a8 00 00 1d e4 52 00 40 08 30d+14:56:10.974 READ FPDMA
QUEUED
60 0a 00 00 a0 00 00 1d e4 48 00 40 08 30d+14:56:10.974 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 1d e4 3e 00 40 08 30d+14:56:10.974 READ FPDMA
QUEUED
60 0a 00 00 90 00 00 1d e4 34 00 40 08 30d+14:56:10.974 READ FPDMA
QUEUED
60 0a 00 00 88 00 00 1d e4 2a 00 40 08 30d+14:56:10.974 READ FPDMA
QUEUED
Error 73 [0] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 1d c0 73 99 40 00 Error: UNC at LBA =
0x1dc07399 = 499151769
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 78 00 00 1d c0 c2 80 40 08 30d+14:55:55.699 READ FPDMA
QUEUED
60 0a 00 00 70 00 00 1d c0 b8 80 40 08 30d+14:55:55.693 READ FPDMA
QUEUED
60 00 80 00 68 00 00 1d c0 b8 00 40 08 30d+14:55:55.684 READ FPDMA
QUEUED
60 02 00 00 60 00 00 1d c0 b6 00 40 08 30d+14:55:55.627 READ FPDMA
QUEUED
60 00 80 00 58 00 00 1d c0 b5 80 40 08 30d+14:55:55.627 READ FPDMA
QUEUED
Error 72 [23] occurred at disk power-on lifetime: 36219 hours (1509 days
+ 3 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 1d 23 fc 01 40 00 Error: UNC at LBA =
0x1d23fc01 = 488897537
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 06 80 00 b0 00 00 1d 24 12 00 40 08 30d+14:54:51.370 READ FPDMA
QUEUED
60 0a 00 00 a8 00 00 1d 24 08 00 40 08 30d+14:54:51.370 READ FPDMA
QUEUED
60 0a 00 00 a0 00 00 1d 23 fe 00 40 08 30d+14:54:51.370 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 1d 23 f4 00 40 08 30d+14:54:51.369 READ FPDMA
QUEUED
60 0a 00 00 90 00 00 1d 23 ea 00 40 08 30d+14:54:51.369 READ FPDMA
QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 34 Celsius
Power Cycle Min/Max Temperature: 32/43 Celsius
Lifetime Min/Max Temperature: 32/58 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (407)
Index Estimated Time Temperature Celsius
408 2020-04-02 05:00 35 ****************
... ..( 59 skipped). .. ****************
468 2020-04-02 06:00 35 ****************
469 2020-04-02 06:01 34 ***************
... ..(219 skipped). .. ***************
211 2020-04-02 09:41 34 ***************
212 2020-04-02 09:42 35 ****************
... ..(194 skipped). .. ****************
407 2020-04-02 12:57 35 ****************
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Device Statistics (GP/SMART Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 11280539 Vendor specific
##################################################################
System logs:
[11245659.844193] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11245659.844253] ata8.00: irq_stat 0x40000008
[11245659.844286] ata8.00: failed command: READ FPDMA QUEUED
[11245659.844325] ata8.00: cmd 60/00:f0:80:33:15/0a:00:19:00:00/40 tag
30 ncq dma 1310720 in
res 41/40:00:a7:3b:15/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245659.844434] ata8.00: status: { DRDY ERR }
[11245659.844464] ata8.00: error: { UNC }
[11245659.851322] ata8.00: configured for UDMA/133
[11245659.851424] sd 7:0:0:0: [sdf] tag#30 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245659.851484] sd 7:0:0:0: [sdf] tag#30 Sense Key : Medium Error
[current]
[11245659.851521] sd 7:0:0:0: [sdf] tag#30 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245659.851581] sd 7:0:0:0: [sdf] tag#30 CDB: Read(10) 28 00 19 15 33
80 00 0a 00 00
[11245659.851636] blk_update_request: I/O error, dev sdf, sector 420821927
[11245659.851684] ata8: EH complete
[11245662.543958] ata8.00: exception Emask 0x0 SAct 0xc0000 SErr 0x0
action 0x0
[11245662.544000] ata8.00: irq_stat 0x40000008
[11245662.544032] ata8.00: failed command: READ FPDMA QUEUED
[11245662.544071] ata8.00: cmd 60/80:90:00:ee:15/08:00:19:00:00/40 tag
18 ncq dma 1114112 in
res 41/40:00:5b:f3:15/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245662.544179] ata8.00: status: { DRDY ERR }
[11245662.544209] ata8.00: error: { UNC }
[11245662.551701] ata8.00: configured for UDMA/133
[11245662.551766] sd 7:0:0:0: [sdf] tag#18 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245662.551826] sd 7:0:0:0: [sdf] tag#18 Sense Key : Medium Error
[current]
[11245662.551864] sd 7:0:0:0: [sdf] tag#18 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245662.551924] sd 7:0:0:0: [sdf] tag#18 CDB: Read(10) 28 00 19 15 ee
00 00 08 80 00
[11245662.551979] blk_update_request: I/O error, dev sdf, sector 420868955
[11245662.552031] ata8: EH complete
[11245666.003705] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11245666.003763] ata8.00: irq_stat 0x40000008
[11245666.003795] ata8.00: failed command: READ FPDMA QUEUED
[11245666.003834] ata8.00: cmd 60/00:d0:80:e2:16/0a:00:19:00:00/40 tag
26 ncq dma 1310720 in
res 41/40:00:ae:e6:16/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245666.007915] ata8.00: status: { DRDY ERR }
[11245666.007947] ata8.00: error: { UNC }
[11245666.015145] ata8.00: configured for UDMA/133
[11245666.015245] sd 7:0:0:0: [sdf] tag#26 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245666.015305] sd 7:0:0:0: [sdf] tag#26 Sense Key : Medium Error
[current]
[11245666.015343] sd 7:0:0:0: [sdf] tag#26 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245666.015415] sd 7:0:0:0: [sdf] tag#26 CDB: Read(10) 28 00 19 16 e2
80 00 0a 00 00
[11245666.015471] blk_update_request: I/O error, dev sdf, sector 420931246
[11245666.015523] ata8: EH complete
[11245667.895545] ata8.00: exception Emask 0x0 SAct 0x39fc0000 SErr 0x0
action 0x0
[11245667.895602] ata8.00: irq_stat 0x40000008
[11245667.895635] ata8.00: failed command: READ FPDMA QUEUED
[11245667.895673] ata8.00: cmd 60/00:e8:80:f6:16/0a:00:19:00:00/40 tag
29 ncq dma 1310720 in
res 41/40:00:b8:fa:16/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245667.895782] ata8.00: status: { DRDY ERR }
[11245667.895813] ata8.00: error: { UNC }
[11245667.903392] ata8.00: configured for UDMA/133
[11245667.903468] sd 7:0:0:0: [sdf] tag#29 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245667.903527] sd 7:0:0:0: [sdf] tag#29 Sense Key : Medium Error
[current]
[11245667.903565] sd 7:0:0:0: [sdf] tag#29 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245667.903625] sd 7:0:0:0: [sdf] tag#29 CDB: Read(10) 28 00 19 16 f6
80 00 0a 00 00
[11245667.903680] blk_update_request: I/O error, dev sdf, sector 420936376
[11245667.903729] ata8: EH complete
[11245670.779337] ata8.00: exception Emask 0x0 SAct 0x80000 SErr 0x0
action 0x0
[11245670.779380] ata8.00: irq_stat 0x40000008
[11245670.779412] ata8.00: failed command: READ FPDMA QUEUED
[11245670.779451] ata8.00: cmd 60/00:98:00:96:17/06:00:19:00:00/40 tag
19 ncq dma 786432 in
res 41/40:00:03:99:17/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245670.779560] ata8.00: status: { DRDY ERR }
[11245670.779590] ata8.00: error: { UNC }
[11245670.787087] ata8.00: configured for UDMA/133
[11245670.787148] sd 7:0:0:0: [sdf] tag#19 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245670.787208] sd 7:0:0:0: [sdf] tag#19 Sense Key : Medium Error
[current]
[11245670.787246] sd 7:0:0:0: [sdf] tag#19 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245670.787306] sd 7:0:0:0: [sdf] tag#19 CDB: Read(10) 28 00 19 17 96
00 00 06 00 00
[11245670.787361] blk_update_request: I/O error, dev sdf, sector 420976899
[11245670.787413] ata8: EH complete
[11245673.595136] ata8.00: exception Emask 0x0 SAct 0x3fff8000 SErr 0x0
action 0x0
[11245673.595196] ata8.00: irq_stat 0x40000008
[11245673.595228] ata8.00: failed command: READ FPDMA QUEUED
[11245673.595267] ata8.00: cmd 60/00:78:80:ee:17/0a:00:19:00:00/40 tag
15 ncq dma 1310720 in
res 41/40:00:c2:ef:17/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245673.595376] ata8.00: status: { DRDY ERR }
[11245673.595406] ata8.00: error: { UNC }
[11245673.603723] ata8.00: configured for UDMA/133
[11245673.603799] sd 7:0:0:0: [sdf] tag#15 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245673.603859] sd 7:0:0:0: [sdf] tag#15 Sense Key : Medium Error
[current]
[11245673.603897] sd 7:0:0:0: [sdf] tag#15 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245673.603957] sd 7:0:0:0: [sdf] tag#15 CDB: Read(10) 28 00 19 17 ee
80 00 0a 00 00
[11245673.604012] blk_update_request: I/O error, dev sdf, sector 420999106
[11245673.604077] ata8: EH complete
[11245675.542997] ata8.00: exception Emask 0x0 SAct 0x7e SErr 0x0 action 0x0
[11245675.543039] ata8.00: irq_stat 0x40000008
[11245675.543071] ata8.00: failed command: READ FPDMA QUEUED
[11245675.543109] ata8.00: cmd 60/80:30:00:48:18/09:00:19:00:00/40 tag 6
ncq dma 1245184 in
res 41/40:00:f2:50:18/00:00:19:00:00/40
Emask 0x409 (media error) <F>
[11245675.543217] ata8.00: status: { DRDY ERR }
[11245675.543248] ata8.00: error: { UNC }
[11245675.550967] ata8.00: configured for UDMA/133
[11245675.551042] sd 7:0:0:0: [sdf] tag#6 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[11245675.551102] sd 7:0:0:0: [sdf] tag#6 Sense Key : Medium Error
[current]
[11245675.551140] sd 7:0:0:0: [sdf] tag#6 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245675.551199] sd 7:0:0:0: [sdf] tag#6 CDB: Read(10) 28 00 19 18 48
00 00 09 80 00
[11245675.551255] blk_update_request: I/O error, dev sdf, sector 421023986
[11245675.551303] ata8: EH complete
[11245885.111468] ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11245885.111528] ata6.00: irq_stat 0x40000008
[11245885.111560] ata6.00: failed command: READ FPDMA QUEUED
[11245885.111598] ata6.00: cmd 60/00:20:00:ee:67/06:00:1b:00:00/40 tag 4
ncq dma 786432 in
res 41/40:00:60:f3:67/00:00:1b:00:00/40
Emask 0x409 (media error) <F>
[11245885.111707] ata6.00: status: { DRDY ERR }
[11245885.111737] ata6.00: error: { UNC }
[11245885.121706] ata6.00: configured for UDMA/133
[11245885.121806] sd 5:0:0:0: [sdd] tag#4 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[11245885.121865] sd 5:0:0:0: [sdd] tag#4 Sense Key : Medium Error
[current]
[11245885.121903] sd 5:0:0:0: [sdd] tag#4 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245885.121961] sd 5:0:0:0: [sdd] tag#4 CDB: Read(10) 28 00 1b 67 ee
00 00 06 00 00
[11245885.122016] blk_update_request: I/O error, dev sdd, sector 459797344
[11245885.122074] ata6: EH complete
[11245953.390426] ata8.00: exception Emask 0x0 SAct 0x7ffffc0f SErr 0x0
action 0x0
[11245953.390486] ata8.00: irq_stat 0x40000008
[11245953.390518] ata8.00: failed command: READ FPDMA QUEUED
[11245953.390557] ata8.00: cmd 60/00:50:80:7f:19/09:00:1c:00:00/40 tag
10 ncq dma 1179648 in
res 41/40:00:03:88:19/00:00:1c:00:00/40
Emask 0x409 (media error) <F>
[11245953.390665] ata8.00: status: { DRDY ERR }
[11245953.390695] ata8.00: error: { UNC }
[11245953.397630] ata8.00: configured for UDMA/133
[11245953.397713] sd 7:0:0:0: [sdf] tag#10 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11245953.397780] sd 7:0:0:0: [sdf] tag#10 Sense Key : Medium Error
[current]
[11245953.397817] sd 7:0:0:0: [sdf] tag#10 Add. Sense: Unrecovered read
error - auto reallocate failed
[11245953.397878] sd 7:0:0:0: [sdf] tag#10 CDB: Read(10) 28 00 1c 19 7f
80 00 09 00 00
[11245953.397932] blk_update_request: I/O error, dev sdf, sector 471435267
[11245953.397987] ata8: EH complete
[11246048.451449] ata8.00: exception Emask 0x0 SAct 0x780000 SErr 0x0
action 0x0
[11246048.451492] ata8.00: irq_stat 0x40000008
[11246048.451524] ata8.00: failed command: READ FPDMA QUEUED
[11246048.451563] ata8.00: cmd 60/00:98:00:f4:23/0a:00:1d:00:00/40 tag
19 ncq dma 1310720 in
res 41/40:00:01:fc:23/00:00:1d:00:00/40
Emask 0x409 (media error) <F>
[11246048.451670] ata8.00: status: { DRDY ERR }
[11246048.451700] ata8.00: error: { UNC }
[11246048.458682] ata8.00: configured for UDMA/133
[11246048.458749] sd 7:0:0:0: [sdf] tag#19 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246048.458809] sd 7:0:0:0: [sdf] tag#19 Sense Key : Medium Error
[current]
[11246048.458847] sd 7:0:0:0: [sdf] tag#19 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246048.458908] sd 7:0:0:0: [sdf] tag#19 CDB: Read(10) 28 00 1d 23 f4
00 00 0a 00 00
[11246048.458962] blk_update_request: I/O error, dev sdf, sector 488897537
[11246048.459015] ata8: EH complete
[11246112.786663] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11246112.786724] ata8.00: irq_stat 0x40000008
[11246112.786756] ata8.00: failed command: READ FPDMA QUEUED
[11246112.786795] ata8.00: cmd 60/00:88:80:73:c0/0a:00:1d:00:00/40 tag
17 ncq dma 1310720 in
res 41/40:00:99:73:c0/00:00:1d:00:00/40
Emask 0x409 (media error) <F>
[11246112.786904] ata8.00: status: { DRDY ERR }
[11246112.786934] ata8.00: error: { UNC }
[11246112.793935] ata8.00: configured for UDMA/133
[11246112.794026] sd 7:0:0:0: [sdf] tag#17 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246112.794086] sd 7:0:0:0: [sdf] tag#17 Sense Key : Medium Error
[current]
[11246112.794124] sd 7:0:0:0: [sdf] tag#17 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246112.794184] sd 7:0:0:0: [sdf] tag#17 CDB: Read(10) 28 00 1d c0 73
80 00 0a 00 00
[11246112.794239] blk_update_request: I/O error, dev sdf, sector 499151769
[11246112.794302] ata8: EH complete
[11246128.145504] ata8.00: exception Emask 0x0 SAct 0x3f8000 SErr 0x0
action 0x0
[11246128.145546] ata8.00: irq_stat 0x40000008
[11246128.145579] ata8.00: failed command: READ FPDMA QUEUED
[11246128.145617] ata8.00: cmd 60/00:78:00:16:e4/0a:00:1d:00:00/40 tag
15 ncq dma 1310720 in
res 41/40:00:e7:17:e4/00:00:1d:00:00/40
Emask 0x409 (media error) <F>
[11246128.145726] ata8.00: status: { DRDY ERR }
[11246128.145756] ata8.00: error: { UNC }
[11246128.157322] ata8.00: configured for UDMA/133
[11246128.157386] sd 7:0:0:0: [sdf] tag#15 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246128.157445] sd 7:0:0:0: [sdf] tag#15 Sense Key : Medium Error
[current]
[11246128.157483] sd 7:0:0:0: [sdf] tag#15 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246128.157543] sd 7:0:0:0: [sdf] tag#15 CDB: Read(10) 28 00 1d e4 16
00 00 0a 00 00
[11246128.157598] blk_update_request: I/O error, dev sdf, sector 501487591
[11246128.157655] ata8: EH complete
[11246147.488076] ata8.00: exception Emask 0x0 SAct 0x3e00000 SErr 0x0
action 0x0
[11246147.488136] ata8.00: irq_stat 0x40000008
[11246147.488168] ata8.00: failed command: READ FPDMA QUEUED
[11246147.488207] ata8.00: cmd 60/00:a8:80:77:0c/0a:00:1e:00:00/40 tag
21 ncq dma 1310720 in
res 41/40:00:d3:77:0c/00:00:1e:00:00/40
Emask 0x409 (media error) <F>
[11246147.488316] ata8.00: status: { DRDY ERR }
[11246147.488346] ata8.00: error: { UNC }
[11246147.496292] ata8.00: configured for UDMA/133
[11246147.496358] sd 7:0:0:0: [sdf] tag#21 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246147.496418] sd 7:0:0:0: [sdf] tag#21 Sense Key : Medium Error
[current]
[11246147.496456] sd 7:0:0:0: [sdf] tag#21 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246147.496517] sd 7:0:0:0: [sdf] tag#21 CDB: Read(10) 28 00 1e 0c 77
80 00 0a 00 00
[11246147.496572] blk_update_request: I/O error, dev sdf, sector 504133587
[11246147.496635] ata8: EH complete
[11246154.639292] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[11246154.639333] ata8.00: irq_stat 0x40000001
[11246154.639365] ata8.00: failed command: FLUSH CACHE EXT
[11246154.639404] ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 29
res 51/04:00:1f:79:ee/00:00:00:00:00/ae
Emask 0x1 (device error)
[11246154.639491] ata8.00: status: { DRDY ERR }
[11246154.639521] ata8.00: error: { ABRT }
[11246154.647149] ata8.00: configured for UDMA/133
[11246154.647192] ata8: EH complete
[11246191.856803] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11246191.856862] ata8.00: irq_stat 0x40000008
[11246191.856894] ata8.00: failed command: READ FPDMA QUEUED
[11246191.856932] ata8.00: cmd 60/00:48:00:3b:5a/0a:00:1e:00:00/40 tag 9
ncq dma 1310720 in
res 41/40:00:86:3e:5a/00:00:1e:00:00/40
Emask 0x409 (media error) <F>
[11246191.857042] ata8.00: status: { DRDY ERR }
[11246191.857072] ata8.00: error: { UNC }
[11246191.864899] ata8.00: configured for UDMA/133
[11246191.864990] sd 7:0:0:0: [sdf] tag#9 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[11246191.865050] sd 7:0:0:0: [sdf] tag#9 Sense Key : Medium Error
[current]
[11246191.865088] sd 7:0:0:0: [sdf] tag#9 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246191.865147] sd 7:0:0:0: [sdf] tag#9 CDB: Read(10) 28 00 1e 5a 3b
00 00 0a 00 00
[11246191.865203] blk_update_request: I/O error, dev sdf, sector 509230726
[11246191.865265] ata8: EH complete
[11246632.028233] ata6.00: exception Emask 0x0 SAct 0x7f1fffff SErr 0x0
action 0x0
[11246632.028293] ata6.00: irq_stat 0x40000008
[11246632.028326] ata6.00: failed command: READ FPDMA QUEUED
[11246632.028365] ata6.00: cmd 60/f0:c0:90:13:5a/09:00:23:00:00/40 tag
24 ncq dma 1302528 in
res 41/40:00:92:13:5a/00:00:23:00:00/40
Emask 0x409 (media error) <F>
[11246632.028474] ata6.00: status: { DRDY ERR }
[11246632.028504] ata6.00: error: { UNC }
[11246632.038079] ata6.00: configured for UDMA/133
[11246632.038171] sd 5:0:0:0: [sdd] tag#24 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246632.038230] sd 5:0:0:0: [sdd] tag#24 Sense Key : Medium Error
[current]
[11246632.038268] sd 5:0:0:0: [sdd] tag#24 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246632.038328] sd 5:0:0:0: [sdd] tag#24 CDB: Read(10) 28 00 23 5a 13
90 00 09 f0 00
[11246632.038383] blk_update_request: I/O error, dev sdd, sector 593105810
[11246632.038439] ata6: EH complete
[11246934.977824] ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11246934.977884] ata6.00: irq_stat 0x40000008
[11246934.977916] ata6.00: failed command: READ FPDMA QUEUED
[11246934.977955] ata6.00: cmd 60/00:c0:00:de:cd/0a:00:26:00:00/40 tag
24 ncq dma 1310720 in
res 41/40:00:56:e5:cd/00:00:26:00:00/40
Emask 0x409 (media error) <F>
[11246934.978064] ata6.00: status: { DRDY ERR }
[11246934.978094] ata6.00: error: { UNC }
[11246934.986281] ata6.00: configured for UDMA/133
[11246934.986390] sd 5:0:0:0: [sdd] tag#24 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11246934.986450] sd 5:0:0:0: [sdd] tag#24 Sense Key : Medium Error
[current]
[11246934.986488] sd 5:0:0:0: [sdd] tag#24 Add. Sense: Unrecovered read
error - auto reallocate failed
[11246934.986549] sd 5:0:0:0: [sdd] tag#24 CDB: Read(10) 28 00 26 cd de
00 00 0a 00 00
[11246934.986603] blk_update_request: I/O error, dev sdd, sector 651027798
[11246934.986656] ata6: EH complete
[11247262.925561] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11247262.925621] ata8.00: irq_stat 0x40000008
[11247262.925653] ata8.00: failed command: READ FPDMA QUEUED
[11247262.925692] ata8.00: cmd 60/00:d8:80:46:25/0a:00:2a:00:00/40 tag
27 ncq dma 1310720 in
res 41/40:00:10:4e:25/00:00:2a:00:00/40
Emask 0x409 (media error) <F>
[11247262.925801] ata8.00: status: { DRDY ERR }
[11247262.925831] ata8.00: error: { UNC }
[11247262.934244] ata8.00: configured for UDMA/133
[11247262.934345] sd 7:0:0:0: [sdf] tag#27 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11247262.934405] sd 7:0:0:0: [sdf] tag#27 Sense Key : Medium Error
[current]
[11247262.934443] sd 7:0:0:0: [sdf] tag#27 Add. Sense: Unrecovered read
error - auto reallocate failed
[11247262.934503] sd 7:0:0:0: [sdf] tag#27 CDB: Read(10) 28 00 2a 25 46
80 00 0a 00 00
[11247262.934558] blk_update_request: I/O error, dev sdf, sector 707087888
[11247262.934608] ata8: EH complete
[11247412.132945] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11247412.133004] ata8.00: irq_stat 0x40000008
[11247412.133036] ata8.00: failed command: READ FPDMA QUEUED
[11247412.133075] ata8.00: cmd 60/00:58:00:86:cc/0a:00:2b:00:00/40 tag
11 ncq dma 1310720 in
res 41/40:00:e9:8c:cc/00:00:2b:00:00/40
Emask 0x409 (media error) <F>
[11247412.133184] ata8.00: status: { DRDY ERR }
[11247412.133214] ata8.00: error: { UNC }
[11247412.140402] ata8.00: configured for UDMA/133
[11247412.140495] sd 7:0:0:0: [sdf] tag#11 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11247412.140556] sd 7:0:0:0: [sdf] tag#11 Sense Key : Medium Error
[current]
[11247412.140594] sd 7:0:0:0: [sdf] tag#11 Add. Sense: Unrecovered read
error - auto reallocate failed
[11247412.140655] sd 7:0:0:0: [sdf] tag#11 CDB: Read(10) 28 00 2b cc 86
00 00 0a 00 00
[11247412.140710] blk_update_request: I/O error, dev sdf, sector 734825705
[11247412.140768] ata8: EH complete
[11248431.854940] ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0
action 0x0
[11248431.855000] ata6.00: irq_stat 0x40000008
[11248431.855032] ata6.00: failed command: READ FPDMA QUEUED
[11248431.855071] ata6.00: cmd 60/00:80:00:cb:8e/0a:00:36:00:00/40 tag
16 ncq dma 1310720 in
res 41/40:00:98:d1:8e/00:00:36:00:00/40
Emask 0x409 (media error) <F>
[11248431.855179] ata6.00: status: { DRDY ERR }
[11248431.855209] ata6.00: error: { UNC }
[11248431.864561] ata6.00: configured for UDMA/133
[11248431.864659] sd 5:0:0:0: [sdd] tag#16 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11248431.864719] sd 5:0:0:0: [sdd] tag#16 Sense Key : Medium Error
[current]
[11248431.864757] sd 5:0:0:0: [sdd] tag#16 Add. Sense: Unrecovered read
error - auto reallocate failed
[11248431.864817] sd 5:0:0:0: [sdd] tag#16 CDB: Read(10) 28 00 36 8e cb
00 00 0a 00 00
[11248431.864872] blk_update_request: I/O error, dev sdd, sector 915329432
[11248431.864929] ata6: EH complete
[11269366.179625] md: md1: data-check done.
[11269366.356954] RAID10 conf printout:
[11269366.356957] --- wd:4 rd:4
[11269366.356959] disk 0, wo:0, o:1, dev:sdb2
[11269366.356960] disk 1, wo:0, o:1, dev:sdc2
[11269366.356962] disk 2, wo:0, o:1, dev:sdd2
[11269366.356963] disk 3, wo:0, o:1, dev:sdf2
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: RAID Issues - RAID10 working but with errors
2020-04-02 2:28 RAID Issues - RAID10 working but with errors Adam Goryachev
@ 2020-04-02 8:49 ` Reindl Harald
2020-04-02 14:26 ` John Stoffel
2020-04-02 9:19 ` Wolfgang Denk
1 sibling, 1 reply; 7+ messages in thread
From: Reindl Harald @ 2020-04-02 8:49 UTC (permalink / raw)
To: Adam Goryachev, linux-raid
Am 02.04.20 um 04:28 schrieb Adam Goryachev:
> Is there a method to determine if this is a HDD error (ie, 2 drives that
> have errors) or a cabling issue (with just these two drives) or some
> strange driver/motherboard issue?
just try by start with the cheapest option: cables
when nothing changes switch where they disks are connected and you will
find out it's the motherboard when suddenly one drive that was completly
fine before has the same issue
> I notice in the output below MD is showing a number of bad blocks on the
> drives, and logs suggest that the drives have run out of "spare" space
> to re-allocate these to.
from the moment on some software stack is telling you about bad blocks
on drives run fast and replace fast
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID Issues - RAID10 working but with errors
2020-04-02 8:49 ` Reindl Harald
@ 2020-04-02 14:26 ` John Stoffel
0 siblings, 0 replies; 7+ messages in thread
From: John Stoffel @ 2020-04-02 14:26 UTC (permalink / raw)
To: Reindl Harald; +Cc: Adam Goryachev, linux-raid
>>>>> "Reindl" == Reindl Harald <h.reindl@thelounge.net> writes:
Reindl> Am 02.04.20 um 04:28 schrieb Adam Goryachev:
>> Is there a method to determine if this is a HDD error (ie, 2 drives that
>> have errors) or a cabling issue (with just these two drives) or some
>> strange driver/motherboard issue?
Reindl> just try by start with the cheapest option: cables
Reindl> when nothing changes switch where they disks are connected and you will
Reindl> find out it's the motherboard when suddenly one drive that was completly
Reindl> fine before has the same issue
>> I notice in the output below MD is showing a number of bad blocks on the
>> drives, and logs suggest that the drives have run out of "spare" space
>> to re-allocate these to.
Reindl> from the moment on some software stack is telling you about
Reindl> bad blocks on drives run fast and replace fast
Hear hear! When the drive is saying it's not happy... go get some
more now. I wouldn't even question the badblocks you're seeing, that
means it's dying.
John
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID Issues - RAID10 working but with errors
2020-04-02 2:28 RAID Issues - RAID10 working but with errors Adam Goryachev
2020-04-02 8:49 ` Reindl Harald
@ 2020-04-02 9:19 ` Wolfgang Denk
2020-04-02 11:20 ` Phil Turmel
1 sibling, 1 reply; 7+ messages in thread
From: Wolfgang Denk @ 2020-04-02 9:19 UTC (permalink / raw)
To: Adam Goryachev; +Cc: linux-raid
Dear Adam,
In message <d934f662-9fde-370b-bb4b-b92bd1730c96@websitemanagers.com.au> you wrote:
>
> smartctl -x /dev/sdd
...
> Model Family: Western Digital RE4
> Device Model: WDC WD2003FYYS-02W0B0
> Serial Number: WD-WMAY00922575
...
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 23
> 3 Spin_Up_Time POS--K 253 253 021 - 8583
> 4 Start_Stop_Count -O--CK 100 100 000 - 77
> 5 Reallocated_Sector_Ct PO--CK 184 184 140 - 126
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 017 017 000 - 61089
> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 67
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
> 193 Load_Cycle_Count -O--CK 200 200 000 - 28
> 194 Temperature_Celsius -O---K 118 105 000 - 34
> 196 Reallocated_Event_Count -O--CK 095 095 000 - 105
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 197 Current_Pending_Sector -O--CK 200 200 000 - 21
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This disk has a pretty high count of reallocated sectors, plus a lot
of other errors. I recommend to replace it ASAP. It is not worth
further investigation - this drive has reached EOL.
> smartctl -x /dev/sdf
...
> Model Family: Western Digital RE4
> Device Model: WDC WD2003FYYS-02W0B0
> Serial Number: WD-WMAY00611922
...
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> 3 Spin_Up_Time POS--K 253 253 021 - 7350
> 4 Start_Stop_Count -O--CK 100 100 000 - 73
> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 051 051 000 - 36231
> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 64
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
> 193 Load_Cycle_Count -O--CK 200 200 000 - 26
> 194 Temperature_Celsius -O---K 118 094 000 - 34
> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
> 40 -- 51 0a 00 00 00 2a 25 4e 10 40 00 Error: UNC at LBA =
> 0x2a254e10 = 707087888
...
> 40 -- 51 0a 00 00 00 1e 5a 3e 86 40 00 Error: UNC at LBA =
> 0x1e5a3e86 = 509230726
...
> 40 -- 51 0a 00 00 00 1e 0c 77 d3 40 00 Error: UNC at LBA =
> 0x1e0c77d3 = 504133587
...
> 40 -- 51 0a 00 00 00 1d e4 17 e7 40 00 Error: UNC at LBA =
> 0x1de417e7 = 501487591
...
> 40 -- 51 0a 00 00 00 1d c0 73 99 40 00 Error: UNC at LBA =
> 0x1dc07399 = 499151769
...
> 40 -- 51 0a 00 00 00 1d 23 fc 01 40 00 Error: UNC at LBA =
> 0x1d23fc01 = 488897537
This disk also has stored a number of errors, but it does not look
as bad as the first one. However, there are errors. I would
replace it as well.
Best regards,
Wolfgang Denk
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"A complex system that works is invariably found to have evolved from
a simple system that worked." - John Gall, _Systemantics_
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID Issues - RAID10 working but with errors
2020-04-02 9:19 ` Wolfgang Denk
@ 2020-04-02 11:20 ` Phil Turmel
2020-04-02 13:31 ` Adam Goryachev
0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2020-04-02 11:20 UTC (permalink / raw)
To: Wolfgang Denk, Adam Goryachev; +Cc: linux-raid
On 4/2/20 5:19 AM, Wolfgang Denk wrote:
> Dear Adam,
>
> In message <d934f662-9fde-370b-bb4b-b92bd1730c96@websitemanagers.com.au> you wrote:
>>
>> smartctl -x /dev/sdd
> ...
>> Model Family: Western Digital RE4
>> Device Model: WDC WD2003FYYS-02W0B0
>> Serial Number: WD-WMAY00922575
> ...
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
>> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 23
>> 3 Spin_Up_Time POS--K 253 253 021 - 8583
>> 4 Start_Stop_Count -O--CK 100 100 000 - 77
>> 5 Reallocated_Sector_Ct PO--CK 184 184 140 - 126
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
>> 9 Power_On_Hours -O--CK 017 017 000 - 61089
>> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
>> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
>> 12 Power_Cycle_Count -O--CK 100 100 000 - 67
>> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
>> 193 Load_Cycle_Count -O--CK 200 200 000 - 28
>> 194 Temperature_Celsius -O---K 118 105 000 - 34
>> 196 Reallocated_Event_Count -O--CK 095 095 000 - 105
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 197 Current_Pending_Sector -O--CK 200 200 000 - 21
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
>> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
>> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This disk has a pretty high count of reallocated sectors, plus a lot
> of other errors. I recommend to replace it ASAP. It is not worth
Concur. Old and worn out. Personally, I replace when reallocations are
in the 10 to 20 range. Once you get past that, they seem to start
coming much faster.
>> smartctl -x /dev/sdf
> ...
>> Model Family: Western Digital RE4
>> Device Model: WDC WD2003FYYS-02W0B0
>> Serial Number: WD-WMAY00611922
> ...
>> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
>> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
>> 3 Spin_Up_Time POS--K 253 253 021 - 7350
>> 4 Start_Stop_Count -O--CK 100 100 000 - 73
>> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
>> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
>> 9 Power_On_Hours -O--CK 051 051 000 - 36231
>> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
>> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
>> 12 Power_Cycle_Count -O--CK 100 100 000 - 64
>> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
>> 193 Load_Cycle_Count -O--CK 200 200 000 - 26
>> 194 Temperature_Celsius -O---K 118 094 000 - 34
>> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
>> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
>> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
>> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
>> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ...
>> 40 -- 51 0a 00 00 00 2a 25 4e 10 40 00 Error: UNC at LBA =
>> 0x2a254e10 = 707087888
> ...
>> 40 -- 51 0a 00 00 00 1e 5a 3e 86 40 00 Error: UNC at LBA =
>> 0x1e5a3e86 = 509230726
> ...
>> 40 -- 51 0a 00 00 00 1e 0c 77 d3 40 00 Error: UNC at LBA =
>> 0x1e0c77d3 = 504133587
> ...
>> 40 -- 51 0a 00 00 00 1d e4 17 e7 40 00 Error: UNC at LBA =
>> 0x1de417e7 = 501487591
> ...
>> 40 -- 51 0a 00 00 00 1d c0 73 99 40 00 Error: UNC at LBA =
>> 0x1dc07399 = 499151769
> ...
>> 40 -- 51 0a 00 00 00 1d 23 fc 01 40 00 Error: UNC at LBA =
>> 0x1d23fc01 = 488897537
>
> This disk also has stored a number of errors, but it does not look
> as bad as the first one. However, there are errors. I would
> replace it as well.
Disagree. No reallocations by the drive, just bad block log entries.
This drive is fine. The bad block log mis-feature should be turned off
after failing this drive, zeroing its superblock, and adding back to the
array (so the bad blocks get reconstructed).
The bad block log mis-feature should never have been merged in its
current form--it simply prevents redundancy from ever working on problem
sectors, and cannot distinguish correctable communications problems from
true underlying uncorrectable sectors. Which should be left to the
drive, at least until it runs out of spare sectors. (And why would you
keep such a drive anyways?) Bad block logging in MD raid is *Dangerous*
*Junk*.
> Best regards,
>
> Wolfgang Denk
Regards,
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID Issues - RAID10 working but with errors
2020-04-02 11:20 ` Phil Turmel
@ 2020-04-02 13:31 ` Adam Goryachev
2020-04-02 13:52 ` Phil Turmel
0 siblings, 1 reply; 7+ messages in thread
From: Adam Goryachev @ 2020-04-02 13:31 UTC (permalink / raw)
To: Phil Turmel, Wolfgang Denk; +Cc: linux-raid
On 2/4/20 22:20, Phil Turmel wrote:
> On 4/2/20 5:19 AM, Wolfgang Denk wrote:
>> Dear Adam,
>>
>> In message
>> <d934f662-9fde-370b-bb4b-b92bd1730c96@websitemanagers.com.au> you wrote:
>>>
>>> smartctl -x /dev/sdd
>> ...
>>> Model Family: Western Digital RE4
>>> Device Model: WDC WD2003FYYS-02W0B0
>>> Serial Number: WD-WMAY00922575
>> ...
>>> SMART Attributes Data Structure revision number: 16
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
>>> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 23
>>> 3 Spin_Up_Time POS--K 253 253 021 - 8583
>>> 4 Start_Stop_Count -O--CK 100 100 000 - 77
>>> 5 Reallocated_Sector_Ct PO--CK 184 184 140 - 126
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
>>> 9 Power_On_Hours -O--CK 017 017 000 - 61089
>>> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
>>> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
>>> 12 Power_Cycle_Count -O--CK 100 100 000 - 67
>>> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
>>> 193 Load_Cycle_Count -O--CK 200 200 000 - 28
>>> 194 Temperature_Celsius -O---K 118 105 000 - 34
>>> 196 Reallocated_Event_Count -O--CK 095 095 000 - 105
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> 197 Current_Pending_Sector -O--CK 200 200 000 - 21
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
>>> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
>>> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> This disk has a pretty high count of reallocated sectors, plus a lot
>> of other errors. I recommend to replace it ASAP. It is not worth
>
> Concur. Old and worn out. Personally, I replace when reallocations
> are in the 10 to 20 range. Once you get past that, they seem to start
> coming much faster.
>
Thank you, I'll check if the drive can be replaced by warranty, or else
check if I have a spare. Otherwise, I may be forced to buy a replacement.
>
>>> smartctl -x /dev/sdf
>> ...
>>> Model Family: Western Digital RE4
>>> Device Model: WDC WD2003FYYS-02W0B0
>>> Serial Number: WD-WMAY00611922
>> ...
>>> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
>>> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
>>> 3 Spin_Up_Time POS--K 253 253 021 - 7350
>>> 4 Start_Stop_Count -O--CK 100 100 000 - 73
>>> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
>>> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
>>> 9 Power_On_Hours -O--CK 051 051 000 - 36231
>>> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
>>> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
>>> 12 Power_Cycle_Count -O--CK 100 100 000 - 64
>>> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
>>> 193 Load_Cycle_Count -O--CK 200 200 000 - 26
>>> 194 Temperature_Celsius -O---K 118 094 000 - 34
>>> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
>>> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
>>> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
>>> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
>>> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> ...
>>> 40 -- 51 0a 00 00 00 2a 25 4e 10 40 00 Error: UNC at LBA =
>>> 0x2a254e10 = 707087888
>> ...
>>> 40 -- 51 0a 00 00 00 1e 5a 3e 86 40 00 Error: UNC at LBA =
>>> 0x1e5a3e86 = 509230726
>> ...
>>> 40 -- 51 0a 00 00 00 1e 0c 77 d3 40 00 Error: UNC at LBA =
>>> 0x1e0c77d3 = 504133587
>> ...
>>> 40 -- 51 0a 00 00 00 1d e4 17 e7 40 00 Error: UNC at LBA =
>>> 0x1de417e7 = 501487591
>> ...
>>> 40 -- 51 0a 00 00 00 1d c0 73 99 40 00 Error: UNC at LBA =
>>> 0x1dc07399 = 499151769
>> ...
>>> 40 -- 51 0a 00 00 00 1d 23 fc 01 40 00 Error: UNC at LBA =
>>> 0x1d23fc01 = 488897537
>>
>> This disk also has stored a number of errors, but it does not look
>> as bad as the first one. However, there are errors. I would
>> replace it as well.
>
> Disagree. No reallocations by the drive, just bad block log entries.
> This drive is fine. The bad block log mis-feature should be turned
> off after failing this drive, zeroing its superblock, and adding back
> to the array (so the bad blocks get reconstructed).
>
> The bad block log mis-feature should never have been merged in its
> current form--it simply prevents redundancy from ever working on
> problem sectors, and cannot distinguish correctable communications
> problems from true underlying uncorrectable sectors. Which should be
> left to the drive, at least until it runs out of spare sectors. (And
> why would you keep such a drive anyways?) Bad block logging in MD raid
> is *Dangerous* *Junk*.
So I have a "spare" drive in the array, what steps should I take to
"fix" this? Here are the statistics on the spare drive. Maybe it is just
as bad as the other two anyway, and I should replace all three?
If I can, I assume I would run some commands on the spare to configure
it to not have any BBL, then add it back to the array, use it to replace
the existing bad drive?
I assume that the two "good" drives and mdadm output, suggests that I
have a single "good" copy of all the data on the two good drives:
Number Major Minor RaidDevice State
0 8 18 0 active sync set-A /dev/sdb2
1 8 34 1 active sync set-B /dev/sdc2
2 8 50 2 active sync set-A /dev/sdd2
4 8 82 3 active sync set-B /dev/sdf2
3 8 66 - spare /dev/sde2
(Good drives assumed to be sdb and sdc)
Equally, all data is real-time synced to another machine (DRBD), as well
as being backed up regularly, so I'm not super concerned about the data
content, but I do want to maximise uptime, and minimise risk to the data
as it really is rather important (understatement...).
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4
Device Model: WDC WD2003FYYS-02W0B0
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 253 253 021 - 7391
4 Start_Stop_Count -O--CK 100 100 000 - 72
5 Reallocated_Sector_Ct PO--CK 181 181 140 - 151
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 009 008 000 - 66691
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 62
192 Power-Off_Retract_Count -O--CK 200 200 000 - 47
193 Load_Cycle_Count -O--CK 200 200 000 - 24
194 Temperature_Celsius -O---K 116 103 000 - 36
196 Reallocated_Event_Count -O--CK 059 059 000 - 141
197 Current_Pending_Sector -O--CK 200 200 000 - 3
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
[......]
Error 646 [21] occurred at disk power-on lifetime: 1152 hours (48 days +
0 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 02 c3 87 67 40 00 Error: UNC at LBA =
0x02c38767 = 46368615
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 a8 00 00 02 c3 82 00 40 08 31d+11:36:50.175 READ FPDMA
QUEUED
60 0a 00 00 a0 00 00 02 c3 78 00 40 08 31d+11:36:49.708 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 02 c3 6e 00 40 08 31d+11:36:49.574 READ FPDMA
QUEUED
60 0a 00 00 90 00 00 02 c3 64 00 40 08 31d+11:36:49.149 READ FPDMA
QUEUED
60 0a 00 00 88 00 00 02 c3 5a 00 40 08 31d+11:36:49.141 READ FPDMA
QUEUED
Error 645 [20] occurred at disk power-on lifetime: 890 hours (37 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 66 7c ad fb 40 00 Error: UNC at LBA =
0x667cadfb = 1719447035
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 58 00 00 66 7c ac 80 40 08 20d+13:07:16.625 READ FPDMA
QUEUED
60 0a 00 00 50 00 00 66 7c a2 80 40 08 20d+13:07:16.603 READ FPDMA
QUEUED
ea 00 00 00 00 00 00 00 00 00 00 e0 08 20d+13:07:16.474 FLUSH CACHE EXT
61 00 07 00 40 00 00 02 54 38 18 40 08 20d+13:07:16.474 WRITE FPDMA
QUEUED
ea 00 00 00 00 00 00 00 00 00 00 e0 08 20d+13:07:16.462 FLUSH CACHE EXT
Error 644 [19] occurred at disk power-on lifetime: 890 hours (37 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 66 54 17 ee 40 00 Error: UNC at LBA =
0x665417ee = 1716787182
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 98 00 00 66 54 15 80 40 08 20d+13:06:57.140 READ FPDMA
QUEUED
61 07 00 00 90 00 00 66 54 0e 80 40 08 20d+13:06:57.140 WRITE FPDMA
QUEUED
61 00 08 00 88 00 00 02 54 38 10 40 08 20d+13:06:57.140 WRITE FPDMA
QUEUED
ea 00 00 00 00 00 00 00 00 00 00 e0 08 20d+13:06:56.685 FLUSH CACHE EXT
ef 00 10 00 02 00 00 00 00 00 00 a0 08 20d+13:06:56.684 SET FEATURES
[Enable SATA feature]
Error 643 [18] occurred at disk power-on lifetime: 890 hours (37 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 66 54 0e dc 40 00 Error: UNC at LBA =
0x66540edc = 1716784860
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 e0 00 00 66 54 0b 80 40 08 20d+13:06:54.853 READ FPDMA
QUEUED
60 08 00 00 d8 00 00 66 54 03 80 40 08 20d+13:06:54.845 READ FPDMA
QUEUED
60 08 00 00 d0 00 00 66 53 fb 80 40 08 20d+13:06:54.837 READ FPDMA
QUEUED
60 08 00 00 c8 00 00 66 53 f3 80 40 08 20d+13:06:54.829 READ FPDMA
QUEUED
60 09 00 00 c0 00 00 66 53 ea 80 40 08 20d+13:06:54.820 READ FPDMA
QUEUED
Error 642 [17] occurred at disk power-on lifetime: 890 hours (37 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 66 53 a2 fd 40 00 Error: UNC at LBA =
0x6653a2fd = 1716757245
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 a0 00 00 66 53 9b 80 40 08 20d+13:06:52.760 READ FPDMA
QUEUED
60 09 80 00 98 00 00 66 53 92 00 40 08 20d+13:06:52.751 READ FPDMA
QUEUED
60 0a 00 00 90 00 00 66 53 88 00 40 08 20d+13:06:52.740 READ FPDMA
QUEUED
60 0a 00 00 88 00 00 66 53 7e 00 40 08 20d+13:06:52.730 READ FPDMA
QUEUED
60 09 80 00 80 00 00 66 53 74 80 40 08 20d+13:06:52.721 READ FPDMA
QUEUED
Error 641 [16] occurred at disk power-on lifetime: 889 hours (37 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 3f f5 31 61 40 00 Error: UNC at LBA =
0x3ff53161 = 1073033569
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 40 00 00 3f f5 2d 00 40 08 20d+11:59:33.169 READ FPDMA
QUEUED
60 0a 00 00 38 00 00 3f f5 23 00 40 08 20d+11:59:33.159 READ FPDMA
QUEUED
60 08 80 00 30 00 00 3f f5 1a 80 40 08 20d+11:59:33.151 READ FPDMA
QUEUED
60 0a 00 00 28 00 00 3f f5 10 80 40 08 20d+11:59:33.142 READ FPDMA
QUEUED
60 0a 00 00 20 00 00 3f f5 06 80 40 08 20d+11:59:33.132 READ FPDMA
QUEUED
Error 640 [15] occurred at disk power-on lifetime: 889 hours (37 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 3e ee 42 1a 40 00 Error: UNC at LBA =
0x3eee421a = 1055801882
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 70 00 00 3e ee 41 00 40 08 20d+11:57:50.591 READ FPDMA
QUEUED
60 0a 00 00 68 00 00 3e ee 37 00 40 08 20d+11:57:50.565 READ FPDMA
QUEUED
60 0a 00 00 60 00 00 3e ee 2d 00 40 08 20d+11:57:50.556 READ FPDMA
QUEUED
60 0a 00 00 58 00 00 3e ee 23 00 40 08 20d+11:57:50.547 READ FPDMA
QUEUED
60 0a 00 00 50 00 00 3e ee 19 00 40 08 20d+11:57:50.525 READ FPDMA
QUEUED
Error 639 [14] occurred at disk power-on lifetime: 889 hours (37 days +
1 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 0a 00 00 00 3e ec be 4c 40 00 Error: UNC at LBA =
0x3eecbe4c = 1055702604
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 0a 00 00 a8 00 00 3e ec ba 80 40 08 20d+11:57:44.704 READ FPDMA
QUEUED
60 08 80 00 a0 00 00 3e ec b2 00 40 08 20d+11:57:44.696 READ FPDMA
QUEUED
60 0a 00 00 98 00 00 3e ec a8 00 40 08 20d+11:57:44.671 READ FPDMA
QUEUED
ea 00 00 00 00 00 00 00 00 00 00 e0 08 20d+11:57:44.664 FLUSH CACHE EXT
60 0a 00 00 80 00 00 3e ec 9e 00 40 08 20d+11:57:44.659 READ FPDMA
QUEUED
Note, parts were omitted, hopefully I've included the relevant/important
parts. The other two drives show 0 errors (it looks like to me):
sdb:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 73
3 Spin_Up_Time POS--K 253 253 021 - 9008
4 Start_Stop_Count -O--CK 100 100 000 - 78
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 022 022 000 - 57426
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 65
192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
193 Load_Cycle_Count -O--CK 200 200 000 - 31
194 Temperature_Celsius -O---K 105 095 000 - 47
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 6
sdc:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 42
3 Spin_Up_Time POS--K 253 253 021 - 8441
4 Start_Stop_Count -O--CK 100 100 000 - 69
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 010 010 000 - 65784
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 67
192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
193 Load_Cycle_Count -O--CK 200 200 000 - 20
194 Temperature_Celsius -O---K 120 105 000 - 32
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
>
>> Best regards,
>>
>> Wolfgang Denk
>
> Regards,
>
> Phil
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: RAID Issues - RAID10 working but with errors
2020-04-02 13:31 ` Adam Goryachev
@ 2020-04-02 13:52 ` Phil Turmel
0 siblings, 0 replies; 7+ messages in thread
From: Phil Turmel @ 2020-04-02 13:52 UTC (permalink / raw)
To: Adam Goryachev, Wolfgang Denk; +Cc: linux-raid
On 4/2/20 9:31 AM, Adam Goryachev wrote:
>
> On 2/4/20 22:20, Phil Turmel wrote:
>> Concur. Old and worn out. Personally, I replace when reallocations
>> are in the 10 to 20 range. Once you get past that, they seem to start
>> coming much faster.
>>
> Thank you, I'll check if the drive can be replaced by warranty, or else
> check if I have a spare. Otherwise, I may be forced to buy a replacement.
I'll be astonished if you can get a warranty replacement for a drive
that has 60 *thousand* hours of uptime.
> So I have a "spare" drive in the array, what steps should I take to
> "fix" this? Here are the statistics on the spare drive. Maybe it is just
> as bad as the other two anyway, and I should replace all three?
>
> If I can, I assume I would run some commands on the spare to configure
> it to not have any BBL, then add it back to the array, use it to replace
> the existing bad drive?
Use the --replace operation of modern mdadm/kernel to get that failing
drive out right away. It appears you won't be able to remove the bad
block misfeature until all devices in the array have an empty log.
> Equally, all data is real-time synced to another machine (DRBD), as well
> as being backed up regularly, so I'm not super concerned about the data
> content, but I do want to maximise uptime, and minimise risk to the data
> as it really is rather important (understatement...).
Understood. --replace is your friend.
> smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital RE4
> Device Model: WDC WD2003FYYS-02W0B0
>
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> 3 Spin_Up_Time POS--K 253 253 021 - 7391
> 4 Start_Stop_Count -O--CK 100 100 000 - 72
> 5 Reallocated_Sector_Ct PO--CK 181 181 140 - 151
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 009 008 000 - 66691
> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 62
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 47
> 193 Load_Cycle_Count -O--CK 200 200 000 - 24
> 194 Temperature_Celsius -O---K 116 103 000 - 36
> 196 Reallocated_Event_Count -O--CK 059 059 000 - 141
> 197 Current_Pending_Sector -O--CK 200 200 000 - 3
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
Bleh. Replace this one, too.
> sdb:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 73
> 3 Spin_Up_Time POS--K 253 253 021 - 9008
> 4 Start_Stop_Count -O--CK 100 100 000 - 78
> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 022 022 000 - 57426
> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 65
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 46
> 193 Load_Cycle_Count -O--CK 200 200 000 - 31
> 194 Temperature_Celsius -O---K 105 095 000 - 47
> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 6
>
> sdc:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 42
> 3 Spin_Up_Time POS--K 253 253 021 - 8441
> 4 Start_Stop_Count -O--CK 100 100 000 - 69
> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 010 010 000 - 65784
> 10 Spin_Retry_Count -O--CK 100 253 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 253 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 67
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 48
> 193 Load_Cycle_Count -O--CK 200 200 000 - 20
> 194 Temperature_Celsius -O---K 120 105 000 - 32
> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 2
These two are in astonishingly good condition for their age.
When you've replaced the two bad drives and returned to having a hot
spare, use --replace again on any drives that still have entries in
their bad block logs. The free up drive can than have its superblock
zeroed and added back as the spare. Rinse and repeat.
All of the above can be done on the fly, assuming you have hot-swap bays
for new drives.
When all drives are good, with empty bad block lists, stop the array and
immediately re-assemble with --update=no-bbl.
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-04-02 14:26 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-02 2:28 RAID Issues - RAID10 working but with errors Adam Goryachev
2020-04-02 8:49 ` Reindl Harald
2020-04-02 14:26 ` John Stoffel
2020-04-02 9:19 ` Wolfgang Denk
2020-04-02 11:20 ` Phil Turmel
2020-04-02 13:31 ` Adam Goryachev
2020-04-02 13:52 ` Phil Turmel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).