* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-07 21:58 Gavin Flower
0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-07 21:58 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Hi Neil,
After further checking, I found there was no problem with the swap partition.
Cheers,
Gavin
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!
--- On Thu, 7/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Thursday, 7 April, 2011, 18:07
[...]
> Somewhere along the way, I seemed to have lost my swap
> partition!
[...]
^ permalink raw reply [flat|nested] 28+ messages in thread
* RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08 1:32 Gavin Flower
2011-04-08 9:34 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-08 1:32 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Hi Neil,
My original email may have been eaten: as it did not appear on the list, nor did I get an error message back. So perhaps there was a problem with the attached files.
I will resend the attachments one at a time in separate emails.
Cheers,
Gavin
[begin original]
Hi Neil,
Your help (or anybody else's) would be greatly appreciated, yet again!
This morning, I noticed my system was extremely unresponsive, and that there were clicking sounds coming from one of my 5 hard drives. Also that there was excessive disk I/O even for trivial things like bring up a directory window, and lots of ata3 errors being reported to the system log. These symptoms were mostly during a raid check process.
Somewhere along the way, I seemed to have lost my swap partition!
So I did some extensive investigations, which took most of the day. My notes were created in OpenDocument format using LibreOffice, but I have converted them to txt format for the include - but I can supply the ,odt file if requested.
I Have included 2 files:
my notes: raid-notes-20110407a.txt
selected log entries: messages-gcf-20110407-ATA
If there are some additional diagnostics that might prove useful, please let me know.
Cheers,
Gavin
[end original]
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08 1:34 Gavin Flower
0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-08 1:34 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 446 bytes --]
my notes: raid-notes-20110407a.txt
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!
--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 13:32
[...]
[-- Attachment #2: raid-notes-20110407a.txt --]
[-- Type: text/plain, Size: 22633 bytes --]
Note that the check on md1 took almost 2 hours!
# grep md1 /var/log/messages
Apr 4 08:25:38 saturn kernel: [ 3.203058] md: md1 stopped.
Apr 4 08:25:38 saturn kernel: [ 3.221821] md/raid:md1: device sda2 operational as raid disk 0
Apr 4 08:25:38 saturn kernel: [ 3.223099] md/raid:md1: device sdc2 operational as raid disk 4
Apr 4 08:25:38 saturn kernel: [ 3.224364] md/raid:md1: device sdd2 operational as raid disk 3
Apr 4 08:25:38 saturn kernel: [ 3.225589] md/raid:md1: device sde2 operational as raid disk 2
Apr 4 08:25:38 saturn kernel: [ 3.226806] md/raid:md1: device sdb2 operational as raid disk 1
Apr 4 08:25:38 saturn kernel: [ 3.229256] md/raid:md1: allocated 5334kB
Apr 4 08:25:38 saturn kernel: [ 3.230500] md/raid:md1: raid level 6 active with 5 out of 5 devices, algorithm 2
Apr 4 08:25:38 saturn kernel: [ 3.232503] md1: detected capacity change from 0 to 314571227136
Apr 4 08:25:38 saturn kernel: [ 3.234559] dracut: mdadm: /dev/md1 has been started with 5 drives.
Apr 4 08:25:38 saturn kernel: [ 3.236257] md1: detected capacity change from 0 to 314571227136
Apr 4 08:25:38 saturn kernel: [ 3.237425] md1: unknown partition table
Apr 4 08:25:38 saturn kernel: [ 9.892068] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null)
Apr 5 07:05:28 saturn kernel: [65356.926079] Modules linked in: tcp_lp powernow_k8 freq_table mperf fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS ipt_LOG xt_limit bridge stp llc rmd160 crypto_null camellia lzo lzo_compress cast6 cast5 deflate zlib_deflate cts ctr gcm ccm serpent blowfish twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key bluetooth rfkill nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_amd kvm usblp r8169 edac_core atl1e uvcvideo mii snd_hda_codec_atihdmi edac_mce_amd shpchp videodev v4l2_compat_ioctl32 asus_atk0110 serio_raw snd_hda_codec_via snd_usb_audio snd_usbmidi_lib joydev snd_hda_intel i2c_piix4 k10temp snd_hda_codec snd_hw
Apr 7 07:54:01 saturn kernel: [207546.188800] md: data-check of RAID array md1
Apr 7 07:54:01 saturn kernel: [207546.188868] md: delaying data-check of md0 until md1 has finished (they share one or more physical units)
Apr 7 07:54:01 saturn kernel: [207546.190517] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
Apr 7 07:54:01 saturn kernel: [207546.190523] md: delaying data-check of md0 until md1 has finished (they share one or more physical units)
Apr 7 08:42:08 saturn kernel: [210414.109856] md/raid:md1: read error corrected (8 sectors at 17195800 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109869] md/raid:md1: read error corrected (8 sectors at 17195808 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109872] md/raid:md1: read error corrected (8 sectors at 17195816 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109875] md/raid:md1: read error corrected (8 sectors at 17195824 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109877] md/raid:md1: read error corrected (8 sectors at 17195832 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109880] md/raid:md1: read error corrected (8 sectors at 17195840 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109883] md/raid:md1: read error corrected (8 sectors at 17195848 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109891] md/raid:md1: read error corrected (8 sectors at 17195856 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109894] md/raid:md1: read error corrected (8 sectors at 17195864 on sdc2)
Apr 7 08:42:08 saturn kernel: [210414.109897] md/raid:md1: read error corrected (8 sectors at 17195872 on sdc2)
Apr 7 08:54:39 saturn kernel: [211161.824066] md/raid:md1: read error corrected (8 sectors at 137014528 on sdc2)
Apr 7 09:51:47 saturn kernel: [214581.140560] md: md1: data-check done.
#
# date ; cat /proc/mdstat
Thu Apr 7 10:31:24 NZST 2011
Personalities : [raid6] [raid5] [raid4]
md2 : active raid6 sda4[0] sdc4[6] sdd4[3] sdb4[5] sde4[1]
1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
[==========>..........] check = 54.1% (201068416/371581952) finish=32.6min speed=87129K/sec
bitmap: 2/3 pages [8KB], 65536KB chunk
md1 : active raid6 sda2[0] sdc2[4] sdd2[3] sde2[2] sdb2[1]
307198464 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[2] sde3[1]
10751808 blocks level 6, 64k chunk, algorithm 2 [5/5] [UUUUU]
unused devices: <none>
#
From root@localhost6.localdomain6 Thu Apr 7 11:12:41 2011
Return-Path: <root@localhost6.localdomain6>
Date: Thu, 7 Apr 2011 11:12:40 +1200
From: Anacron <root@localhost6.localdomain6>
To: root@localhost6.localdomain6
Content-Type: text/plain; charset="ANSI_X3.4-1968"
Subject: Anacron job 'cron.weekly' on saturn
Status: R
/etc/cron.weekly/99-raid-check:
WARNING: mismatch_cnt is not 0 on /dev/md2
WARNING: mismatch_cnt is not 0 on /dev/md0
# cat /sys/block/md0/md/mismatch_cnt
128
# cat /sys/block/md1/md/mismatch_cnt
0
# cat /sys/block/md2/md/mismatch_cnt
28904
#
# e2fsck -f -n /dev/md2
e2fsck 1.41.12 (17-May-2010)
Warning! /dev/md2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? no
Inode 20186332 was part of the orphaned inode list. IGNORED.
Inode 20317506 was part of the orphaned inode list. IGNORED.
Inode 20317552 was part of the orphaned inode list. IGNORED.
Inode 20317955 was part of the orphaned inode list. IGNORED.
Inode 20447237 was part of the orphaned inode list. IGNORED.
Inode 20447245 was part of the orphaned inode list. IGNORED.
Inode 20447287 was part of the orphaned inode list. IGNORED.
Inode 20447296 was part of the orphaned inode list. IGNORED.
Inode 20447302 was part of the orphaned inode list. IGNORED.
Inode 20447311 was part of the orphaned inode list. IGNORED.
Inode 20447353 was part of the orphaned inode list. IGNORED.
Inode 20447360 was part of the orphaned inode list. IGNORED.
Inode 21500787 was part of the orphaned inode list. IGNORED.
Inode 21628913 was part of the orphaned inode list. IGNORED.
Inode 22158808 was part of the orphaned inode list. IGNORED.
Inode 22158811 was part of the orphaned inode list. IGNORED.
Inode 22158840 was part of the orphaned inode list. IGNORED.
Inode 22158842 was part of the orphaned inode list. IGNORED.
Inode 22158846 was part of the orphaned inode list. IGNORED.
Inode 25952949 was part of the orphaned inode list. IGNORED.
Inode 25953424 was part of the orphaned inode list. IGNORED.
Inode 25954542 was part of the orphaned inode list. IGNORED.
Deleted inode 45088771 has zero dtime. Fix? no
Inode 45088772 was part of the orphaned inode list. IGNORED.
Inode 45088773 was part of the orphaned inode list. IGNORED.
Inode 45088774 was part of the orphaned inode list. IGNORED.
Inode 45088775 was part of the orphaned inode list. IGNORED.
Inode 45088972 was part of the orphaned inode list. IGNORED.
Inode 45089022 was part of the orphaned inode list. IGNORED.
Inode 45089035 was part of the orphaned inode list. IGNORED.
Inode 45089037 was part of the orphaned inode list. IGNORED.
Inode 45089043 was part of the orphaned inode list. IGNORED.
Inode 45089044 was part of the orphaned inode list. IGNORED.
Inode 45089045 was part of the orphaned inode list. IGNORED.
Inode 45089057 was part of the orphaned inode list. IGNORED.
Inode 45089060 was part of the orphaned inode list. IGNORED.
Inode 45089062 was part of the orphaned inode list. IGNORED.
Inode 45089064 was part of the orphaned inode list. IGNORED.
Inode 45089067 was part of the orphaned inode list. IGNORED.
Inode 45089068 was part of the orphaned inode list. IGNORED.
Inode 45089070 was part of the orphaned inode list. IGNORED.
Inode 45089137 was part of the orphaned inode list. IGNORED.
Inode 45089150 was part of the orphaned inode list. IGNORED.
Inode 45089156 was part of the orphaned inode list. IGNORED.
Inode 45089190 was part of the orphaned inode list. IGNORED.
Inode 45089204 was part of the orphaned inode list. IGNORED.
Inode 45089205 was part of the orphaned inode list. IGNORED.
Inode 45089207 was part of the orphaned inode list. IGNORED.
Inode 45089213 was part of the orphaned inode list. IGNORED.
Inode 45089218 was part of the orphaned inode list. IGNORED.
Inode 45089238 was part of the orphaned inode list. IGNORED.
Inode 45089249 was part of the orphaned inode list. IGNORED.
Inode 45089257 was part of the orphaned inode list. IGNORED.
Inode 45089264 was part of the orphaned inode list. IGNORED.
Inode 45089282 was part of the orphaned inode list. IGNORED.
Inode 45089284 was part of the orphaned inode list. IGNORED.
Inode 45089286 was part of the orphaned inode list. IGNORED.
Inode 45089291 was part of the orphaned inode list. IGNORED.
Inode 45089297 was part of the orphaned inode list. IGNORED.
Inode 45089298 was part of the orphaned inode list. IGNORED.
Inode 45089305 was part of the orphaned inode list. IGNORED.
Inode 45089307 was part of the orphaned inode list. IGNORED.
Inode 45089319 was part of the orphaned inode list. IGNORED.
Inode 45089320 was part of the orphaned inode list. IGNORED.
Inode 63705919 was part of the orphaned inode list. IGNORED.
Inode 65938687 was part of the orphaned inode list. IGNORED.
Inode 65939256 was part of the orphaned inode list. IGNORED.
Inode 65939355 was part of the orphaned inode list. IGNORED.
Inode 65939368 was part of the orphaned inode list. IGNORED.
Inode 66191686 was part of the orphaned inode list. IGNORED.
Inode 66191689 was part of the orphaned inode list. IGNORED.
Inode 66191738 was part of the orphaned inode list. IGNORED.
Inode 66191741 was part of the orphaned inode list. IGNORED.
Inode 66191747 was part of the orphaned inode list. IGNORED.
Inode 66197970 was part of the orphaned inode list. IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(2393344--2393372) -(2393792--2393809) -(2470272--2470336) -(2502016--2502080) +(7831552--7841252) +(7841792--7864319) -(79795252--79795253) -(79823488--79823615) -(79824000--79824123) -(79824640--79825142) -79826344 -79898014 -79923101 -(79923154--79923165) -(80123296--80123311) -(80152298--80152301) -80291729 -80291732 -80291759 -80847380 -(80847438--80847441) -80847502 -80847555 -80847736 -80874645 -80874664 -80875873 -(80875914--80875920) -80875927 -80875960 -(80876002--80876004) -80876048 -(80876052--80876056) -(80876600--80876601) -(80876639--80876641) -81330516 -(81334527--81334528) -81334535 -(81821915--81821947) -(81822170--81822204) -(81894559--81894562) -81923317 -81925743 -81925934 -(81925951--81925952) -(81926003--81926004) -(81956735--81957638) -(82971732--82971733) -(82971902--82971903) -(82971917--82971918) -(82971947--82971948) -(82971972--82971991) -85992203 -86516481 -87626360 -88613273 -104083592 -(104083946--104083948) -104083957 -104084073 -104084084 -104084487 -104137397 -104138111 -104236430 -(104236580--104236596) -(104236598--104236610) -(104301814--104301815) -(104301822--104301828) -104343080 -(105686863--105686864) -105686916 -(115903040--115903065) +(115903516--115903541) -134259847 -134284245 -134284593 -(134284674--134284675) -134285473 -(170994896--170994901) -170994959 -170995027 -(180397545--180397547) -(255167322--255167805) -(263756512--263756516) -(263764800--263764807) -(263779568--263779592) -(263782498--263782533) -(264798344--264798348) -(264804016--264804023) -(264804064--264804074) -(264804968--264804973) -(264809216--264809359)
Fix? no
Free blocks count wrong for group #239 (539, counted=32768).
Fix? no
Free blocks count wrong for group #2446 (23057, counted=23053).
Fix? no
Free blocks count wrong (256921638, counted=256646017).
Fix? no
Inode bitmap differences: -20186332 -20317506 -20317552 -20317955 -20447237 -20447245 -20447287 -20447296 -20447302 -20447311 -20447353 -20447360 -21500787 -21628913 -22158808 -22158811 -22158840 -22158842 -22158846 -25952949 -25953424 -25954542 -(45088771--45088775) -45088972 -45089022 -45089035 -45089037 -(45089043--45089045) -45089057 -45089060 -45089062 -45089064 -(45089067--45089068) -45089070 -45089137 -45089150 -45089156 -45089190 -(45089204--45089205) -45089207 -45089213 -45089218 -45089238 -45089249 -45089257 -45089264 -45089282 -45089284 -45089286 -45089291 -(45089297--45089298) -45089305 -45089307 -(45089319--45089320) -63705919 -65938687 -65939256 -65939355 -65939368 -66191686 -66191689 -66191738 -66191741 -66191747 -66197970
Fix? no
Directories count wrong for group #2624 (735, counted=734).
Fix? no
Directories count wrong for group #2640 (735, counted=734).
Fix? no
Directories count wrong for group #2704 (541, counted=540).
Fix? no
Free inodes count wrong (68295781, counted=68268234).
Fix? no
/dev/md2: ********** WARNING: Filesystem still has errors **********
/dev/md2: 1377179/69672960 files (0.4% non-contiguous), 21764826/278686464 blocks
#
# e2fsck -f -n /dev/sda4
e2fsck 1.41.12 (17-May-2010)
e2fsck: Device or resource busy while trying to open /dev/sda4
Filesystem mounted or opened exclusively by another program?
# mdadm --detail /dev/md2
/dev/md2:
Version : 1.1
Creation Time : Wed Nov 24 08:27:42 2010
Raid Level : raid6
Array Size : 1114745856 (1063.10 GiB 1141.50 GB)
Used Dev Size : 371581952 (354.37 GiB 380.50 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Apr 7 12:11:59 2011
State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : localhost.localdomain:2
UUID : a511e656:a742a2f2:f4917939:2d333c7e
Events : 38609
Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 8 68 1 active sync /dev/sde4
5 8 20 2 active sync /dev/sdb4
3 8 52 3 active sync /dev/sdd4
6 8 36 4 active sync /dev/sdc4
#
note absence of /dev/md0 (swap)!!!
# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md2 1097254408 70799328 970717788 7% /
tmpfs 4097108 824 4096284 1% /dev/shm
/dev/sda1 1032088 128772 850888 14% /boot
/dev/md1 302377920 72501428 214516572 26% /data
#
# mdadm -Evs
ARRAY /dev/md1 level=raid6 num-devices=5 UUID=6f1176ae:a0ad6cac:bfe78010:bc810f04
devices=/dev/sde2,/dev/sdc2,/dev/sdd2,/dev/sdb2,/dev/sda2
ARRAY /dev/md0 level=raid6 num-devices=5 UUID=3b76ac20:8253f696:bfe78010:bc810f04
devices=/dev/sde3,/dev/sdc3,/dev/sdd3,/dev/sdb3,/dev/sda3
ARRAY /dev/md/2 level=raid6 metadata=1.1 num-devices=5 UUID=a511e656:a742a2f2:f4917939:2d333c7e name=localhost.localdomain:2
devices=/dev/sde4,/dev/sdc4,/dev/sdd4,/dev/sdb4,/dev/sda4
#
# fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0000ca3a
Device Boot Start End Blocks Id System
/dev/sda1 * 63 2097214 1048576 83 Linux
/dev/sda2 2097215 206897214 102400000 fd Linux raid autodetect
/dev/sda3 206897215 214065214 3584000 fd Linux raid autodetect
/dev/sda4 214066125 957233024 371583450 fd Linux raid autodetect
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000566c1
Device Boot Start End Blocks Id System
/dev/sdb1 63 2097214 1048576 83 Linux
/dev/sdb2 2097215 206897214 102400000 fd Linux raid autodetect
/dev/sdb3 206897215 214065214 3584000 fd Linux raid autodetect
/dev/sdb4 214066125 957233024 371583450 fd Linux raid autodetect
Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0000af79
Device Boot Start End Blocks Id System
/dev/sdd1 * 63 2097214 1048576 83 Linux
/dev/sdd2 2097215 206897214 102400000 fd Linux raid autodetect
/dev/sdd3 206897215 214065214 3584000 fd Linux raid autodetect
/dev/sdd4 214066125 957233024 371583450 fd Linux raid autodetect
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00081ccd
Device Boot Start End Blocks Id System
/dev/sdc1 * 63 2097214 1048576 83 Linux
/dev/sdc2 2097215 206897214 102400000 fd Linux raid autodetect
/dev/sdc3 206897215 214065214 3584000 fd Linux raid autodetect
/dev/sdc4 214066125 957233024 371583450 fd Linux raid autodetect
Disk /dev/sde: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00081ccd
Device Boot Start End Blocks Id System
/dev/sde1 * 63 2097214 1048576 83 Linux
/dev/sde2 2097215 206897214 102400000 fd Linux raid autodetect
/dev/sde3 206897215 214065214 3584000 fd Linux raid autodetect
/dev/sde4 214066125 957233024 371583450 fd Linux raid autodetect
Disk /dev/md0: 11.0 GB, 11009851392 bytes
2 heads, 4 sectors/track, 2687952 cylinders, total 21503616 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 196608 bytes
Disk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/md1: 314.6 GB, 314571227136 bytes
2 heads, 4 sectors/track, 76799616 cylinders, total 614396928 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
Disk identifier: 0x00000000
Disk /dev/md1 doesn't contain a valid partition table
Disk /dev/md2: 1141.5 GB, 1141499756544 bytes
2 heads, 4 sectors/track, 278686464 cylinders, total 2229491712 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
Disk identifier: 0x00000000
Disk /dev/md2 doesn't contain a valid partition table
#
# dmraid -b
/dev/sde: 976773168 total, "6VM2FE64"
/dev/sdc: 976773168 total, "5VMJ3RJE"
/dev/sdd: 976773168 total, "6VM2AM98"
/dev/sdb: 976773168 total, "6VM2H5W7"
/dev/sda: 976773168 total, "5VM1VNM9"
#
I ran badblocks for each drive concurrently, note that the one for sda took about an hour longer than the others, but it was sdc that reported a bad block.
# badblocks -s -v /dev/sda
Checking blocks 0 to 488386583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
# badblocks -s -v /dev/sdb
Checking blocks 0 to 488386583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
# badblocks -s -v /dev/sdc
Checking blocks 0 to 488386583
Checking for bad blocks (read-only test): 236817152one, 58:43 elapsed
done
Pass completed, 1 bad blocks found.
# badblocks -s -v /dev/sdd
Checking blocks 0 to 488386583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
# badblocks -s -v /dev/sde
Checking blocks 0 to 488386583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
#
Selected lines from the smartctl output:
# smartctl -a /dev/sda
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 5VM1VNM9
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 17
# smartctl -a /dev/sdb
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 6VM2H5W7
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 42
# smartctl -a /dev/sdc
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 5VMJ3RJE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
# smartctl -a /dev/sdd
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 6VM2AM98
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1
# smartctl -a /dev/sde
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 6VM2FE64
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 79
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08 2:01 Gavin Flower
0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-08 2:01 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Hi Neil,
Looks like the log file was simply too big.
Here are the initial and ending lines:
Cheers,
Gavin
output of:
grep -i ATA /var/log/messages
Apr 4 00:46:18 saturn kernel: [58150.946089] pata_atiixp 0000:00:14.1: PCI INT A disabled
Apr 4 00:46:19 saturn kernel: [58151.620996] pata_atiixp 0000:00:14.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Apr 4 00:46:19 saturn kernel: [58151.776364] ata6.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr 4 00:46:19 saturn kernel: [58151.776367] ata6.00: ACPI cmd ef/03:46:00:00:00:a0 (SET FEATURES) filtered out
Apr 4 00:46:19 saturn kernel: [58151.776370] ata6.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Apr 4 00:46:19 saturn kernel: [58151.792475] ata5.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr 4 00:46:19 saturn kernel: [58151.792478] ata5.00: ACPI cmd ef/03:42:00:00:00:a0 (SET FEATURES) filtered out
Apr 4 00:46:19 saturn kernel: [58151.792481] ata5.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Apr 4 00:46:19 saturn kernel: [58151.814455] ata5.00: configured for UDMA/33
Apr 4 00:46:19 saturn kernel: [58151.850339] ata6.00: configured for UDMA/100
Apr 4 00:46:19 saturn kernel: [58151.864031] ata3: softreset failed (device not ready)
Apr 4 00:46:19 saturn kernel: [58151.864035] ata4: softreset failed (device not ready)
Apr 4 00:46:19 saturn kernel: [58151.864038] ata3: applying SB600 PMP SRST workaround and retrying
Apr 4 00:46:19 saturn kernel: [58151.864040] ata4: applying SB600 PMP SRST workaround and retrying
Apr 4 00:46:19 saturn kernel: [58151.864059] ata2: softreset failed (device not ready)
Apr 4 00:46:19 saturn kernel: [58151.864061] ata1: softreset failed (device not ready)
Apr 4 00:46:19 saturn kernel: [58151.864063] ata2: applying SB600 PMP SRST workaround and retrying
Apr 4 00:46:19 saturn kernel: [58151.864065] ata1: applying SB600 PMP SRST workaround and retrying
Apr 4 00:46:19 saturn kernel: [58152.019042] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 4 00:46:19 saturn kernel: [58152.019046] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 4 00:46:19 saturn kernel: [58152.019070] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 4 00:46:19 saturn kernel: [58152.019079] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 4 00:46:19 saturn kernel: [58152.021363] ata3.00: configured for UDMA/133
Apr 4 00:46:19 saturn kernel: [58152.085139] ata4.00: configured for UDMA/133
Apr 4 00:46:19 saturn kernel: [58152.085152] ata1.00: configured for UDMA/133
Apr 4 00:46:19 saturn kernel: [58152.085165] ata2.00: configured for UDMA/133
[...]
Apr 7 14:41:58 saturn kernel: [231943.624749] ata3: hard resetting link
Apr 7 14:42:05 saturn kernel: [231950.625059] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 7 14:42:05 saturn kernel: [231950.635608] ata3.00: configured for UDMA/33
Apr 7 14:42:05 saturn kernel: [231950.635617] ata3: EH complete
Apr 7 14:42:05 saturn kernel: [231950.654531] ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x90a00 action 0xe frozen
Apr 7 14:42:05 saturn kernel: [231950.654535] ata3.00: irq_stat 0x01400000, PHY RDY changed
Apr 7 14:42:05 saturn kernel: [231950.654538] ata3: SError: { Persist HostInt PHYRdyChg 10B8B }
Apr 7 14:42:05 saturn kernel: [231950.654541] ata3.00: failed command: READ FPDMA QUEUED
Apr 7 14:42:05 saturn kernel: [231950.654546] ata3.00: cmd 60/80:00:f0:21:3b/00:00:1c:00:00/40 tag 0 ncq 65536 in
Apr 7 14:42:05 saturn kernel: [231950.654547] res 40/00:00:f0:21:3b/00:00:1c:00:00/40 Emask 0x50 (ATA bus error)
Apr 7 14:42:05 saturn kernel: [231950.654550] ata3.00: status: { DRDY }
Apr 7 14:42:05 saturn kernel: [231950.654554] ata3: hard resetting link
Apr 7 14:42:12 saturn kernel: [231957.654285] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 7 14:42:12 saturn kernel: [231957.666115] ata3.00: configured for UDMA/33
Apr 7 14:42:12 saturn kernel: [231957.666123] ata3: EH complete
Apr 7 14:42:12 saturn kernel: [231957.756013] ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x90a00 action 0xe frozen
Apr 7 14:42:12 saturn kernel: [231957.756016] ata3.00: irq_stat 0x01400000, PHY RDY changed
Apr 7 14:42:12 saturn kernel: [231957.756020] ata3: SError: { Persist HostInt PHYRdyChg 10B8B }
Apr 7 14:42:12 saturn kernel: [231957.756023] ata3.00: failed command: READ FPDMA QUEUED
Apr 7 14:42:12 saturn kernel: [231957.756028] ata3.00: cmd 60/80:00:f0:24:3b/00:00:1c:00:00/40 tag 0 ncq 65536 in
Apr 7 14:42:12 saturn kernel: [231957.756029] res 40/00:00:f0:24:3b/00:00:1c:00:00/40 Emask 0x50 (ATA bus error)
Apr 7 14:42:12 saturn kernel: [231957.756032] ata3.00: status: { DRDY }
Apr 7 14:42:12 saturn kernel: [231957.756037] ata3: hard resetting link
Apr 7 14:42:16 saturn kernel: [231961.389026] ata3: softreset failed (device not ready)
Apr 7 14:42:16 saturn kernel: [231961.389032] ata3: applying SB600 PMP SRST workaround and retrying
Apr 7 14:42:16 saturn kernel: [231961.544030] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 7 14:42:16 saturn kernel: [231961.546323] ata3.00: configured for UDMA/33
Apr 7 14:42:16 saturn kernel: [231961.546331] ata3: EH complete
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!
--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 13:32
[...]
> My original email may have been eaten: as it did not appear
> on the list, nor did I get an error message back. So
> perhaps there was a problem with the attached files.
>
> I will resend the attachments one at a time in separate
> emails.
[...]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-08 1:32 RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive Gavin Flower
@ 2011-04-08 9:34 ` NeilBrown
2011-04-08 9:59 ` Gavin Flower
0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2011-04-08 9:34 UTC (permalink / raw)
To: Gavin Flower; +Cc: linux-raid
On Thu, 7 Apr 2011 18:32:04 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:
> Hi Neil,
>
> My original email may have been eaten: as it did not appear on the list, nor did I get an error message back. So perhaps there was a problem with the attached files.
>
> I will resend the attachments one at a time in separate emails.
>
>
> Cheers,
> Gavin
>
> [begin original]
> Hi Neil,
>
> Your help (or anybody else's) would be greatly appreciated, yet again
Hi Gavin,
it isn't clear to me what help you want.
Obviously there is some sort of hardware issue - possible a drive, possibly a
bus problem - I really don't know.
Apart from that things look normal.
What exactly did you want explained?
NeilBrown
>
> This morning, I noticed my system was extremely unresponsive, and that there were clicking sounds coming from one of my 5 hard drives. Also that there was excessive disk I/O even for trivial things like bring up a directory window, and lots of ata3 errors being reported to the system log. These symptoms were mostly during a raid check process.
>
> Somewhere along the way, I seemed to have lost my swap partition!
>
> So I did some extensive investigations, which took most of the day. My notes were created in OpenDocument format using LibreOffice, but I have converted them to txt format for the include - but I can supply the ,odt file if requested.
>
> I Have included 2 files:
> my notes: raid-notes-20110407a.txt
> selected log entries: messages-gcf-20110407-ATA
>
> If there are some additional diagnostics that might prove useful, please let me know.
>
>
> Cheers,
> Gavin
> [end original]
> --
> All Adults share the Responsibility
> to help Raise Today's Children,
> for they are Tomorrow's Society!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-08 9:34 ` NeilBrown
@ 2011-04-08 9:59 ` Gavin Flower
2011-04-08 11:50 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-08 9:59 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
> From: NeilBrown <neilb@suse.de>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 21:34
> On Thu, 7 Apr 2011 18:32:04 -0700
> (PDT) Gavin Flower <gavinflower@yahoo.com>
> wrote:
>
> > Hi Neil,
> >
> > My original email may have been eaten: as it did not
> appear on the list, nor did I get an error message
> back. So perhaps there was a problem with the attached
> files.
> >
> > I will resend the attachments one at a time in
> separate emails.
> >
> >
> > Cheers,
> > Gavin
> >
> > [begin original]
> > Hi Neil,
> >
> > Your help (or anybody else's) would be greatly
> appreciated, yet again
>
> Hi Gavin,
> it isn't clear to me what help you want.
>
> Obviously there is some sort of hardware issue - possible a
> drive, possibly a
> bus problem - I really don't know.
>
> Apart from that things look normal.
>
> What exactly did you want explained?
>
> NeilBrown
I guess I was surprised that the RAID system appeared normal and that it did not register any errors. I was hoping to get an idea as to which drive was problematic.
I get the feeling, from your reply, that this is not specifically a RAID problem, that it just happens to affect a RAID array.
I had thought that the RAID system should have been able to give me better diagnostics, but possibly I am being (inadvertently) unreasonable!
Not sure what the significance of this mismatch is, and what I should do about it.
# cat /sys/block/md2/md/mismatch_cnt
28904
#
Thanks,
Gavin
> >
> > This morning, I noticed my system was extremely
> unresponsive, and that there were clicking sounds coming
> from one of my 5 hard drives. Also that there was
> excessive disk I/O even for trivial things like bring up a
> directory window, and lots of ata3 errors being reported to
> the system log. These symptoms were mostly during a
> raid check process.
> >
> > Somewhere along the way, I seemed to have lost my swap
> partition!
> >
> > So I did some extensive investigations, which took
> most of the day. My notes were created in OpenDocument
> format using LibreOffice, but I have converted them to txt
> format for the include - but I can supply the ,odt file if
> requested.
> >
> > I Have included 2 files:
> >
> my notes: raid-notes-20110407a.txt
> > selected log entries:
> messages-gcf-20110407-ATA
> >
> > If there are some additional diagnostics that might
> prove useful, please let me know.
> >
> >
> > Cheers,
> > Gavin
> > [end original]
> > --
> > All Adults share the Responsibility
> > to help Raise Today's Children,
> > for they are Tomorrow's Society!
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-08 9:59 ` Gavin Flower
@ 2011-04-08 11:50 ` NeilBrown
2011-04-11 6:50 ` Gavin Flower
2011-04-12 21:30 ` Gavin Flower
0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2011-04-08 11:50 UTC (permalink / raw)
To: Gavin Flower; +Cc: linux-raid
On Fri, 8 Apr 2011 02:59:52 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:
>
> --- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
>
> > From: NeilBrown <neilb@suse.de>
> > Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> > To: "Gavin Flower" <gavinflower@yahoo.com>
> > Cc: linux-raid@vger.kernel.org
> > Date: Friday, 8 April, 2011, 21:34
> > On Thu, 7 Apr 2011 18:32:04 -0700
> > (PDT) Gavin Flower <gavinflower@yahoo.com>
> > wrote:
> >
> > > Hi Neil,
> > >
> > > My original email may have been eaten: as it did not
> > appear on the list, nor did I get an error message
> > back. So perhaps there was a problem with the attached
> > files.
> > >
> > > I will resend the attachments one at a time in
> > separate emails.
> > >
> > >
> > > Cheers,
> > > Gavin
> > >
> > > [begin original]
> > > Hi Neil,
> > >
> > > Your help (or anybody else's) would be greatly
> > appreciated, yet again
> >
> > Hi Gavin,
> > it isn't clear to me what help you want.
> >
> > Obviously there is some sort of hardware issue - possible a
> > drive, possibly a
> > bus problem - I really don't know.
> >
> > Apart from that things look normal.
> >
> > What exactly did you want explained?
> >
> > NeilBrown
>
> I guess I was surprised that the RAID system appeared normal and that it did not register any errors. I was hoping to get an idea as to which drive was problematic.
sdc2 was reporting read error. md/raid6 computed the data from the other
devices and wrote it back to sdc2. This appeared to work so md/raid6 assumed
everything was fine again. It reported this:
Apr 7 08:42:08 saturn kernel: [210414.109880] md/raid:md1: read error corrected (8 sectors at 17195840 on sdc2)
but didn't fail anything.
>
> I get the feeling, from your reply, that this is not specifically a RAID problem, that it just happens to affect a RAID array.
No, it was clearly a disk-drive problem.
e.g.
Apr 7 14:42:12 saturn kernel: [231957.756023] ata3.00: failed command: READ FPDMA QUEUED
a READ command sent to a n 'ata' device failed. i.e. disk error.
>
> I had thought that the RAID system should have been able to give me better diagnostics, but possibly I am being (inadvertently) unreasonable!
Well.... it did tell you that it got a read error and corrected it.
>
> Not sure what the significance of this mismatch is, and what I should do about it.
> # cat /sys/block/md2/md/mismatch_cnt
> 28904
> #
I'm not sure if read errors end up counting as mismatches.. They seem to for
raid1. The raid6 code is more complex and I don't feel like decoding it
right now.
In terms of "what to do about it" - the first thing must be to fix sdc.
Maybe there is a loose cable or a broken cable. Maybe the device needs to be
replaced.
Once you have resolved that and are fairly sure yours drives are all working,
echo check > /sys/block/md2/md/sync_action
once that finishes mismatch_cnt should ideally be zero. If it isn't, try
echo repair > /sys/block/md2/md/sync_action
but only do that if you are confident that your devices are good.
This will result in the same mismatch_cnt. However a subsequent 'check'
should then show zero.
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-08 11:50 ` NeilBrown
@ 2011-04-11 6:50 ` Gavin Flower
2011-04-12 21:30 ` Gavin Flower
1 sibling, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-11 6:50 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
> From: NeilBrown <neilb@suse.de>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 23:50
> On Fri, 8 Apr 2011 02:59:52 -0700
> (PDT) Gavin Flower <gavinflower@yahoo.com>
> wrote:
>
> >
> > --- On Fri, 8/4/11, NeilBrown <neilb@suse.de>
> wrote:
> >
> > > From: NeilBrown <neilb@suse.de>
> > > Subject: Re: RAID6 data-check took almost 2
> hours, clicking sounds, system unresponsive
[...]
> > > Obviously there is some sort of hardware issue -
> possible a
> > > drive, possibly a
> > > bus problem - I really don't know.
> > >
> > > Apart from that things look normal.
> > >
> > > What exactly did you want explained?
> > >
> > > NeilBrown
> >
> > I guess I was surprised that the RAID system appeared
> normal and that it did not register any errors. I was
> hoping to get an idea as to which drive was problematic.
>
> sdc2 was reporting read error. md/raid6 computed the
> data from the other
> devices and wrote it back to sdc2. This appeared to
> work so md/raid6 assumed
> everything was fine again. It reported this:
>
> Apr 7 08:42:08 saturn kernel: [210414.109880]
> md/raid:md1: read error corrected (8 sectors at 17195840 on
> sdc2)
>
> but didn't fail anything.
>
>
> >
> > I get the feeling, from your reply, that this is not
> specifically a RAID problem, that it just happens to affect
> a RAID array.
>
> No, it was clearly a disk-drive problem.
> e.g.
> Apr 7 14:42:12 saturn kernel: [231957.756023]
> ata3.00: failed command: READ FPDMA QUEUED
>
> a READ command sent to a n 'ata' device failed. i.e.
> disk error.
>
> >
> > I had thought that the RAID system should have been
> able to give me better diagnostics, but possibly I am being
> (inadvertently) unreasonable!
>
> Well.... it did tell you that it got a read error and
> corrected it.
>
>
> >
> > Not sure what the significance of this mismatch is,
> and what I should do about it.
> > # cat /sys/block/md2/md/mismatch_cnt
> > 28904
> > #
>
> I'm not sure if read errors end up counting as
> mismatches.. They seem to for
> raid1. The raid6 code is more complex and I don't
> feel like decoding it
> right now.
>
> In terms of "what to do about it" - the first thing must be
> to fix sdc.
> Maybe there is a loose cable or a broken cable. Maybe
> the device needs to be
> replaced.
>
> Once you have resolved that and are fairly sure yours
> drives are all working,
> echo check >
> /sys/block/md2/md/sync_action
>
> once that finishes mismatch_cnt should ideally be
> zero. If it isn't, try
> echo repair >
> /sys/block/md2/md/sync_action
>
> but only do that if you are confident that your devices are
> good.
> This will result in the same mismatch_cnt. However a
> subsequent 'check'
> should then show zero.
>
> NeilBrown
Thanks,
I followed your suggestions and all 'appears' to be fine now.
Reality was a wee bit more dramatic than I would have liked!
Machine refused to boot this morning, complaining about disk errors. Fortunately, I had arranged for a hardware capable friend to come around. He adjusted the cable on the offending drive and I ran fsck twice (lots of alarming messages first time). On rebooting, the system came up, but a video driver problem prevented the desktop from working. Fortunately I was able to log in from another machine and apply your suggested remedy. After the repair, I rebooted and was able to get into my desktop, subsequent checks revealed the mismatch counts to be all zero (I checked the failed RAID array and the other 2)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-08 11:50 ` NeilBrown
2011-04-11 6:50 ` Gavin Flower
@ 2011-04-12 21:30 ` Gavin Flower
2011-04-13 10:57 ` John Robinson
1 sibling, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-12 21:30 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
[...]
> No, it was clearly a disk-drive problem.
> e.g.
> Apr 7 14:42:12 saturn kernel: [231957.756023]
> ata3.00: failed command: READ FPDMA QUEUED
>
> a READ command sent to a n 'ata' device failed. i.e.
> disk error.
[...]
Hi Neil,
I think it is either a drive or cable problem.
However, I was wondering if /proc/mdstat could list drives in a more consistent manner. The C drive has dropped out and affected all 3 RAID partitions. A quick look at /proc/mdstat suggests that md2 & md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU]. In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!
# date ; cat /proc/mdstat
Wed Apr 13 08:40:09 NZST 2011
Personalities : [raid6] [raid5] [raid4]
md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
bitmap: 3/3 pages [12KB], 65536KB chunk
md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
unused devices: <none>
#
Regards,
Gavin
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-12 21:30 ` Gavin Flower
@ 2011-04-13 10:57 ` John Robinson
2011-04-13 11:13 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: John Robinson @ 2011-04-13 10:57 UTC (permalink / raw)
To: Gavin Flower; +Cc: NeilBrown, linux-raid
On 12/04/2011 22:30, Gavin Flower wrote:
> --- On Fri, 8/4/11, NeilBrown<neilb@suse.de> wrote:
> [...]
>> No, it was clearly a disk-drive problem.
>> e.g.
>> Apr 7 14:42:12 saturn kernel: [231957.756023]
>> ata3.00: failed command: READ FPDMA QUEUED
>>
>> a READ command sent to a n 'ata' device failed. i.e.
>> disk error.
> [...]
>
> Hi Neil,
>
> I think it is either a drive or cable problem.
>
> However, I was wondering if /proc/mdstat could list drives in a more consistent manner. The C drive has dropped out and affected all 3 RAID partitions. A quick look at /proc/mdstat suggests that md2& md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU]. In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!
>
> # date ; cat /proc/mdstat
> Wed Apr 13 08:40:09 NZST 2011
> Personalities : [raid6] [raid5] [raid4]
>
> md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
> 1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
This looks correct: sorting the first line into md slot order we have:
md2 : active raid6 sda4[0] sde4[1] sdd4[3] sdb4[5] sdc4[6](F)
which is UUUU_
> md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
> 307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
Similarly:
md1 : active raid6 sda2[0] sdb2[1] sde2[2] sdd2[3] sdc2[5](F)
which is UUUU_
> md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
> 10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
This one I don't get:
md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
which ought to be UUUU_ again...
Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...
Cheers,
John.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 10:57 ` John Robinson
@ 2011-04-13 11:13 ` NeilBrown
2011-04-13 11:58 ` John Robinson
0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2011-04-13 11:13 UTC (permalink / raw)
To: John Robinson; +Cc: Gavin Flower, linux-raid
On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 12/04/2011 22:30, Gavin Flower wrote:
> > --- On Fri, 8/4/11, NeilBrown<neilb@suse.de> wrote:
> > [...]
> >> No, it was clearly a disk-drive problem.
> >> e.g.
> >> Apr 7 14:42:12 saturn kernel: [231957.756023]
> >> ata3.00: failed command: READ FPDMA QUEUED
> >>
> >> a READ command sent to a n 'ata' device failed. i.e.
> >> disk error.
> > [...]
> >
> > Hi Neil,
> >
> > I think it is either a drive or cable problem.
> >
> > However, I was wondering if /proc/mdstat could list drives in a more consistent manner. The C drive has dropped out and affected all 3 RAID partitions. A quick look at /proc/mdstat suggests that md2& md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU]. In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!
> >
> > # date ; cat /proc/mdstat
> > Wed Apr 13 08:40:09 NZST 2011
> > Personalities : [raid6] [raid5] [raid4]
> >
> > md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
> > 1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
>
> This looks correct: sorting the first line into md slot order we have:
> md2 : active raid6 sda4[0] sde4[1] sdd4[3] sdb4[5] sdc4[6](F)
> which is UUUU_
>
> > md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
> > 307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
>
> Similarly:
> md1 : active raid6 sda2[0] sdb2[1] sde2[2] sdd2[3] sdc2[5](F)
> which is UUUU_
>
> > md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
> > 10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
>
> This one I don't get:
> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
> which ought to be UUUU_ again...
>
> Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...
>
This is actually more horrible than you imagine.
The number [] is not the role of the device in the raid. Rather it is an
arbitrarily assigned slot number with no real meaning.
The original 0.90 metadata format has two numbers for each device.
These are in mdp_disk_t defined in include/linux/raid/md_p.h
They are 'number' which is the slot number and so is defined for spare
devices as well as active devices.
And there is the 'raid_disk' number which is the role that the device
plays in the array and is well defined for active devices and
meaningless for spares.
mdstat always showed the 'number'.
However the 0.90 format keeps 'number' and 'raid_disk' the same for active
devices (so why have two different numbers - who knows).
So people reasonably jumped to the technically wrong conclusion that the
number inside [] was the role number.
In 1.x, I keep the slot 'number' the same for the life of a device, but change
the role - from 'spare' to and active role to 'failed' - because this makes
sense.
However that means that the number in [] definitely isn't the role number any
more. It might be when the array is created, but it is not certain to stay
that way.
As the current number is pretty much useless, I should probably change it to
the slot number, or an arbitrarily assigned larger number for spares.
This would be an incompatible change, but I very much doubt anyone uses the
numbers for what they actually are, so I doubt that would really matter.
It has just never really got high on my list of priorities....
Lesson: Ignore the number in [] - it doesn't mean anything useful.
NeilBrown
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 11:13 ` NeilBrown
@ 2011-04-13 11:58 ` John Robinson
2011-04-13 20:30 ` Gavin Flower
0 siblings, 1 reply; 28+ messages in thread
From: John Robinson @ 2011-04-13 11:58 UTC (permalink / raw)
To: NeilBrown; +Cc: Gavin Flower, linux-raid
On 13/04/2011 12:13, NeilBrown wrote:
> On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
> <john.robinson@anonymous.org.uk> wrote:
>> On 12/04/2011 22:30, Gavin Flower wrote:
[...]
>>> md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
>>> 10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
>>
>> This one I don't get:
>> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
>> which ought to be UUUU_ again...
>>
>> Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...
>
> This is actually more horrible than you imagine.
It isn't really, I was asking for the mdadm -D output precisely to get
the list of role and slot numbers, having noticed there was no slot 2 in
Gavin's setup...
[...]
> As the current number is pretty much useless, I should probably change it to
> the slot number, or an arbitrarily assigned larger number for spares.
> This would be an incompatible change, but I very much doubt anyone uses the
> numbers for what they actually are, so I doubt that would really matter.
>
> It has just never really got high on my list of priorities....
>
> Lesson: Ignore the number in [] - it doesn't mean anything useful.
It's not useless, it reflects the order in which devices were added to
the array.
Suggestion: Don't change the number in /proc/mdstat, just sort the
devices by role (i.e. the same order as the UUUU_) instead of device
node, and show spares at the end (as per your arbitrarily-assigned
larger number, which this way you never have to display).
Cheers,
John.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 11:58 ` John Robinson
@ 2011-04-13 20:30 ` Gavin Flower
0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-13 20:30 UTC (permalink / raw)
To: NeilBrown, John Robinson; +Cc: linux-raid
--- On Wed, 13/4/11, John Robinson <john.robinson@anonymous.org.uk> wrote:
> From: John Robinson <john.robinson@anonymous.org.uk>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "NeilBrown" <neilb@suse.de>
> Cc: "Gavin Flower" <gavinflower@yahoo.com>, linux-raid@vger.kernel.org
> Date: Wednesday, 13 April, 2011, 23:58
> On 13/04/2011 12:13, NeilBrown
> wrote:
> > On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
> > <john.robinson@anonymous.org.uk>
> wrote:
> >> On 12/04/2011 22:30, Gavin Flower wrote:
> [...]
> >>> md0 : active raid6 sda3[0] sdb3[4] sdd3[3]
> sdc3[5](F) sde3[1]
> >>> 10751808
> blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
> >>
> >> This one I don't get:
> >> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4]
> sdc3[5](F)
> >> which ought to be UUUU_ again...
> >>
> >> Perhaps `mdadm -D /dev/md[0-2]` would make things
> clearer...
> >
> > This is actually more horrible than you imagine.
>
> It isn't really, I was asking for the mdadm -D output
> precisely to get the list of role and slot numbers, having
> noticed there was no slot 2 in Gavin's setup...
>
> [...]
> > As the current number is pretty much useless, I should
> probably change it to
> > the slot number, or an arbitrarily assigned larger
> number for spares.
> > This would be an incompatible change, but I very much
> doubt anyone uses the
> > numbers for what they actually are, so I doubt that
> would really matter.
> >
> > It has just never really got high on my list of
> priorities....
> >
> > Lesson: Ignore the number in [] - it doesn't
> mean anything useful.
>
> It's not useless, it reflects the order in which devices
> were added to the array.
>
> Suggestion: Don't change the number in /proc/mdstat, just
> sort the devices by role (i.e. the same order as the UUUU_)
> instead of device node, and show spares at the end (as per
> your arbitrarily-assigned larger number, which this way you
> never have to display).
>
> Cheers,
>
> John.
>
The first time I saw this kind of thing: I was very worried, thinking I had 2 bad drives - until I looked more closely, a few hours later. I am sure I am not the only to initially react that way.
From a user perspective, I think that the list of drives and the [UUUU_] string, should be ordered in the alphanumeric order of the logical drive names. Also modify the '[...]' string to indicate spares (not having spares, not sure what it does now).
e.g.
/dev/sda /dev/sdb /dev/sdc[F} /dev/sdd /dev/sde[S} /dev/sdf.
would be reflected by:
[aaFaSa]
Not sure what the 'U' stood for. Marking actives disks with 'a' seems to make more sense to me. The upper and lower case would make it easier to see what is active and what is not. Similarly, the RAID entries would be better, from a user perspective, to be sorted on name.
There are probably lots of technical reasons this can't be done, but users don't care! :-) We just want it to look pretty, be easy to understand, and not be scary.
Just my 2 pennies worth...
When I put my developer hat on, I tend to feel that users rate looking pretty' and 'not be scary' as being more important than 'be easy to understand' - or perhaps, I am too cynical
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-13 22:24 Gavin Flower
2011-04-13 22:28 ` Mathias Burén
2011-04-13 23:09 ` NeilBrown
0 siblings, 2 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-13 22:24 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
[...]
> This morning, I noticed my system was extremely
> unresponsive, and that there were clicking sounds coming
> from one of my 5 hard drives.
[...]
Hi Neil,
When I do
badblocks -s -v /dev/sdc
I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
ata3: irq_stat 0x00400000, PHY RDY changed
ata3: SError: { Persist PHYRdyChg 10B8B }
ata3: hard resetting link
ata3: softreset failed (device not ready)
ata3: applying SB600 PMP SRST workaround and retrying
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata3.00: configured for UDMA/33
ata3: EH complete
So I assume that the clicking corresponds to the hard reset, but I'm not certain of that. Initially, I thought it might be some kind of disk head problems. Note that smart reports no bad blocks.
Regards,
Gavin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 22:24 Gavin Flower
@ 2011-04-13 22:28 ` Mathias Burén
2011-04-14 0:15 ` Gavin Flower
2011-04-13 23:09 ` NeilBrown
1 sibling, 1 reply; 28+ messages in thread
From: Mathias Burén @ 2011-04-13 22:28 UTC (permalink / raw)
To: Gavin Flower; +Cc: neilb, linux-raid
On 13 April 2011 23:24, Gavin Flower <gavinflower@yahoo.com> wrote:
>
> --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
>
>> From: Gavin Flower <gavinflower@yahoo.com>
>> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> [...]
>> This morning, I noticed my system was extremely
>> unresponsive, and that there were clicking sounds coming
>> from one of my 5 hard drives.
> [...]
>
> Hi Neil,
>
> When I do
> badblocks -s -v /dev/sdc
> I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
> ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
> ata3: irq_stat 0x00400000, PHY RDY changed
> ata3: SError: { Persist PHYRdyChg 10B8B }
> ata3: hard resetting link
> ata3: softreset failed (device not ready)
> ata3: applying SB600 PMP SRST workaround and retrying
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata3.00: configured for UDMA/33
> ata3: EH complete
>
> So I assume that the clicking corresponds to the hard reset, but I'm not certain of that. Initially, I thought it might be some kind of disk head problems. Note that smart reports no bad blocks.
>
>
> Regards,
> Gavin
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Perhaps you could post the full smartctl -a output?
Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 22:24 Gavin Flower
2011-04-13 22:28 ` Mathias Burén
@ 2011-04-13 23:09 ` NeilBrown
1 sibling, 0 replies; 28+ messages in thread
From: NeilBrown @ 2011-04-13 23:09 UTC (permalink / raw)
To: Gavin Flower; +Cc: linux-raid
On Wed, 13 Apr 2011 15:24:17 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:
>
> --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
>
> > From: Gavin Flower <gavinflower@yahoo.com>
> > Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> [...]
> > This morning, I noticed my system was extremely
> > unresponsive, and that there were clicking sounds coming
> > from one of my 5 hard drives.
> [...]
>
> Hi Neil,
>
> When I do
> badblocks -s -v /dev/sdc
> I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
> ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
> ata3: irq_stat 0x00400000, PHY RDY changed
> ata3: SError: { Persist PHYRdyChg 10B8B }
> ata3: hard resetting link
> ata3: softreset failed (device not ready)
> ata3: applying SB600 PMP SRST workaround and retrying
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata3.00: configured for UDMA/33
> ata3: EH complete
>
> So I assume that the clicking corresponds to the hard reset, but I'm not certain of that. Initially, I thought it might be some kind of disk head problems. Note that smart reports no bad blocks.
>
>
> Regards,
> Gavin
>
This completely out side of my area of expertise.
My approach to such issues is to replace bits until the issue goes away, and
the last bit I replaced goes in the bin (after suitable double-checks) or
back to the supplier.
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-13 22:28 ` Mathias Burén
@ 2011-04-14 0:15 ` Gavin Flower
2011-04-14 4:08 ` Roman Mamedov
2011-04-14 13:16 ` Phil Turmel
0 siblings, 2 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-14 0:15 UTC (permalink / raw)
To: Mathias Burén; +Cc: neilb, linux-raid
--- On Thu, 14/4/11, Mathias Burén <mathias.buren@gmail.com> wrote:
> From: Mathias Burén <mathias.buren@gmail.com>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: neilb@suse.de, linux-raid@vger.kernel.org
> Date: Thursday, 14 April, 2011, 10:28
> On 13 April 2011 23:24, Gavin Flower
> <gavinflower@yahoo.com>
> wrote:
> >
> > --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com>
> wrote:
> >
> >> From: Gavin Flower <gavinflower@yahoo.com>
> >> Subject: RAID6 data-check took almost 2 hours,
> clicking sounds, system unresponsive
> > [...]
> >> This morning, I noticed my system was extremely
> >> unresponsive, and that there were clicking sounds
> coming
> >> from one of my 5 hard drives.
> > [...]
> >
> > Hi Neil,
> >
> > When I do
> > badblocks -s -v /dev/sdc
> > I hear clicking sounds from the hard drive, and notice
> lots and lots of log messages such as:
> > ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200
> action 0xe frozen
> > ata3: irq_stat 0x00400000, PHY RDY changed
> > ata3: SError: { Persist PHYRdyChg 10B8B }
> > ata3: hard resetting link
> > ata3: softreset failed (device not ready)
> > ata3: applying SB600 PMP SRST workaround and retrying
> > ata3: SATA link up 1.5 Gbps (SStatus 113 SControl
> 310)
> > ata3.00: configured for UDMA/33
> > ata3: EH complete
> >
> > So I assume that the clicking corresponds to the hard
> reset, but I'm not certain of that. Initially, I thought
> it might be some kind of disk head problems. Note that
> smart reports no bad blocks.
> >
> >
> > Regards,
> > Gavin
> >
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> Perhaps you could post the full smartctl -a output?
>
> Regards,
> Mathias
>
Hi Mathias,
I was more commenting on the clicking sound, rather than asking for help! However, I am happy to oblige, output follows later.
I am happy to provide additional diagnostics and log messages, should they be of use.
Regards,
Gavin
# smartctl -a /dev/sdc
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST3500418AS
Serial Number: 5VMJ3RJE
Firmware Version: CC38
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Apr 14 12:08:18 2011 NZST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 85) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 87918991
3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 085 085 020 Old_age Always - 16014
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 20251386
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2940
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 093 093 020 Old_age Always - 7999
183 Runtime_Bad_Block 0x0032 076 076 000 Old_age Always - 24
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 067 055 045 Old_age Always - 33 (Min/Max 17/33)
194 Temperature_Celsius 0x0022 033 045 000 Old_age Always - 33 (0 16 0 0)
195 Hardware_ECC_Recovered 0x001a 031 026 000 Old_age Always - 87918991
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 225696236445405
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 134453215
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 846601860
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
#
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 0:15 ` Gavin Flower
@ 2011-04-14 4:08 ` Roman Mamedov
2011-04-14 13:16 ` Phil Turmel
1 sibling, 0 replies; 28+ messages in thread
From: Roman Mamedov @ 2011-04-14 4:08 UTC (permalink / raw)
To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid
[-- Attachment #1: Type: text/plain, Size: 626 bytes --]
On Wed, 13 Apr 2011 17:15:42 -0700 (PDT)
Gavin Flower <gavinflower@yahoo.com> wrote:
> Note that smart reports no bad blocks.
It reports 24 bad blocks:
183 Runtime_Bad_Block 0x0032 076 076 000 Old_age Always
- 24
Try running a long SMART self test on the drive (smartctl -t long /dev/sdX).
Also for a bit of common sense - why not just try the disk on another PC. If it
produces clicking sounds and "frozen" errors when running "badblocks" even
there, then what else do you expect it to do, jump out of the PC and say "i'm
broken please replace me"? :)
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 0:15 ` Gavin Flower
2011-04-14 4:08 ` Roman Mamedov
@ 2011-04-14 13:16 ` Phil Turmel
2011-04-14 21:12 ` Gavin Flower
1 sibling, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-14 13:16 UTC (permalink / raw)
To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid
Hi Gavin,
I think you might want to investigate your *power supply* ...
On 04/13/2011 08:15 PM, Gavin Flower wrote:
[snip /]
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 87918991
> 3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0
> 4 Start_Stop_Count 0x0032 085 085 020 Old_age Always - 16014
> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 20251386
> 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2940
> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0032 093 093 020 Old_age Always - 7999
SMOKING GUN ^^^^
I suspect your power supply is good enough to slowly spin up your drives and get them talking, but when you ask them to work hard, especially when writing, the PS voltage dips enough to reset the drive.
Look up all the power consumption specs for all of your components, and add up the *peak* current requirements. Make sure your PS can handle it.
HTH,
Phil
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 13:16 ` Phil Turmel
@ 2011-04-14 21:12 ` Gavin Flower
2011-04-14 22:23 ` Phil Turmel
0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-14 21:12 UTC (permalink / raw)
To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid
--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:
> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 1:16
> Hi Gavin,
>
> I think you might want to investigate your *power supply*
> ...
>
> On 04/13/2011 08:15 PM, Gavin Flower wrote:
>
> [snip /]
>
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME
> FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> > 1 Raw_Read_Error_Rate
> 0x000f 115 099 006
> Pre-fail Always
> - 87918991
> > 3 Spin_Up_Time
>
> 0x0003 099 097 000
> Pre-fail Always
> - 0
> > 4 Start_Stop_Count
>
> 0x0032 085 085 020
> Old_age Always
> - 16014
> > 5
> Reallocated_Sector_Ct 0x0033 100 100 036
> Pre-fail Always
> - 0
> > 7 Seek_Error_Rate
> 0x000f 072 060 030
> Pre-fail Always
> - 20251386
> > 9 Power_On_Hours
>
> 0x0032 097 097 000
> Old_age Always
> - 2940
> > 10 Spin_Retry_Count
> 0x0013 100 100 097
> Pre-fail Always
> - 0
> > 12 Power_Cycle_Count
> 0x0032 093 093 020
> Old_age Always
> - 7999
>
> SMOKING GUN
>
>
>
>
> ^^^^
>
> I suspect your power supply is good enough to slowly spin
> up your drives and get them talking, but when you ask them
> to work hard, especially when writing, the PS voltage dips
> enough to reset the drive.
>
> Look up all the power consumption specs for all of your
> components, and add up the *peak* current
> requirements. Make sure your PS can handle it.
>
> HTH,
>
> Phil
>
Hi Phil,
I was under the impression that I had an adequate power supply, so I checked all 5 drives. In fact I made a table to compare all the smart entries. The differences I thought were significant follow later. I have the full comparison table, and the original smart output, in an OpenDocument file - which I will attach to a separate email (in case it gets blocked/dropped or some such).
Note that Power_Cycle_Count is anomalous only for /dev/sdc, so would this suggest cable problems?
I am not sure what to make of the other discrepancies.
Note that sda, sdb, sdd, & sde were bought and put in at the same time, while sdc was only obtained and inserted recently.
sda sdb sdc sdd sde
4 Start_Stop_Count
720 716 16021 65535 713
5 Reallocated_Sector_Ct
17 42 0 1 79
9 Power_On_Hours
12505 12500 2960 12405 12475
12 Power_Cycle_Count
720 716 7999 719 713
188 Command_Timeout
1040 1 1 0 4
189 High_Fly_Writes
1 0 0 0 0
Only /dev/sda has any errors logged, the 6th error occurred at disk power-on lifetime 12416 hours (517 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 26 52 c2 0c
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 a8 97 51 c2 4c 00 00:07:58.408 READ FPDMA QUEUED
60 00 00 3f 52 c2 4c 00 00:07:58.407 READ FPDMA QUEUED
60 00 00 3f 53 c2 4c 00 00:07:58.407 READ FPDMA QUEUED
60 00 28 3f 54 c2 4c 00 00:07:58.407 READ FPDMA QUEUED
60 00 18 67 54 c2 4c 00 00:07:58.407 READ FPDMA QUEUED
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-14 21:14 Gavin Flower
2011-04-14 21:19 ` Mathias Burén
0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-14 21:14 UTC (permalink / raw)
To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid
[-- Attachment #1: Type: text/plain, Size: 561 bytes --]
--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:
> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 1:16
> Hi Gavin,
>
> I think you might want to investigate your *power supply*
[...]
Attaching OpenDocument file with full details of smart output and comparison table.
Cheers,
Gavin
[-- Attachment #2: raid-notes-20110415-smart.odt --]
[-- Type: application/vnd.oasis.opendocument.text, Size: 18683 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 21:14 Gavin Flower
@ 2011-04-14 21:19 ` Mathias Burén
2011-04-14 23:15 ` John Robinson
0 siblings, 1 reply; 28+ messages in thread
From: Mathias Burén @ 2011-04-14 21:19 UTC (permalink / raw)
To: Gavin Flower; +Cc: Phil Turmel, neilb, linux-raid
On 14 April 2011 22:14, Gavin Flower <gavinflower@yahoo.com> wrote:
> --- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:
>
>> From: Phil Turmel <philip@turmel.org>
>> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
>> To: "Gavin Flower" <gavinflower@yahoo.com>
>> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
>> Date: Friday, 15 April, 2011, 1:16
>> Hi Gavin,
>>
>> I think you might want to investigate your *power supply*
> [...]
>
> Attaching OpenDocument file with full details of smart output and comparison table.
>
>
> Cheers,
> Gavin
sda has a value higher than 0 on reported uncorrected sectors. That's
enough for me to replace a drive. (heck, even if I see 1 reallocated
sector I'd RMA it ASAP).
Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 21:12 ` Gavin Flower
@ 2011-04-14 22:23 ` Phil Turmel
2011-04-28 20:03 ` Gavin Flower
0 siblings, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-14 22:23 UTC (permalink / raw)
To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid
On 04/14/2011 05:12 PM, Gavin Flower wrote:
>
> Hi Phil,
>
> I was under the impression that I had an adequate power supply, so I checked all 5 drives. In fact I made a table to compare all the smart entries. The differences I thought were significant follow later. I have the full comparison table, and the original smart output, in an OpenDocument file - which I will attach to a separate email (in case it gets blocked/dropped or some such).
>
> Note that Power_Cycle_Count is anomalous only for /dev/sdc, so would this suggest cable problems?
No two drives are perfectly identical, so when the drive's power rail is only slightly overloaded, the least tolerant drive chokes as the voltage declines (we're talking tens of milliseconds, here). As soon as it chokes, the extra load disappears, and the power supply recovers. The other drives carry on. The drive that choked resets (*Click*) in time for the block driver to try again, and the cycle repeats.
As a test, borrow another power supply and hook just that one drive to it. If the problem continues, the drive is toast. If the problem goes away, look for a better power supply. Note: for the Barracuda with the problem, the detailed spec says the 5V load spikes on activity, not the 12V load. So make sure the current capacity of the power supply meets your needs for both 5V & 12V (plus your motherboard). Also check if the power supply has multiple regulators for drive power, and if you need to re-arrange the connectors to spread the load evenly amongst them.
As another test, you can swap all your cables around. If the problem is in the cables, the problem will follow the cables to the drive you moved them to.
> I am not sure what to make of the other discrepancies.
>
> Note that sda, sdb, sdd, & sde were bought and put in at the same time, while sdc was only obtained and inserted recently.
So sdc came from a different manufacturing batch, which is likely to have slightly different tolerances.
HTH,
Phil
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 21:19 ` Mathias Burén
@ 2011-04-14 23:15 ` John Robinson
0 siblings, 0 replies; 28+ messages in thread
From: John Robinson @ 2011-04-14 23:15 UTC (permalink / raw)
To: Mathias Burén; +Cc: Gavin Flower, linux-raid
On 14/04/2011 22:19, Mathias Burén wrote:
> On 14 April 2011 22:14, Gavin Flower<gavinflower@yahoo.com> wrote:
[...]
>> Attaching OpenDocument file with full details of smart output and comparison table.
[...]
> sda has a value higher than 0 on reported uncorrected sectors. That's
> enough for me to replace a drive. (heck, even if I see 1 reallocated
> sector I'd RMA it ASAP).
I've had a look at some of my drives' SMART output, and only my Samsung
drives have it - one showing 57 (with zero reallocated, pending or
offline sectors) and one showing 331 (zero reallocated, one pending and
zero offline sectors). Neither drive has ever given a read error, and I
do run a weekly check on my arrays which has never reported any mismatches.
Googling for this field doesn't give any indication that it is relevant
to determining whether a drive is failing, but if someone with more
SMART expertise can comment I'll be quite happy to be corrected...
Having said that, all of Gavin's drives apart from sdc show non-zero
reallocated sector counts, and that field definitely is one to follow
when considering replacing drives.
Cheers,
John.
(Just started long self-tests on all my local drives)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-14 22:23 ` Phil Turmel
@ 2011-04-28 20:03 ` Gavin Flower
2011-04-28 20:11 ` Roman Mamedov
0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-28 20:03 UTC (permalink / raw)
To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid
--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:
> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 10:23
> On 04/14/2011 05:12 PM, Gavin Flower
> wrote:
>
> >
> > Hi Phil,
> >
> > I was under the impression that I had an adequate
> power supply, so I checked all 5 drives.
[...]
> >
> > Note that Power_Cycle_Count is anomalous only for
> /dev/sdc, so would this suggest cable problems?
>
> No two drives are perfectly identical, so when the drive's
> power rail is only slightly overloaded, the least tolerant
> drive chokes as the voltage declines (we're talking tens of
> milliseconds, here). As soon as it chokes, the extra
> load disappears, and the power supply recovers. The
> other drives carry on. The drive that choked resets
> (*Click*) in time for the block driver to try again, and the
> cycle repeats.
>
> As a test, borrow another power supply and hook just that
> one drive to it. If the problem continues, the drive
> is toast. If the problem goes away, look for a better
> power supply. Note: for the Barracuda with the
> problem, the detailed spec says the 5V load spikes on
> activity, not the 12V load. So make sure the current
> capacity of the power supply meets your needs for both 5V
> & 12V (plus your motherboard). Also check if the
> power supply has multiple regulators for drive power, and if
> you need to re-arrange the connectors to spread the load
> evenly amongst them.
>
> As another test, you can swap all your cables around.
> If the problem is in the cables, the problem will follow the
> cables to the drive you moved them to.
>
> > I am not sure what to make of the other
> discrepancies.
> >
> > Note that sda, sdb, sdd, & sde were bought and put
> in at the same time, while sdc was only obtained and
> inserted recently.
>
> So sdc came from a different manufacturing batch, which is
> likely to have slightly different tolerances.
>
> HTH,
>
> Phil
Thanks Phil,
A few days ago, I noticed that 2 of my 3 RAID arrays were down to 4 out of 5 drives - /dev/sdc had been dropped out, the one which made clicking sounds when I ran badblocks.
A couple of days ago, my friend Mario brought over his oscilloscope and a volt meter. The 5 volt rail was showing about 4.7 volts, typically it should be 5.2 - 5.4 (from memory of what he said), and the voltage looked shaky on the oscilloscope. The old power supply rated at 400 Watts.
Mario suggested that power supplies greater than 500 Watts had significantly better quality, also he and others said that power supplies tended to have reduced capability to supply their maximum power as they age. So while, 400 Watts seemed nominally adequate for my system, I started looking for ones that wee at least 500 Watts, I also looked at other features, such as reliability and the ability to support at least 5 sata drives without using adapters.
I was in the process of checking out various power supplies, when my development machine ('saturn') refused to complete the boot process due to RAID problems.
There are many power supplies that would have met my requirements, but I told Mario that I was prepared to pay a bit extra, if there was real benefit, as I saw no point in being penny wise and pound foolish as they say in England. If the time Mario and I (let alone that of the others who advised me) had spent on this problem was costed, it would have been more than double the price of the power supply, so I figured paying a bit extra was a good investment. The one Mario obtained for me was the one in stock that met my needs without being too expensive. The new one is 700 Watts with reasonably robust specifications: Cooler Master Extreme Power Plus 700W. MTBF > 100,000 hours (11 years), high efficiency 80% at typical load...
Reassembling the 2 defective RAID-6 partitions went okay, now all 3 RAID partitions are complete.
Been running over 16 hours now and no apparent problems. I ran badblocks on all 5 disks concurrently - no clicking sounds were heard, nor were any errors reported. Also the 'ata' errors previously seen on the systems log are absent.
I very much appreciate the help provided to me by the people on this list.
Regards,
Gavin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-28 20:03 ` Gavin Flower
@ 2011-04-28 20:11 ` Roman Mamedov
2011-04-28 22:11 ` Phil Turmel
0 siblings, 1 reply; 28+ messages in thread
From: Roman Mamedov @ 2011-04-28 20:11 UTC (permalink / raw)
To: Gavin Flower; +Cc: Phil Turmel, Mathias Burén, neilb, linux-raid
[-- Attachment #1: Type: text/plain, Size: 628 bytes --]
On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
Gavin Flower <gavinflower@yahoo.com> wrote:
> A couple of days ago, my friend Mario brought over his oscilloscope and a
> volt meter. The 5 volt rail was showing about 4.7 volts, typically it
> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
> shaky on the oscilloscope. The old power supply rated at 400 Watts.
4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
deviation there, it is actually better for it to be to the lower side, than
equivalent to the higher. Reason: no heat increase.
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-28 20:11 ` Roman Mamedov
@ 2011-04-28 22:11 ` Phil Turmel
2011-04-28 22:40 ` Phil Turmel
0 siblings, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-28 22:11 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Gavin Flower, Mathias Burén, neilb, linux-raid
On 04/28/2011 04:11 PM, Roman Mamedov wrote:
> On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
> Gavin Flower <gavinflower@yahoo.com> wrote:
>
>> A couple of days ago, my friend Mario brought over his oscilloscope and a
>> volt meter. The 5 volt rail was showing about 4.7 volts, typically it
>> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
>> shaky on the oscilloscope. The old power supply rated at 400 Watts.
>
> 4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
> deviation there, it is actually better for it to be to the lower side, than
> equivalent to the higher. Reason: no heat increase.
Uh. Close, but no.
Quoting the Seagate manual for that drive:
> 2.7.3 Voltage tolerance
>
> Voltage tolerance (including noise):
> 5V +10% / -7.5%
> 12V +10% / -10.0%
So he was somewhere around 75mV above the low spec, and it was "shaky". If his meter was rounding up to 4.7 from 4.65, he could have been as close as 25mV to the low spec.
Based on the manual, the best noise tolerance will be at 5.0625V, the middle of the spec.
There might be big datacenter engineers that'll trade some noise margin for heat dissipation savings. For that drive's active power consumption, a 5% voltage reduction (half of his noise margin) will save him, at most, 28W (5 drives * 6.19W * 0.95^2). That's a savings of $14 per year for 24/7/365 usage in my household.
I'd spend the $14 for the safety margin on *my* data. (And I do.)
Regards,
Phil
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
2011-04-28 22:11 ` Phil Turmel
@ 2011-04-28 22:40 ` Phil Turmel
0 siblings, 0 replies; 28+ messages in thread
From: Phil Turmel @ 2011-04-28 22:40 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Gavin Flower, Mathias Burén, neilb, linux-raid
Whoops! Hasty. See below.
On 04/28/2011 06:11 PM, Phil Turmel wrote:
> On 04/28/2011 04:11 PM, Roman Mamedov wrote:
>> On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
>> Gavin Flower <gavinflower@yahoo.com> wrote:
>>
>>> A couple of days ago, my friend Mario brought over his oscilloscope and a
>>> volt meter. The 5 volt rail was showing about 4.7 volts, typically it
>>> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
>>> shaky on the oscilloscope. The old power supply rated at 400 Watts.
>>
>> 4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
>> deviation there, it is actually better for it to be to the lower side, than
>> equivalent to the higher. Reason: no heat increase.
>
> Uh. Close, but no.
>
> Quoting the Seagate manual for that drive:
>
>> 2.7.3 Voltage tolerance
>>
>> Voltage tolerance (including noise):
>> 5V +10% / -7.5%
>> 12V +10% / -10.0%
>
> So he was somewhere around 75mV above the low spec, and it was "shaky". If his meter was rounding up to 4.7 from 4.65, he could have been as close as 25mV to the low spec.
>
> Based on the manual, the best noise tolerance will be at 5.0625V, the middle of the spec.
>
> There might be big datacenter engineers that'll trade some noise margin for heat dissipation savings. For that drive's active power consumption, a 5% voltage reduction (half of his noise margin) will save him, at most, 28W (5 drives * 6.19W * 0.95^2). That's a savings of $14 per year for 24/7/365 usage in my household.
Sorry. 28W is what they'll *consume* @ -5%. The *savings* is 3W (5 drives * 6.19W * (1-0.95^2)), at most. $1.50/year.
> I'd spend the $14 for the safety margin on *my* data. (And I do.)
I'd still spend $14, if that's what it was.
Phil
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2011-04-28 22:40 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-08 1:32 RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive Gavin Flower
2011-04-08 9:34 ` NeilBrown
2011-04-08 9:59 ` Gavin Flower
2011-04-08 11:50 ` NeilBrown
2011-04-11 6:50 ` Gavin Flower
2011-04-12 21:30 ` Gavin Flower
2011-04-13 10:57 ` John Robinson
2011-04-13 11:13 ` NeilBrown
2011-04-13 11:58 ` John Robinson
2011-04-13 20:30 ` Gavin Flower
-- strict thread matches above, loose matches on Subject: below --
2011-04-14 21:14 Gavin Flower
2011-04-14 21:19 ` Mathias Burén
2011-04-14 23:15 ` John Robinson
2011-04-13 22:24 Gavin Flower
2011-04-13 22:28 ` Mathias Burén
2011-04-14 0:15 ` Gavin Flower
2011-04-14 4:08 ` Roman Mamedov
2011-04-14 13:16 ` Phil Turmel
2011-04-14 21:12 ` Gavin Flower
2011-04-14 22:23 ` Phil Turmel
2011-04-28 20:03 ` Gavin Flower
2011-04-28 20:11 ` Roman Mamedov
2011-04-28 22:11 ` Phil Turmel
2011-04-28 22:40 ` Phil Turmel
2011-04-13 23:09 ` NeilBrown
2011-04-08 2:01 Gavin Flower
2011-04-08 1:34 Gavin Flower
2011-04-07 21:58 Gavin Flower
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).