Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-07 21:58 Gavin Flower
  0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-07 21:58 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi Neil,

After further checking, I found there was no problem with the swap partition.


Cheers,
Gavin
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!


--- On Thu, 7/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:

> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Thursday, 7 April, 2011, 18:07
[...]
> Somewhere along the way, I seemed to have lost my swap
> partition!
[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08  1:32 Gavin Flower
  2011-04-08  9:34 ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-08  1:32 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi Neil,

My original email may have been eaten: as it did not appear on the list, nor did I get an error message back.  So perhaps there was a problem with the attached files.

I will resend the attachments one at a time in separate emails.

Cheers,
Gavin

[begin original]
Hi Neil,

Your help (or anybody else's) would be greatly appreciated, yet again!

This morning, I noticed my system was extremely unresponsive, and that there were clicking sounds coming from one of my 5 hard drives.  Also that there was excessive disk I/O even for trivial things like bring up a directory window, and lots of ata3 errors being reported to the system log.  These symptoms were mostly during a raid check process.

Somewhere along the way, I seemed to have lost my swap partition!

So I did some extensive investigations, which took most of the day.  My notes were created in OpenDocument format using LibreOffice, but I have converted them to txt format for the include - but I can supply the ,odt file if requested.

I Have included 2 files:
               my notes: raid-notes-20110407a.txt
   selected log entries: messages-gcf-20110407-ATA

If there are some additional diagnostics that might prove useful, please let me know.

Cheers,
Gavin
[end original]
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08  1:34 Gavin Flower
  0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-08  1:34 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 446 bytes --]

my notes: raid-notes-20110407a.txt
--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!

--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:

> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 13:32
[...]

[-- Attachment #2: raid-notes-20110407a.txt --]
[-- Type: text/plain, Size: 22633 bytes --]

ï»¿Note that the check on md1 took almost 2 hours!
# grep md1 /var/log/messages 
Apr  4 08:25:38 saturn kernel: [    3.203058] md: md1 stopped. 
Apr  4 08:25:38 saturn kernel: [    3.221821] md/raid:md1: device sda2 operational as raid disk 0 
Apr  4 08:25:38 saturn kernel: [    3.223099] md/raid:md1: device sdc2 operational as raid disk 4 
Apr  4 08:25:38 saturn kernel: [    3.224364] md/raid:md1: device sdd2 operational as raid disk 3 
Apr  4 08:25:38 saturn kernel: [    3.225589] md/raid:md1: device sde2 operational as raid disk 2 
Apr  4 08:25:38 saturn kernel: [    3.226806] md/raid:md1: device sdb2 operational as raid disk 1 
Apr  4 08:25:38 saturn kernel: [    3.229256] md/raid:md1: allocated 5334kB 
Apr  4 08:25:38 saturn kernel: [    3.230500] md/raid:md1: raid level 6 active with 5 out of 5 devices, algorithm 2 
Apr  4 08:25:38 saturn kernel: [    3.232503] md1: detected capacity change from 0 to 314571227136 
Apr  4 08:25:38 saturn kernel: [    3.234559] dracut: mdadm: /dev/md1 has been started with 5 drives. 
Apr  4 08:25:38 saturn kernel: [    3.236257] md1: detected capacity change from 0 to 314571227136 
Apr  4 08:25:38 saturn kernel: [    3.237425]  md1: unknown partition table 
Apr  4 08:25:38 saturn kernel: [    9.892068] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null) 
Apr  5 07:05:28 saturn kernel: [65356.926079] Modules linked in: tcp_lp powernow_k8 freq_table mperf fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS ipt_LOG xt_limit bridge stp llc rmd160 crypto_null camellia lzo lzo_compress cast6 cast5 deflate zlib_deflate cts ctr gcm ccm serpent blowfish twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key bluetooth rfkill nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_amd kvm usblp r8169 edac_core atl1e uvcvideo mii snd_hda_codec_atihdmi edac_mce_amd shpchp videodev v4l2_compat_ioctl32 asus_atk0110 serio_raw snd_hda_codec_via snd_usb_audio snd_usbmidi_lib joydev snd_hda_intel i2c_piix4 k10temp snd_hda_codec snd_hw 
Apr  7 07:54:01 saturn kernel: [207546.188800] md: data-check of RAID array md1 
Apr  7 07:54:01 saturn kernel: [207546.188868] md: delaying data-check of md0 until md1 has finished (they share one or more physical units) 
Apr  7 07:54:01 saturn kernel: [207546.190517] md: delaying data-check of md2 until md1 has finished (they share one or more physical units) 
Apr  7 07:54:01 saturn kernel: [207546.190523] md: delaying data-check of md0 until md1 has finished (they share one or more physical units) 
Apr  7 08:42:08 saturn kernel: [210414.109856] md/raid:md1: read error corrected (8 sectors at 17195800 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109869] md/raid:md1: read error corrected (8 sectors at 17195808 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109872] md/raid:md1: read error corrected (8 sectors at 17195816 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109875] md/raid:md1: read error corrected (8 sectors at 17195824 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109877] md/raid:md1: read error corrected (8 sectors at 17195832 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109880] md/raid:md1: read error corrected (8 sectors at 17195840 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109883] md/raid:md1: read error corrected (8 sectors at 17195848 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109891] md/raid:md1: read error corrected (8 sectors at 17195856 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109894] md/raid:md1: read error corrected (8 sectors at 17195864 on sdc2) 
Apr  7 08:42:08 saturn kernel: [210414.109897] md/raid:md1: read error corrected (8 sectors at 17195872 on sdc2) 
Apr  7 08:54:39 saturn kernel: [211161.824066] md/raid:md1: read error corrected (8 sectors at 137014528 on sdc2) 
Apr  7 09:51:47 saturn kernel: [214581.140560] md: md1: data-check done. 
# 

# date ; cat /proc/mdstat 
Thu Apr  7 10:31:24 NZST 2011 
Personalities : [raid6] [raid5] [raid4] 
md2 : active raid6 sda4[0] sdc4[6] sdd4[3] sdb4[5] sde4[1] 
      1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU] 
      [==========>..........]  check = 54.1% (201068416/371581952) finish=32.6min speed=87129K/sec 
      bitmap: 2/3 pages [8KB], 65536KB chunk 

md1 : active raid6 sda2[0] sdc2[4] sdd2[3] sde2[2] sdb2[1] 
      307198464 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU] 

md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[2] sde3[1] 
      10751808 blocks level 6, 64k chunk, algorithm 2 [5/5] [UUUUU] 

unused devices: <none> 
# 

From root@localhost6.localdomain6  Thu Apr  7 11:12:41 2011 
Return-Path: <root@localhost6.localdomain6> 
Date: Thu, 7 Apr 2011 11:12:40 +1200 
From: Anacron <root@localhost6.localdomain6> 
To: root@localhost6.localdomain6 
Content-Type: text/plain; charset="ANSI_X3.4-1968" 
Subject: Anacron job 'cron.weekly' on saturn 
Status: R 

/etc/cron.weekly/99-raid-check: 

WARNING: mismatch_cnt is not 0 on /dev/md2 
WARNING: mismatch_cnt is not 0 on /dev/md0 

# cat /sys/block/md0/md/mismatch_cnt
128 
# cat /sys/block/md1/md/mismatch_cnt 
0 
# cat /sys/block/md2/md/mismatch_cnt 
28904 
# 

# e2fsck -f -n /dev/md2 
e2fsck 1.41.12 (17-May-2010) 
Warning!  /dev/md2 is mounted. 
Warning: skipping journal recovery because doing a read-only filesystem check. 
Pass 1: Checking inodes, blocks, and sizes 
Inodes that were part of a corrupted orphan linked list found.  Fix? no 

Inode 20186332 was part of the orphaned inode list.  IGNORED. 
Inode 20317506 was part of the orphaned inode list.  IGNORED. 
Inode 20317552 was part of the orphaned inode list.  IGNORED. 
Inode 20317955 was part of the orphaned inode list.  IGNORED. 
Inode 20447237 was part of the orphaned inode list.  IGNORED. 
Inode 20447245 was part of the orphaned inode list.  IGNORED. 
Inode 20447287 was part of the orphaned inode list.  IGNORED. 
Inode 20447296 was part of the orphaned inode list.  IGNORED. 
Inode 20447302 was part of the orphaned inode list.  IGNORED. 
Inode 20447311 was part of the orphaned inode list.  IGNORED. 
Inode 20447353 was part of the orphaned inode list.  IGNORED. 
Inode 20447360 was part of the orphaned inode list.  IGNORED. 
Inode 21500787 was part of the orphaned inode list.  IGNORED. 
Inode 21628913 was part of the orphaned inode list.  IGNORED. 
Inode 22158808 was part of the orphaned inode list.  IGNORED. 
Inode 22158811 was part of the orphaned inode list.  IGNORED. 
Inode 22158840 was part of the orphaned inode list.  IGNORED. 
Inode 22158842 was part of the orphaned inode list.  IGNORED. 
Inode 22158846 was part of the orphaned inode list.  IGNORED. 
Inode 25952949 was part of the orphaned inode list.  IGNORED. 
Inode 25953424 was part of the orphaned inode list.  IGNORED. 
Inode 25954542 was part of the orphaned inode list.  IGNORED. 
Deleted inode 45088771 has zero dtime.  Fix? no 

Inode 45088772 was part of the orphaned inode list.  IGNORED. 
Inode 45088773 was part of the orphaned inode list.  IGNORED. 
Inode 45088774 was part of the orphaned inode list.  IGNORED. 
Inode 45088775 was part of the orphaned inode list.  IGNORED. 
Inode 45088972 was part of the orphaned inode list.  IGNORED. 
Inode 45089022 was part of the orphaned inode list.  IGNORED. 
Inode 45089035 was part of the orphaned inode list.  IGNORED. 
Inode 45089037 was part of the orphaned inode list.  IGNORED. 
Inode 45089043 was part of the orphaned inode list.  IGNORED. 
Inode 45089044 was part of the orphaned inode list.  IGNORED. 
Inode 45089045 was part of the orphaned inode list.  IGNORED. 
Inode 45089057 was part of the orphaned inode list.  IGNORED. 
Inode 45089060 was part of the orphaned inode list.  IGNORED. 
Inode 45089062 was part of the orphaned inode list.  IGNORED. 
Inode 45089064 was part of the orphaned inode list.  IGNORED. 
Inode 45089067 was part of the orphaned inode list.  IGNORED. 
Inode 45089068 was part of the orphaned inode list.  IGNORED. 
Inode 45089070 was part of the orphaned inode list.  IGNORED. 
Inode 45089137 was part of the orphaned inode list.  IGNORED. 
Inode 45089150 was part of the orphaned inode list.  IGNORED. 
Inode 45089156 was part of the orphaned inode list.  IGNORED. 
Inode 45089190 was part of the orphaned inode list.  IGNORED. 
Inode 45089204 was part of the orphaned inode list.  IGNORED. 
Inode 45089205 was part of the orphaned inode list.  IGNORED. 
Inode 45089207 was part of the orphaned inode list.  IGNORED. 
Inode 45089213 was part of the orphaned inode list.  IGNORED. 
Inode 45089218 was part of the orphaned inode list.  IGNORED. 
Inode 45089238 was part of the orphaned inode list.  IGNORED. 
Inode 45089249 was part of the orphaned inode list.  IGNORED. 
Inode 45089257 was part of the orphaned inode list.  IGNORED. 
Inode 45089264 was part of the orphaned inode list.  IGNORED. 
Inode 45089282 was part of the orphaned inode list.  IGNORED. 
Inode 45089284 was part of the orphaned inode list.  IGNORED. 
Inode 45089286 was part of the orphaned inode list.  IGNORED. 
Inode 45089291 was part of the orphaned inode list.  IGNORED. 
Inode 45089297 was part of the orphaned inode list.  IGNORED. 
Inode 45089298 was part of the orphaned inode list.  IGNORED. 
Inode 45089305 was part of the orphaned inode list.  IGNORED. 
Inode 45089307 was part of the orphaned inode list.  IGNORED. 
Inode 45089319 was part of the orphaned inode list.  IGNORED. 
Inode 45089320 was part of the orphaned inode list.  IGNORED. 
Inode 63705919 was part of the orphaned inode list.  IGNORED. 
Inode 65938687 was part of the orphaned inode list.  IGNORED. 
Inode 65939256 was part of the orphaned inode list.  IGNORED. 
Inode 65939355 was part of the orphaned inode list.  IGNORED. 
Inode 65939368 was part of the orphaned inode list.  IGNORED. 
Inode 66191686 was part of the orphaned inode list.  IGNORED. 
Inode 66191689 was part of the orphaned inode list.  IGNORED. 
Inode 66191738 was part of the orphaned inode list.  IGNORED. 
Inode 66191741 was part of the orphaned inode list.  IGNORED. 
Inode 66191747 was part of the orphaned inode list.  IGNORED. 
Inode 66197970 was part of the orphaned inode list.  IGNORED. 
Pass 2: Checking directory structure 
Pass 3: Checking directory connectivity 
Pass 4: Checking reference counts 
Pass 5: Checking group summary information 
Block bitmap differences:  -(2393344--2393372) -(2393792--2393809) -(2470272--2470336) -(2502016--2502080) +(7831552--7841252) +(7841792--7864319) -(79795252--79795253) -(79823488--79823615) -(79824000--79824123) -(79824640--79825142) -79826344 -79898014 -79923101 -(79923154--79923165) -(80123296--80123311) -(80152298--80152301) -80291729 -80291732 -80291759 -80847380 -(80847438--80847441) -80847502 -80847555 -80847736 -80874645 -80874664 -80875873 -(80875914--80875920) -80875927 -80875960 -(80876002--80876004) -80876048 -(80876052--80876056) -(80876600--80876601) -(80876639--80876641) -81330516 -(81334527--81334528) -81334535 -(81821915--81821947) -(81822170--81822204) -(81894559--81894562) -81923317 -81925743 -81925934 -(81925951--81925952) -(81926003--81926004) -(81956735--81957638) -(82971732--82971733) -(82971902--82971903) -(82971917--82971918) -(82971947--82971948) -(82971972--82971991) -85992203 -86516481 -87626360 -88613273 -104083592 -(104083946--104083948) -104083957 -104084073 -104084084 -104084487 -104137397 -104138111 -104236430 -(104236580--104236596) -(104236598--104236610) -(104301814--104301815) -(104301822--104301828) -104343080 -(105686863--105686864) -105686916 -(115903040--115903065) +(115903516--115903541) -134259847 -134284245 -134284593 -(134284674--134284675) -134285473 -(170994896--170994901) -170994959 -170995027 -(180397545--180397547) -(255167322--255167805) -(263756512--263756516) -(263764800--263764807) -(263779568--263779592) -(263782498--263782533) -(264798344--264798348) -(264804016--264804023) -(264804064--264804074) -(264804968--264804973) -(264809216--264809359) 
Fix? no 

Free blocks count wrong for group #239 (539, counted=32768). 
Fix? no 

Free blocks count wrong for group #2446 (23057, counted=23053). 
Fix? no 

Free blocks count wrong (256921638, counted=256646017). 
Fix? no 

Inode bitmap differences:  -20186332 -20317506 -20317552 -20317955 -20447237 -20447245 -20447287 -20447296 -20447302 -20447311 -20447353 -20447360 -21500787 -21628913 -22158808 -22158811 -22158840 -22158842 -22158846 -25952949 -25953424 -25954542 -(45088771--45088775) -45088972 -45089022 -45089035 -45089037 -(45089043--45089045) -45089057 -45089060 -45089062 -45089064 -(45089067--45089068) -45089070 -45089137 -45089150 -45089156 -45089190 -(45089204--45089205) -45089207 -45089213 -45089218 -45089238 -45089249 -45089257 -45089264 -45089282 -45089284 -45089286 -45089291 -(45089297--45089298) -45089305 -45089307 -(45089319--45089320) -63705919 -65938687 -65939256 -65939355 -65939368 -66191686 -66191689 -66191738 -66191741 -66191747 -66197970 
Fix? no 

Directories count wrong for group #2624 (735, counted=734). 
Fix? no 

Directories count wrong for group #2640 (735, counted=734). 
Fix? no 

Directories count wrong for group #2704 (541, counted=540). 
Fix? no 

Free inodes count wrong (68295781, counted=68268234). 
Fix? no 

/dev/md2: ********** WARNING: Filesystem still has errors ********** 

/dev/md2: 1377179/69672960 files (0.4% non-contiguous), 21764826/278686464 blocks 
# 

# e2fsck -f -n /dev/sda4 
e2fsck 1.41.12 (17-May-2010) 
e2fsck: Device or resource busy while trying to open /dev/sda4 
Filesystem mounted or opened exclusively by another program? 

# mdadm --detail /dev/md2 
/dev/md2: 
        Version : 1.1 
  Creation Time : Wed Nov 24 08:27:42 2010 
     Raid Level : raid6 
     Array Size : 1114745856 (1063.10 GiB 1141.50 GB) 
  Used Dev Size : 371581952 (354.37 GiB 380.50 GB) 
   Raid Devices : 5 
  Total Devices : 5 
    Persistence : Superblock is persistent 

  Intent Bitmap : Internal 

    Update Time : Thu Apr  7 12:11:59 2011 
          State : active 
 Active Devices : 5 
Working Devices : 5 
 Failed Devices : 0 
  Spare Devices : 0 

         Layout : left-symmetric 
     Chunk Size : 512K 

           Name : localhost.localdomain:2 
           UUID : a511e656:a742a2f2:f4917939:2d333c7e 
         Events : 38609 

    Number   Major   Minor   RaidDevice State 
       0       8        4        0      active sync   /dev/sda4 
       1       8       68        1      active sync   /dev/sde4 
       5       8       20        2      active sync   /dev/sdb4 
       3       8       52        3      active sync   /dev/sdd4 
       6       8       36        4      active sync   /dev/sdc4 
# 

note absence of /dev/md0 (swap)!!!
# df 
Filesystem           1K-blocks      Used Available Use% Mounted on 
/dev/md2             1097254408  70799328 970717788   7% / 
tmpfs                  4097108       824   4096284   1% /dev/shm 
/dev/sda1              1032088    128772    850888  14% /boot 
/dev/md1             302377920  72501428 214516572  26% /data 
# 

# mdadm -Evs 
ARRAY /dev/md1 level=raid6 num-devices=5 UUID=6f1176ae:a0ad6cac:bfe78010:bc810f04 
   devices=/dev/sde2,/dev/sdc2,/dev/sdd2,/dev/sdb2,/dev/sda2 
ARRAY /dev/md0 level=raid6 num-devices=5 UUID=3b76ac20:8253f696:bfe78010:bc810f04 
   devices=/dev/sde3,/dev/sdc3,/dev/sdd3,/dev/sdb3,/dev/sda3 
ARRAY /dev/md/2 level=raid6 metadata=1.1 num-devices=5 UUID=a511e656:a742a2f2:f4917939:2d333c7e name=localhost.localdomain:2 
   devices=/dev/sde4,/dev/sdc4,/dev/sdd4,/dev/sdb4,/dev/sda4 
# 

# fdisk -l 

Disk /dev/sda: 500.1 GB, 500107862016 bytes 
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 512 bytes / 512 bytes 
Disk identifier: 0x0000ca3a 

   Device Boot      Start         End      Blocks   Id  System 
/dev/sda1   *          63     2097214     1048576   83  Linux 
/dev/sda2         2097215   206897214   102400000   fd  Linux raid autodetect 
/dev/sda3       206897215   214065214     3584000   fd  Linux raid autodetect 
/dev/sda4       214066125   957233024   371583450   fd  Linux raid autodetect 

Disk /dev/sdb: 500.1 GB, 500107862016 bytes 
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 512 bytes / 512 bytes 
Disk identifier: 0x000566c1 

   Device Boot      Start         End      Blocks   Id  System 
/dev/sdb1              63     2097214     1048576   83  Linux 
/dev/sdb2         2097215   206897214   102400000   fd  Linux raid autodetect 
/dev/sdb3       206897215   214065214     3584000   fd  Linux raid autodetect 
/dev/sdb4       214066125   957233024   371583450   fd  Linux raid autodetect 

Disk /dev/sdd: 500.1 GB, 500107862016 bytes 
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 512 bytes / 512 bytes 
Disk identifier: 0x0000af79 

   Device Boot      Start         End      Blocks   Id  System 
/dev/sdd1   *          63     2097214     1048576   83  Linux 
/dev/sdd2         2097215   206897214   102400000   fd  Linux raid autodetect 
/dev/sdd3       206897215   214065214     3584000   fd  Linux raid autodetect 
/dev/sdd4       214066125   957233024   371583450   fd  Linux raid autodetect 

Disk /dev/sdc: 500.1 GB, 500107862016 bytes 
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 512 bytes / 512 bytes 
Disk identifier: 0x00081ccd 

   Device Boot      Start         End      Blocks   Id  System 
/dev/sdc1   *          63     2097214     1048576   83  Linux 
/dev/sdc2         2097215   206897214   102400000   fd  Linux raid autodetect 
/dev/sdc3       206897215   214065214     3584000   fd  Linux raid autodetect 
/dev/sdc4       214066125   957233024   371583450   fd  Linux raid autodetect 

Disk /dev/sde: 500.1 GB, 500107862016 bytes 
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 512 bytes / 512 bytes 
Disk identifier: 0x00081ccd 

   Device Boot      Start         End      Blocks   Id  System 
/dev/sde1   *          63     2097214     1048576   83  Linux 
/dev/sde2         2097215   206897214   102400000   fd  Linux raid autodetect 
/dev/sde3       206897215   214065214     3584000   fd  Linux raid autodetect 
/dev/sde4       214066125   957233024   371583450   fd  Linux raid autodetect 

Disk /dev/md0: 11.0 GB, 11009851392 bytes 
2 heads, 4 sectors/track, 2687952 cylinders, total 21503616 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 65536 bytes / 196608 bytes 
Disk identifier: 0x00000000 

Disk /dev/md0 doesn't contain a valid partition table 

Disk /dev/md1: 314.6 GB, 314571227136 bytes 
2 heads, 4 sectors/track, 76799616 cylinders, total 614396928 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes 
Disk identifier: 0x00000000 

Disk /dev/md1 doesn't contain a valid partition table 

Disk /dev/md2: 1141.5 GB, 1141499756544 bytes 
2 heads, 4 sectors/track, 278686464 cylinders, total 2229491712 sectors 
Units = sectors of 1 * 512 = 512 bytes 
Sector size (logical/physical): 512 bytes / 512 bytes 
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes 
Disk identifier: 0x00000000 

Disk /dev/md2 doesn't contain a valid partition table 
# 

# dmraid -b 
/dev/sde:    976773168 total, "6VM2FE64" 
/dev/sdc:    976773168 total, "5VMJ3RJE" 
/dev/sdd:    976773168 total, "6VM2AM98" 
/dev/sdb:    976773168 total, "6VM2H5W7" 
/dev/sda:    976773168 total, "5VM1VNM9" 
# 

I ran badblocks for each drive concurrently, note that the one for sda took about an hour longer than the others, but it was sdc that reported a bad block.
# badblocks -s -v /dev/sda 
Checking blocks 0 to 488386583 
Checking for bad blocks (read-only test): done                                
Pass completed, 0 bad blocks found. 
# badblocks -s -v /dev/sdb 
Checking blocks 0 to 488386583 
Checking for bad blocks (read-only test): done                                
Pass completed, 0 bad blocks found. 
# badblocks -s -v /dev/sdc 
Checking blocks 0 to 488386583 
Checking for bad blocks (read-only test): 236817152one, 58:43 elapsed 
done                                
Pass completed, 1 bad blocks found. 
# badblocks -s -v /dev/sdd 
Checking blocks 0 to 488386583 
Checking for bad blocks (read-only test): done                                
Pass completed, 0 bad blocks found. 
# badblocks -s -v /dev/sde 
Checking blocks 0 to 488386583 
Checking for bad blocks (read-only test): done                                
Pass completed, 0 bad blocks found. 
#
Selected lines from the smartctl output:
# smartctl -a /dev/sda 
Model Family:     Seagate Barracuda 7200.12 family 
Device Model:     ST3500418AS 
Serial Number:    5VM1VNM9 
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       17 

# smartctl -a /dev/sdb 
Model Family:     Seagate Barracuda 7200.12 family 
Device Model:     ST3500418AS 
Serial Number:    6VM2H5W7 
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       42 

# smartctl -a /dev/sdc 
Model Family:     Seagate Barracuda 7200.12 family 
Device Model:     ST3500418AS 
Serial Number:    5VMJ3RJE 
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0 

# smartctl -a /dev/sdd 
Model Family:     Seagate Barracuda 7200.12 family 
Device Model:     ST3500418AS 
Serial Number:    6VM2AM98 
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1 

# smartctl -a /dev/sde 
Model Family:     Seagate Barracuda 7200.12 family 
Device Model:     ST3500418AS 
Serial Number:    6VM2FE64 
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       79 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-08  2:01 Gavin Flower
  0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-08  2:01 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi Neil,

Looks like the log file was simply too big.

Here are the initial and ending lines:

Cheers,
Gavin

output of:
grep -i ATA /var/log/messages

Apr  4 00:46:18 saturn kernel: [58150.946089] pata_atiixp 0000:00:14.1: PCI INT A disabled
Apr  4 00:46:19 saturn kernel: [58151.620996] pata_atiixp 0000:00:14.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Apr  4 00:46:19 saturn kernel: [58151.776364] ata6.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:46:19 saturn kernel: [58151.776367] ata6.00: ACPI cmd ef/03:46:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:46:19 saturn kernel: [58151.776370] ata6.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Apr  4 00:46:19 saturn kernel: [58151.792475] ata5.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:46:19 saturn kernel: [58151.792478] ata5.00: ACPI cmd ef/03:42:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:46:19 saturn kernel: [58151.792481] ata5.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Apr  4 00:46:19 saturn kernel: [58151.814455] ata5.00: configured for UDMA/33
Apr  4 00:46:19 saturn kernel: [58151.850339] ata6.00: configured for UDMA/100
Apr  4 00:46:19 saturn kernel: [58151.864031] ata3: softreset failed (device not ready)
Apr  4 00:46:19 saturn kernel: [58151.864035] ata4: softreset failed (device not ready)
Apr  4 00:46:19 saturn kernel: [58151.864038] ata3: applying SB600 PMP SRST workaround and retrying
Apr  4 00:46:19 saturn kernel: [58151.864040] ata4: applying SB600 PMP SRST workaround and retrying
Apr  4 00:46:19 saturn kernel: [58151.864059] ata2: softreset failed (device not ready)
Apr  4 00:46:19 saturn kernel: [58151.864061] ata1: softreset failed (device not ready)
Apr  4 00:46:19 saturn kernel: [58151.864063] ata2: applying SB600 PMP SRST workaround and retrying
Apr  4 00:46:19 saturn kernel: [58151.864065] ata1: applying SB600 PMP SRST workaround and retrying
Apr  4 00:46:19 saturn kernel: [58152.019042] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:46:19 saturn kernel: [58152.019046] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:46:19 saturn kernel: [58152.019070] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:46:19 saturn kernel: [58152.019079] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:46:19 saturn kernel: [58152.021363] ata3.00: configured for UDMA/133
Apr  4 00:46:19 saturn kernel: [58152.085139] ata4.00: configured for UDMA/133
Apr  4 00:46:19 saturn kernel: [58152.085152] ata1.00: configured for UDMA/133
Apr  4 00:46:19 saturn kernel: [58152.085165] ata2.00: configured for UDMA/133
[...]
Apr  7 14:41:58 saturn kernel: [231943.624749] ata3: hard resetting link
Apr  7 14:42:05 saturn kernel: [231950.625059] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr  7 14:42:05 saturn kernel: [231950.635608] ata3.00: configured for UDMA/33
Apr  7 14:42:05 saturn kernel: [231950.635617] ata3: EH complete
Apr  7 14:42:05 saturn kernel: [231950.654531] ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x90a00 action 0xe frozen
Apr  7 14:42:05 saturn kernel: [231950.654535] ata3.00: irq_stat 0x01400000, PHY RDY changed
Apr  7 14:42:05 saturn kernel: [231950.654538] ata3: SError: { Persist HostInt PHYRdyChg 10B8B }
Apr  7 14:42:05 saturn kernel: [231950.654541] ata3.00: failed command: READ FPDMA QUEUED
Apr  7 14:42:05 saturn kernel: [231950.654546] ata3.00: cmd 60/80:00:f0:21:3b/00:00:1c:00:00/40 tag 0 ncq 65536 in
Apr  7 14:42:05 saturn kernel: [231950.654547]          res 40/00:00:f0:21:3b/00:00:1c:00:00/40 Emask 0x50 (ATA bus error)
Apr  7 14:42:05 saturn kernel: [231950.654550] ata3.00: status: { DRDY }
Apr  7 14:42:05 saturn kernel: [231950.654554] ata3: hard resetting link
Apr  7 14:42:12 saturn kernel: [231957.654285] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr  7 14:42:12 saturn kernel: [231957.666115] ata3.00: configured for UDMA/33
Apr  7 14:42:12 saturn kernel: [231957.666123] ata3: EH complete
Apr  7 14:42:12 saturn kernel: [231957.756013] ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x90a00 action 0xe frozen
Apr  7 14:42:12 saturn kernel: [231957.756016] ata3.00: irq_stat 0x01400000, PHY RDY changed
Apr  7 14:42:12 saturn kernel: [231957.756020] ata3: SError: { Persist HostInt PHYRdyChg 10B8B }
Apr  7 14:42:12 saturn kernel: [231957.756023] ata3.00: failed command: READ FPDMA QUEUED
Apr  7 14:42:12 saturn kernel: [231957.756028] ata3.00: cmd 60/80:00:f0:24:3b/00:00:1c:00:00/40 tag 0 ncq 65536 in
Apr  7 14:42:12 saturn kernel: [231957.756029]          res 40/00:00:f0:24:3b/00:00:1c:00:00/40 Emask 0x50 (ATA bus error)
Apr  7 14:42:12 saturn kernel: [231957.756032] ata3.00: status: { DRDY }
Apr  7 14:42:12 saturn kernel: [231957.756037] ata3: hard resetting link
Apr  7 14:42:16 saturn kernel: [231961.389026] ata3: softreset failed (device not ready)
Apr  7 14:42:16 saturn kernel: [231961.389032] ata3: applying SB600 PMP SRST workaround and retrying
Apr  7 14:42:16 saturn kernel: [231961.544030] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr  7 14:42:16 saturn kernel: [231961.546323] ata3.00: configured for UDMA/33
Apr  7 14:42:16 saturn kernel: [231961.546331] ata3: EH complete

--
All Adults share the Responsibility
to help Raise Today's Children,
for they are Tomorrow's Society!


--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:

> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: neilb@suse.de
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 13:32
[...]
> My original email may have been eaten: as it did not appear
> on the list, nor did I get an error message back.  So
> perhaps there was a problem with the attached files.
> 
> I will resend the attachments one at a time in separate
> emails.
[...]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-08  1:32 RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive Gavin Flower
@ 2011-04-08  9:34 ` NeilBrown
  2011-04-08  9:59   ` Gavin Flower
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2011-04-08  9:34 UTC (permalink / raw)
  To: Gavin Flower; +Cc: linux-raid

On Thu, 7 Apr 2011 18:32:04 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:

> Hi Neil,
> 
> My original email may have been eaten: as it did not appear on the list, nor did I get an error message back.  So perhaps there was a problem with the attached files.
> 
> I will resend the attachments one at a time in separate emails.
> 
> 
> Cheers,
> Gavin
> 
> [begin original]
> Hi Neil,
> 
> Your help (or anybody else's) would be greatly appreciated, yet again

Hi Gavin,
 it isn't clear to me what help you want.

Obviously there is some sort of hardware issue - possible a drive, possibly a
bus problem - I really don't know.

Apart from that things look normal.

What exactly did you want explained?

NeilBrown


> 
> This morning, I noticed my system was extremely unresponsive, and that there were clicking sounds coming from one of my 5 hard drives.  Also that there was excessive disk I/O even for trivial things like bring up a directory window, and lots of ata3 errors being reported to the system log.  These symptoms were mostly during a raid check process.
> 
> Somewhere along the way, I seemed to have lost my swap partition!
> 
> So I did some extensive investigations, which took most of the day.  My notes were created in OpenDocument format using LibreOffice, but I have converted them to txt format for the include - but I can supply the ,odt file if requested.
> 
> I Have included 2 files:
>                my notes: raid-notes-20110407a.txt
>    selected log entries: messages-gcf-20110407-ATA
> 
> If there are some additional diagnostics that might prove useful, please let me know.
> 
> 
> Cheers,
> Gavin
> [end original]
> --
> All Adults share the Responsibility
> to help Raise Today's Children,
> for they are Tomorrow's Society!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-08  9:34 ` NeilBrown
@ 2011-04-08  9:59   ` Gavin Flower
  2011-04-08 11:50     ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-08  9:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:

> From: NeilBrown <neilb@suse.de>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 21:34
> On Thu, 7 Apr 2011 18:32:04 -0700
> (PDT) Gavin Flower <gavinflower@yahoo.com>
> wrote:
> 
> > Hi Neil,
> > 
> > My original email may have been eaten: as it did not
> appear on the list, nor did I get an error message
> back.  So perhaps there was a problem with the attached
> files.
> > 
> > I will resend the attachments one at a time in
> separate emails.
> > 
> > 
> > Cheers,
> > Gavin
> > 
> > [begin original]
> > Hi Neil,
> > 
> > Your help (or anybody else's) would be greatly
> appreciated, yet again
> 
> Hi Gavin,
>  it isn't clear to me what help you want.
> 
> Obviously there is some sort of hardware issue - possible a
> drive, possibly a
> bus problem - I really don't know.
> 
> Apart from that things look normal.
> 
> What exactly did you want explained?
> 
> NeilBrown

I guess I was surprised that the RAID system appeared normal and that it did not register any errors.  I was hoping to get an idea as to which drive was problematic.

I get the feeling, from your reply, that this is not specifically a RAID problem, that it just happens to affect a RAID array.

I had thought that the RAID system should have been able to give me better diagnostics, but possibly I am being (inadvertently) unreasonable!

Not sure what the significance of this mismatch is, and what I should do about it.
# cat /sys/block/md2/md/mismatch_cnt 
28904 
# 


Thanks,
Gavin

> > 
> > This morning, I noticed my system was extremely
> unresponsive, and that there were clicking sounds coming
> from one of my 5 hard drives.  Also that there was
> excessive disk I/O even for trivial things like bring up a
> directory window, and lots of ata3 errors being reported to
> the system log.  These symptoms were mostly during a
> raid check process.
> > 
> > Somewhere along the way, I seemed to have lost my swap
> partition!
> > 
> > So I did some extensive investigations, which took
> most of the day.  My notes were created in OpenDocument
> format using LibreOffice, but I have converted them to txt
> format for the include - but I can supply the ,odt file if
> requested.
> > 
> > I Have included 2 files:
> >               
> my notes: raid-notes-20110407a.txt
> >    selected log entries:
> messages-gcf-20110407-ATA
> > 
> > If there are some additional diagnostics that might
> prove useful, please let me know.
> > 
> > 
> > Cheers,
> > Gavin
> > [end original]
> > --
> > All Adults share the Responsibility
> > to help Raise Today's Children,
> > for they are Tomorrow's Society!
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-08  9:59   ` Gavin Flower
@ 2011-04-08 11:50     ` NeilBrown
  2011-04-11  6:50       ` Gavin Flower
  2011-04-12 21:30       ` Gavin Flower
  0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2011-04-08 11:50 UTC (permalink / raw)
  To: Gavin Flower; +Cc: linux-raid

On Fri, 8 Apr 2011 02:59:52 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:

> 
> --- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
> 
> > From: NeilBrown <neilb@suse.de>
> > Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> > To: "Gavin Flower" <gavinflower@yahoo.com>
> > Cc: linux-raid@vger.kernel.org
> > Date: Friday, 8 April, 2011, 21:34
> > On Thu, 7 Apr 2011 18:32:04 -0700
> > (PDT) Gavin Flower <gavinflower@yahoo.com>
> > wrote:
> > 
> > > Hi Neil,
> > > 
> > > My original email may have been eaten: as it did not
> > appear on the list, nor did I get an error message
> > back.  So perhaps there was a problem with the attached
> > files.
> > > 
> > > I will resend the attachments one at a time in
> > separate emails.
> > > 
> > > 
> > > Cheers,
> > > Gavin
> > > 
> > > [begin original]
> > > Hi Neil,
> > > 
> > > Your help (or anybody else's) would be greatly
> > appreciated, yet again
> > 
> > Hi Gavin,
> >  it isn't clear to me what help you want.
> > 
> > Obviously there is some sort of hardware issue - possible a
> > drive, possibly a
> > bus problem - I really don't know.
> > 
> > Apart from that things look normal.
> > 
> > What exactly did you want explained?
> > 
> > NeilBrown
> 
> I guess I was surprised that the RAID system appeared normal and that it did not register any errors.  I was hoping to get an idea as to which drive was problematic.

sdc2 was reporting read error.  md/raid6 computed the data from the other
devices and wrote it back to sdc2.  This appeared to work so md/raid6 assumed
everything was fine again.  It reported this:

Apr  7 08:42:08 saturn kernel: [210414.109880] md/raid:md1: read error corrected (8 sectors at 17195840 on sdc2) 

but didn't fail anything.


> 
> I get the feeling, from your reply, that this is not specifically a RAID problem, that it just happens to affect a RAID array.

No, it was clearly a disk-drive problem.
e.g.
Apr  7 14:42:12 saturn kernel: [231957.756023] ata3.00: failed command: READ FPDMA QUEUED

a READ command sent to a n 'ata' device failed.  i.e. disk error.

> 
> I had thought that the RAID system should have been able to give me better diagnostics, but possibly I am being (inadvertently) unreasonable!

Well.... it did tell you that it got a read error and corrected it.


> 
> Not sure what the significance of this mismatch is, and what I should do about it.
> # cat /sys/block/md2/md/mismatch_cnt 
> 28904 
> # 

I'm not sure if read errors end up counting as mismatches..  They seem to for
raid1.  The raid6 code is more complex and I don't feel like decoding it
right now.

In terms of "what to do about it" - the first thing must be to fix sdc.
Maybe there is a loose cable or a broken cable.  Maybe the device needs to be
replaced.

Once you have resolved that and are fairly sure yours drives are all working,
    echo check > /sys/block/md2/md/sync_action

once that finishes mismatch_cnt should ideally be zero.  If it isn't, try
    echo repair > /sys/block/md2/md/sync_action

but only do that if you are confident that your devices are good.
This will result in the same mismatch_cnt.  However a subsequent 'check'
should then show zero.

NeilBrown



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-08 11:50     ` NeilBrown
@ 2011-04-11  6:50       ` Gavin Flower
  2011-04-12 21:30       ` Gavin Flower
  1 sibling, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-11  6:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:

> From: NeilBrown <neilb@suse.de>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 8 April, 2011, 23:50
> On Fri, 8 Apr 2011 02:59:52 -0700
> (PDT) Gavin Flower <gavinflower@yahoo.com>
> wrote:
> 
> > 
> > --- On Fri, 8/4/11, NeilBrown <neilb@suse.de>
> wrote:
> > 
> > > From: NeilBrown <neilb@suse.de>
> > > Subject: Re: RAID6 data-check took almost 2
> hours, clicking sounds, system unresponsive
[...]
> > > Obviously there is some sort of hardware issue -
> possible a
> > > drive, possibly a
> > > bus problem - I really don't know.
> > > 
> > > Apart from that things look normal.
> > > 
> > > What exactly did you want explained?
> > > 
> > > NeilBrown
> > 
> > I guess I was surprised that the RAID system appeared
> normal and that it did not register any errors.  I was
> hoping to get an idea as to which drive was problematic.
> 
> sdc2 was reporting read error.  md/raid6 computed the
> data from the other
> devices and wrote it back to sdc2.  This appeared to
> work so md/raid6 assumed
> everything was fine again.  It reported this:
> 
> Apr  7 08:42:08 saturn kernel: [210414.109880]
> md/raid:md1: read error corrected (8 sectors at 17195840 on
> sdc2) 
> 
> but didn't fail anything.
> 
> 
> > 
> > I get the feeling, from your reply, that this is not
> specifically a RAID problem, that it just happens to affect
> a RAID array.
> 
> No, it was clearly a disk-drive problem.
> e.g.
> Apr  7 14:42:12 saturn kernel: [231957.756023]
> ata3.00: failed command: READ FPDMA QUEUED
> 
> a READ command sent to a n 'ata' device failed.  i.e.
> disk error.
> 
> > 
> > I had thought that the RAID system should have been
> able to give me better diagnostics, but possibly I am being
> (inadvertently) unreasonable!
> 
> Well.... it did tell you that it got a read error and
> corrected it.
> 
> 
> > 
> > Not sure what the significance of this mismatch is,
> and what I should do about it.
> > # cat /sys/block/md2/md/mismatch_cnt 
> > 28904 
> > # 
> 
> I'm not sure if read errors end up counting as
> mismatches..  They seem to for
> raid1.  The raid6 code is more complex and I don't
> feel like decoding it
> right now.
> 
> In terms of "what to do about it" - the first thing must be
> to fix sdc.
> Maybe there is a loose cable or a broken cable.  Maybe
> the device needs to be
> replaced.
> 
> Once you have resolved that and are fairly sure yours
> drives are all working,
>     echo check >
> /sys/block/md2/md/sync_action
> 
> once that finishes mismatch_cnt should ideally be
> zero.  If it isn't, try
>     echo repair >
> /sys/block/md2/md/sync_action
> 
> but only do that if you are confident that your devices are
> good.
> This will result in the same mismatch_cnt.  However a
> subsequent 'check'
> should then show zero.
> 
> NeilBrown

Thanks,

I followed your suggestions and all 'appears' to be fine now.

Reality was a wee bit more dramatic than I would have liked!

Machine refused to boot this morning, complaining about disk errors.  Fortunately, I had arranged for a hardware capable friend to come around. He adjusted the cable on the offending drive and I ran fsck twice (lots of alarming messages first time). On rebooting, the system came up, but a video driver problem prevented the desktop from working.  Fortunately I was able to log in from another machine and apply your suggested remedy.  After the repair, I rebooted and was able to get into my desktop, subsequent checks revealed the mismatch counts to be all zero (I checked the failed RAID array and the other 2)


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-08 11:50     ` NeilBrown
  2011-04-11  6:50       ` Gavin Flower
@ 2011-04-12 21:30       ` Gavin Flower
  2011-04-13 10:57         ` John Robinson
  1 sibling, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-12 21:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


--- On Fri, 8/4/11, NeilBrown <neilb@suse.de> wrote:
[...]
> No, it was clearly a disk-drive problem.
> e.g.
> Apr  7 14:42:12 saturn kernel: [231957.756023]
> ata3.00: failed command: READ FPDMA QUEUED
> 
> a READ command sent to a n 'ata' device failed.  i.e.
> disk error.
[...]

Hi Neil,

I think it is either a drive or cable problem.

However, I was wondering if /proc/mdstat could list drives in a more consistent manner.  The C drive has dropped out and affected all 3 RAID partitions.  A quick look at /proc/mdstat suggests that md2 & md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU].  In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!

# date ; cat /proc/mdstat 
Wed Apr 13 08:40:09 NZST 2011
Personalities : [raid6] [raid5] [raid4] 

md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
      1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
      bitmap: 3/3 pages [12KB], 65536KB chunk

md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
      307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_] 
     
md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
      10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]      

unused devices: <none>
# 


Regards,
Gavin



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-12 21:30       ` Gavin Flower
@ 2011-04-13 10:57         ` John Robinson
  2011-04-13 11:13           ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: John Robinson @ 2011-04-13 10:57 UTC (permalink / raw)
  To: Gavin Flower; +Cc: NeilBrown, linux-raid

On 12/04/2011 22:30, Gavin Flower wrote:
> --- On Fri, 8/4/11, NeilBrown<neilb@suse.de>  wrote:
> [...]
>> No, it was clearly a disk-drive problem.
>> e.g.
>> Apr  7 14:42:12 saturn kernel: [231957.756023]
>> ata3.00: failed command: READ FPDMA QUEUED
>>
>> a READ command sent to a n 'ata' device failed.  i.e.
>> disk error.
> [...]
>
> Hi Neil,
>
> I think it is either a drive or cable problem.
>
> However, I was wondering if /proc/mdstat could list drives in a more consistent manner.  The C drive has dropped out and affected all 3 RAID partitions.  A quick look at /proc/mdstat suggests that md2&  md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU].  In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!
>
> # date ; cat /proc/mdstat
> Wed Apr 13 08:40:09 NZST 2011
> Personalities : [raid6] [raid5] [raid4]
>
> md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
>        1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]

This looks correct: sorting the first line into md slot order we have:
md2 : active raid6 sda4[0] sde4[1] sdd4[3] sdb4[5] sdc4[6](F)
which is UUUU_

> md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
>        307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]

Similarly:
md1 : active raid6 sda2[0] sdb2[1] sde2[2] sdd2[3] sdc2[5](F)
which is UUUU_

> md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
>        10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]

This one I don't get:
md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
which ought to be UUUU_ again...

Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...

Cheers,

John.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 10:57         ` John Robinson
@ 2011-04-13 11:13           ` NeilBrown
  2011-04-13 11:58             ` John Robinson
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2011-04-13 11:13 UTC (permalink / raw)
  To: John Robinson; +Cc: Gavin Flower, linux-raid

On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
<john.robinson@anonymous.org.uk> wrote:

> On 12/04/2011 22:30, Gavin Flower wrote:
> > --- On Fri, 8/4/11, NeilBrown<neilb@suse.de>  wrote:
> > [...]
> >> No, it was clearly a disk-drive problem.
> >> e.g.
> >> Apr  7 14:42:12 saturn kernel: [231957.756023]
> >> ata3.00: failed command: READ FPDMA QUEUED
> >>
> >> a READ command sent to a n 'ata' device failed.  i.e.
> >> disk error.
> > [...]
> >
> > Hi Neil,
> >
> > I think it is either a drive or cable problem.
> >
> > However, I was wondering if /proc/mdstat could list drives in a more consistent manner.  The C drive has dropped out and affected all 3 RAID partitions.  A quick look at /proc/mdstat suggests that md2&  md1 have the same drive drop out [UUUU_], but a different drive for md0 [UU_UU].  In fact, the list of drives (...sda4[0] sdc4[6](F)...) is not consistent with the [UUUU_] representation even for the same mdN!
> >
> > # date ; cat /proc/mdstat
> > Wed Apr 13 08:40:09 NZST 2011
> > Personalities : [raid6] [raid5] [raid4]
> >
> > md2 : active raid6 sda4[0] sdc4[6](F) sdd4[3] sdb4[5] sde4[1]
> >        1114745856 blocks super 1.1 level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
> 
> This looks correct: sorting the first line into md slot order we have:
> md2 : active raid6 sda4[0] sde4[1] sdd4[3] sdb4[5] sdc4[6](F)
> which is UUUU_
> 
> > md1 : active raid6 sda2[0] sdc2[5](F) sdd2[3] sde2[2] sdb2[1]
> >        307198464 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUUU_]
> 
> Similarly:
> md1 : active raid6 sda2[0] sdb2[1] sde2[2] sdd2[3] sdc2[5](F)
> which is UUUU_
> 
> > md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
> >        10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
> 
> This one I don't get:
> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
> which ought to be UUUU_ again...
> 
> Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...
> 

This is actually more horrible than you imagine.

The number [] is not the role of the device in the raid.  Rather it is an
arbitrarily assigned slot number with no real meaning.

The original 0.90 metadata format has two numbers for each device.
These are in mdp_disk_t defined in include/linux/raid/md_p.h

They are 'number' which is the slot number and so is defined for spare
devices as well as active devices.
And there is the 'raid_disk' number which is the role that the device
plays in the array and is well defined for active devices and
meaningless for spares.

mdstat always showed the 'number'.

However the 0.90 format keeps 'number' and 'raid_disk' the same for active
devices (so why have two different numbers - who knows).
So people reasonably jumped to the technically wrong conclusion that the
number inside [] was the role number.

In 1.x, I keep the slot 'number' the same for the life of a device, but change
the role - from 'spare' to and active role to 'failed' - because this makes
sense.
However that means that the number in [] definitely isn't the role number any
more.  It might be when the array is created, but it is not certain to stay
that way.

As the current number is pretty much useless, I should probably change it to
the slot number, or an arbitrarily assigned larger number for spares.
This would be an incompatible change, but I very much doubt anyone uses the
numbers for what they actually are, so I doubt that would really matter.

It has just never really got high on my list of priorities....

Lesson:  Ignore the number in [] - it doesn't mean anything useful.

NeilBrown

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 11:13           ` NeilBrown
@ 2011-04-13 11:58             ` John Robinson
  2011-04-13 20:30               ` Gavin Flower
  0 siblings, 1 reply; 28+ messages in thread
From: John Robinson @ 2011-04-13 11:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: Gavin Flower, linux-raid

On 13/04/2011 12:13, NeilBrown wrote:
> On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
> <john.robinson@anonymous.org.uk>  wrote:
>> On 12/04/2011 22:30, Gavin Flower wrote:
[...]
>>> md0 : active raid6 sda3[0] sdb3[4] sdd3[3] sdc3[5](F) sde3[1]
>>>         10751808 blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
>>
>> This one I don't get:
>> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4] sdc3[5](F)
>> which ought to be UUUU_ again...
>>
>> Perhaps `mdadm -D /dev/md[0-2]` would make things clearer...
>
> This is actually more horrible than you imagine.

It isn't really, I was asking for the mdadm -D output precisely to get 
the list of role and slot numbers, having noticed there was no slot 2 in 
Gavin's setup...

[...]
> As the current number is pretty much useless, I should probably change it to
> the slot number, or an arbitrarily assigned larger number for spares.
> This would be an incompatible change, but I very much doubt anyone uses the
> numbers for what they actually are, so I doubt that would really matter.
>
> It has just never really got high on my list of priorities....
>
> Lesson:  Ignore the number in [] - it doesn't mean anything useful.

It's not useless, it reflects the order in which devices were added to 
the array.

Suggestion: Don't change the number in /proc/mdstat, just sort the 
devices by role (i.e. the same order as the UUUU_) instead of device 
node, and show spares at the end (as per your arbitrarily-assigned 
larger number, which this way you never have to display).

Cheers,

John.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 11:58             ` John Robinson
@ 2011-04-13 20:30               ` Gavin Flower
  0 siblings, 0 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-13 20:30 UTC (permalink / raw)
  To: NeilBrown, John Robinson; +Cc: linux-raid


--- On Wed, 13/4/11, John Robinson <john.robinson@anonymous.org.uk> wrote:

> From: John Robinson <john.robinson@anonymous.org.uk>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "NeilBrown" <neilb@suse.de>
> Cc: "Gavin Flower" <gavinflower@yahoo.com>, linux-raid@vger.kernel.org
> Date: Wednesday, 13 April, 2011, 23:58
> On 13/04/2011 12:13, NeilBrown
> wrote:
> > On Wed, 13 Apr 2011 11:57:24 +0100 John Robinson
> > <john.robinson@anonymous.org.uk> 
> wrote:
> >> On 12/04/2011 22:30, Gavin Flower wrote:
> [...]
> >>> md0 : active raid6 sda3[0] sdb3[4] sdd3[3]
> sdc3[5](F) sde3[1]
> >>>         10751808
> blocks level 6, 64k chunk, algorithm 2 [5/4] [UU_UU]
> >> 
> >> This one I don't get:
> >> md0 : active raid6 sda3[0] sde3[1] sdd3[3] sdb3[4]
> sdc3[5](F)
> >> which ought to be UUUU_ again...
> >> 
> >> Perhaps `mdadm -D /dev/md[0-2]` would make things
> clearer...
> > 
> > This is actually more horrible than you imagine.
> 
> It isn't really, I was asking for the mdadm -D output
> precisely to get the list of role and slot numbers, having
> noticed there was no slot 2 in Gavin's setup...
> 
> [...]
> > As the current number is pretty much useless, I should
> probably change it to
> > the slot number, or an arbitrarily assigned larger
> number for spares.
> > This would be an incompatible change, but I very much
> doubt anyone uses the
> > numbers for what they actually are, so I doubt that
> would really matter.
> > 
> > It has just never really got high on my list of
> priorities....
> > 
> > Lesson:  Ignore the number in [] - it doesn't
> mean anything useful.
> 
> It's not useless, it reflects the order in which devices
> were added to the array.
> 
> Suggestion: Don't change the number in /proc/mdstat, just
> sort the devices by role (i.e. the same order as the UUUU_)
> instead of device node, and show spares at the end (as per
> your arbitrarily-assigned larger number, which this way you
> never have to display).
> 
> Cheers,
> 
> John.
> 

The first time I saw this kind of thing: I was very worried, thinking I had 2 bad drives - until I looked more closely, a few hours later.  I am sure I am not the only to initially react that way.

From a user perspective, I think that the list of drives and the [UUUU_] string, should be ordered in the alphanumeric order of the logical drive names. Also modify the '[...]' string to indicate spares (not having spares, not sure what it does now).

e.g.
/dev/sda  /dev/sdb /dev/sdc[F} /dev/sdd /dev/sde[S} /dev/sdf.
would be reflected by:
[aaFaSa]

Not sure what the 'U' stood for. Marking actives disks with 'a' seems to make more sense to me.  The upper and lower case would make it easier to see what is active and what is not.  Similarly, the RAID entries would be better, from a user perspective, to be sorted on name.

There are probably lots of technical reasons this can't be done, but users don't care! :-)  We just want it to look pretty, be easy to understand, and not be scary.

Just my 2 pennies worth...

When I put my developer hat on, I tend to feel that users rate looking pretty' and 'not be scary' as being more important than 'be easy to understand' - or perhaps, I am too cynical


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-13 22:24 Gavin Flower
  2011-04-13 22:28 ` Mathias Burén
  2011-04-13 23:09 ` NeilBrown
  0 siblings, 2 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-13 22:24 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid


--- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:

> From: Gavin Flower <gavinflower@yahoo.com>
> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
[...]
> This morning, I noticed my system was extremely
> unresponsive, and that there were clicking sounds coming
> from one of my 5 hard drives.  
[...]

Hi Neil,

When I do 
   badblocks -s -v /dev/sdc
I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
ata3: irq_stat 0x00400000, PHY RDY changed
ata3: SError: { Persist PHYRdyChg 10B8B }
ata3: hard resetting link
ata3: softreset failed (device not ready)
ata3: applying SB600 PMP SRST workaround and retrying
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata3.00: configured for UDMA/33
ata3: EH complete

So I assume that the clicking corresponds to the hard reset, but I'm not certain of that.  Initially, I thought it might be some kind of disk head problems.  Note that smart reports no bad blocks.


Regards,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 22:24 Gavin Flower
@ 2011-04-13 22:28 ` Mathias Burén
  2011-04-14  0:15   ` Gavin Flower
  2011-04-13 23:09 ` NeilBrown
  1 sibling, 1 reply; 28+ messages in thread
From: Mathias Burén @ 2011-04-13 22:28 UTC (permalink / raw)
  To: Gavin Flower; +Cc: neilb, linux-raid

On 13 April 2011 23:24, Gavin Flower <gavinflower@yahoo.com> wrote:
>
> --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
>
>> From: Gavin Flower <gavinflower@yahoo.com>
>> Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> [...]
>> This morning, I noticed my system was extremely
>> unresponsive, and that there were clicking sounds coming
>> from one of my 5 hard drives.
> [...]
>
> Hi Neil,
>
> When I do
>   badblocks -s -v /dev/sdc
> I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
> ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
> ata3: irq_stat 0x00400000, PHY RDY changed
> ata3: SError: { Persist PHYRdyChg 10B8B }
> ata3: hard resetting link
> ata3: softreset failed (device not ready)
> ata3: applying SB600 PMP SRST workaround and retrying
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata3.00: configured for UDMA/33
> ata3: EH complete
>
> So I assume that the clicking corresponds to the hard reset, but I'm not certain of that.  Initially, I thought it might be some kind of disk head problems.  Note that smart reports no bad blocks.
>
>
> Regards,
> Gavin
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Perhaps you could post the full smartctl -a output?

Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 22:24 Gavin Flower
  2011-04-13 22:28 ` Mathias Burén
@ 2011-04-13 23:09 ` NeilBrown
  1 sibling, 0 replies; 28+ messages in thread
From: NeilBrown @ 2011-04-13 23:09 UTC (permalink / raw)
  To: Gavin Flower; +Cc: linux-raid

On Wed, 13 Apr 2011 15:24:17 -0700 (PDT) Gavin Flower <gavinflower@yahoo.com>
wrote:

> 
> --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com> wrote:
> 
> > From: Gavin Flower <gavinflower@yahoo.com>
> > Subject: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> [...]
> > This morning, I noticed my system was extremely
> > unresponsive, and that there were clicking sounds coming
> > from one of my 5 hard drives.  
> [...]
> 
> Hi Neil,
> 
> When I do 
>    badblocks -s -v /dev/sdc
> I hear clicking sounds from the hard drive, and notice lots and lots of log messages such as:
> ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
> ata3: irq_stat 0x00400000, PHY RDY changed
> ata3: SError: { Persist PHYRdyChg 10B8B }
> ata3: hard resetting link
> ata3: softreset failed (device not ready)
> ata3: applying SB600 PMP SRST workaround and retrying
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata3.00: configured for UDMA/33
> ata3: EH complete
> 
> So I assume that the clicking corresponds to the hard reset, but I'm not certain of that.  Initially, I thought it might be some kind of disk head problems.  Note that smart reports no bad blocks.
> 
> 
> Regards,
> Gavin
> 

This completely out side of my area of expertise.

My approach to such issues is to replace bits until the issue goes away, and
the last bit I replaced goes in the bin (after suitable double-checks) or
back to the supplier.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-13 22:28 ` Mathias Burén
@ 2011-04-14  0:15   ` Gavin Flower
  2011-04-14  4:08     ` Roman Mamedov
  2011-04-14 13:16     ` Phil Turmel
  0 siblings, 2 replies; 28+ messages in thread
From: Gavin Flower @ 2011-04-14  0:15 UTC (permalink / raw)
  To: Mathias Burén; +Cc: neilb, linux-raid


--- On Thu, 14/4/11, Mathias Burén <mathias.buren@gmail.com> wrote:

> From: Mathias Burén <mathias.buren@gmail.com>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: neilb@suse.de, linux-raid@vger.kernel.org
> Date: Thursday, 14 April, 2011, 10:28
> On 13 April 2011 23:24, Gavin Flower
> <gavinflower@yahoo.com>
> wrote:
> >
> > --- On Fri, 8/4/11, Gavin Flower <gavinflower@yahoo.com>
> wrote:
> >
> >> From: Gavin Flower <gavinflower@yahoo.com>
> >> Subject: RAID6 data-check took almost 2 hours,
> clicking sounds, system unresponsive
> > [...]
> >> This morning, I noticed my system was extremely
> >> unresponsive, and that there were clicking sounds
> coming
> >> from one of my 5 hard drives.
> > [...]
> >
> > Hi Neil,
> >
> > When I do
> >   badblocks -s -v /dev/sdc
> > I hear clicking sounds from the hard drive, and notice
> lots and lots of log messages such as:
> > ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200
> action 0xe frozen
> > ata3: irq_stat 0x00400000, PHY RDY changed
> > ata3: SError: { Persist PHYRdyChg 10B8B }
> > ata3: hard resetting link
> > ata3: softreset failed (device not ready)
> > ata3: applying SB600 PMP SRST workaround and retrying
> > ata3: SATA link up 1.5 Gbps (SStatus 113 SControl
> 310)
> > ata3.00: configured for UDMA/33
> > ata3: EH complete
> >
> > So I assume that the clicking corresponds to the hard
> reset, but I'm not certain of that.  Initially, I thought
> it might be some kind of disk head problems.  Note that
> smart reports no bad blocks.
> >
> >
> > Regards,
> > Gavin
> >
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> Perhaps you could post the full smartctl -a output?
> 
> Regards,
> Mathias
> 

Hi Mathias,

I was more commenting on the clicking sound, rather than asking for help!  However, I am happy to oblige, output follows later.

I am happy to provide additional diagnostics and log messages, should they be of use.


Regards,
Gavin

# smartctl -a /dev/sdc
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12 family
Device Model:     ST3500418AS
Serial Number:    5VMJ3RJE
Firmware Version: CC38
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Apr 14 12:08:18 2011 NZST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  85) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail  Always       -       87918991
  3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   085   085   020    Old_age   Always       -       16014
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       20251386
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2940
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   093   093   020    Old_age   Always       -       7999
183 Runtime_Bad_Block       0x0032   076   076   000    Old_age   Always       -       24
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   055   045    Old_age   Always       -       33 (Min/Max 17/33)
194 Temperature_Celsius     0x0022   033   045   000    Old_age   Always       -       33 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   031   026   000    Old_age   Always       -       87918991
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       225696236445405
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       134453215
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       846601860

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

# 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14  0:15   ` Gavin Flower
@ 2011-04-14  4:08     ` Roman Mamedov
  2011-04-14 13:16     ` Phil Turmel
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Mamedov @ 2011-04-14  4:08 UTC (permalink / raw)
  To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]

On Wed, 13 Apr 2011 17:15:42 -0700 (PDT)
Gavin Flower <gavinflower@yahoo.com> wrote:

> Note that smart reports no bad blocks.

It reports 24 bad blocks:

183 Runtime_Bad_Block       0x0032   076   076   000    Old_age   Always
-       24

Try running a long SMART self test on the drive (smartctl -t long /dev/sdX).

Also for a bit of common sense - why not just try the disk on another PC. If it
produces clicking sounds and "frozen" errors when running "badblocks" even
there, then what else do you expect it to do, jump out of the PC and say "i'm
broken please replace me"? :)

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14  0:15   ` Gavin Flower
  2011-04-14  4:08     ` Roman Mamedov
@ 2011-04-14 13:16     ` Phil Turmel
  2011-04-14 21:12       ` Gavin Flower
  1 sibling, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-14 13:16 UTC (permalink / raw)
  To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid

Hi Gavin,

I think you might want to investigate your *power supply* ...

On 04/13/2011 08:15 PM, Gavin Flower wrote:

[snip /]

> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail  Always       -       87918991
>   3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
>   4 Start_Stop_Count        0x0032   085   085   020    Old_age   Always       -       16014
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       20251386
>   9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2940
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   093   093   020    Old_age   Always       -       7999

SMOKING GUN                                                                              ^^^^

I suspect your power supply is good enough to slowly spin up your drives and get them talking, but when you ask them to work hard, especially when writing, the PS voltage dips enough to reset the drive.

Look up all the power consumption specs for all of your components, and add up the *peak* current requirements.  Make sure your PS can handle it.

HTH,

Phil

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14 13:16     ` Phil Turmel
@ 2011-04-14 21:12       ` Gavin Flower
  2011-04-14 22:23         ` Phil Turmel
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-14 21:12 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid


--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:

> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 1:16
> Hi Gavin,
> 
> I think you might want to investigate your *power supply*
> ...
> 
> On 04/13/2011 08:15 PM, Gavin Flower wrote:
> 
> [snip /]
> 
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME         
> FLAG     VALUE WORST THRESH TYPE 
>     UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate 
>    0x000f   115   099   006 
>   Pre-fail  Always   
>    -       87918991
> >   3 Spin_Up_Time     
>      
> 0x0003   099   097   000 
>   Pre-fail  Always   
>    -       0
> >   4 Start_Stop_Count   
>    
> 0x0032   085   085   020 
>   Old_age   Always   
>    -       16014
> >   5
> Reallocated_Sector_Ct   0x0033   100   100   036 
>   Pre-fail  Always   
>    -       0
> >   7 Seek_Error_Rate     
>    0x000f   072   060   030 
>   Pre-fail  Always   
>    -       20251386
> >   9 Power_On_Hours     
>    
> 0x0032   097   097   000 
>   Old_age   Always   
>    -       2940
> >  10 Spin_Retry_Count       
> 0x0013   100   100   097 
>   Pre-fail  Always   
>    -       0
> >  12 Power_Cycle_Count   
>    0x0032   093   093   020 
>   Old_age   Always   
>    -       7999
> 
> SMOKING GUN             
>                
>                
>                
>                
> ^^^^
> 
> I suspect your power supply is good enough to slowly spin
> up your drives and get them talking, but when you ask them
> to work hard, especially when writing, the PS voltage dips
> enough to reset the drive.
> 
> Look up all the power consumption specs for all of your
> components, and add up the *peak* current
> requirements.  Make sure your PS can handle it.
> 
> HTH,
> 
> Phil
> 

Hi Phil,

I was under the impression that I had an adequate power supply, so I checked all 5 drives.  In fact I made a table to compare all the smart entries.  The differences I thought were significant follow later.  I have the full comparison table, and the original smart output, in an OpenDocument file - which I will attach to a separate email (in case it gets blocked/dropped or some such).

Note that Power_Cycle_Count is anomalous only for /dev/sdc, so would this suggest cable problems?

I am not sure what to make of the other discrepancies.

Note that sda, sdb, sdd, & sde were bought and put in at the same time, while sdc was only obtained and inserted recently.

  sda      sdb      sdc      sdd      sde
  4 Start_Stop_Count
  720      716    16021    65535      713

  5 Reallocated_Sector_Ct
   17       42        0        1       79

  9 Power_On_Hours
12505    12500     2960    12405    12475

 12 Power_Cycle_Count
  720      716     7999      719      713
  
188 Command_Timeout
 1040        1        1        0        4
 
189 High_Fly_Writes
    1        0        0        0        0
    
Only /dev/sda has any errors logged, the 6th error occurred at disk power-on lifetime 12416 hours (517 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 26 52 c2 0c



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 a8 97 51 c2 4c 00      00:07:58.408  READ FPDMA QUEUED

  60 00 00 3f 52 c2 4c 00      00:07:58.407  READ FPDMA QUEUED

  60 00 00 3f 53 c2 4c 00      00:07:58.407  READ FPDMA QUEUED

  60 00 28 3f 54 c2 4c 00      00:07:58.407  READ FPDMA QUEUED

  60 00 18 67 54 c2 4c 00      00:07:58.407  READ FPDMA QUEUED


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
@ 2011-04-14 21:14 Gavin Flower
  2011-04-14 21:19 ` Mathias Burén
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-14 21:14 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:

> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 1:16
> Hi Gavin,
> 
> I think you might want to investigate your *power supply*
[...]

Attaching OpenDocument file with full details of smart output and comparison table.


Cheers,
Gavin

[-- Attachment #2: raid-notes-20110415-smart.odt --]
[-- Type: application/vnd.oasis.opendocument.text, Size: 18683 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14 21:14 Gavin Flower
@ 2011-04-14 21:19 ` Mathias Burén
  2011-04-14 23:15   ` John Robinson
  0 siblings, 1 reply; 28+ messages in thread
From: Mathias Burén @ 2011-04-14 21:19 UTC (permalink / raw)
  To: Gavin Flower; +Cc: Phil Turmel, neilb, linux-raid

On 14 April 2011 22:14, Gavin Flower <gavinflower@yahoo.com> wrote:
> --- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:
>
>> From: Phil Turmel <philip@turmel.org>
>> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
>> To: "Gavin Flower" <gavinflower@yahoo.com>
>> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
>> Date: Friday, 15 April, 2011, 1:16
>> Hi Gavin,
>>
>> I think you might want to investigate your *power supply*
> [...]
>
> Attaching OpenDocument file with full details of smart output and comparison table.
>
>
> Cheers,
> Gavin

sda has a value higher than 0 on reported uncorrected sectors. That's
enough for me to replace a drive. (heck, even if I see 1 reallocated
sector I'd RMA it ASAP).

Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14 21:12       ` Gavin Flower
@ 2011-04-14 22:23         ` Phil Turmel
  2011-04-28 20:03           ` Gavin Flower
  0 siblings, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-14 22:23 UTC (permalink / raw)
  To: Gavin Flower; +Cc: Mathias Burén, neilb, linux-raid

On 04/14/2011 05:12 PM, Gavin Flower wrote:

> 
> Hi Phil,
> 
> I was under the impression that I had an adequate power supply, so I checked all 5 drives.  In fact I made a table to compare all the smart entries.  The differences I thought were significant follow later.  I have the full comparison table, and the original smart output, in an OpenDocument file - which I will attach to a separate email (in case it gets blocked/dropped or some such).
> 
> Note that Power_Cycle_Count is anomalous only for /dev/sdc, so would this suggest cable problems?

No two drives are perfectly identical, so when the drive's power rail is only slightly overloaded, the least tolerant drive chokes as the voltage declines (we're talking tens of milliseconds, here).  As soon as it chokes, the extra load disappears, and the power supply recovers.  The other drives carry on.  The drive that choked resets (*Click*) in time for the block driver to try again, and the cycle repeats.

As a test, borrow another power supply and hook just that one drive to it.  If the problem continues, the drive is toast.  If the problem goes away, look for a better power supply.  Note:  for the Barracuda with the problem, the detailed spec says the 5V load spikes on activity, not the 12V load.  So make sure the current capacity of the power supply meets your needs for both 5V & 12V (plus your motherboard).  Also check if the power supply has multiple regulators for drive power, and if you need to re-arrange the connectors to spread the load evenly amongst them.

As another test, you can swap all your cables around.  If the problem is in the cables, the problem will follow the cables to the drive you moved them to.

> I am not sure what to make of the other discrepancies.
> 
> Note that sda, sdb, sdd, & sde were bought and put in at the same time, while sdc was only obtained and inserted recently.

So sdc came from a different manufacturing batch, which is likely to have slightly different tolerances.

HTH,

Phil

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14 21:19 ` Mathias Burén
@ 2011-04-14 23:15   ` John Robinson
  0 siblings, 0 replies; 28+ messages in thread
From: John Robinson @ 2011-04-14 23:15 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Gavin Flower, linux-raid

On 14/04/2011 22:19, Mathias Burén wrote:
> On 14 April 2011 22:14, Gavin Flower<gavinflower@yahoo.com>  wrote:
[...]
>> Attaching OpenDocument file with full details of smart output and comparison table.
[...]
> sda has a value higher than 0 on reported uncorrected sectors. That's
> enough for me to replace a drive. (heck, even if I see 1 reallocated
> sector I'd RMA it ASAP).

I've had a look at some of my drives' SMART output, and only my Samsung 
drives have it - one showing 57 (with zero reallocated, pending or 
offline sectors) and one showing 331 (zero reallocated, one pending and 
zero offline sectors). Neither drive has ever given a read error, and I 
do run a weekly check on my arrays which has never reported any mismatches.

Googling for this field doesn't give any indication that it is relevant 
to determining whether a drive is failing, but if someone with more 
SMART expertise can comment I'll be quite happy to be corrected...

Having said that, all of Gavin's drives apart from sdc show non-zero 
reallocated sector counts, and that field definitely is one to follow 
when considering replacing drives.

Cheers,

John.
(Just started long self-tests on all my local drives)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-14 22:23         ` Phil Turmel
@ 2011-04-28 20:03           ` Gavin Flower
  2011-04-28 20:11             ` Roman Mamedov
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Flower @ 2011-04-28 20:03 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mathias Burén, neilb, linux-raid

--- On Fri, 15/4/11, Phil Turmel <philip@turmel.org> wrote:

> From: Phil Turmel <philip@turmel.org>
> Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
> To: "Gavin Flower" <gavinflower@yahoo.com>
> Cc: "Mathias Burén" <mathias.buren@gmail.com>, neilb@suse.de, linux-raid@vger.kernel.org
> Date: Friday, 15 April, 2011, 10:23
> On 04/14/2011 05:12 PM, Gavin Flower
> wrote:
> 
> > 
> > Hi Phil,
> > 
> > I was under the impression that I had an adequate
> power supply, so I checked all 5 drives.  
[...]
> > 
> > Note that Power_Cycle_Count is anomalous only for
> /dev/sdc, so would this suggest cable problems?
> 
> No two drives are perfectly identical, so when the drive's
> power rail is only slightly overloaded, the least tolerant
> drive chokes as the voltage declines (we're talking tens of
> milliseconds, here).  As soon as it chokes, the extra
> load disappears, and the power supply recovers.  The
> other drives carry on.  The drive that choked resets
> (*Click*) in time for the block driver to try again, and the
> cycle repeats.
> 
> As a test, borrow another power supply and hook just that
> one drive to it.  If the problem continues, the drive
> is toast.  If the problem goes away, look for a better
> power supply.  Note:  for the Barracuda with the
> problem, the detailed spec says the 5V load spikes on
> activity, not the 12V load.  So make sure the current
> capacity of the power supply meets your needs for both 5V
> & 12V (plus your motherboard).  Also check if the
> power supply has multiple regulators for drive power, and if
> you need to re-arrange the connectors to spread the load
> evenly amongst them.
> 
> As another test, you can swap all your cables around. 
> If the problem is in the cables, the problem will follow the
> cables to the drive you moved them to.
> 
> > I am not sure what to make of the other
> discrepancies.
> > 
> > Note that sda, sdb, sdd, & sde were bought and put
> in at the same time, while sdc was only obtained and
> inserted recently.
> 
> So sdc came from a different manufacturing batch, which is
> likely to have slightly different tolerances.
> 
> HTH,
> 
> Phil

Thanks Phil,

A few days ago, I noticed that 2 of my 3 RAID arrays were down to 4 out of 5 drives - /dev/sdc had been dropped out, the one which made clicking sounds when I ran badblocks.

A couple of days ago, my friend Mario brought over his oscilloscope and a volt meter.  The 5 volt rail was showing about 4.7 volts, typically it should be 5.2 - 5.4 (from memory of what he said), and the voltage looked shaky on the oscilloscope.  The old power supply rated at 400 Watts.  

Mario suggested that power supplies greater than 500 Watts had significantly better quality, also he and others said that power supplies tended to have reduced capability to supply their maximum power as they age.  So while, 400 Watts seemed nominally adequate for my system, I started looking for ones that wee at least 500 Watts, I also looked at other features, such as reliability and the ability to support at least 5 sata drives without using adapters.

I was in the process of checking out various power supplies, when my development machine ('saturn') refused to complete the boot process due to RAID problems.

There are many power supplies that would have met my requirements, but I told Mario that I was prepared to pay a bit extra, if there was real benefit, as I saw no point in being penny wise and pound foolish as they say in England.  If the time Mario and I (let alone that of the others who advised me) had spent on this problem was costed, it would have been more than double the price of the power supply, so I figured paying a bit extra was a good investment. The one Mario obtained for me was the one in stock that met my needs without being too expensive.  The new one is 700 Watts with reasonably robust specifications: Cooler Master Extreme Power Plus 700W. MTBF  > 100,000 hours (11 years), high efficiency 80% at typical load...

Reassembling the 2 defective RAID-6 partitions went okay, now all 3 RAID partitions are complete.

Been running over 16 hours now and no apparent problems.  I ran badblocks on all 5 disks concurrently - no clicking sounds were heard, nor were any errors reported. Also the 'ata' errors previously seen on the systems log are absent.

I very much appreciate the help provided to me by the people on this list.

Regards,
Gavin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-28 20:03           ` Gavin Flower
@ 2011-04-28 20:11             ` Roman Mamedov
  2011-04-28 22:11               ` Phil Turmel
  0 siblings, 1 reply; 28+ messages in thread
From: Roman Mamedov @ 2011-04-28 20:11 UTC (permalink / raw)
  To: Gavin Flower; +Cc: Phil Turmel, Mathias Burén, neilb, linux-raid

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
Gavin Flower <gavinflower@yahoo.com> wrote:

> A couple of days ago, my friend Mario brought over his oscilloscope and a
> volt meter.  The 5 volt rail was showing about 4.7 volts, typically it
> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
> shaky on the oscilloscope.  The old power supply rated at 400 Watts.  

4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
deviation there, it is actually better for it to be to the lower side, than
equivalent to the higher. Reason: no heat increase.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-28 20:11             ` Roman Mamedov
@ 2011-04-28 22:11               ` Phil Turmel
  2011-04-28 22:40                 ` Phil Turmel
  0 siblings, 1 reply; 28+ messages in thread
From: Phil Turmel @ 2011-04-28 22:11 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Gavin Flower, Mathias Burén, neilb, linux-raid

On 04/28/2011 04:11 PM, Roman Mamedov wrote:
> On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
> Gavin Flower <gavinflower@yahoo.com> wrote:
> 
>> A couple of days ago, my friend Mario brought over his oscilloscope and a
>> volt meter.  The 5 volt rail was showing about 4.7 volts, typically it
>> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
>> shaky on the oscilloscope.  The old power supply rated at 400 Watts.  
> 
> 4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
> deviation there, it is actually better for it to be to the lower side, than
> equivalent to the higher. Reason: no heat increase.

Uh.  Close, but no.

Quoting the Seagate manual for that drive:

> 2.7.3 Voltage tolerance
> 
> Voltage tolerance (including noise):
>  5V +10% / -7.5%
> 12V +10% / -10.0%

So he was somewhere around 75mV above the low spec, and it was "shaky".  If his meter was rounding up to 4.7 from 4.65, he could have been as close as 25mV to the low spec.

Based on the manual, the best noise tolerance will be at 5.0625V, the middle of the spec.

There might be big datacenter engineers that'll trade some noise margin for heat dissipation savings.  For that drive's active power consumption, a 5% voltage reduction (half of his noise margin) will save him, at most, 28W (5 drives * 6.19W * 0.95^2).  That's a savings of $14 per year for 24/7/365 usage in my household.

I'd spend the $14 for the safety margin on *my* data.  (And I do.)

Regards,

Phil

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive
  2011-04-28 22:11               ` Phil Turmel
@ 2011-04-28 22:40                 ` Phil Turmel
  0 siblings, 0 replies; 28+ messages in thread
From: Phil Turmel @ 2011-04-28 22:40 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Gavin Flower, Mathias Burén, neilb, linux-raid

Whoops!  Hasty.  See below.

On 04/28/2011 06:11 PM, Phil Turmel wrote:
> On 04/28/2011 04:11 PM, Roman Mamedov wrote:
>> On Thu, 28 Apr 2011 13:03:39 -0700 (PDT)
>> Gavin Flower <gavinflower@yahoo.com> wrote:
>>
>>> A couple of days ago, my friend Mario brought over his oscilloscope and a
>>> volt meter.  The 5 volt rail was showing about 4.7 volts, typically it
>>> should be 5.2 - 5.4 (from memory of what he said), and the voltage looked
>>> shaky on the oscilloscope.  The old power supply rated at 400 Watts.  
>>
>> 4.7V is fine, those voltages have a tolerance of +/-10%. And if you have a
>> deviation there, it is actually better for it to be to the lower side, than
>> equivalent to the higher. Reason: no heat increase.
> 
> Uh.  Close, but no.
> 
> Quoting the Seagate manual for that drive:
> 
>> 2.7.3 Voltage tolerance
>>
>> Voltage tolerance (including noise):
>>  5V +10% / -7.5%
>> 12V +10% / -10.0%
> 
> So he was somewhere around 75mV above the low spec, and it was "shaky".  If his meter was rounding up to 4.7 from 4.65, he could have been as close as 25mV to the low spec.
> 
> Based on the manual, the best noise tolerance will be at 5.0625V, the middle of the spec.
> 
> There might be big datacenter engineers that'll trade some noise margin for heat dissipation savings.  For that drive's active power consumption, a 5% voltage reduction (half of his noise margin) will save him, at most, 28W (5 drives * 6.19W * 0.95^2).  That's a savings of $14 per year for 24/7/365 usage in my household.

Sorry.  28W is what they'll *consume* @ -5%.  The *savings* is 3W (5 drives * 6.19W * (1-0.95^2)), at most.  $1.50/year.

> I'd spend the $14 for the safety margin on *my* data.  (And I do.)

I'd still spend $14, if that's what it was.

Phil

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-04-28 22:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-08  1:32 RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive Gavin Flower
2011-04-08  9:34 ` NeilBrown
2011-04-08  9:59   ` Gavin Flower
2011-04-08 11:50     ` NeilBrown
2011-04-11  6:50       ` Gavin Flower
2011-04-12 21:30       ` Gavin Flower
2011-04-13 10:57         ` John Robinson
2011-04-13 11:13           ` NeilBrown
2011-04-13 11:58             ` John Robinson
2011-04-13 20:30               ` Gavin Flower
  -- strict thread matches above, loose matches on Subject: below --
2011-04-14 21:14 Gavin Flower
2011-04-14 21:19 ` Mathias Burén
2011-04-14 23:15   ` John Robinson
2011-04-13 22:24 Gavin Flower
2011-04-13 22:28 ` Mathias Burén
2011-04-14  0:15   ` Gavin Flower
2011-04-14  4:08     ` Roman Mamedov
2011-04-14 13:16     ` Phil Turmel
2011-04-14 21:12       ` Gavin Flower
2011-04-14 22:23         ` Phil Turmel
2011-04-28 20:03           ` Gavin Flower
2011-04-28 20:11             ` Roman Mamedov
2011-04-28 22:11               ` Phil Turmel
2011-04-28 22:40                 ` Phil Turmel
2011-04-13 23:09 ` NeilBrown
2011-04-08  2:01 Gavin Flower
2011-04-08  1:34 Gavin Flower
2011-04-07 21:58 Gavin Flower

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).