linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Request for assistance
@ 2016-07-06  0:13 o1bigtenor
  2016-07-06  1:55 ` Adam Goryachev
  2016-07-06  7:39 ` keld
  0 siblings, 2 replies; 10+ messages in thread
From: o1bigtenor @ 2016-07-06  0:13 UTC (permalink / raw)
  To: Linux-RAID

Greetings

Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
gets significant lightning and also brownout (rural power) events.

Found the array was read only this morning. Thought that rebooting the
system might correct things - - - it did not as the array did not
load.

commands used followed by system response

mdadm --detail /dev/md0
   mdadm:  md device /dev/md0 does not appear to be active.

cat /proc/mdstat
   md0  : inactive sdc1[5](S) sdf1[8](S) sde1[7](S) sdb1[4](S)

mdadm -E /dev/sdb1
                       sdc1
                       sde1
                       sdf1

everything is the same except for 2 items

sde and sdf have uptime listed from July 04 05:50:46
                          events 64841
                          array state of AAAA

sdb and sdc have uptime listed from July 05 01:57:38
                           events 64844
                           array state of AAA.



Do I just re-create the array?

TIA

Dee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06  0:13 Request for assistance o1bigtenor
@ 2016-07-06  1:55 ` Adam Goryachev
  2016-07-06 12:14   ` o1bigtenor
  2016-07-06  7:39 ` keld
  1 sibling, 1 reply; 10+ messages in thread
From: Adam Goryachev @ 2016-07-06  1:55 UTC (permalink / raw)
  To: o1bigtenor, Linux-RAID

On 06/07/16 10:13, o1bigtenor wrote:
> Greetings
>
> Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
> gets significant lightning and also brownout (rural power) events.
>
> Found the array was read only this morning. Thought that rebooting the
> system might correct things - - - it did not as the array did not
> load.
>
> commands used followed by system response
>
> mdadm --detail /dev/md0
>     mdadm:  md device /dev/md0 does not appear to be active.
>
> cat /proc/mdstat
>     md0  : inactive sdc1[5](S) sdf1[8](S) sde1[7](S) sdb1[4](S)
>
> mdadm -E /dev/sdb1
>                         sdc1
>                         sde1
>                         sdf1
>
> everything is the same except for 2 items
>
> sde and sdf have uptime listed from July 04 05:50:46
>                            events 64841
>                            array state of AAAA
>
> sdb and sdc have uptime listed from July 05 01:57:38
>                             events 64844
>                             array state of AAA.
>
>
>
> Do I just re-create the array?
>
No, not if you value your data. Only re-create the array if you are told 
to by someone (knowledgeable) on the list.

In your case, I think you should stop the array.
mdadm --stop /dev/md0
Make sure there is nothing listed in /proc/mdstat
Then try to assemble the array, but force the events to match:
mdadm --assemble /dev/md0 --force /dev/sd[bcef]1

If that doesn't work, then include the output from dmesg as well as 
/proc/mdstat and any commandline output generated.

You might also want to examine why two drives dropped, referring to logs 
or similar might assist.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06  0:13 Request for assistance o1bigtenor
  2016-07-06  1:55 ` Adam Goryachev
@ 2016-07-06  7:39 ` keld
  2016-07-06 12:15   ` o1bigtenor
  1 sibling, 1 reply; 10+ messages in thread
From: keld @ 2016-07-06  7:39 UTC (permalink / raw)
  To: o1bigtenor; +Cc: Linux-RAID

What operating system and version are you  running?

Best regards
keld

On Tue, Jul 05, 2016 at 07:13:23PM -0500, o1bigtenor wrote:
> Greetings
> 
> Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
> gets significant lightning and also brownout (rural power) events.
> 
> Found the array was read only this morning. Thought that rebooting the
> system might correct things - - - it did not as the array did not
> load.
> 
> commands used followed by system response
> 
> mdadm --detail /dev/md0
>    mdadm:  md device /dev/md0 does not appear to be active.
> 
> cat /proc/mdstat
>    md0  : inactive sdc1[5](S) sdf1[8](S) sde1[7](S) sdb1[4](S)
> 
> mdadm -E /dev/sdb1
>                        sdc1
>                        sde1
>                        sdf1
> 
> everything is the same except for 2 items
> 
> sde and sdf have uptime listed from July 04 05:50:46
>                           events 64841
>                           array state of AAAA
> 
> sdb and sdc have uptime listed from July 05 01:57:38
>                            events 64844
>                            array state of AAA.
> 
> 
> 
> Do I just re-create the array?
> 
> TIA
> 
> Dee
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06  1:55 ` Adam Goryachev
@ 2016-07-06 12:14   ` o1bigtenor
  2016-07-06 12:51     ` Wols Lists
  0 siblings, 1 reply; 10+ messages in thread
From: o1bigtenor @ 2016-07-06 12:14 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Linux-RAID

On Tue, Jul 5, 2016 at 8:55 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 06/07/16 10:13, o1bigtenor wrote:
>>
>> Greetings
>>
>> Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
>> gets significant lightning and also brownout (rural power) events.
>>
snip
>>
>> Do I just re-create the array?
>>
> No, not if you value your data. Only re-create the array if you are told to
> by someone (knowledgeable) on the list.
>
> In your case, I think you should stop the array.
> mdadm --stop /dev/md0
> Make sure there is nothing listed in /proc/mdstat
> Then try to assemble the array, but force the events to match:
> mdadm --assemble /dev/md0 --force /dev/sd[bcef]1
>
> If that doesn't work, then include the output from dmesg as well as
> /proc/mdstat and any commandline output generated.
>
> You might also want to examine why two drives dropped, referring to logs or
> similar might assist.
>
mdadm --stop /dev/md0
cat /proc/mdstat
    indicated no md (can't remember the exact response but it said
nothing there)
mdadm --assemble /dev/md0 --force /dev/sd[bcef]1 to

mdadm :forcing event count in /dev/sde1(2) from 64841 to 64844
mdadm :forcing event count in /dev/sdf1(3) from 64841 to 64844
mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sdf1
mdadm: Marking array /dev/md0 as 'clean'
mdadm: /dev/md0 has been started with 4 drives

So my array is back up - - - thank you very much for your assistance!!!

What does the 'clearing FAULTY flag . . ' mean?

Regards

Dee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06  7:39 ` keld
@ 2016-07-06 12:15   ` o1bigtenor
  0 siblings, 0 replies; 10+ messages in thread
From: o1bigtenor @ 2016-07-06 12:15 UTC (permalink / raw)
  To: keld; +Cc: Linux-RAID

On Wed, Jul 6, 2016 at 2:39 AM,  <keld@keldix.com> wrote:
> What operating system and version are you  running?
>

Running Debian testing.

Thanks for the assistance.

Dee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06 12:14   ` o1bigtenor
@ 2016-07-06 12:51     ` Wols Lists
  2016-07-06 18:28       ` o1bigtenor
  0 siblings, 1 reply; 10+ messages in thread
From: Wols Lists @ 2016-07-06 12:51 UTC (permalink / raw)
  To: o1bigtenor, Adam Goryachev; +Cc: Linux-RAID

On 06/07/16 13:14, o1bigtenor wrote:
> On Tue, Jul 5, 2016 at 8:55 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> On 06/07/16 10:13, o1bigtenor wrote:
>>>
>>> Greetings
>>>
>>> Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
>>> gets significant lightning and also brownout (rural power) events.
>>>
> snip
>>>
>>> Do I just re-create the array?
>>>
>> No, not if you value your data. Only re-create the array if you are told to
>> by someone (knowledgeable) on the list.
>>
>> In your case, I think you should stop the array.
>> mdadm --stop /dev/md0
>> Make sure there is nothing listed in /proc/mdstat
>> Then try to assemble the array, but force the events to match:
>> mdadm --assemble /dev/md0 --force /dev/sd[bcef]1
>>
>> If that doesn't work, then include the output from dmesg as well as
>> /proc/mdstat and any commandline output generated.
>>
>> You might also want to examine why two drives dropped, referring to logs or
>> similar might assist.
>>
> mdadm --stop /dev/md0
> cat /proc/mdstat
>     indicated no md (can't remember the exact response but it said
> nothing there)
> mdadm --assemble /dev/md0 --force /dev/sd[bcef]1 to
> 
> mdadm :forcing event count in /dev/sde1(2) from 64841 to 64844
> mdadm :forcing event count in /dev/sdf1(3) from 64841 to 64844
> mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sdf1
> mdadm: Marking array /dev/md0 as 'clean'
> mdadm: /dev/md0 has been started with 4 drives
> 
> So my array is back up - - - thank you very much for your assistance!!!
> 
But why did they drop ... are you using desktop drives? I use Seagate
Barracudas - NOT a particularly good idea. You should be using WD Red,
Seagate NAS, or similar.

"smartctl -x /dev/sdx" will give you an idea of what's going on. Search
the list for "timeout error" for an idea of the grief you'll get if
you're using desktop drives ...

If smartctl says smart is disabled, enable it. When I do, my drive comes
back (using the -x option again) saying "SCT Error Recovery not
supported". This is a no-no for a decent raid drive. I think the other
acronyms are ETL or TLS - either way you can control how the drive
reports an error back to the OS. Which is why you need proper raid
drives (the manufacturers have downgraded the firmware on desktop drives :-(

You need to fix the WHY or it could easily happen again. And this could
well be why ... (if you've had a problem on a desktop drive, it WILL
happen again, and data loss is quite likely ... even if you recover the
bulk of the drive).

Cheers,
Wol


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06 12:51     ` Wols Lists
@ 2016-07-06 18:28       ` o1bigtenor
  2016-07-06 21:31         ` Wols Lists
  2016-07-07  2:05         ` Brad Campbell
  0 siblings, 2 replies; 10+ messages in thread
From: o1bigtenor @ 2016-07-06 18:28 UTC (permalink / raw)
  To: Wols Lists; +Cc: Adam Goryachev, Linux-RAID

On Wed, Jul 6, 2016 at 7:51 AM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 06/07/16 13:14, o1bigtenor wrote:
>> On Tue, Jul 5, 2016 at 8:55 PM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>> On 06/07/16 10:13, o1bigtenor wrote:
>>>>
>>>> Greetings
>>>>
>>>> Running a Raid 10 array with 4 - 3 TB drives. Have a UPS but this area
>>>> gets significant lightning and also brownout (rural power) events.
>>>>
>> snip
snip
>>
>> So my array is back up - - - thank you very much for your assistance!!!
>>
> But why did they drop ... are you using desktop drives? I use Seagate
> Barracudas - NOT a particularly good idea. You should be using WD Red,
> Seagate NAS, or similar.

Sorry - - - this system is 4 1 TB WD Red drives
>
> "smartctl -x /dev/sdx" will give you an idea of what's going on. Search
> the list for "timeout error" for an idea of the grief you'll get if
> you're using desktop drives ...
>
> If smartctl says smart is disabled, enable it. When I do, my drive comes
> back (using the -x option again) saying "SCT Error Recovery not
> supported". This is a no-no for a decent raid drive. I think the other
> acronyms are ETL or TLS - either way you can control how the drive
> reports an error back to the OS. Which is why you need proper raid
> drives (the manufacturers have downgraded the firmware on desktop drives :-(
>
> You need to fix the WHY or it could easily happen again. And this could
> well be why ... (if you've had a problem on a desktop drive, it WILL
> happen again, and data loss is quite likely ... even if you recover the
> bulk of the drive).

My best understanding as to the why is - - dirty power - - - fixing that means
going off-grid. Expensive and not happening any time soon although I would
really like that.

As I do not understand the error messages in smartctl I add the following
(maybe someone would explain what they mean) :

smartctl -x /dev/sdf
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.1.0-2-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD10EFRX-68FYTN0
Serial Number:    WD-WCC4J4XV62F4
LU WWN Device Id: 5 0014ee 20cd9d7d1
Firmware Version: 82.00A82
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul  6 13:21:25 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: (   2) minutes.
Extended self-test routine
recommended polling time: ( 152) minutes.
Conveyance self-test routine
recommended polling time: (   5) minutes.
SCT capabilities:       (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   139   139   021    -    4050
  4 Start_Stop_Count        -O--CK   100   100   000    -    23
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   100   099   000    -    423
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    6
192 Power-Off_Retract_Count -O--CK   200   200   000    -    1
193 Load_Cycle_Count        -O--CK   198   198   000    -    8922
194 Temperature_Celsius     -O---K   115   107   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 1
CR     = Command Register
FEATR  = Features Register
COUNT  = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
LH     = LBA High (was: Cylinder High) Register    ]   LBA
LM     = LBA Mid (was: Cylinder Low) Register      ] Register
LL     = LBA Low (was: Sector Number) Register     ]
DV     = Device (was: Device/Head) Register
DC     = Device Control Register
ER     = Error register
ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 395 hours (16 days + 11 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 18 11 28 00 40 00  Error: IDNF at LBA =
0x18112800 = 403777536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 51 78 00 e0 00 00 18 06 38 00 40 08  5d+03:01:34.882  WRITE FPDMA QUEUED
  61 50 00 00 d8 00 00 18 05 e8 00 40 08  5d+03:01:34.882  WRITE FPDMA QUEUED
  61 50 00 00 d0 00 00 18 05 98 00 40 08  5d+03:01:34.882  WRITE FPDMA QUEUED
  61 50 00 00 c8 00 00 18 05 48 00 40 08  5d+03:01:34.882  WRITE FPDMA QUEUED
  61 50 00 00 c0 00 00 18 04 f8 00 40 08  5d+03:01:34.882  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     21/28 Celsius
Lifetime    Min/Max Temperature:     20/36 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (237)

Index    Estimated Time   Temperature Celsius
 238    2016-07-06 05:24    26  *******
 ...    ..( 34 skipped).    ..  *******
 273    2016-07-06 05:59    26  *******
 274    2016-07-06 06:00    27  ********
 ...    ..(  8 skipped).    ..  ********
 283    2016-07-06 06:09    27  ********
 284    2016-07-06 06:10    26  *******
 ...    ..(  3 skipped).    ..  *******
 288    2016-07-06 06:14    26  *******
 289    2016-07-06 06:15    27  ********
 ...    ..( 42 skipped).    ..  ********
 332    2016-07-06 06:58    27  ********
 333    2016-07-06 06:59    28  *********
 ...    ..( 18 skipped).    ..  *********
 352    2016-07-06 07:18    28  *********
 353    2016-07-06 07:19    29  **********
 ...    ..(  3 skipped).    ..  **********
 357    2016-07-06 07:23    29  **********
 358    2016-07-06 07:24    28  *********
 ...    ..( 29 skipped).    ..  *********
 388    2016-07-06 07:54    28  *********
 389    2016-07-06 07:55    29  **********
 390    2016-07-06 07:56    28  *********
 391    2016-07-06 07:57    28  *********
 392    2016-07-06 07:58    29  **********
 393    2016-07-06 07:59    28  *********
 394    2016-07-06 08:00    28  *********
 395    2016-07-06 08:01    29  **********
 ...    ..(  4 skipped).    ..  **********
 400    2016-07-06 08:06    29  **********
 401    2016-07-06 08:07     ?  -
 402    2016-07-06 08:08    21  **
 403    2016-07-06 08:09    21  **
 404    2016-07-06 08:10    21  **
 405    2016-07-06 08:11    22  ***
 406    2016-07-06 08:12    22  ***
 407    2016-07-06 08:13    22  ***
 408    2016-07-06 08:14    24  *****
 409    2016-07-06 08:15    24  *****
 410    2016-07-06 08:16    23  ****
 411    2016-07-06 08:17    23  ****
 412    2016-07-06 08:18    23  ****
 413    2016-07-06 08:19    24  *****
 ...    ..(  2 skipped).    ..  *****
 416    2016-07-06 08:22    24  *****
 417    2016-07-06 08:23    25  ******
 ...    ..(  3 skipped).    ..  ******
 421    2016-07-06 08:27    25  ******
 422    2016-07-06 08:28    26  *******
 ...    ..( 60 skipped).    ..  *******
   5    2016-07-06 09:29    26  *******
   6    2016-07-06 09:30    27  ********
 ...    ..(106 skipped).    ..  ********
 113    2016-07-06 11:17    27  ********
 114    2016-07-06 11:18    26  *******
 ...    ..(113 skipped).    ..  *******
 228    2016-07-06 13:12    26  *******
 229    2016-07-06 13:13    27  ********
 ...    ..(  4 skipped).    ..  ********
 234    2016-07-06 13:18    27  ********
 235    2016-07-06 13:19    26  *******
 236    2016-07-06 13:20    26  *******
 237    2016-07-06 13:21    26  *******

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  1  =====  =                =  == General Statistics (rev 2) ==
  1  0x008  4                6  Lifetime Power-On Resets
  1  0x010  4              423  Power-on Hours
  1  0x018  6       2044877667  Logical Sectors Written
  1  0x020  6          2397939  Number of Write Commands
  1  0x028  6       1961443492  Logical Sectors Read
  1  0x030  6          9792433  Number of Read Commands
  3  =====  =                =  == Rotating Media Statistics (rev 1) ==
  3  0x008  4             2800  Spindle Motor Power-on Hours
  3  0x010  4             1582  Head Flying Hours
  3  0x018  4             8924  Head Load Events
  3  0x020  4              200~ Number of Reallocated Logical Sectors
  3  0x028  4                0  Read Recovery Attempts
  3  0x030  4                0  Number of Mechanical Start Failures
  4  =====  =                =  == General Errors Statistics (rev 1) ==
  4  0x008  4                1  Number of Reported Uncorrectable Errors
  4  0x010  4                0  Resets Between Cmd Acceptance and Completion
  5  =====  =                =  == Temperature Statistics (rev 1) ==
  5  0x008  1               28  Current Temperature
  5  0x010  1               27  Average Short Term Temperature
  5  0x018  1               26  Average Long Term Temperature
  5  0x020  1               36  Highest Temperature
  5  0x028  1               20  Lowest Temperature
  5  0x030  1               33  Highest Average Short Term Temperature
  5  0x038  1               22  Lowest Average Short Term Temperature
  5  0x040  1               27  Highest Average Long Term Temperature
  5  0x048  1               25  Lowest Average Long Term Temperature
  5  0x050  4                0  Time in Over-Temperature
  5  0x058  1               60  Specified Maximum Operating Temperature
  5  0x060  4                0  Time in Under-Temperature
  5  0x068  1                0  Specified Minimum Operating Temperature
  6  =====  =                =  == Transport Statistics (rev 1) ==
  6  0x008  4               96  Number of Hardware Resets
  6  0x010  4               45  Number of ASR Events
  6  0x018  4                0  Number of Interface CRC Errors
                              |_ ~ normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            8  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           14  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        24888  Vendor specific

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06 18:28       ` o1bigtenor
@ 2016-07-06 21:31         ` Wols Lists
  2016-07-07  2:05         ` Brad Campbell
  1 sibling, 0 replies; 10+ messages in thread
From: Wols Lists @ 2016-07-06 21:31 UTC (permalink / raw)
  To: o1bigtenor; +Cc: Adam Goryachev, Linux-RAID

On 06/07/16 19:28, o1bigtenor wrote:
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)

As soon as you said WD Red, that said the drives are good. The SCT says
the drives will wait at most 7 seconds before returning a problem, so
that's what you want (My Barracudas can't do that - a problem waiting to
happen).

I'll let someone who knows more comment on the rest of the output, but
that SCT stuff tells us your problem is not the usual one of someone
using the wrong drives.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-06 18:28       ` o1bigtenor
  2016-07-06 21:31         ` Wols Lists
@ 2016-07-07  2:05         ` Brad Campbell
  2016-07-07  3:28           ` o1bigtenor
  1 sibling, 1 reply; 10+ messages in thread
From: Brad Campbell @ 2016-07-07  2:05 UTC (permalink / raw)
  To: o1bigtenor; +Cc: Linux-RAID

On 07/07/16 02:28, o1bigtenor wrote:

> My best understanding as to the why is - - dirty power - - - fixing that means
> going off-grid. Expensive and not happening any time soon although I would
> really like that.
>

Get a UPS.
Get a UPS.
Get a UPS.
Get a UPS.

I've got some nice full on-line double conversion units, but they are 
noisy and less efficient. In my experience, a second hand APC SmartUPS 
will sort enough of the most revolting power to keep things running 
smoothly, and they are CHEAP. Despite owning several expensive UPS 
units, all my stuff is behind a couple of second hand SmartUPS.

My last purchase saw me pick up 5 decent line interactive UPS units for 
about $25 each as a job lot. New batteries for one were less than $100 
(same brand as the UPS comes with) from the local wholesaler. I get 4-5 
years out of a set of batteries.

If you had to budget for time, one blip on a RAID and the associated 
recovery pays for the UPS. Cheap insurance.

Regards,
Brad

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Request for assistance
  2016-07-07  2:05         ` Brad Campbell
@ 2016-07-07  3:28           ` o1bigtenor
  0 siblings, 0 replies; 10+ messages in thread
From: o1bigtenor @ 2016-07-07  3:28 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Linux-RAID

On Wed, Jul 6, 2016 at 9:05 PM, Brad Campbell <lists2009@fnarfbargle.com> wrote:
> On 07/07/16 02:28, o1bigtenor wrote:
>
>> My best understanding as to the why is - - dirty power - - - fixing that
>> means
>> going off-grid. Expensive and not happening any time soon although I would
>> really like that.
>>
>
> Get a UPS.
> Get a UPS.
> Get a UPS.
> Get a UPS.

Hmmmmmmmmmmm - - - got one. Working on getting a bigger one setup as
maybe the first one isn't big enough. Have also found out that voltage
spikes destroy surge protectors and not necessarily all at once - - that
they die with each 'use'. It is frustrating to have such 'dirty' power. Even
better is that the CSA standards for voltage are so sloppy that electronics
die early (and often) when you are in rural country.
>
> I've got some nice full on-line double conversion units, but they are noisy
> and less efficient. In my experience, a second hand APC SmartUPS will sort
> enough of the most revolting power to keep things running smoothly, and they
> are CHEAP. Despite owning several expensive UPS units, all my stuff is
> behind a couple of second hand SmartUPS.
>
> My last purchase saw me pick up 5 decent line interactive UPS units for
> about $25 each as a job lot. New batteries for one were less than $100 (same
> brand as the UPS comes with) from the local wholesaler. I get 4-5 years out
> of a set of batteries.
>
> If you had to budget for time, one blip on a RAID and the associated
> recovery pays for the UPS. Cheap insurance.
>
Working on more. Haven't found too many of those 'reasonable' upses
though.

Regards

Dee

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-07-07  3:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-06  0:13 Request for assistance o1bigtenor
2016-07-06  1:55 ` Adam Goryachev
2016-07-06 12:14   ` o1bigtenor
2016-07-06 12:51     ` Wols Lists
2016-07-06 18:28       ` o1bigtenor
2016-07-06 21:31         ` Wols Lists
2016-07-07  2:05         ` Brad Campbell
2016-07-07  3:28           ` o1bigtenor
2016-07-06  7:39 ` keld
2016-07-06 12:15   ` o1bigtenor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).