Linux RAID subsystem development
 help / color / mirror / Atom feed
From: rene.feistle@posteo.de
To: Phil Turmel <philip@turmel.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: mdadm reshaping stuck problem
Date: Sun, 03 Dec 2017 15:59:27 +0100	[thread overview]
Message-ID: <83ee590a545d4de7938b535b952729b7@posteo.de> (raw)
In-Reply-To: <fed720e2-fd55-6458-2d68-9b87c96254fc@turmel.org>

[-- Attachment #1: Type: text/plain, Size: 5411 bytes --]

Hello Phil,

Thanks for your fast reply.

I did run your commands and the results are attached to this email and 
on pastebin here:

https://pastebin.com/EVpLfmAe
https://pastebin.com/ZMBYB5CW


The drive names have changed because I deinstalled one drive that was 
not in the raid. I had a copy of all data on this drive so I'm trying to 
recover my data with that drive now. The chances are good because I did 
overwrite the partition table only.



Am 03.12.2017 15:17 schrieb Phil Turmel:
> Hi Rene,
> 
> On 12/03/2017 07:47 AM, rene.feistle@posteo.de wrote:
>> Hello,
>> 
>> after hours and hours of googling and trying out things, I gave up on
>> this. This email is my last hope of getting my data back.
> 
> I'm worried for you -- "trying out things" can be dangerous.
> 
>> I have 4*4TB drives installed and want to create a raid 5 with them.
>> 
>> So what I did is create an array of 3 disks (raid 5), copy the data 
>> from
>> the 4th drive (I don't have more space available) to the raid and then 
>> I
>> wanted to add the last drive to the raid.
> 
> Ok.
> 
>> I made a mistake here. I accidentally grew the raid to 4 disks with
>> 
>> sudo mdadm --grow --raid-devices=4 /dev/md0 --backup-file=/tmp/md0.bak
>> 
>> BEFORE adding the last drive as a hot spare. Mdadm immediately started 
>> a
>> reshape and says that it failed - because it consists of 4 drives but
>> only 3 drives are available.
> 
> Adding the fourth drive at this point should have enabled the reshape 
> to
> resume.
> 
>> I thought okay, let him complete the reshape and everything will be
>> okay. But no - the reshape is stuck at 34.3%.
>> 
>> What I have tried:
>> 
>> - Reboot ( about a 100 times)
>> - increase stripe cache size up to 32768
>> 
>> mdadm --assemble --invalid-backup --backup-file=/root/mdadm0.bak
>> /dev/md0 /dev/sdc1 /dev/sde1 /dev/sdf1
>> 
>> And some other things.
> 
> We will probably need you to detail "some other things".
> 
>> The raid is not mountable. When I try to mount it, the mount command
>> just hungs and nothing happens. That means that I had to edit my fstab
>> with a rescue cd because it would never boot again.
>> That also means that I have no access to my data.
>> 
>> When I shutdown or reboot the computer, it also hungs at shutdown, I 
>> can
>> only hard reset it.
>> 
>> cat /proc/mdstat:
>> 
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
>> [raid4] [r$
>> md0 : active raid5 sdc1[0] sdf1[3] sde1[1]
>>       7813771264 blocks super 1.2 level 5, 512k chunk, algorithm 2 
>> [4/3] [UU__]
>>       [======>..............]  reshape = 34.3% (1340465664/3906885632) 
>> finish=3$
>>       bitmap: 3/30 pages [12KB], 65536KB chunk
>> 
>> unused devices: <none>
> 
> Note the "UU__".  That means as some point your three-drive array lost 
> a
> drive, and the reshape is showing another missing drive.  A
> doubly-degraded array cannot run.
> 
>> mdadm --detail /dev/md0
>> 
>> 
>> /dev/md0:
>>         Version : 1.2
>>   Creation Time : Fri Dec  1 02:10:06 2017
>>      Raid Level : raid5
>>      Array Size : 7813771264 (7451.79 GiB 8001.30 GB)
>>   Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
>>    Raid Devices : 4
>>   Total Devices : 3
>>     Persistence : Superblock is persistent
>> 
>>   Intent Bitmap : Internal
>> 
>>     Update Time : Sun Dec  3 13:34:43 2017
>>           State : active, FAILED, reshaping
>>  Active Devices : 2
>> Working Devices : 3
>>  Failed Devices : 0
>>   Spare Devices : 1
>> 
>>          Layout : left-symmetric
>>      Chunk Size : 512K
>> 
>>  Reshape Status : 34% complete
>>   Delta Devices : 1, (3->4)
>> 
>>            Name : nas-server:0  (local to host nas-server)
>>            UUID : e410e68d:76460b65:69c056c0:d2645d55
>>          Events : 28155
>> 
>>     Number   Major   Minor   RaidDevice State
>>        0       8       33        0      active sync   /dev/sdc1
>>        1       8       65        1      active sync   /dev/sde1
>>        3       8       81        2      spare rebuilding   /dev/sdf1
>>        6       0        0        6      removed
> 
> Note the "spare rebuilding" on sdf1.  That means at some point sdf1 was
> ejected from your array and you --added it back.  A supposition
> buttressed by its slot number displayed in mdstat.  sdf1 was already a
> critical device, so --add destroyed important data on it.
> 
>> Any help is appreciated, I'm lost.
> 
> With the current status of the array, doubly-degraded with a reshape
> quite far along, I am not optimistic for you.  However, you have not
> provided all the information that might be helpful here.  Please supply
> the output (cat'd to a file, not copied from a narrow terminal, please)
> of these commands:
> 
> for x in /dev/sd[cef]1 ; do echo $x ; mdadm -E $x ; done
> 
> for x in /dev/sd[cef] ; do echo $x ; smartctl -iA -l scterc $x ; done
> 
> Please make sure your mailer is in plain text mode with line wrap
> disabled to ensure the content isn't corrupted when you paste it into
> your reply.
> 
> Regards,
> 
> Phil

[-- Attachment #2: mdadm.txt --]
[-- Type: text/plain, Size: 3204 bytes --]

/dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : e410e68d:76460b65:69c056c0:d2645d55
           Name : nas-server:0  (local to host nas-server)
  Creation Time : Fri Dec  1 02:10:06 2017
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB)
     Array Size : 11720656896 (11177.69 GiB 12001.95 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : f4490b9f:475aca83:2ff93d65:a4fabf8c

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 4021168128 (3834.88 GiB 4117.68 GB)
  Delta Devices : 1 (3->4)

    Update Time : Sun Dec  3 13:34:43 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : d1b15eed - correct
         Events : 28155

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : e410e68d:76460b65:69c056c0:d2645d55
           Name : nas-server:0  (local to host nas-server)
  Creation Time : Fri Dec  1 02:10:06 2017
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB)
     Array Size : 11720656896 (11177.69 GiB 12001.95 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : eb679765:5771cc1a:651f5a86:e166424b

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 4021168128 (3834.88 GiB 4117.68 GB)
  Delta Devices : 1 (3->4)

    Update Time : Sun Dec  3 13:34:43 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : ede2668 - correct
         Events : 28155

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x7
     Array UUID : e410e68d:76460b65:69c056c0:d2645d55
           Name : nas-server:0  (local to host nas-server)
  Creation Time : Fri Dec  1 02:10:06 2017
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB)
     Array Size : 11720656896 (11177.69 GiB 12001.95 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
Recovery Offset : 2680909424 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : 2c5c510d:3ddd9cb3:85782829:cdce1b89

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 4021168128 (3834.88 GiB 4117.68 GB)
  Delta Devices : 1 (3->4)

    Update Time : Sun Dec  3 13:34:43 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : cfdbb60f - correct
         Events : 28155

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)

[-- Attachment #3: smart.txt --]
[-- Type: text/plain, Size: 9933 bytes --]

/dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-40-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDH2M2PV
LU WWN Device Id: 5 000c50 0a5690986
Firmware Version: SC60
User Capacity:    4.000.787.030.016 bytes [4,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec  3 15:48:15 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   073   064   044    Pre-fail  Always       -       19890494
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   045    Pre-fail  Always       -       21251399
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       134 (224 223 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   054   040    Old_age   Always       -       33 (Min/Max 33/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       53
194 Temperature_Celsius     0x0022   033   046   000    Old_age   Always       -       33 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       132 (215 163 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       28126761649
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       13670857188

SCT Error Recovery Control:
           Read:     70 (7,0 seconds)
          Write:     70 (7,0 seconds)

/dev/sdd
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-40-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDH2GRNF
LU WWN Device Id: 5 000c50 0a556eef6
Firmware Version: SC60
User Capacity:    4.000.787.030.016 bytes [4,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec  3 15:48:15 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   066   044    Pre-fail  Always       -       83554824
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   045    Pre-fail  Always       -       16531420
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       134 (30 205 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   058   040    Old_age   Always       -       33 (Min/Max 33/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       79
194 Temperature_Celsius     0x0022   033   042   000    Old_age   Always       -       33 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       132 (251 231 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       16036671217
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       13248987983

SCT Error Recovery Control:
           Read:     70 (7,0 seconds)
          Write:     70 (7,0 seconds)

/dev/sde
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-40-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    Z3018XTT
LU WWN Device Id: 5 000c50 065b12345
Firmware Version: CC54
User Capacity:    4.000.787.030.016 bytes [4,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec  3 15:48:15 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   099   006    Pre-fail  Always       -       56920152
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2179
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       4457875
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       18477
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       136
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   051   045    Old_age   Always       -       34 (Min/Max 33/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   085   085   000    Old_age   Always       -       31842
194 Temperature_Celsius     0x0022   034   049   000    Old_age   Always       -       34 (0 10 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       5936h+59m+15.445s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       34663048643
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       14331594669

SCT Error Recovery Control command not supported


  reply	other threads:[~2017-12-03 14:59 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-03 12:47 mdadm reshaping stuck problem rene.feistle
2017-12-03 14:17 ` Phil Turmel
2017-12-03 14:59   ` rene.feistle [this message]
2017-12-03 17:20     ` Phil Turmel
2017-12-03 18:14       ` ERC for raid [forked from "mdadm reshaping stuck problem"] Matthias Walther
2017-12-03 18:59         ` Wols Lists
2017-12-03 21:23         ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83ee590a545d4de7938b535b952729b7@posteo.de \
    --to=rene.feistle@posteo.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox