linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAID 5 reshape stalled at 77.5% - next steps??
@ 2017-01-28 23:01 George Rapp
  2017-01-28 23:15 ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:01 UTC (permalink / raw)
  To: Linux-RAID; +Cc: Matthew Krumwiede

Hello linux-raid team. I have a reshape operation that is stuck and
refuses to respond to commands. I'm wondering what my options are to
safely get it moving again.

Background: I added two new partitions to a RAID 5 array, using a
backup-file on a
separate device:

# mdadm --add /dev/md4 /dev/sdb4 /dev/sdd4
mdadm: added /dev/sdb4
mdadm: added /dev/sdd4

# mdadm --grow --raid-devices=10
--backup-file=/home/gwr/c/md4_backup__2017-01-25 /dev/md4
mdadm: Need to backup 32256K of critical section..

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
[...]
md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10] sdi4[8] sdl4[9]
sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
[UUUUUUUUU_]
[>....................] reshape = 0.8% (16715456/1922131968)
finish=965.4min speed=32892K/sec

The reshape proceeded normally until it hit 77.5%, where it has been
stuck for the last couple of days:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]

13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
[UUUU_UUUU_]
[===============>.....] reshape = 77.5% (1490403328/1922131968)
finish=2544246.9min speed=2K/sec

The backup file was last accessed at about the time I started the reshape:
-rw-------. 1 root root  33034240 Jan 25 11:52 md4_backup__2017-01-25

I tried to idle the RAID reshape, but the "echo" command just hung:

# cd /sys/block/md4/md
# echo idle > sync_action

I can get some data from the files in this directory, though:

# cat reshape_direction
forwards
# cat reshape_position
26825379840

I tried to pull mdadm data about this array to add to this post, but that
command also hung:

# mdadm --misc --examine /dev/md4

The server CPU load is pegged, with md4_raid5 as the top CPU hog.

What are my safe alternatives here? Can I safely reboot without corrupting
the reshape? How can I get the reshape unstuck?

-- 
George Rapp  (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID 5 reshape stalled at 77.5% - next steps??
  2017-01-28 23:01 RAID 5 reshape stalled at 77.5% - next steps?? George Rapp
@ 2017-01-28 23:15 ` Roman Mamedov
  2017-01-28 23:29   ` George Rapp
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2017-01-28 23:15 UTC (permalink / raw)
  To: George Rapp; +Cc: Linux-RAID, Matthew Krumwiede

On Sat, 28 Jan 2017 18:01:30 -0500
George Rapp <george.rapp@gmail.com> wrote:

> The reshape proceeded normally until it hit 77.5%, where it has been
> stuck for the last couple of days:
> 
> # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
> sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
> 
> 13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
> [UUUU_UUUU_]
> [===============>.....] reshape = 77.5% (1490403328/1922131968)
> finish=2544246.9min speed=2K/sec

It shows you have a failed device (sdg4) but you don't mention anything about
that? Post your mdadm --detail /dev/md4, and what do you have in dmesg.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID 5 reshape stalled at 77.5% - next steps??
  2017-01-28 23:15 ` Roman Mamedov
@ 2017-01-28 23:29   ` George Rapp
  2017-01-28 23:33     ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:29 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Linux-RAID, Matthew Krumwiede

On Sat, Jan 28, 2017 at 6:15 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Sat, 28 Jan 2017 18:01:30 -0500
> George Rapp <george.rapp@gmail.com> wrote:
>
>> The reshape proceeded normally until it hit 77.5%, where it has been
>> stuck for the last couple of days:
>>
>> # cat /proc/mdstat
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
>> sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
>>
>> 13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
>> [UUUU_UUUU_]
>> [===============>.....] reshape = 77.5% (1490403328/1922131968)
>> finish=2544246.9min speed=2K/sec
>
> It shows you have a failed device (sdg4) but you don't mention anything about
> that? Post your mdadm --detail /dev/md4, and what do you have in dmesg.

Roman -

Good catch. I didn't notice that.

# mdadm --detail /dev/md4
/dev/md4:
        Version : 1.1
  Creation Time : Thu Feb 17 14:54:06 2011
     Raid Level : raid5
     Array Size : 13454923776 (12831.62 GiB 13777.84 GB)
  Used Dev Size : 1922131968 (1833.09 GiB 1968.26 GB)
   Raid Devices : 10
  Total Devices : 10
    Persistence : Superblock is persistent

    Update Time : Thu Jan 26 08:06:56 2017
          State : active, FAILED, reshaping
 Active Devices : 8
Working Devices : 9
 Failed Devices : 1
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Reshape Status : 77% complete
  Delta Devices : 2, (8->10)

           Name : localhost.localdomain:4
           UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
         Events : 3957775

    Number   Major   Minor   RaidDevice State
       0       8       68        0      active sync   /dev/sde4
       1       8       84        1      active sync   /dev/sdf4
       2       8      116        2      active sync   /dev/sdh4
       9       8      180        3      active sync   /dev/sdl4
      10       8      100        4      faulty   /dev/sdg4
      13       8       52        4      spare rebuilding   /dev/sdd4
      11       8      164        5      active sync   /dev/sdk4
       8       8      132        6      active sync   /dev/sdi4
       7       8      148        7      active sync   /dev/sdj4
      12       8       20        8      active sync   /dev/sdb4
      18       0        0       18      removed

Relevant dmesg output:

[128702.154193] md: super_written gets error=-5
[128702.154197] md/raid:md4: Disk failure on sdg4, disabling device.
                md/raid:md4: Operation continuing on 9 devices.
[128702.154205] md: super_written gets error=-5
[128702.254561] mvsas 0000:03:00.0: Phy2 : No sig fis
[128703.151620] md: md4: reshape interrupted.
[128706.343757] sas: sas_form_port: phy2 belongs to port2 already(1)!

Attempting to re-add /dev/sdg4 to the array fails on a busy device:

# mdadm --manage /dev/md4 --re-add /dev/sdg4
mdadm: Cannot open /dev/sdg4: Device or resource busy

To free up /dev/sdg4, I tried to stop the array. Not surprisingly,
this command hung as well:

# mdadm --stop /dev/md4


-- 
George Rapp  (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID 5 reshape stalled at 77.5% - next steps??
  2017-01-28 23:29   ` George Rapp
@ 2017-01-28 23:33     ` Roman Mamedov
  2017-01-28 23:58       ` George Rapp
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2017-01-28 23:33 UTC (permalink / raw)
  To: George Rapp; +Cc: Linux-RAID, Matthew Krumwiede

On Sat, 28 Jan 2017 18:29:32 -0500
George Rapp <george.rapp@gmail.com> wrote:

> Attempting to re-add /dev/sdg4 to the array fails on a busy device:
> 
> # mdadm --manage /dev/md4 --re-add /dev/sdg4
> mdadm: Cannot open /dev/sdg4: Device or resource busy

You need to remove it first

  mdadm --remove /dev/md4 /dev/sdg4

or 

  mdadm --remove /dev/md4 faulty

But honestly I am not sure if simply removing and re-adding will bring your
reshape back to its working order at this point.

Also you should figure out why did it fail in the first place. Check
SMART, check dmesg further back rather than a few lines only. Maybe the disk
needs a replacement, not just a blind re-add.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID 5 reshape stalled at 77.5% - next steps??
  2017-01-28 23:33     ` Roman Mamedov
@ 2017-01-28 23:58       ` George Rapp
  0 siblings, 0 replies; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:58 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Linux-RAID, Matthew Krumwiede

On Sat, Jan 28, 2017 at 6:33 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Sat, 28 Jan 2017 18:29:32 -0500
> George Rapp <george.rapp@gmail.com> wrote:
>
>> Attempting to re-add /dev/sdg4 to the array fails on a busy device:
>>
>> # mdadm --manage /dev/md4 --re-add /dev/sdg4
>> mdadm: Cannot open /dev/sdg4: Device or resource busy
>
> You need to remove it first
>
>   mdadm --remove /dev/md4 /dev/sdg4
>
> or
>
>   mdadm --remove /dev/md4 faulty
>
> But honestly I am not sure if simply removing and re-adding will bring your
> reshape back to its working order at this point.
>
> Also you should figure out why did it fail in the first place. Check
> SMART, check dmesg further back rather than a few lines only. Maybe the disk
> needs a replacement, not just a blind re-add.

Perhaps not surprisingly, the --remove command also hung.

/dev/sdg4 apparently suffered an uncorrectable read error. Entire
dmesg output (2172 lines) is at
https://app.box.com/s/7brp7c53a51zw4ez5to0m12oc5hxeq92 for your
reference.

Since none of the mdadm commands will respond, I'm thinking we need to
reboot the machine at this point to do any more diagnostics.

Thanks for your quick replies!

-- 
George Rapp  (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-01-28 23:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-28 23:01 RAID 5 reshape stalled at 77.5% - next steps?? George Rapp
2017-01-28 23:15 ` Roman Mamedov
2017-01-28 23:29   ` George Rapp
2017-01-28 23:33     ` Roman Mamedov
2017-01-28 23:58       ` George Rapp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).