* RAID 5 reshape stalled at 77.5% - next steps??
@ 2017-01-28 23:01 George Rapp
2017-01-28 23:15 ` Roman Mamedov
0 siblings, 1 reply; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:01 UTC (permalink / raw)
To: Linux-RAID; +Cc: Matthew Krumwiede
Hello linux-raid team. I have a reshape operation that is stuck and
refuses to respond to commands. I'm wondering what my options are to
safely get it moving again.
Background: I added two new partitions to a RAID 5 array, using a
backup-file on a
separate device:
# mdadm --add /dev/md4 /dev/sdb4 /dev/sdd4
mdadm: added /dev/sdb4
mdadm: added /dev/sdd4
# mdadm --grow --raid-devices=10
--backup-file=/home/gwr/c/md4_backup__2017-01-25 /dev/md4
mdadm: Need to backup 32256K of critical section..
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
[...]
md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10] sdi4[8] sdl4[9]
sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
[UUUUUUUUU_]
[>....................] reshape = 0.8% (16715456/1922131968)
finish=965.4min speed=32892K/sec
The reshape proceeded normally until it hit 77.5%, where it has been
stuck for the last couple of days:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
[UUUU_UUUU_]
[===============>.....] reshape = 77.5% (1490403328/1922131968)
finish=2544246.9min speed=2K/sec
The backup file was last accessed at about the time I started the reshape:
-rw-------. 1 root root 33034240 Jan 25 11:52 md4_backup__2017-01-25
I tried to idle the RAID reshape, but the "echo" command just hung:
# cd /sys/block/md4/md
# echo idle > sync_action
I can get some data from the files in this directory, though:
# cat reshape_direction
forwards
# cat reshape_position
26825379840
I tried to pull mdadm data about this array to add to this post, but that
command also hung:
# mdadm --misc --examine /dev/md4
The server CPU load is pegged, with md4_raid5 as the top CPU hog.
What are my safe alternatives here? Can I safely reboot without corrupting
the reshape? How can I get the reshape unstuck?
--
George Rapp (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID 5 reshape stalled at 77.5% - next steps??
2017-01-28 23:01 RAID 5 reshape stalled at 77.5% - next steps?? George Rapp
@ 2017-01-28 23:15 ` Roman Mamedov
2017-01-28 23:29 ` George Rapp
0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2017-01-28 23:15 UTC (permalink / raw)
To: George Rapp; +Cc: Linux-RAID, Matthew Krumwiede
On Sat, 28 Jan 2017 18:01:30 -0500
George Rapp <george.rapp@gmail.com> wrote:
> The reshape proceeded normally until it hit 77.5%, where it has been
> stuck for the last couple of days:
>
> # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
> sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
>
> 13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
> [UUUU_UUUU_]
> [===============>.....] reshape = 77.5% (1490403328/1922131968)
> finish=2544246.9min speed=2K/sec
It shows you have a failed device (sdg4) but you don't mention anything about
that? Post your mdadm --detail /dev/md4, and what do you have in dmesg.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID 5 reshape stalled at 77.5% - next steps??
2017-01-28 23:15 ` Roman Mamedov
@ 2017-01-28 23:29 ` George Rapp
2017-01-28 23:33 ` Roman Mamedov
0 siblings, 1 reply; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:29 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Linux-RAID, Matthew Krumwiede
On Sat, Jan 28, 2017 at 6:15 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Sat, 28 Jan 2017 18:01:30 -0500
> George Rapp <george.rapp@gmail.com> wrote:
>
>> The reshape proceeded normally until it hit 77.5%, where it has been
>> stuck for the last couple of days:
>>
>> # cat /proc/mdstat
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md4 : active raid5 sdd4[13](R) sdb4[12] sdg4[10](F) sdi4[8] sdl4[9]
>> sdf4[1] sdj4[7] sdh4[2] sde4[0] sdk4[11]
>>
>> 13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2 [10/9]
>> [UUUU_UUUU_]
>> [===============>.....] reshape = 77.5% (1490403328/1922131968)
>> finish=2544246.9min speed=2K/sec
>
> It shows you have a failed device (sdg4) but you don't mention anything about
> that? Post your mdadm --detail /dev/md4, and what do you have in dmesg.
Roman -
Good catch. I didn't notice that.
# mdadm --detail /dev/md4
/dev/md4:
Version : 1.1
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Array Size : 13454923776 (12831.62 GiB 13777.84 GB)
Used Dev Size : 1922131968 (1833.09 GiB 1968.26 GB)
Raid Devices : 10
Total Devices : 10
Persistence : Superblock is persistent
Update Time : Thu Jan 26 08:06:56 2017
State : active, FAILED, reshaping
Active Devices : 8
Working Devices : 9
Failed Devices : 1
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Reshape Status : 77% complete
Delta Devices : 2, (8->10)
Name : localhost.localdomain:4
UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Events : 3957775
Number Major Minor RaidDevice State
0 8 68 0 active sync /dev/sde4
1 8 84 1 active sync /dev/sdf4
2 8 116 2 active sync /dev/sdh4
9 8 180 3 active sync /dev/sdl4
10 8 100 4 faulty /dev/sdg4
13 8 52 4 spare rebuilding /dev/sdd4
11 8 164 5 active sync /dev/sdk4
8 8 132 6 active sync /dev/sdi4
7 8 148 7 active sync /dev/sdj4
12 8 20 8 active sync /dev/sdb4
18 0 0 18 removed
Relevant dmesg output:
[128702.154193] md: super_written gets error=-5
[128702.154197] md/raid:md4: Disk failure on sdg4, disabling device.
md/raid:md4: Operation continuing on 9 devices.
[128702.154205] md: super_written gets error=-5
[128702.254561] mvsas 0000:03:00.0: Phy2 : No sig fis
[128703.151620] md: md4: reshape interrupted.
[128706.343757] sas: sas_form_port: phy2 belongs to port2 already(1)!
Attempting to re-add /dev/sdg4 to the array fails on a busy device:
# mdadm --manage /dev/md4 --re-add /dev/sdg4
mdadm: Cannot open /dev/sdg4: Device or resource busy
To free up /dev/sdg4, I tried to stop the array. Not surprisingly,
this command hung as well:
# mdadm --stop /dev/md4
--
George Rapp (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID 5 reshape stalled at 77.5% - next steps??
2017-01-28 23:29 ` George Rapp
@ 2017-01-28 23:33 ` Roman Mamedov
2017-01-28 23:58 ` George Rapp
0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2017-01-28 23:33 UTC (permalink / raw)
To: George Rapp; +Cc: Linux-RAID, Matthew Krumwiede
On Sat, 28 Jan 2017 18:29:32 -0500
George Rapp <george.rapp@gmail.com> wrote:
> Attempting to re-add /dev/sdg4 to the array fails on a busy device:
>
> # mdadm --manage /dev/md4 --re-add /dev/sdg4
> mdadm: Cannot open /dev/sdg4: Device or resource busy
You need to remove it first
mdadm --remove /dev/md4 /dev/sdg4
or
mdadm --remove /dev/md4 faulty
But honestly I am not sure if simply removing and re-adding will bring your
reshape back to its working order at this point.
Also you should figure out why did it fail in the first place. Check
SMART, check dmesg further back rather than a few lines only. Maybe the disk
needs a replacement, not just a blind re-add.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID 5 reshape stalled at 77.5% - next steps??
2017-01-28 23:33 ` Roman Mamedov
@ 2017-01-28 23:58 ` George Rapp
0 siblings, 0 replies; 5+ messages in thread
From: George Rapp @ 2017-01-28 23:58 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Linux-RAID, Matthew Krumwiede
On Sat, Jan 28, 2017 at 6:33 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Sat, 28 Jan 2017 18:29:32 -0500
> George Rapp <george.rapp@gmail.com> wrote:
>
>> Attempting to re-add /dev/sdg4 to the array fails on a busy device:
>>
>> # mdadm --manage /dev/md4 --re-add /dev/sdg4
>> mdadm: Cannot open /dev/sdg4: Device or resource busy
>
> You need to remove it first
>
> mdadm --remove /dev/md4 /dev/sdg4
>
> or
>
> mdadm --remove /dev/md4 faulty
>
> But honestly I am not sure if simply removing and re-adding will bring your
> reshape back to its working order at this point.
>
> Also you should figure out why did it fail in the first place. Check
> SMART, check dmesg further back rather than a few lines only. Maybe the disk
> needs a replacement, not just a blind re-add.
Perhaps not surprisingly, the --remove command also hung.
/dev/sdg4 apparently suffered an uncorrectable read error. Entire
dmesg output (2172 lines) is at
https://app.box.com/s/7brp7c53a51zw4ez5to0m12oc5hxeq92 for your
reference.
Since none of the mdadm commands will respond, I'm thinking we need to
reboot the machine at this point to do any more diagnostics.
Thanks for your quick replies!
--
George Rapp (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-01-28 23:58 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-28 23:01 RAID 5 reshape stalled at 77.5% - next steps?? George Rapp
2017-01-28 23:15 ` Roman Mamedov
2017-01-28 23:29 ` George Rapp
2017-01-28 23:33 ` Roman Mamedov
2017-01-28 23:58 ` George Rapp
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).