From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Shaver Subject: Kernel deadlock during mdadm reshape Date: Tue, 26 Jul 2016 22:18:48 -0400 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Sender: linux-raid-owner@vger.kernel.org To: mdraid List-Id: linux-raid.ids I am experiencing the exact same problem reported in this thread: http://www.spinics.net/lists/raid/msg52235.html Also reported here: https://forums.gentoo.org/viewtopic-t-1043706.html And here: https://bbs.archlinux.org/viewtopic.php?id=212108 I have a raid5 array of 2TB disks currently stuck at 94% of a mdadm reshape squeal to a grow operation from 4 disks to 5. In my case, I did have a drive drop out of the array during the reshape. The PC has been rebooted many times now in an attempt to restart the process, but no matter what I do, the array immediately locks up upon assembly. The md127_raid5 kernel process immediately spikes to near 100% cpu, and md127_reshape immediately deadlocks, followed by udev shortly after. At this point, any attempt to mount or interact with the array will cause processes to hang. Been trying to recover for about three weeks now, starting to run out of ideas of what to try next. What I have tried thus far: 1. Disabled all manner of intrusive security enforcement (selinux) 2. Attempted to 'freeze-reshape' but to no effect 3. Attempted to assemble with 'invalid-backup' but to no effect 4. Changed min and max through put values for array reshape but to no effect 5. Ran extended SMART tests against all drives (all pass, the faulty drive has issues with going to sleep) 6. Booted live recovery CDs from a variety of kernel versions (as far back as 3.6.10 and as far forward as 4.6.3) 7. Compiled latest mdadm 8. Disabled udev 9. Tried killing the md127_raid5 process before it could spike but to no effect 10. Tried killing the md127_reshape process before it could deadlock but to no effect 11. Swapped out drives to a different physical PC Nothing I do seems to have any effect. The issue reproduces exactly the same under all scenarios. > mdadm --add /dev/md127 /dev/sdf1 > mdadm --grow /dev/md127 --raid-devices=5 --backup-file=/home/user/grow_md127.bak > cat /prod/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 sdd1[1] sde1[5] sda1[4] sdf1[2] 5860147200 blocks super 1.2 level 5, 128k chunk, algorithm 2 [5/4] [_UUUU] [==================>..] reshape = 94.3% (1842696832/1953382400) finish=99999.99min speed=0K/sec bitmap: 2/15 pages [8KB], 65536KB chunk unused devices: > ps aux | grep md127 root 3568 98.4 0.0 0 0 ? R 21:35 1:16 [md127_raid5] root 3569 0.0 0.0 0 0 ? D 21:35 0:00 [md127_reshape] > ps aux | grep md | grep D root 3569 0.0 0.0 0 0 ? D 21:35 0:00 [md127_reshape] root 3570 0.0 0.0 0 0 ? D 21:35 0:00 [systemd-udevd] > cat /proc/3569/stack [] raid5_get_active_stripe+0x310/0x6f0 [raid456] [] reshape_request+0x2fb/0x940 [raid456] [] raid5_sync_request+0x326/0x3a0 [raid456] [] md_do_sync+0x88c/0xe50 [] md_thread+0x139/0x150 [] kthread+0xd8/0xf0 [] ret_from_fork+0x22/0x40 [] 0xffffffffffffffff > cat /proc/3570/stack [] __lock_page+0xc8/0xe0 [] truncate_inode_pages_range+0x46d/0x880 [] truncate_inode_pages+0x15/0x20 [] kill_bdev+0x2f/0x40 [] __blkdev_put+0x85/0x290 [] blkdev_put+0x4c/0x110 [] blkdev_close+0x25/0x30 [] __fput+0xdf/0x1f0 [] ____fput+0xe/0x10 [] task_work_run+0x7f/0xa0 [] do_exit+0x2d8/0xb60 [] do_group_exit+0x47/0xb0 [] get_signal+0x291/0x610 [] do_signal+0x37/0x710 [] exit_to_mode_loop+0x8c/0xd0 [] syscall_return_slowpath+0xa1/0xb0 [] entry_SYSCALL_64_fastpath+0xa2/0xa4 [] 0xffffffffffffffff > cat /proc/3568/stack [] 0xffffffffffffffff > mdadm -S /dev/md127 hangs > reboot > mdadm --assemble /dev/md127 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 --verbose --backup-file=/home/user/grow_md127.bak mdadm: /dev/sda1 is identified as a member of /dev/md127, slot 3. mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 1. mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 4. mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 2. mdadm: /dev/md127 has an active reshape - checking if critical section needs to be restored mdadm: No backup metadata on /home/user/grow_md127.bak mdadm: too-old timestamp on backup-metadata on device-4 mdadm: If you think it is should be safe, try 'export MDADM_GROW_ALLOW_OLD=1' mdadm: added /dev/sdc1 to /dev/md127 as 0 (possibly out of date) mdadm: added /dev/sdf1 to /dev/md127 as 2 mdadm: added /dev/sda1 to /dev/md127 as 3 mdadm: added /dev/sde1 to /dev/md127 as 4 mdadm: added /dev/sdd1 to /dev/md127 as 1 mdadm: /dev/md127 has been started with 4 drives (out of 5). > cat /prod/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 sdd1[1] sde1[5] sda1[4] sdf1[2] 5860147200 blocks super 1.2 level 5, 128k chunk, algorithm 2 [5/4] [_UUUU] [==================>..] reshape = 94.3% (1842696832/1953382400) finish=99999.99min speed=0K/sec bitmap: 2/15 pages [8KB], 65536KB chunk unused devices: > mdadm -S /dev/md127 hangs > reboot > export MDADM_GROW_ALLOW_OLD=1 > mdadm --assemble /dev/md127 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 --verbose --backup-file=/home/user/grow_md127.bak mdadm: looking for devices for /dev/md127 mdadm: /dev/sda1 is identified as a member of /dev/md127, slot 3. mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 1. mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 4. mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 2. mdadm: /dev/md127 has an active reshape - checking if critical section needs to be restored mdadm: No backup metadata on /home/user/grow_md127.bak mdadm: accepting backup with timestamp 1467397557 for array with timestamp 1469583355 mdadm: backup-metadata found on device-4 but is not needed mdadm: added /dev/sdc1 to /dev/md127 as 0 (possibly out of date) mdadm: added /dev/sdf1 to /dev/md127 as 2 mdadm: added /dev/sda1 to /dev/md127 as 3 mdadm: added /dev/sde1 to /dev/md127 as 4 mdadm: added /dev/sdd1 to /dev/md127 as 1 mdadm: /dev/md127 has been started with 4 drives (out of 5). > cat /prod/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 sdd1[1] sde1[5] sda1[4] sdf1[2] 5860147200 blocks super 1.2 level 5, 128k chunk, algorithm 2 [5/4] [_UUUU] [==================>..] reshape = 94.3% (1842696832/1953382400) finish=99999.99min speed=0K/sec bitmap: 2/15 pages [8KB], 65536KB chunk unused devices: > mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Sun May 18 16:54:52 2014 Raid Level : raid5 Array Size : 5860147200 (5588.67 GiB 6000.79 GB) Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) Raid Devices : 5 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Jul 26 21:53:57 2016 State : clean, degraded, reshaping Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 128K Reshape Status : 94% complete Delta Devices : 1, (4->5) Name : rza.eth0.net:0 (local to host rza.eth0.net) UUID : 9d5d1606:414b51f8:b5173999:7239c63f Events : 345137 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 49 1 active sync /dev/sdd1 2 8 81 2 active sync /dev/sdf1 4 8 1 3 active sync /dev/sda1 5 8 65 4 active sync /dev/sde1 Looking for pointers on where to look next, if anyone has suggestions. I am starting to step through code and debugging the kernel, but this is out of my depth. A couple of specific questions: 1. Am I correct in my understanding that the code for the md127_raid5 and md127_reshape processes are effectively in kernel space? My understanding is that mdadm manages those kernel space processes? If I want to debug the deadlock, I should be looking in the kernel portion of linux raid? 2. Does md_reshape require md_raid5 to be running and vise-versa? Would it be possible to force mdadm to only start one process or the other? thanks for any tips or suggestions! Michael