From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jes Sorensen Subject: Re: raid5 lockups post ca64cae96037de16e4af92678814f5d4bf0c1c65 Date: Thu, 14 Mar 2013 08:35:05 +0100 Message-ID: References: <20130305080010.6285b435@notabene.brown> <20130306131804.0b39752a@notabene.brown> <20130312093231.72c54735@notabene.brown> <20130312123224.62018981@notabene.brown> <20130313103513.350f24f7@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain Return-path: In-Reply-To: <20130313103513.350f24f7@notabene.brown> (NeilBrown's message of "Wed, 13 Mar 2013 10:35:13 +1100") Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid@vger.kernel.org, Shaohua Li , Eryu Guan List-Id: linux-raid.ids NeilBrown writes: > On Tue, 12 Mar 2013 14:45:44 +0100 Jes Sorensen > wrote: > >> NeilBrown writes: >> > On Tue, 12 Mar 2013 09:32:31 +1100 NeilBrown wrote: >> > >> >> On Wed, 06 Mar 2013 10:31:55 +0100 Jes Sorensen >> >> wrote: >> >> >> > >> >> > >> >> > I am attaching the test script I am running too. It was written by Eryu >> >> > Guan. >> >> >> >> Thanks for that. I've tried using it but haven't managed to trigger a BUG >> >> yet. What size are the loop files? I mostly use fairly small ones, but >> >> maybe it needs to be bigger to trigger the problem. >> > >> > Shortly after I wrote that I got a bug-on! It hasn't happened again though. >> > >> > This was using code without that latest patch I sent. The bug was >> > BUG_ON(s->uptodate != disks); >> > >> > in the check_state_compute_result case of handle_parity_checks5() which is >> > probably the same cause as your most recent BUG. >> > >> > I've revised my thinking a bit and am now running with this patch which I >> > think should fix a problem that probably caused the symptoms we have seen. >> > >> > If you could run your tests for a while too and is whether it will >> > still crash >> > for you, I'd really appreciate it. >> >> Hi Neil, >> >> Sorry I can't verify the line numbers of my old test since I managed to >> mess up my git tree in the process :( >> >> However running with this new patch I have just hit another but >> different case. Looks like a deadlock. > > You test setup is clearly different from mine. I've been running all night > without a single hiccup. > >> >> This is basically running ca64cae96037de16e4af92678814f5d4bf0c1c65 with >> your patch applied on top, and nothing else. >> >> If you want me to try a more uptodate Linus tree, please let me know. >> >> Cheers, >> Jes >> >> >> [17635.205927] INFO: task mkfs.ext4:20060 blocked for more than 120 seconds. >> [17635.213543] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [17635.222291] mkfs.ext4 D ffff880236814100 0 20060 20026 0x00000080 >> [17635.230199] ffff8801bc8bbb98 0000000000000082 ffff88022f0be540 >> ffff8801bc8bbfd8 >> [17635.238518] ffff8801bc8bbfd8 ffff8801bc8bbfd8 ffff88022d47b2a0 >> ffff88022f0be540 >> [17635.246837] ffff8801cea1f430 000000000001d5f0 ffff8801c7f4f430 >> ffff88022169a400 >> [17635.255161] Call Trace: >> [17635.257891] [] schedule+0x29/0x70 >> [17635.263433] [] make_request+0x6da/0x6f0 [raid456] >> [17635.270525] [] ? wake_up_bit+0x40/0x40 >> [17635.276560] [] md_make_request+0xc3/0x200 >> [17635.282884] [] ? mempool_alloc_slab+0x15/0x20 >> [17635.289586] [] generic_make_request+0xc2/0x110 >> [17635.296393] [] submit_bio+0x79/0x160 >> [17635.302232] [] ? bio_alloc_bioset+0x65/0x120 >> [17635.308844] [] blkdev_issue_discard+0x184/0x240 >> [17635.315748] [] blkdev_ioctl+0x3b6/0x810 >> [17635.321877] [] block_ioctl+0x41/0x50 >> [17635.327714] [] do_vfs_ioctl+0x99/0x580 >> [17635.333745] [] ? >> inode_has_perm.isra.30.constprop.60+0x2a/0x30 >> [17635.342103] [] ? file_has_perm+0x97/0xb0 >> [17635.348329] [] sys_ioctl+0x91/0xb0 >> [17635.353972] [] ? __audit_syscall_exit+0x3ec/0x450 >> [17635.361070] [] system_call_fastpath+0x16/0x1b > > There is a small race in the exclusion between discard and recovery. > This patch on top should fix it (I hope). > Thanks for testing. Ok I spent most of yesterday running tests on this. With this additional patch applied I haven't been able to reproduce the hang so far - without it I could do it in about an hour, so I suspect it solves the problem. Thanks! Jes