* bug #917 - deadlock on log recovery @ 2012-03-22 17:34 Kirill Malkin 2012-03-30 16:06 ` Christoph Hellwig 0 siblings, 1 reply; 3+ messages in thread From: Kirill Malkin @ 2012-03-22 17:34 UTC (permalink / raw) To: xfs, xfs-masters Hi, I am wondering if someone had a chance to look at the bug #917. I filed it a couple of weeks ago, but haven’t seen any action. We are running into it quite a lot, and the only way out of it is to reboot the OS and drop the log. Below is another stack trace that is slightly different from the one I filed, but apparently it is the same bug. Please let me know if you need any other input. Thanks! Kirill [1185916.684850] mount D ffff8808edc989c0 0 6978 1 0x00000000 [1185916.684853] ffff8802433632f8 0000000000000086 0000000000000000 0000000000000000 [1185916.684856] 000000000000e488 ffff880243363fd8 ffff880443636280 ffff880c3cdd8180 [1185916.684860] ffff880443636608 000000063b49a400 0000000111a61ff6 ffff88065848e488 [1185916.684863] Call Trace: [1185916.684866] [<ffffffff8150d44b>] ? dm_any_congested+0x6b/0x90 [1185916.684869] [<ffffffff816766ed>] schedule_timeout+0x1dd/0x260 [1185916.684871] [<ffffffff8150bfea>] ? dm_get_live_table+0x4a/0x60 [1185916.684874] [<ffffffff8167759e>] __down+0x6e/0xb0 [1185916.684877] [<ffffffff8135aad5>] ? _xfs_buf_find+0x145/0x280 [1185916.684879] [<ffffffff8108318c>] down+0x4c/0x50 [1185916.684882] [<ffffffff813599d0>] xfs_buf_lock+0x60/0xd0 [1185916.684884] [<ffffffff8135aad5>] _xfs_buf_find+0x145/0x280 [1185916.684887] [<ffffffff8135ac71>] xfs_buf_get+0x61/0x1c0 [1185916.684890] [<ffffffff8134fe6b>] xfs_trans_get_buf+0x13b/0x1c0 [1185916.684895] [<ffffffff8131cf94>] xfs_btree_get_buf_block+0x54/0x80 [1185916.684898] [<ffffffff813206a4>] xfs_btree_split+0x114/0x6a0 [1185916.684900] [<ffffffff8131e995>] ? xfs_btree_rshift+0x75/0x530 [1185916.684903] [<ffffffff8131d89d>] ? xfs_btree_lshift+0x7d/0x5f0 [1185916.684906] [<ffffffff81321151>] xfs_btree_make_block_unfull+0x151/0x190 [1185916.684909] [<ffffffff8132152c>] xfs_btree_insrec+0x39c/0x5b0 [1185916.684911] [<ffffffff8131dec7>] ? xfs_btree_lookup_get_block+0xb7/0xf0 [1185916.684915] [<ffffffff8131be72>] ? xfs_btree_rec_addr+0x12/0x20 [1185916.684917] [<ffffffff8131c0d8>] ? xfs_lookup_get_search_key+0x58/0x60 [1185916.684920] [<ffffffff813217c6>] xfs_btree_insert+0x86/0x180 [1185916.684925] [<ffffffff81306d01>] xfs_free_ag_extent+0x4f1/0x7a0 [1185916.684928] [<ffffffff81308850>] xfs_alloc_fix_freelist+0x120/0x490 [1185916.684931] [<ffffffff81342306>] ? xlog_regrant_write_log_space+0x1e6/0x590 [1185916.684934] [<ffffffff81308c3c>] xfs_free_extent+0x7c/0xc0 [1185916.684938] [<ffffffff81312aa5>] xfs_bmap_finish+0x165/0x1b0 [1185916.684942] [<ffffffff81339065>] xfs_itruncate_finish+0x195/0x370 [1185916.684945] [<ffffffff8135526e>] xfs_inactive+0x3be/0x4e0 [1185916.684948] [<ffffffff8134f9f7>] ? xfs_trans_read_buf+0x217/0x410 [1185916.684951] [<ffffffff813616bd>] xfs_fs_clear_inode+0x9d/0xe0 [1185916.684954] [<ffffffff8114553e>] clear_inode+0x7e/0x100 [1185916.684957] [<ffffffff81145cc6>] generic_delete_inode+0x186/0x1c0 [1185916.684959] [<ffffffff81145d65>] generic_drop_inode+0x65/0x90 [1185916.684961] [<ffffffff81144892>] iput+0x62/0x70 [1185916.684964] [<ffffffff813471c9>] xlog_recover_process_one_iunlink+0x169/0x180 [1185916.684967] [<ffffffff810830ca>] ? up+0x3a/0x50 [1185916.684969] [<ffffffff81347287>] xlog_recover_process_iunlinks+0xa7/0x130 [1185916.684972] [<ffffffff81347354>] xlog_recover_finish+0x44/0xd0 [1185916.684975] [<ffffffff813403fc>] xfs_log_mount_finish+0x2c/0x40 [1185916.684978] [<ffffffff8134b03a>] xfs_mountfs+0x48a/0x6f0 [1185916.684981] [<ffffffff81356003>] ? kmem_zalloc+0x33/0x50 [1185916.684984] [<ffffffff8134badb>] ? xfs_mru_cache_create+0x13b/0x170 [1185916.684987] [<ffffffff813631b5>] xfs_fs_fill_super+0x245/0x3a0 [1185916.684990] [<ffffffff8112e31c>] get_sb_bdev+0x17c/0x1e0 [1185916.684992] [<ffffffff810f9a61>] ? kstrdup+0x41/0x70 [1185916.684995] [<ffffffff81362f70>] ? xfs_fs_fill_super+0x0/0x3a0 [1185916.684998] [<ffffffff813612f8>] xfs_fs_get_sb+0x18/0x20 [1185916.685000] [<ffffffff8112cc9c>] vfs_kern_mount+0x5c/0xf0 [1185916.685002] [<ffffffff8112cda3>] do_kern_mount+0x53/0x120 [1185916.685005] [<ffffffff8114b80a>] do_mount+0x26a/0x8c0 [1185916.685008] [<ffffffff8114bf1b>] sys_mount+0xbb/0xf0 [1185916.685011] [<ffffffff8100c15b>] system_call_fastpath+0x16/0x1b _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: bug #917 - deadlock on log recovery 2012-03-22 17:34 bug #917 - deadlock on log recovery Kirill Malkin @ 2012-03-30 16:06 ` Christoph Hellwig 2012-03-30 16:44 ` Kirill Malkin 0 siblings, 1 reply; 3+ messages in thread From: Christoph Hellwig @ 2012-03-30 16:06 UTC (permalink / raw) To: Kirill Malkin; +Cc: xfs-masters, xfs On Thu, Mar 22, 2012 at 01:34:00PM -0400, Kirill Malkin wrote: > Hi, > > I am wondering if someone had a chance to look at the bug #917. I > filed it a couple of weeks ago, but haven?t seen any action. We are > running into it quite a lot, and the only way out of it is to reboot > the OS and drop the log. Below is another stack trace that is slightly > different from the one I filed, but apparently it is the same bug. > > Please let me know if you need any other input. Can you reproduce this with a recent kernel? 2.6.32 is fairly old and a lot of things have changed in this area. I quickly looked over the trace and nothing obvious springs to mind. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: bug #917 - deadlock on log recovery 2012-03-30 16:06 ` Christoph Hellwig @ 2012-03-30 16:44 ` Kirill Malkin 0 siblings, 0 replies; 3+ messages in thread From: Kirill Malkin @ 2012-03-30 16:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs-masters, xfs Christoph - Thank you for getting back to me. The kernel I am using is not a vanilla kernel.org 2.6.32, but is part of the RHEL/CentOS 6 distribution, which has many bug fixes backported, at least up until 2.6.38 or so. Technically, it's their latest kernel. The bug is very difficult to reproduce even on this kernel. It occurs while mounting a snapshot of a very large (40TB) filesystem that is in a very active, continuous use. Once the filesystem snapshot is in that state, it is reproducible 100% (i.e. on every mount), but it's not clear what pushes it there. Unfortunately, a kernel upgrade on that system is currently not possible. Note the lockup occurs during the trimming of free list in xfs_alloc.c:xfs_alloc_fix_freelist when it's too long (look for "Make the freelist shorter if it's too long" comment inside this function), then for some reason the buffer gets double-locked inside xfs_btree_get_bufs, and the mount hangs forever. I suspect that we are not seeing this more frequently because the free list trimming is not a typical occurrence during recovery. I've looked through the patches to xfs stack in kernel.org git, and found virtually no changes to this particular area or references to something similar. I can probably do more research into it, but would really appreciate some guidance. Would it help to obtain the metadata backup from that system? What could possibly cause a deadlock when the log recovery has really no concurrency? Would it help to debug this by somehow forcing free list trimming during the recovery? Thanks again for your help. Kirill -----Original Message----- From: Christoph Hellwig [mailto:hch@infradead.org] Sent: Friday, March 30, 2012 12:07 PM To: Kirill Malkin Cc: xfs@oss.sgi.com; xfs-masters@oss.sgi.com Subject: Re: bug #917 - deadlock on log recovery On Thu, Mar 22, 2012 at 01:34:00PM -0400, Kirill Malkin wrote: > Hi, > > I am wondering if someone had a chance to look at the bug #917. I > filed it a couple of weeks ago, but haven?t seen any action. We are > running into it quite a lot, and the only way out of it is to reboot > the OS and drop the log. Below is another stack trace that is slightly > different from the one I filed, but apparently it is the same bug. > > Please let me know if you need any other input. Can you reproduce this with a recent kernel? 2.6.32 is fairly old and a lot of things have changed in this area. I quickly looked over the trace and nothing obvious springs to mind. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-03-30 16:44 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-22 17:34 bug #917 - deadlock on log recovery Kirill Malkin 2012-03-30 16:06 ` Christoph Hellwig 2012-03-30 16:44 ` Kirill Malkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox