* Re: VFS scalability git tree [not found] ` <20100727070538.GA2893@amd> @ 2010-07-27 8:06 ` Nick Piggin 2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin 2010-07-28 12:57 ` VFS scalability git tree Dave Chinner 0 siblings, 2 replies; 10+ messages in thread From: Nick Piggin @ 2010-07-27 8:06 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. > > I found that performance is a little unstable, so I sync and echo 3 > > drop_caches between each run. When it starts reclaiming memory, things > get a bit more erratic (and XFS seemed to be almost livelocking for tens > of seconds in inode reclaim). So about this XFS livelock type thingy. It looks like this, and happens periodically while running the above fs_mark benchmark requiring reclaim of inodes: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 15 0 6900 31032 192 471852 0 0 28 183296 8520 46672 5 91 4 0 19 0 7044 22928 192 466712 96 144 1056 115586 8622 41695 3 96 1 0 19 0 7136 59884 192 471200 160 92 6768 34564 995 542 1 99 0 0 19 0 7244 17008 192 467860 0 104 2068 32953 1044 630 1 99 0 0 18 0 7244 43436 192 467324 0 0 12 0 817 405 0 100 0 0 18 0 7244 43684 192 467324 0 0 0 0 806 425 0 100 0 0 18 0 7244 43932 192 467324 0 0 0 0 808 403 0 100 0 0 18 0 7244 44924 192 467324 0 0 0 0 808 398 0 100 0 0 18 0 7244 45456 192 467324 0 0 0 0 809 409 0 100 0 0 18 0 7244 45472 192 467324 0 0 0 0 805 412 0 100 0 0 18 0 7244 46392 192 467324 0 0 0 0 807 401 0 100 0 0 18 0 7244 47012 192 467324 0 0 0 0 810 414 0 100 0 0 18 0 7244 47260 192 467324 0 0 0 0 806 396 0 100 0 0 18 0 7244 47752 192 467324 0 0 0 0 806 403 0 100 0 0 18 0 7244 48204 192 467324 0 0 0 0 810 409 0 100 0 0 18 0 7244 48608 192 467324 0 0 0 0 807 412 0 100 0 0 18 0 7244 48876 192 467324 0 0 0 0 805 406 0 100 0 0 18 0 7244 49000 192 467324 0 0 0 0 809 402 0 100 0 0 18 0 7244 49408 192 467324 0 0 0 0 807 396 0 100 0 0 18 0 7244 49908 192 467324 0 0 0 0 809 406 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 404 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 406 0 100 0 0 19 0 7244 73436 192 467324 0 0 0 6340 808 384 0 100 0 0 20 0 7244 490220 192 467324 0 0 0 8411 830 389 0 100 0 0 18 0 7244 620092 192 467324 0 0 0 4 809 435 0 100 0 0 18 0 7244 620344 192 467324 0 0 0 0 806 430 0 100 0 0 16 0 7244 682620 192 467324 0 0 44 80 890 326 0 100 0 0 12 0 7244 604464 192 479308 76 0 11716 73555 2242 14318 2 94 4 0 12 0 7244 556700 192 483488 0 0 4276 77680 6576 92285 1 97 2 0 17 0 7244 502508 192 485456 0 0 2092 98368 6308 91919 1 96 4 0 11 0 7244 416500 192 487116 0 0 1760 114844 7414 63025 2 96 2 0 Nothing much happening except 100% system time for seconds at a time (length of time varies). This is on a ramdisk, so it isn't waiting for IO. During this time, lots of things are contending on the lock: 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed | --- __write_lock_failed | |--99.92%-- xfs_inode_ag_walk | xfs_inode_ag_iterator | xfs_reclaim_inode_shrink | shrink_slab | shrink_zone | balance_pgdat | kswapd | kthread | kernel_thread_helper --0.08%-- [...] 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find 1.16% fs_mark [kernel.kallsyms] [k] memcpy 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil | --- xfs_log_commit_cil _xfs_trans_commit | |--60.00%-- xfs_remove | xfs_vn_unlink | vfs_unlink | do_unlinkat | sys_unlink I'm not sure if there was a long-running read locker in there causing all the write lockers to fail, or if they are just running into one another. But anyway, I hacked the following patch which seemed to improve that behaviour. I haven't run any throughput numbers on it yet, but I could if you're interested (and it's not completely broken!) Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes that appear to be busy to improve locking efficiency. Index: source/fs/xfs/linux-2.6/xfs_sync.c =================================================================== --- source.orig/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:12:11.000000000 +1000 +++ source/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:58:59.000000000 +1000 @@ -87,6 +87,91 @@ xfs_inode_ag_lookup( return ip; } +#define RECLAIM_BATCH_SIZE 32 +STATIC int +xfs_inode_ag_walk_reclaim( + struct xfs_mount *mp, + struct xfs_perag *pag, + int (*execute)(struct xfs_inode *ip, + struct xfs_perag *pag, int flags), + int flags, + int tag, + int exclusive, + int *nr_to_scan) +{ + uint32_t first_index; + int last_error = 0; + int skipped; + xfs_inode_t *batch[RECLAIM_BATCH_SIZE]; + int batchnr; + int i; + + BUG_ON(!exclusive); + +restart: + skipped = 0; + first_index = 0; +next_batch: + batchnr = 0; + /* fill the batch */ + write_lock(&pag->pag_ici_lock); + do { + xfs_inode_t *ip; + + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); + if (!ip) + break; + if (!(flags & SYNC_WAIT) && + (!xfs_iflock_free(ip) || + __xfs_iflags_test(ip, XFS_IRECLAIM))) + continue; + + /* + * The radix tree lock here protects a thread in xfs_iget from + * racing with us starting reclaim on the inode. Once we have + * the XFS_IRECLAIM flag set it will not touch us. + */ + spin_lock(&ip->i_flags_lock); + ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); + if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { + /* ignore as it is already under reclaim */ + spin_unlock(&ip->i_flags_lock); + continue; + } + __xfs_iflags_set(ip, XFS_IRECLAIM); + spin_unlock(&ip->i_flags_lock); + + batch[batchnr++] = ip; + } while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE); + write_unlock(&pag->pag_ici_lock); + + for (i = 0; i < batchnr; i++) { + int error = 0; + xfs_inode_t *ip = batch[i]; + + /* execute doesn't require pag->pag_ici_lock */ + error = execute(ip, pag, flags); + if (error == EAGAIN) { + skipped++; + continue; + } + if (error) + last_error = error; + + /* bail out if the filesystem is corrupted. */ + if (error == EFSCORRUPTED) + break; + } + if (batchnr == RECLAIM_BATCH_SIZE) + goto next_batch; + + if (0 && skipped) { + delay(1); + goto restart; + } + return last_error; +} + STATIC int xfs_inode_ag_walk( struct xfs_mount *mp, @@ -113,6 +198,7 @@ restart: write_lock(&pag->pag_ici_lock); else read_lock(&pag->pag_ici_lock); + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); if (!ip) { if (exclusive) @@ -198,8 +284,12 @@ xfs_inode_ag_iterator( nr = nr_to_scan ? *nr_to_scan : INT_MAX; ag = 0; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { - error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, - exclusive, &nr); + if (tag == XFS_ICI_RECLAIM_TAG) + error = xfs_inode_ag_walk_reclaim(mp, pag, execute, + flags, tag, exclusive, &nr); + else + error = xfs_inode_ag_walk(mp, pag, execute, + flags, tag, exclusive, &nr); xfs_perag_put(pag); if (error) { last_error = error; @@ -789,23 +879,6 @@ xfs_reclaim_inode( { int error = 0; - /* - * The radix tree lock here protects a thread in xfs_iget from racing - * with us starting reclaim on the inode. Once we have the - * XFS_IRECLAIM flag set it will not touch us. - */ - spin_lock(&ip->i_flags_lock); - ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); - if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { - /* ignore as it is already under reclaim */ - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - return 0; - } - __xfs_iflags_set(ip, XFS_IRECLAIM); - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - xfs_ilock(ip, XFS_ILOCK_EXCL); if (!xfs_iflock_nowait(ip)) { if (!(sync_mode & SYNC_WAIT)) Index: source/fs/xfs/xfs_inode.h =================================================================== --- source.orig/fs/xfs/xfs_inode.h 2010-07-26 21:10:33.000000000 +1000 +++ source/fs/xfs/xfs_inode.h 2010-07-26 21:11:59.000000000 +1000 @@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_ return try_wait_for_completion(&ip->i_flush); } +static inline int xfs_iflock_free(xfs_inode_t *ip) +{ + return completion_done(&ip->i_flush); +} + static inline void xfs_ifunlock(xfs_inode_t *ip) { complete(&ip->i_flush); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) 2010-07-27 8:06 ` VFS scalability git tree Nick Piggin @ 2010-07-27 11:36 ` Nick Piggin 2010-07-27 13:30 ` Dave Chinner 2010-07-28 12:57 ` VFS scalability git tree Dave Chinner 1 sibling, 1 reply; 10+ messages in thread From: Nick Piggin @ 2010-07-27 11:36 UTC (permalink / raw) To: xfs On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). On this same system, same setup (vanilla kernel with sha given below), I have now twice reproduced a complete hang in XFS. I can give more information, test patches or options etc if required. setup.sh looks like this: #!/bin/bash modprobe rd rd_size=$[2*1024*1024] dd if=/dev/zero of=/dev/ram0 bs=4K mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt The 'dd' is required to ensure rd driver does not allocate pages during IO (which can lead to out of memory deadlocks). Running just involves changing into mnt directory and while true do sync echo 3 > /proc/sys/vm/drop_caches ../dbench -c ../loadfiles/client.txt -t20 8 rm -rf clients done And wait for it to hang (happend in < 5 minutes here) Sysrq of blocked tasks looks like this: Linux version 2.6.35-rc5-00176-gcd5b8f8 (npiggin@amd) (gcc version 4.4.4 (Debian 4.4.4-7) ) #348 SMP Mon Jul 26 22:20:32 EST 2010 brd: module loaded Enabling EXPERIMENTAL delayed logging feature - use at your own risk. XFS mounting filesystem ram0 Ending clean XFS mount for filesystem: ram0 SysRq : Show Blocked State task PC stack pid father flush-1:0 D 00000000fffff8fd 0 2799 2 0x00000000 ffff8800701ff690 0000000000000046 ffff880000000000 ffff8800701fffd8 ffff8800071531f0 00000000000122c0 0000000000004000 ffff8800701fffd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800071531f0 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0 [<ffffffff8123275b>] xfs_iomap_write_allocate+0x25b/0x3c0 [<ffffffff81232abe>] xfs_iomap+0x1fe/0x270 [<ffffffff8124aaf7>] xfs_map_blocks+0x37/0x40 [<ffffffff8107b8e1>] ? find_lock_page+0x21/0x70 [<ffffffff8124bdca>] xfs_page_state_convert+0x35a/0x690 [<ffffffff8107c38a>] ? find_or_create_page+0x3a/0xa0 [<ffffffff8124c3a6>] xfs_vm_writepage+0x76/0x110 [<ffffffff8108eb00>] ? __dec_zone_page_state+0x30/0x40 [<ffffffff81082c32>] __writepage+0x12/0x40 [<ffffffff81083367>] write_cache_pages+0x1c7/0x3d0 [<ffffffff81082c20>] ? __writepage+0x0/0x40 [<ffffffff8108358f>] generic_writepages+0x1f/0x30 [<ffffffff8124c2fc>] xfs_vm_writepages+0x4c/0x60 [<ffffffff810835bc>] do_writepages+0x1c/0x40 [<ffffffff810d606e>] writeback_single_inode+0xce/0x3b0 [<ffffffff810d6774>] writeback_sb_inodes+0x174/0x260 [<ffffffff810d701f>] writeback_inodes_wb+0x8f/0x180 [<ffffffff810d733b>] wb_writeback+0x22b/0x290 [<ffffffff810d7536>] wb_do_writeback+0x196/0x1a0 [<ffffffff810d7583>] bdi_writeback_task+0x43/0x120 [<ffffffff81050f46>] ? bit_waitqueue+0x16/0xe0 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0 [<ffffffff8108fd6c>] bdi_start_fn+0x6c/0xe0 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0 [<ffffffff81050bee>] kthread+0x8e/0xa0 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10 [<ffffffff81050b60>] ? kthread+0x0/0xa0 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10 xfssyncd/ram0 D 00000000fffff045 0 2807 2 0x00000000 ffff880007635d00 0000000000000046 ffff880000000000 ffff880007635fd8 ffff880007abd370 00000000000122c0 0000000000004000 ffff880007635fd8 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abd370 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812572b7>] ? xfs_inode_ag_iterator+0x57/0xd0 [<ffffffff81256c0a>] xfs_commit_dummy_trans+0x4a/0xe0 [<ffffffff81257454>] xfs_sync_worker+0x74/0x80 [<ffffffff81256b2a>] xfssyncd+0x13a/0x1d0 [<ffffffff812569f0>] ? xfssyncd+0x0/0x1d0 [<ffffffff81050bee>] kthread+0x8e/0xa0 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10 [<ffffffff81050b60>] ? kthread+0x0/0xa0 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10 dbench D 00000000ffffefc6 0 2975 2974 0x00000000 ffff88005ecd1ae8 0000000000000082 ffff880000000000 ffff88005ecd1fd8 ffff8800079ce250 00000000000122c0 0000000000004000 ffff88005ecd1fd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800079ce250 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff81247920>] xfs_create+0x170/0x5d0 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140 [<ffffffff810d0724>] ? mntput_no_expire+0x24/0xe0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000fffff045 0 2976 2974 0x00000000 ffff880060c11b18 0000000000000086 ffff880000000000 ffff880060c11fd8 ffff880007668630 00000000000122c0 0000000000004000 ffff880060c11fd8 0000000000004000 00000000000122c0 ffffffff81793020 ffff880007668630 Call Trace: [<ffffffff81236320>] xlog_grant_log_space+0x280/0x3d0 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812412c8>] xfs_rename+0x138/0x630 [<ffffffff810c036e>] ? exec_permission+0x3e/0x70 [<ffffffff81253111>] xfs_vn_rename+0x61/0x70 [<ffffffff810c1b4e>] vfs_rename+0x41e/0x480 [<ffffffff810c3bd6>] sys_renameat+0x236/0x270 [<ffffffff8122551d>] ? xfs_dir2_sf_getdents+0x21d/0x390 [<ffffffff810c6800>] ? filldir+0x0/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810b8a4a>] ? fput+0x1aa/0x220 [<ffffffff810c3c26>] sys_rename+0x16/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed4 0 2977 2974 0x00000000 ffff88000873fa88 0000000000000082 ffff88000873fac8 ffff88000873ffd8 ffff880007669710 00000000000122c0 0000000000004000 ffff88000873ffd8 0000000000004000 00000000000122c0 ffff88007f2e1790 ffff880007669710 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8123dbf6>] ? xfs_icsb_disable_counter+0x16/0xa0 [<ffffffff812445bb>] ? _xfs_trans_bjoin+0x4b/0x60 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0 [<ffffffff8123dcb6>] ? xfs_icsb_balance_counter_locked+0x36/0xc0 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0 [<ffffffff8122fa86>] xfs_iunlink+0x46/0x160 [<ffffffff81253d21>] ? xfs_mark_inode_dirty_sync+0x21/0x30 [<ffffffff81253dcf>] ? xfs_ichgtime+0x9f/0xc0 [<ffffffff81245677>] xfs_droplink+0x57/0x70 [<ffffffff8124751a>] xfs_remove+0x28a/0x370 [<ffffffff81253443>] xfs_vn_unlink+0x43/0x90 [<ffffffff810c161b>] vfs_unlink+0x8b/0x110 [<ffffffff810c0e20>] ? lookup_hash+0x30/0x40 [<ffffffff810c3db3>] do_unlinkat+0x183/0x1c0 [<ffffffff810bb3f1>] ? sys_newstat+0x31/0x50 [<ffffffff810c3e01>] sys_unlink+0x11/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefa8 0 2978 2974 0x00000000 ffff880040da7c38 0000000000000082 ffff880000000000 ffff880040da7fd8 ffff880007668090 00000000000122c0 0000000000004000 ffff880040da7fd8 0000000000004000 00000000000122c0 ffff88012ff78b90 ffff880007668090 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20 [<ffffffff810cdb94>] notify_change+0x104/0x2e0 [<ffffffff810db270>] utimes_common+0xd0/0x1a0 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40 [<ffffffff810db416>] do_utimes+0xd6/0xf0 [<ffffffff810db5ae>] sys_utime+0x1e/0x70 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefd3 0 2979 2974 0x00000000 ffff8800072c7ae8 0000000000000082 ffff8800072c7b78 ffff8800072c7fd8 ffff880007669170 00000000000122c0 0000000000004000 ffff8800072c7fd8 0000000000004000 00000000000122c0 ffff88012ff785f0 ffff880007669170 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0 [<ffffffff81247920>] xfs_create+0x170/0x5d0 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed0 0 2980 2974 0x00000000 ffff88003dd91688 0000000000000082 0000000000000000 ffff88003dd91fd8 ffff880007abc290 00000000000122c0 0000000000004000 ffff88003dd91fd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff880007abc290 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0 [<ffffffff812292d9>] xfs_ialloc_read_agi+0x29/0x90 [<ffffffff8122957b>] xfs_ialloc_ag_select+0x12b/0x260 [<ffffffff8122abc7>] xfs_dialloc+0x3d7/0x860 [<ffffffff8124acc8>] ? __xfs_get_blocks+0x1c8/0x210 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0 [<ffffffff810ddb9e>] ? unmap_underlying_metadata+0xe/0x50 [<ffffffff8122ef4d>] xfs_ialloc+0x5d/0x690 [<ffffffff8124a031>] ? kmem_zone_alloc+0x91/0xe0 [<ffffffff8124570d>] xfs_dir_ialloc+0x7d/0x320 [<ffffffff81236552>] ? xfs_log_reserve+0xe2/0xf0 [<ffffffff81247b83>] xfs_create+0x3d3/0x5d0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed0 0 2981 2974 0x00000000 ffff88005b79f618 0000000000000086 ffff88005b79f598 ffff88005b79ffd8 ffff880007abcdd0 00000000000122c0 0000000000004000 ffff88005b79ffd8 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abcdd0 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8122cd44>] ? xfs_iext_bno_to_ext+0x84/0x160 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff81244a40>] xfs_trans_get_buf+0xc0/0xe0 [<ffffffff8121ac3f>] xfs_da_do_buf+0x3df/0x6d0 [<ffffffff8121b0c5>] xfs_da_get_buf+0x25/0x30 [<ffffffff81220926>] ? xfs_dir2_data_init+0x46/0xe0 [<ffffffff81220926>] xfs_dir2_data_init+0x46/0xe0 [<ffffffff8121e829>] xfs_dir2_sf_to_block+0xb9/0x5a0 [<ffffffff8105106a>] ? wake_up_bit+0x2a/0x40 [<ffffffff81226a78>] xfs_dir2_sf_addname+0x418/0x5c0 [<ffffffff8122f3fb>] ? xfs_ialloc+0x50b/0x690 [<ffffffff8121e61c>] xfs_dir_createname+0x14c/0x1a0 [<ffffffff81247bf9>] xfs_create+0x449/0x5d0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefbf 0 2982 2974 0x00000000 ffff88005b7f9c38 0000000000000082 ffff880000000000 ffff88005b7f9fd8 ffff880007698150 00000000000122c0 0000000000004000 ffff88005b7f9fd8 0000000000004000 00000000000122c0 ffff88012ff79130 ffff880007698150 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20 [<ffffffff810cdb94>] notify_change+0x104/0x2e0 [<ffffffff810db270>] utimes_common+0xd0/0x1a0 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40 [<ffffffff810db416>] do_utimes+0xd6/0xf0 [<ffffffff810db5ae>] sys_utime+0x1e/0x70 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) 2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin @ 2010-07-27 13:30 ` Dave Chinner 2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2010-07-27 13:30 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On this same system, same setup (vanilla kernel with sha given below), > I have now twice reproduced a complete hang in XFS. I can give more > information, test patches or options etc if required. > > setup.sh looks like this: > #!/bin/bash > modprobe rd rd_size=$[2*1024*1024] > dd if=/dev/zero of=/dev/ram0 bs=4K > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > The 'dd' is required to ensure rd driver does not allocate pages > during IO (which can lead to out of memory deadlocks). Running just > involves changing into mnt directory and > > while true > do > sync > echo 3 > /proc/sys/vm/drop_caches > ../dbench -c ../loadfiles/client.txt -t20 8 > rm -rf clients > done > > And wait for it to hang (happend in < 5 minutes here) .... > Call Trace: > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 It's waiting on log space to be freed up. Either there's an accounting problem (possible), or you've got an xfslogd/xfsaild spinning and not making progress competing log IOs or pushing the tail of the log. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-27 13:30 ` Dave Chinner @ 2010-07-27 14:58 ` Dave Chinner 2010-07-28 13:17 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2010-07-27 14:58 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > On this same system, same setup (vanilla kernel with sha given below), > > I have now twice reproduced a complete hang in XFS. I can give more > > information, test patches or options etc if required. > > > > setup.sh looks like this: > > #!/bin/bash > > modprobe rd rd_size=$[2*1024*1024] > > dd if=/dev/zero of=/dev/ram0 bs=4K > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > > > The 'dd' is required to ensure rd driver does not allocate pages > > during IO (which can lead to out of memory deadlocks). Running just > > involves changing into mnt directory and > > > > while true > > do > > sync > > echo 3 > /proc/sys/vm/drop_caches > > ../dbench -c ../loadfiles/client.txt -t20 8 > > rm -rf clients > > done > > > > And wait for it to hang (happend in < 5 minutes here) > .... > > Call Trace: > > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 > > It's waiting on log space to be freed up. Either there's an > accounting problem (possible), or you've got an xfslogd/xfsaild > spinning and not making progress competing log IOs or pushing the > tail of the log. I'll see if I can reproduce it. Ok, I've just reproduced it. From some tracing: touch-3340 [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 The key part here is this: curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 This says the tail of the log is roughly 62MB behind the head. i.e the log is full and we are waiting for tail pushing to write the item holding the tail in place to disk so it can them be moved forward. That's better than an accounting problem, at least. So what is holding the tail in place? The first item on the AIL appears to be: xfsaild/ram0-2997 [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock A stale buffer. Given that the next objects show this trace: xfsaild/ram0-2997 [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock xfsaild/ram0-2997 [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find xfsaild/ram0-2997 [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf we see the next item on the AIL is an inode but the trace is followed by a release on the original buffer, than tells me the inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the inode buffer out. That results in xfs_inode_item_pushbuf() being called, and that tries to lock the inode buffer to flush it. xfs_buf_rele is called if the trylock on the buffer fails. IOWs, this looks to be another problem with inode cluster freeing. Ok, so we can't flush the buffer because it is locked. Why is it locked? Well, that is unclear as yet. None of the blocked processes should be holding an inode buffer locked, and a stale buffer should be unlocked during transaction commit and not live longer than the log IO that writes the transaction to disk. That is, it should not get locked again before everything is freed up. That's as much as I can get from post-mortem analysis - I need to capture a trace that spans the lockup to catch what happens to the buffer that we are hung on. That will have to wait until the morning.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner @ 2010-07-28 13:17 ` Dave Chinner 2010-07-29 14:05 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2010-07-28 13:17 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Wed, Jul 28, 2010 at 12:58:09AM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > > On this same system, same setup (vanilla kernel with sha given below), > > > I have now twice reproduced a complete hang in XFS. I can give more > > > information, test patches or options etc if required. > > > > > > setup.sh looks like this: > > > #!/bin/bash > > > modprobe rd rd_size=$[2*1024*1024] > > > dd if=/dev/zero of=/dev/ram0 bs=4K > > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > > > > > The 'dd' is required to ensure rd driver does not allocate pages > > > during IO (which can lead to out of memory deadlocks). Running just > > > involves changing into mnt directory and > > > > > > while true > > > do > > > sync > > > echo 3 > /proc/sys/vm/drop_caches > > > ../dbench -c ../loadfiles/client.txt -t20 8 > > > rm -rf clients > > > done > > > > > > And wait for it to hang (happend in < 5 minutes here) > > .... > > > Call Trace: > > > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 > > > > It's waiting on log space to be freed up. Either there's an > > accounting problem (possible), or you've got an xfslogd/xfsaild > > spinning and not making progress competing log IOs or pushing the > > tail of the log. I'll see if I can reproduce it. > > Ok, I've just reproduced it. From some tracing: > > touch-3340 [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 > > The key part here is this: > > curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 > > This says the tail of the log is roughly 62MB behind the head. i.e > the log is full and we are waiting for tail pushing to write the > item holding the tail in place to disk so it can them be moved > forward. That's better than an accounting problem, at least. > > So what is holding the tail in place? The first item on the AIL > appears to be: > > xfsaild/ram0-2997 [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock > > A stale buffer. Given that the next objects show this trace: > > xfsaild/ram0-2997 [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock > xfsaild/ram0-2997 [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find > xfsaild/ram0-2997 [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf > > we see the next item on the AIL is an inode but the trace is > followed by a release on the original buffer, than tells me the > inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the > inode buffer out. That results in xfs_inode_item_pushbuf() being > called, and that tries to lock the inode buffer to flush it. > xfs_buf_rele is called if the trylock on the buffer fails. > > IOWs, this looks to be another problem with inode cluster freeing. > > Ok, so we can't flush the buffer because it is locked. Why is it > locked? Well, that is unclear as yet. None of the blocked processes > should be holding an inode buffer locked, and a stale buffer should > be unlocked during transaction commit and not live longer than > the log IO that writes the transaction to disk. That is, it should > not get locked again before everything is freed up. > > That's as much as I can get from post-mortem analysis - I need to > capture a trace that spans the lockup to catch what happens > to the buffer that we are hung on. That will have to wait until the > morning.... Ok, so I got a trace of all the inode and buffer locking and log item operations, and the reason the hang has occurred can bee seen here: dbench-3084 [007] 1877156.395784: xfs_buf_item_unlock_stale: dev 1:0 bno 0x80040 len 0x2000 hold 2 pincount 0 lock 0 flags |ASYNC|DONE|STALE|PAGE_CACHE recur 1 refcount 1 bliflags |STALE|INODE_ALLOC|STALE_INODE lidesc 0x(null) liflags IN_AIL The key points in this trace is that when we are unlocking a stale buffer during transaction commit - we don't actually unlock it. WHat we do it: /* * If the buf item is marked stale, then don't do anything. We'll * unlock the buffer and free the buf item when the buffer is unpinned * for the last time. */ if (bip->bli_flags & XFS_BLI_STALE) { trace_xfs_buf_item_unlock_stale(bip); ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); if (!aborted) { atomic_dec(&bip->bli_refcount); return; } } But from the above trace it can be seen that the buffer pincount is zero. Hence it will never get and unpin callback, and hence never get unlocked. As a result, this process here is the one that is stuck on the buffer: dbench D 0000000000000007 0 3084 1 0x00000004 ffff880104e07608 0000000000000086 ffff880104e075b8 0000000000014000 ffff880104e07fd8 0000000000014000 ffff880104e07fd8 ffff880104c1d770 0000000000014000 0000000000014000 ffff880104e07fd8 0000000000014000 Call Trace: [<ffffffff817e59cd>] schedule_timeout+0x1ed/0x2c0 [<ffffffff810de9de>] ? ring_buffer_lock_reserve+0x9e/0x160 [<ffffffff817e690e>] __down+0x7e/0xc0 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290 [<ffffffff810a05d0>] down+0x40/0x50 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290 [<ffffffff812f314d>] xfs_buf_lock+0x4d/0x110 [<ffffffff812f41d5>] _xfs_buf_find+0x145/0x290 [<ffffffff812f4380>] xfs_buf_get+0x60/0x1c0 [<ffffffff812ea8f0>] xfs_trans_get_buf+0xe0/0x180 [<ffffffff812ccdab>] xfs_ialloc_inode_init+0xcb/0x1c0 [<ffffffff812cdaf9>] xfs_ialloc_ag_alloc+0x179/0x4a0 [<ffffffff812cdeff>] xfs_dialloc+0xdf/0x870 [<ffffffff8105ec88>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff812d2105>] xfs_ialloc+0x65/0x6b0 [<ffffffff812eb032>] xfs_dir_ialloc+0x82/0x2d0 [<ffffffff8128ce83>] ? ftrace_raw_event_xfs_lock_class+0xd3/0xe0 [<ffffffff812ecad7>] ? xfs_create+0x1a7/0x690 [<ffffffff812ecd37>] xfs_create+0x407/0x690 [<ffffffff812f9f97>] xfs_vn_mknod+0xa7/0x1c0 [<ffffffff812fa0e0>] xfs_vn_create+0x10/0x20 [<ffffffff8114ca6c>] vfs_create+0xac/0xd0 [<ffffffff8114d6ec>] do_last+0x51c/0x620 [<ffffffff8114f6d8>] do_filp_open+0x228/0x640 [<ffffffff812c1f18>] ? xfs_dir2_block_getdents+0x218/0x220 [<ffffffff8115a62a>] ? alloc_fd+0x10a/0x150 [<ffffffff8113f919>] do_sys_open+0x69/0x140 [<ffffffff8113fa30>] sys_open+0x20/0x30 [<ffffffff81035032>] system_call_fastpath+0x16/0x1b It is trying to allocate a new inode chunk on disk, which happens to be the one we just removed and staled. Now to find out why the pin count on the buffer is wrong. ..... Now it makes no sense - the buf item pin trace is there directly directly before the above unlock stale trace. The buf item pinning increments both the buffer pin count and the buf item refcount, neith of which are reflected in the unlock stale trace. From the trace analysis I did: process 83 84 85 86 87 ..... get trans_read stale format pin unlock committed unpin stale free trans_get init format pin item unlock get trans_read unlock get trans_read item unlock **** pincount goes from 1 to 0 here without any unpin traces **** get trans_read unlock get trans_read item unlock (repeat 1x) get trans_read unlock get trans_read stale format pin **** no committed/unlock/free trace for this transaction **** **** hold count goes from 2 to 1 without any rele traces **** **** is a new buffer allocated here without the staled buffer being committed? **** trans_get init format pin item unlock get trans_read unlock get trans_read item unlock (repeat 9x) get trans_read unlock get trans_read stale format pin **** pin count does not increment! **** buf lock from xfs_buf_find => hangs **** buffer that was found is locked, not pinned **** Something very strange is happening, and to make matters worse I cannot reproduce it with a debug kernel (ran for 3 hours without failing). Hence it smells like a race condition somewhere. I've reproduced it without delayed logging, so it is not directly related to that functionality. I've seen this warning: Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 Which indicates we failed to mark an inode stale when freeing an inode cluster, but I think I've fixed that and the problem still shows up. It's posible the last version didn't fix it, but.... Now I've got the ag iterator rotor patch in place as well and possibly a different version of the cluster free fix to what I previously tested and it's now been running for almost half an hour. I can't say yet whether I've fixed the bug of just changed the timing enough to avoid it. I'll leave this test running over night and redo individual patch testing tomorrow. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-28 13:17 ` Dave Chinner @ 2010-07-29 14:05 ` Nick Piggin 2010-07-29 22:56 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2010-07-29 14:05 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Piggin, xfs On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > Something very strange is happening, and to make matters worse I > cannot reproduce it with a debug kernel (ran for 3 hours without > failing). Hence it smells like a race condition somewhere. > > I've reproduced it without delayed logging, so it is not directly > related to that functionality. > > I've seen this warning: > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > Which indicates we failed to mark an inode stale when freeing an > inode cluster, but I think I've fixed that and the problem still > shows up. It's posible the last version didn't fix it, but.... I've seen that one a couple of times too. Keeps coming back each time you echo 3 > /proc/sys/vm/drop_caches :) > Now I've got the ag iterator rotor patch in place as well and > possibly a different version of the cluster free fix to what I > previously tested and it's now been running for almost half an hour. > I can't say yet whether I've fixed the bug of just changed the > timing enough to avoid it. I'll leave this test running over night > and redo individual patch testing tomorrow. I reproduced it with fs_stress now too. Any patches I could test for you just let me know. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-29 14:05 ` Nick Piggin @ 2010-07-29 22:56 ` Dave Chinner 2010-07-30 3:59 ` Nick Piggin 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2010-07-29 22:56 UTC (permalink / raw) To: Nick Piggin; +Cc: Nick Piggin, xfs On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote: > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > > Something very strange is happening, and to make matters worse I > > cannot reproduce it with a debug kernel (ran for 3 hours without > > failing). Hence it smells like a race condition somewhere. > > > > I've reproduced it without delayed logging, so it is not directly > > related to that functionality. > > > > I've seen this warning: > > > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > > > Which indicates we failed to mark an inode stale when freeing an > > inode cluster, but I think I've fixed that and the problem still > > shows up. It's posible the last version didn't fix it, but.... > > I've seen that one a couple of times too. Keeps coming back each > time you echo 3 > /proc/sys/vm/drop_caches :) Yup - it's an unflushable inode that is pinning the tail of the log, hence causing the log space hangs. > > Now I've got the ag iterator rotor patch in place as well and > > possibly a different version of the cluster free fix to what I > > previously tested and it's now been running for almost half an hour. > > I can't say yet whether I've fixed the bug of just changed the > > timing enough to avoid it. I'll leave this test running over night > > and redo individual patch testing tomorrow. > > I reproduced it with fs_stress now too. Any patches I could test > for you just let me know. You should see them in a few minutes ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-29 22:56 ` Dave Chinner @ 2010-07-30 3:59 ` Nick Piggin 0 siblings, 0 replies; 10+ messages in thread From: Nick Piggin @ 2010-07-30 3:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, xfs On Fri, Jul 30, 2010 at 08:56:58AM +1000, Dave Chinner wrote: > On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote: > > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > > > Something very strange is happening, and to make matters worse I > > > cannot reproduce it with a debug kernel (ran for 3 hours without > > > failing). Hence it smells like a race condition somewhere. > > > > > > I've reproduced it without delayed logging, so it is not directly > > > related to that functionality. > > > > > > I've seen this warning: > > > > > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > > > > > Which indicates we failed to mark an inode stale when freeing an > > > inode cluster, but I think I've fixed that and the problem still > > > shows up. It's posible the last version didn't fix it, but.... > > > > I've seen that one a couple of times too. Keeps coming back each > > time you echo 3 > /proc/sys/vm/drop_caches :) > > Yup - it's an unflushable inode that is pinning the tail of the log, > hence causing the log space hangs. > > > > Now I've got the ag iterator rotor patch in place as well and > > > possibly a different version of the cluster free fix to what I > > > previously tested and it's now been running for almost half an hour. > > > I can't say yet whether I've fixed the bug of just changed the > > > timing enough to avoid it. I'll leave this test running over night > > > and redo individual patch testing tomorrow. > > > > I reproduced it with fs_stress now too. Any patches I could test > > for you just let me know. > > You should see them in a few minutes ;) It's certainly not locking up like it used to... Thanks! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: VFS scalability git tree 2010-07-27 8:06 ` VFS scalability git tree Nick Piggin 2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin @ 2010-07-28 12:57 ` Dave Chinner 2010-07-29 14:03 ` Nick Piggin 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, xfs On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). > > So about this XFS livelock type thingy. It looks like this, and happens > periodically while running the above fs_mark benchmark requiring reclaim > of inodes: .... > Nothing much happening except 100% system time for seconds at a time > (length of time varies). This is on a ramdisk, so it isn't waiting > for IO. > > During this time, lots of things are contending on the lock: > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > | > --- __write_lock_failed > | > |--99.92%-- xfs_inode_ag_walk > | xfs_inode_ag_iterator > | xfs_reclaim_inode_shrink > | shrink_slab > | shrink_zone > | balance_pgdat > | kswapd > | kthread > | kernel_thread_helper > --0.08%-- [...] > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > | > --- xfs_log_commit_cil > _xfs_trans_commit > | > |--60.00%-- xfs_remove > | xfs_vn_unlink > | vfs_unlink > | do_unlinkat > | sys_unlink > > I'm not sure if there was a long-running read locker in there causing > all the write lockers to fail, or if they are just running into one > another. The longest hold is in the inode cluster writeback (xfs_iflush_cluster), but if there is no IO then I don't see how that would be a problem. I suspect that it might be caused by having several CPUs all trying to run the shrinker at the same time and them all starting at the same AG and therefore lockstepping and getting nothing done because they are all scanning the same inodes. Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to avoid this lockstepping. I've attached a patch below to do this - can you give it a try? > But anyway, I hacked the following patch which seemed to > improve that behaviour. I haven't run any throughput numbers on it yet, > but I could if you're interested (and it's not completely broken!) Batching is certainly something that I have been considering, but apart from the excessive scanning bug, the per-ag inode tree lookups hve not featured prominently in any profiling I've done, so it hasn't been a high priority. You patch looks like it will work fine, but I think it can be made a lot cleaner. I'll have a closer look at this once I get to the bottom of the dbench hang you are seeing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: add an AG iterator start rotor From: Dave Chinner <dchinner@redhat.com> To avoid multiple CPUs from executing inode cache shrinkers on the same AG all at the same time, make every shrinker call start on a different AG. This will mostly prevent concurrent shrinker calls from competing for the serialising pag_ici_lock and lock-stepping reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 11 ++++++++++- fs/xfs/xfs_mount.h | 1 + 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index dfcbd98..5322105 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -181,11 +181,14 @@ xfs_inode_ag_iterator( struct xfs_perag *pag; int error = 0; int last_error = 0; + xfs_agnumber_t start_ag; xfs_agnumber_t ag; int nr; + int looped = 0; nr = nr_to_scan ? *nr_to_scan : INT_MAX; - ag = 0; + start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount; + ag = start_ag; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, exclusive, &nr); @@ -197,6 +200,12 @@ xfs_inode_ag_iterator( } if (nr <= 0) break; + if (ag >= mp->m_sb.sb_agcount) { + looped = 1; + ag = 0; + } + if (ag >= start_ag && looped) + break; } if (nr_to_scan) *nr_to_scan = nr; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 622da21..fae61bb 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -199,6 +199,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + atomic_t m_agiter_rotor; /* ag iterator start rotor */ } xfs_mount_t; /* _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: VFS scalability git tree 2010-07-28 12:57 ` VFS scalability git tree Dave Chinner @ 2010-07-29 14:03 ` Nick Piggin 0 siblings, 0 replies; 10+ messages in thread From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, Nick Piggin, xfs On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > > > Branch vfs-scale-working > > > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > > run the same fs_mark parallel create/unlink workload to show > > > > scalability as I ran here: > > > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > > of a real disk (I don't have easy access to a good disk setup ATM, but > > > I guess we're more interested in code above the block layer anyway). > > > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > > yours. > > > > > > I found that performance is a little unstable, so I sync and echo 3 > > > > drop_caches between each run. When it starts reclaiming memory, things > > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > > of seconds in inode reclaim). > > > > So about this XFS livelock type thingy. It looks like this, and happens > > periodically while running the above fs_mark benchmark requiring reclaim > > of inodes: > .... > > > Nothing much happening except 100% system time for seconds at a time > > (length of time varies). This is on a ramdisk, so it isn't waiting > > for IO. > > > > During this time, lots of things are contending on the lock: > > > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > > | > > --- __write_lock_failed > > | > > |--99.92%-- xfs_inode_ag_walk > > | xfs_inode_ag_iterator > > | xfs_reclaim_inode_shrink > > | shrink_slab > > | shrink_zone > > | balance_pgdat > > | kswapd > > | kthread > > | kernel_thread_helper > > --0.08%-- [...] > > > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > > | > > --- xfs_log_commit_cil > > _xfs_trans_commit > > | > > |--60.00%-- xfs_remove > > | xfs_vn_unlink > > | vfs_unlink > > | do_unlinkat > > | sys_unlink > > > > I'm not sure if there was a long-running read locker in there causing > > all the write lockers to fail, or if they are just running into one > > another. > > The longest hold is in the inode cluster writeback > (xfs_iflush_cluster), but if there is no IO then I don't see how > that would be a problem. No I wasn't suggesting there was, just that there could have been one that I didn't notice in profiles (ie. because it had taken read lock rather than spinning on it). > I suspect that it might be caused by having several CPUs > all trying to run the shrinker at the same time and them all > starting at the same AG and therefore lockstepping and getting > nothing done because they are all scanning the same inodes. I think that is the most likely answer, yes. > Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to > avoid this lockstepping. I've attached a patch below to do this > - can you give it a try? Cool yes I will. I could try it in combination with the batching patch too. Thanks. > > But anyway, I hacked the following patch which seemed to > > improve that behaviour. I haven't run any throughput numbers on it yet, > > but I could if you're interested (and it's not completely broken!) > > Batching is certainly something that I have been considering, but > apart from the excessive scanning bug, the per-ag inode tree lookups > hve not featured prominently in any profiling I've done, so it > hasn't been a high priority. > > You patch looks like it will work fine, but I think it can be made a > lot cleaner. I'll have a closer look at this once I get to the bottom of > the dbench hang you are seeing.... Well I'll see if I can measure any efficiency or lock contention improvements with it and report back. Might have to wait till the weekend. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-07-29 14:02 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20100722190100.GA22269@amd>
[not found] ` <20100723135514.GJ32635@dastard>
[not found] ` <20100727070538.GA2893@amd>
2010-07-27 8:06 ` VFS scalability git tree Nick Piggin
2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30 ` Dave Chinner
2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17 ` Dave Chinner
2010-07-29 14:05 ` Nick Piggin
2010-07-29 22:56 ` Dave Chinner
2010-07-30 3:59 ` Nick Piggin
2010-07-28 12:57 ` VFS scalability git tree Dave Chinner
2010-07-29 14:03 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox