* [PATCH 0/2] xfs: write back inodes during reclaim
@ 2011-04-07 6:19 Dave Chinner
2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Dave Chinner @ 2011-04-07 6:19 UTC (permalink / raw)
To: xfs; +Cc: linux-fsdevel
This series fixes an OOM problem where VFS-only dirty inodes
accumulate on an XFS filesystem due to atime updates causing OOM to
occur.
The first patch fixes a deadlock triggering bdi-flusher writeback
from memory reclaim when a new bdi-flusher thread needs to be forked
and no memory is available.
the second adds a bdi-flusher kick from XFS's inode cache shrinker
so that when memory is low the VFS starts writing back dirty inodes
so they can be reclaimed as they get cleaned rather than remaining
dirty and pinning the inode cache in memory.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH 1/2] bdi: mark the bdi flusher busy when being forked 2011-04-07 6:19 [PATCH 0/2] xfs: write back inodes during reclaim Dave Chinner @ 2011-04-07 6:19 ` Dave Chinner 2011-04-11 18:34 ` Christoph Hellwig 2011-04-13 19:29 ` Alex Elder 2011-04-07 6:19 ` [PATCH 2/2] xfs: kick inode writeback when low on memory Dave Chinner 2011-04-15 7:23 ` [PATCH 0/2] xfs: write back inodes during reclaim Yann Dupont 2 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2011-04-07 6:19 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel From: Dave Chinner <dchinner@redhat.com> Recetn attempts to use writeback_inode_sb_nr_if_idle() in XFs from memory reclaim context have caused deadlocks because memory reclaim call be called from a failed allocation during forking a flusher thread. The shrinker then attempts to trigger writeback and the bdi is considered idle because writeback is not in progress yet and then deadlocks because bdi_queue_work() blocks waiting for the BDI_Pending bit to clear which will never happen because it needs the fork to complete. To avoid this deadlock, consider writeback to be in progress if the flusher thread is being created. This prevents reclaim from blocking waiting for it be forked and hence avoids the deadlock. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/fs-writeback.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index b5ed541..64e2aba 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -62,11 +62,14 @@ int nr_pdflush_threads; * @bdi: the device's backing_dev_info structure. * * Determine whether there is writeback waiting to be handled against a - * backing device. + * backing device. If the flusher thread is being created, then writeback is in + * the process of being started, so indicate that it writeback is not idle at + * this point in time. */ int writeback_in_progress(struct backing_dev_info *bdi) { - return test_bit(BDI_writeback_running, &bdi->state); + return test_bit(BDI_writeback_running, &bdi->state) || + test_bit(BDI_pending, &bdi->state); } static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] bdi: mark the bdi flusher busy when being forked 2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner @ 2011-04-11 18:34 ` Christoph Hellwig 2011-04-13 19:29 ` Alex Elder 1 sibling, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2011-04-11 18:34 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs On Thu, Apr 07, 2011 at 04:19:55PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Recetn attempts to use writeback_inode_sb_nr_if_idle() in XFs from > memory reclaim context have caused deadlocks because memory reclaim > call be called from a failed allocation during forking a flusher > thread. The shrinker then attempts to trigger writeback and the bdi > is considered idle because writeback is not in progress yet and then > deadlocks because bdi_queue_work() blocks waiting for the > BDI_Pending bit to clear which will never happen because it needs > the fork to complete. > > To avoid this deadlock, consider writeback to be in progress if the > flusher thread is being created. This prevents reclaim from blocking > waiting for it be forked and hence avoids the deadlock. Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] bdi: mark the bdi flusher busy when being forked 2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner 2011-04-11 18:34 ` Christoph Hellwig @ 2011-04-13 19:29 ` Alex Elder 1 sibling, 0 replies; 13+ messages in thread From: Alex Elder @ 2011-04-13 19:29 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs On Thu, 2011-04-07 at 16:19 +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Recetn attempts to use writeback_inode_sb_nr_if_idle() in XFs from > memory reclaim context have caused deadlocks because memory reclaim > call be called from a failed allocation during forking a flusher > thread. The shrinker then attempts to trigger writeback and the bdi > is considered idle because writeback is not in progress yet and then > deadlocks because bdi_queue_work() blocks waiting for the > BDI_Pending bit to clear which will never happen because it needs > the fork to complete. > > To avoid this deadlock, consider writeback to be in progress if the > flusher thread is being created. This prevents reclaim from blocking > waiting for it be forked and hence avoids the deadlock. I don't believe it matters, but BDI_pending is also set while a writeback flusher thread is being shut down. In any case, a handy use of that flag bit. Reviewed-by: Alex Elder <aelder@sgi.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-07 6:19 [PATCH 0/2] xfs: write back inodes during reclaim Dave Chinner 2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner @ 2011-04-07 6:19 ` Dave Chinner 2011-04-11 18:36 ` Christoph Hellwig 2011-04-13 20:33 ` Alex Elder 2011-04-15 7:23 ` [PATCH 0/2] xfs: write back inodes during reclaim Yann Dupont 2 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2011-04-07 6:19 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel From: Dave Chinner <dchinner@redhat.com> When the inode cache shrinker runs, we may have lots of dirty inodes queued up in the VFS dirty queues that have not been expired. The typical case for this with XFS is atime updates. The result is that a highly concurrent workload that copies files and then later reads them (say to verify checksums) dirties all the inodes again, even when relatime is used. In a constrained memory environment, this results in a large number of dirty inodes using all of available memory and memory reclaim being unable to free them as dirty inodes areconsidered active. This problem was uncovered by Chris Mason during recent low memory stress testing. The fix is to trigger VFS level writeback from the XFS inode cache shrinker if there isn't already writeback in progress. This ensures that when we enter a low memory situation we start cleaning inodes (via the flusher thread) on the filesystem immediately, thereby making it more likely that we will be able to evict those dirty inodes from the VFS in the near future. The mechanism is not perfect - it only acts on the current filesystem, so if all the dirty inodes are on a different filesystem it won't help. However, it seems to be a valid assumption is that the filesystem with lots of dirty inodes is going to have the shrinker called very soon after the memory shortage begins, so this shouldn't be an issue. The other flaw is that there is no guarantee that the flusher thread will make progress fast enough to clean the dirty inodes so they can be reclaimed in the near future. However, this mechanism does improve the resilience of the filesystem under the test conditions - instead of reliably triggering the OOM killer 20 minutes into the stress test, it took more than 6 hours before it happened. This small addition definitely improves the low memory resilience of XFS on this type of workload, and best of all it has no impact on performance when memory is not constrained. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 11 +++++++++++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 9ad9560..c240d46 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -1038,6 +1038,17 @@ xfs_reclaim_inode_shrink( if (!(gfp_mask & __GFP_FS)) return -1; + /* + * make sure VFS is cleaning inodes so they can be pruned + * and marked for reclaim in the XFS inode cache. If we don't + * do this the VFS can accumulate dirty inodes and we can OOM + * before they are cleaned by the periodic VFS writeback. + * + * This takes VFS level locks, so we can only do this after + * the __GFP_FS checks otherwise lockdep gets really unhappy. + */ + writeback_inodes_sb_nr_if_idle(mp->m_super, nr_to_scan); + xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); /* terminate if we don't exhaust the scan */ -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-07 6:19 ` [PATCH 2/2] xfs: kick inode writeback when low on memory Dave Chinner @ 2011-04-11 18:36 ` Christoph Hellwig 2011-04-11 21:14 ` Dave Chinner 2011-04-13 20:33 ` Alex Elder 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2011-04-11 18:36 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs How do you produce so many atime-dirty inodes? With relatime we should have cut down on the requirement for those a lot. Do you have traces that show if we're kicking off additional data writeback this way too, or just pushing timestamp updates into the AIL? Either way the actual patch looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-11 18:36 ` Christoph Hellwig @ 2011-04-11 21:14 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2011-04-11 21:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, xfs On Mon, Apr 11, 2011 at 02:36:53PM -0400, Christoph Hellwig wrote: > How do you produce so many atime-dirty inodes? With relatime we > should have cut down on the requirement for those a lot. Copy a bunch of files, then md5sum them. The copy modifies c/mtime, the md5sum modifies atime, sees mtime is younger than atime, updates atime and dirties the inode. i.e.: $ touch foo $ stat foo File: `foo' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fe02h/65026d Inode: 150756489 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ dave) Gid: ( 1000/ dave) Access: 2011-04-12 07:08:24.668542636 +1000 Modify: 2011-04-12 07:08:24.668542636 +1000 Change: 2011-04-12 07:08:24.668542636 +1000 $ cp README foo cp: overwrite `foo'? y $ stat foo File: `foo' Size: 17525 Blocks: 40 IO Block: 4096 regular file Device: fe02h/65026d Inode: 150756489 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ dave) Gid: ( 1000/ dave) Access: 2011-04-12 07:08:24.668542636 +1000 Modify: 2011-04-12 07:08:44.676108103 +1000 Change: 2011-04-12 07:08:44.676108103 +1000 $ md5sum foo 9eb709847626f3663ea66121f10a27d7 foo $ stat foo File: `foo' Size: 17525 Blocks: 40 IO Block: 4096 regular file Device: fe02h/65026d Inode: 150756489 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ dave) Gid: ( 1000/ dave) Access: 2011-04-12 07:09:00.223770431 +1000 Modify: 2011-04-12 07:08:44.676108103 +1000 Change: 2011-04-12 07:08:44.676108103 +1000 $ > Do you have traces that show if we're kicking off additional data > writeback this way too, or just pushing timestamp updates into > the AIL? For the test workload, it just pushes timestamp updates into the AIL as that is the only thing that is dirtying the inodes when the OOM occurs. In other situations, I have no evidence either way, but I have not noticed any performance changes from a high level. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-07 6:19 ` [PATCH 2/2] xfs: kick inode writeback when low on memory Dave Chinner 2011-04-11 18:36 ` Christoph Hellwig @ 2011-04-13 20:33 ` Alex Elder 2011-04-14 5:08 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Alex Elder @ 2011-04-13 20:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs On Thu, 2011-04-07 at 16:19 +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When the inode cache shrinker runs, we may have lots of dirty inodes queued up > in the VFS dirty queues that have not been expired. The typical case for this > with XFS is atime updates. The result is that a highly concurrent workload that > copies files and then later reads them (say to verify checksums) dirties all > the inodes again, even when relatime is used. > > In a constrained memory environment, this results in a large number of dirty > inodes using all of available memory and memory reclaim being unable to free > them as dirty inodes areconsidered active. This problem was uncovered by Chris > Mason during recent low memory stress testing. > > The fix is to trigger VFS level writeback from the XFS inode cache shrinker if > there isn't already writeback in progress. This ensures that when we enter a > low memory situation we start cleaning inodes (via the flusher thread) on the > filesystem immediately, thereby making it more likely that we will be able to > evict those dirty inodes from the VFS in the near future. > > The mechanism is not perfect - it only acts on the current filesystem, so if > all the dirty inodes are on a different filesystem it won't help. However, it > seems to be a valid assumption is that the filesystem with lots of dirty inodes > is going to have the shrinker called very soon after the memory shortage > begins, so this shouldn't be an issue. > > The other flaw is that there is no guarantee that the flusher thread will make > progress fast enough to clean the dirty inodes so they can be reclaimed in the > near future. However, this mechanism does improve the resilience of the > filesystem under the test conditions - instead of reliably triggering the OOM > killer 20 minutes into the stress test, it took more than 6 hours before it > happened. > > This small addition definitely improves the low memory resilience of XFS on > this type of workload, and best of all it has no impact on performance when > memory is not constrained. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Looks good to me. Reviewed-by: Alex Elder <aelder@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-13 20:33 ` Alex Elder @ 2011-04-14 5:08 ` Dave Chinner 2011-04-15 8:09 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2011-04-14 5:08 UTC (permalink / raw) To: Alex Elder; +Cc: linux-fsdevel, xfs On Wed, Apr 13, 2011 at 03:33:42PM -0500, Alex Elder wrote: > On Thu, 2011-04-07 at 16:19 +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > When the inode cache shrinker runs, we may have lots of dirty inodes queued up > > in the VFS dirty queues that have not been expired. The typical case for this > > with XFS is atime updates. The result is that a highly concurrent workload that > > copies files and then later reads them (say to verify checksums) dirties all > > the inodes again, even when relatime is used. > > > > In a constrained memory environment, this results in a large number of dirty > > inodes using all of available memory and memory reclaim being unable to free > > them as dirty inodes areconsidered active. This problem was uncovered by Chris > > Mason during recent low memory stress testing. > > > > The fix is to trigger VFS level writeback from the XFS inode cache shrinker if > > there isn't already writeback in progress. This ensures that when we enter a > > low memory situation we start cleaning inodes (via the flusher thread) on the > > filesystem immediately, thereby making it more likely that we will be able to > > evict those dirty inodes from the VFS in the near future. > > > > The mechanism is not perfect - it only acts on the current filesystem, so if > > all the dirty inodes are on a different filesystem it won't help. However, it > > seems to be a valid assumption is that the filesystem with lots of dirty inodes > > is going to have the shrinker called very soon after the memory shortage > > begins, so this shouldn't be an issue. > > > > The other flaw is that there is no guarantee that the flusher thread will make > > progress fast enough to clean the dirty inodes so they can be reclaimed in the > > near future. However, this mechanism does improve the resilience of the > > filesystem under the test conditions - instead of reliably triggering the OOM > > killer 20 minutes into the stress test, it took more than 6 hours before it > > happened. > > > > This small addition definitely improves the low memory resilience of XFS on > > this type of workload, and best of all it has no impact on performance when > > memory is not constrained. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Looks good to me. > > Reviewed-by: Alex Elder <aelder@sgi.com> Unfortunately, we simply can't take the s_umount lock in reclaim context. So further hackery is going to be required here - I think that writeback_inodes_sb_nr_if_idle() need to use trylocks. if the s_umount lock is taken in write mode, then it's pretty certain that the sb is busy.... [ 2226.939859] ================================= [ 2226.940026] [ INFO: inconsistent lock state ] [ 2226.940026] 2.6.39-rc3-dgc+ #1162 [ 2226.940026] --------------------------------- [ 2226.940026] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. [ 2226.940026] diff/23704 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 2226.940026] (&type->s_umount_key#23){+++++?}, at: [<ffffffff81191bf0>] writeback_inodes_sb_nr_if_idle+0x50/0x80 [ 2226.940026] {RECLAIM_FS-ON-W} state was registered at: [ 2226.940026] [<ffffffff810c06a7>] mark_held_locks+0x67/0x90 [ 2226.940026] [<ffffffff810c0796>] lockdep_trace_alloc+0xc6/0x100 [ 2226.940026] [<ffffffff8115fec9>] kmem_cache_alloc+0x39/0x1e0 [ 2226.940026] [<ffffffff814afce7>] kmem_zone_alloc+0x77/0xf0 [ 2226.940026] [<ffffffff814afd7e>] kmem_zone_zalloc+0x1e/0x50 [ 2226.940026] [<ffffffff814a5f51>] _xfs_trans_alloc+0x31/0x80 [ 2226.940026] [<ffffffff814a1b74>] xfs_log_sbcount+0x84/0xf0 [ 2226.940026] [<ffffffff814a26be>] xfs_unmountfs+0xde/0x1a0 [ 2226.940026] [<ffffffff814bd466>] xfs_fs_put_super+0x46/0x80 [ 2226.940026] [<ffffffff8116cb92>] generic_shutdown_super+0x72/0x100 [ 2226.940026] [<ffffffff8116cc51>] kill_block_super+0x31/0x80 [ 2226.940026] [<ffffffff8116d415>] deactivate_locked_super+0x45/0x60 [ 2226.940026] [<ffffffff8116e10a>] deactivate_super+0x4a/0x70 [ 2226.940026] [<ffffffff8118951c>] mntput_no_expire+0xec/0x140 [ 2226.940026] [<ffffffff81189a08>] sys_umount+0x78/0x3c0 [ 2226.940026] [<ffffffff81b76c82>] system_call_fastpath+0x16/0x1b [ 2226.940026] irq event stamp: 2767751 [ 2226.940026] hardirqs last enabled at (2767751): [<ffffffff810ee0b6>] __call_rcu+0xa6/0x190 [ 2226.940026] hardirqs last disabled at (2767750): [<ffffffff810ee05a>] __call_rcu+0x4a/0x190 [ 2226.940026] softirqs last enabled at (2758484): [<ffffffff8108d1a3>] __do_softirq+0x143/0x220 [ 2226.940026] softirqs last disabled at (2758471): [<ffffffff81b77e9c>] call_softirq+0x1c/0x30 [ 2226.940026] [ 2226.940026] other info that might help us debug this: [ 2226.940026] 3 locks held by diff/23704: [ 2226.940026] #0: (xfs_iolock_active){++++++}, at: [<ffffffff81487408>] xfs_ilock+0x138/0x190 [ 2226.940026] #1: (&mm->mmap_sem){++++++}, at: [<ffffffff81b7258b>] do_page_fault+0xeb/0x4f0 [ 2226.940026] #2: (shrinker_rwsem){++++..}, at: [<ffffffff8112cb6d>] shrink_slab+0x3d/0x1a0 [ 2226.940026] [ 2226.940026] stack backtrace: [ 2226.940026] Pid: 23704, comm: diff Not tainted 2.6.39-rc3-dgc+ #1162 [ 2226.940026] Call Trace: [ 2226.940026] [<ffffffff810bf5fa>] print_usage_bug+0x18a/0x190 [ 2226.940026] [<ffffffff8104982f>] ? save_stack_trace+0x2f/0x50 [ 2226.940026] [<ffffffff810bf770>] ? print_irq_inversion_bug+0x170/0x170 [ 2226.940026] [<ffffffff810c055e>] mark_lock+0x35e/0x440 [ 2226.940026] [<ffffffff810c1227>] __lock_acquire+0x447/0x14b0 [ 2226.940026] [<ffffffff81065ed8>] ? pvclock_clocksource_read+0x58/0xd0 [ 2226.940026] [<ffffffff814a84c8>] ? xfs_ail_push_all+0x78/0x80 [ 2226.940026] [<ffffffff810650b9>] ? kvm_clock_read+0x19/0x20 [ 2226.940026] [<ffffffff81042bc9>] ? sched_clock+0x9/0x10 [ 2226.940026] [<ffffffff810aff15>] ? sched_clock_local+0x25/0x90 [ 2226.940026] [<ffffffff810c2344>] lock_acquire+0xb4/0x140 [ 2226.940026] [<ffffffff81191bf0>] ? writeback_inodes_sb_nr_if_idle+0x50/0x80 [ 2226.940026] [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b [ 2226.940026] [<ffffffff81b6d731>] down_read+0x51/0xa0 [ 2226.940026] [<ffffffff81191bf0>] ? writeback_inodes_sb_nr_if_idle+0x50/0x80 [ 2226.940026] [<ffffffff81191bf0>] writeback_inodes_sb_nr_if_idle+0x50/0x80 [ 2226.940026] [<ffffffff814bec18>] ? xfs_syncd_queue_reclaim+0x28/0xc0 [ 2226.940026] [<ffffffff814c02e9>] xfs_reclaim_inode_shrink+0x99/0xc0 [ 2226.940026] [<ffffffff8112cc67>] shrink_slab+0x137/0x1a0 [ 2226.940026] [<ffffffff8112e40c>] do_try_to_free_pages+0x20c/0x440 [ 2226.940026] [<ffffffff8112e7a2>] try_to_free_pages+0x92/0x130 [ 2226.940026] [<ffffffff81124826>] __alloc_pages_nodemask+0x496/0x930 [ 2226.940026] [<ffffffff810aff15>] ? sched_clock_local+0x25/0x90 [ 2226.940026] [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b [ 2226.940026] [<ffffffff8115c169>] alloc_pages_vma+0x99/0x150 [ 2226.940026] [<ffffffff811681b3>] do_huge_pmd_anonymous_page+0x143/0x380 [ 2226.940026] [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b [ 2226.940026] [<ffffffff81141b26>] handle_mm_fault+0x136/0x290 [ 2226.940026] [<ffffffff81b72601>] do_page_fault+0x161/0x4f0 [ 2226.940026] [<ffffffff810b0038>] ? sched_clock_cpu+0xb8/0x110 [ 2226.940026] [<ffffffff810c1116>] ? __lock_acquire+0x336/0x14b0 [ 2226.940026] [<ffffffff811275b8>] ? __do_page_cache_readahead+0x208/0x2b0 [ 2226.940026] [<ffffffff81065ed8>] ? pvclock_clocksource_read+0x58/0xd0 [ 2226.940026] [<ffffffff816d527d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2226.940026] [<ffffffff81b6f265>] page_fault+0x25/0x30 [ 2226.940026] [<ffffffff8111bed4>] ? file_read_actor+0x114/0x1d0 [ 2226.940026] [<ffffffff8111bde1>] ? file_read_actor+0x21/0x1d0 [ 2226.940026] [<ffffffff8111dfad>] generic_file_aio_read+0x35d/0x7b0 [ 2226.940026] [<ffffffff814b769e>] xfs_file_aio_read+0x15e/0x2e0 [ 2226.940026] [<ffffffff8116a4d0>] ? do_sync_write+0x120/0x120 [ 2226.940026] [<ffffffff8116a5aa>] do_sync_read+0xda/0x120 [ 2226.940026] [<ffffffff8169aeee>] ? security_file_permission+0x8e/0x90 [ 2226.940026] [<ffffffff8116acdd>] vfs_read+0xcd/0x180 [ 2226.940026] [<ffffffff8116ae94>] sys_read+0x54/0xa0 [ 2226.940026] [<ffffffff81b76c82>] system_call_fastpath+0x16/0x1b Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xfs: kick inode writeback when low on memory 2011-04-14 5:08 ` Dave Chinner @ 2011-04-15 8:09 ` Christoph Hellwig 0 siblings, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2011-04-15 8:09 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs, Alex Elder On Thu, Apr 14, 2011 at 03:08:46PM +1000, Dave Chinner wrote: > Unfortunately, we simply can't take the s_umount lock in reclaim > context. So further hackery is going to be required here - I think > that writeback_inodes_sb_nr_if_idle() need to use trylocks. if the > s_umount lock is taken in write mode, then it's pretty certain that > the sb is busy.... http://thread.gmane.org/gmane.linux.file-systems/48373/focus=48628 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] xfs: write back inodes during reclaim 2011-04-07 6:19 [PATCH 0/2] xfs: write back inodes during reclaim Dave Chinner 2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner 2011-04-07 6:19 ` [PATCH 2/2] xfs: kick inode writeback when low on memory Dave Chinner @ 2011-04-15 7:23 ` Yann Dupont 2011-04-15 7:54 ` Dave Chinner 2 siblings, 1 reply; 13+ messages in thread From: Yann Dupont @ 2011-04-15 7:23 UTC (permalink / raw) To: xfs Le 07/04/2011 08:19, Dave Chinner a écrit : > This series fixes an OOM problem where VFS-only dirty inodes > accumulate on an XFS filesystem due to atime updates causing OOM to > occur. > > The first patch fixes a deadlock triggering bdi-flusher writeback > from memory reclaim when a new bdi-flusher thread needs to be forked > and no memory is available. > > the second adds a bdi-flusher kick from XFS's inode cache shrinker > so that when memory is low the VFS starts writing back dirty inodes > so they can be reclaimed as they get cleaned rather than remaining > dirty and pinning the inode cache in memory. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > Hello, we've been hit for some times by a bug (oom) which may been related to this one. Our server contains lots of samba server (in linux-vserver, this is NOT a vanilla kernel) and is also NFS kernel server. The oom generally happens after 1 month of uptime, and last week we also had the problem after 1 week. for example this one : Feb 25 12:54:15 strathisla.u11.univ-nantes.prive kernel: [2743591.087102] Node 0 Normal free:8840kB min:12968kB low:16208kB high:19452kB active_anon:140168kB inactive_anon:21200kB active_file:1446724kB inactive_file:10741224kB unevictable:4172kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4172kB dirty:42924kB writeback:249420kB mapped:60296kB shmem:7028kB slab_reclaimable:758752kB slab_unreclaimable:136528kB kernel_stack:6784kB pagetables:8388kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Feb 25 12:57:21 strathisla.u11.univ-nantes.prive kernel: [2743777.877303] admind: page allocation failure. order:0, mode:0x4020 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877340] Pid: 10121, comm: admind Not tainted 2.6.32-5-vserver-amd64 #1 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877369] Call Trace: Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877392] <IRQ> [<ffffffff810c3f43>] ? __alloc_pages_nodemask+0x592/0x5f3 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877449] [<ffffffff810f0d1e>] ? new_slab+0x5b/0x1ca Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877477] [<ffffffff810f107d>] ? __slab_alloc+0x1f0/0x39b Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877507] [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877537] [<ffffffff810f1aaf>] ? __kmalloc_node_track_caller+0xbb/0x11b Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877568] [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877598] [<ffffffff812555f5>] ? __alloc_skb+0x69/0x15a Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877627] [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877673] [<ffffffffa00af52a>] ? bnx2_alloc_rx_skb+0x4c/0x1a3 [bnx2] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877706] [<ffffffffa00b34fb>] ? bnx2_poll_work+0x4f3/0xa7e [bnx2] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877738] [<ffffffffa00b3c47>] ? bnx2_poll+0x11b/0x229 [bnx2] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877768] [<ffffffff8125c851>] ? net_rx_action+0xae/0x1c9 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877799] [<ffffffff8105430b>] ? __do_softirq+0xdd/0x1a2 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877828] [<ffffffff81011cac>] ? call_softirq+0x1c/0x30 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877857] [<ffffffff8101322b>] ? do_softirq+0x3f/0x7c Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877885] [<ffffffff8105417a>] ? irq_exit+0x36/0x76 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877912] [<ffffffff81012922>] ? do_IRQ+0xa0/0xb6 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877939] [<ffffffff810114d3>] ? ret_from_intr+0x0/0x11 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.877966] <EOI> [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878019] [<ffffffff8130a7c5>] ? _write_lock+0x7/0xf Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878058] [<ffffffffa0230e3d>] ? xfs_inode_ag_walk+0x4e/0xef [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878098] [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878138] [<ffffffffa0230f4f>] ? xfs_inode_ag_iterator+0x71/0xb2 [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878179] [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878219] [<ffffffffa0230feb>] ? xfs_reclaim_inode_shrink+0x5b/0x10d [xfs] Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878265] [<ffffffff810c8dd1>] ? shrink_slab+0xe0/0x153 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878294] [<ffffffff810c9d2e>] ? try_to_free_pages+0x26a/0x38e Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878323] [<ffffffff810c6ceb>] ? isolate_pages_global+0x0/0x20f Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878353] [<ffffffff810c3d7e>] ? __alloc_pages_nodemask+0x3cd/0x5f3 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878383] [<ffffffff810f0d05>] ? new_slab+0x42/0x1ca Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878411] [<ffffffff810f107d>] ? __slab_alloc+0x1f0/0x39b Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878441] [<ffffffff8110437f>] ? getname+0x23/0x1a0 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878468] [<ffffffff8110437f>] ? getname+0x23/0x1a0 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878495] [<ffffffff810f1558>] ? kmem_cache_alloc+0x7f/0xf0 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878524] [<ffffffff8110437f>] ? getname+0x23/0x1a0 Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878552] [<ffffffff810f75b3>] ? do_sys_open+0x1d/0xfc Feb 25 12:57:22 strathisla.u11.univ-nantes.prive kernel: [2743777.878580] [<ffffffff81037623>] ? ia32_sysret+0x0/0x5 I saw this on 2.6.32 kernels ; Since 2 days we're testing 2.6.38.2 kernel on the very same machine. Some questions : -What kernel versions are known to be impacted ? -What is the plan for inclusion in kernel ? Is this considered appropriate material for 2.6.38.4 and older stable kernels ? - Is mounting with noatime can alleviate the problem ? Regards, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] xfs: write back inodes during reclaim 2011-04-15 7:23 ` [PATCH 0/2] xfs: write back inodes during reclaim Yann Dupont @ 2011-04-15 7:54 ` Dave Chinner 2011-04-15 9:13 ` Yann Dupont 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2011-04-15 7:54 UTC (permalink / raw) To: Yann Dupont; +Cc: xfs {In future can you make sure you don't line wrap stack traces? they turn into an utter mess when being quoted if you wrap them} On Fri, Apr 15, 2011 at 09:23:50AM +0200, Yann Dupont wrote: > Le 07/04/2011 08:19, Dave Chinner a écrit : > >This series fixes an OOM problem where VFS-only dirty inodes > >accumulate on an XFS filesystem due to atime updates causing OOM to > >occur. > > > >The first patch fixes a deadlock triggering bdi-flusher writeback > >from memory reclaim when a new bdi-flusher thread needs to be forked > >and no memory is available. > > > >the second adds a bdi-flusher kick from XFS's inode cache shrinker > >so that when memory is low the VFS starts writing back dirty inodes > >so they can be reclaimed as they get cleaned rather than remaining > >dirty and pinning the inode cache in memory. > > > Hello, we've been hit for some times by a bug (oom) which may been > related to this one. Our server contains lots of samba server (in > linux-vserver, this is NOT a vanilla kernel) and is also NFS kernel > server. > The oom generally happens after 1 month of uptime, and last week we > also had the problem after 1 week. > > for example this one : .... > [2743777.877340] Pid: 10121, comm: admind Not tainted 2.6.32-5-vserver-amd64 #1 Vserver. uggh. Call Trace: <IRQ> [<ffffffff810c3f43>] ? > __alloc_pages_nodemask+0x592/0x5f3 [<ffffffff810f0d1e>] ? new_slab+0x5b/0x1ca [<ffffffff810f107d>] ? __slab_alloc+0x1f0/0x39b [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 [<ffffffff810f1aaf>] ? __kmalloc_node_track_caller+0xbb/0x11b [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 [<ffffffff812555f5>] ? __alloc_skb+0x69/0x15a [<ffffffff812565c8>] ? __netdev_alloc_skb+0x29/0x45 [<ffffffffa00af52a>] ? bnx2_alloc_rx_skb+0x4c/0x1a3 [bnx2] [<ffffffffa00b34fb>] ? bnx2_poll_work+0x4f3/0xa7e [bnx2] [<ffffffffa00b3c47>] ? bnx2_poll+0x11b/0x229 [bnx2] [<ffffffff8125c851>] ? net_rx_action+0xae/0x1c9 [<ffffffff8105430b>] ? __do_softirq+0xdd/0x1a2 [<ffffffff81011cac>] ? call_softirq+0x1c/0x30 [<ffffffff8101322b>] ? do_softirq+0x3f/0x7c [<ffffffff8105417a>] ? irq_exit+0x36/0x76 [<ffffffff81012922>] ? do_IRQ+0xa0/0xb6 [<ffffffff810114d3>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] [<ffffffff8130a7c5>] ? _write_lock+0x7/0xf [<ffffffffa0230e3d>] ? xfs_inode_ag_walk+0x4e/0xef [xfs] [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] [<ffffffffa0230f4f>] ? xfs_inode_ag_iterator+0x71/0xb2 [xfs] [<ffffffffa02304cf>] ? xfs_reclaim_inode+0x0/0xe0 [xfs] [<ffffffffa0230feb>] ? xfs_reclaim_inode_shrink+0x5b/0x10d [xfs] [<ffffffff810c8dd1>] ? shrink_slab+0xe0/0x153 [<ffffffff810c9d2e>] ? try_to_free_pages+0x26a/0x38e [<ffffffff810c6ceb>] ? isolate_pages_global+0x0/0x20f [<ffffffff810c3d7e>] ? __alloc_pages_nodemask+0x3cd/0x5f3 [<ffffffff810f0d05>] ? new_slab+0x42/0x1ca [<ffffffff810f107d>] ? __slab_alloc+0x1f0/0x39b [<ffffffff8110437f>] ? getname+0x23/0x1a0 [<ffffffff8110437f>] ? getname+0x23/0x1a0 [<ffffffff810f1558>] ? kmem_cache_alloc+0x7f/0xf0 [<ffffffff8110437f>] ? getname+0x23/0x1a0 [<ffffffff810f75b3>] ? do_sys_open+0x1d/0xfc [<ffffffff81037623>] ? ia32_sysret+0x0/0x5 This, I'd say, has nothing to do with XFS - the system has taken a network interrupt and failed an allocation in bnx2 NIC driver. You chopped off the line that describes the actual allocation parameters that failed, so I can't really say why it failed... > Some questions : > > -What kernel versions are known to be impacted ? No idea. it was reporte don a .38-rc kernel, and I don't have the bandwiѕth to do a "which versions does it affect" search. > -What is the plan for inclusion in kernel ? Is this considered > appropriate material for 2.6.38.4 and older stable kernels ? None right now - the patch is dead in the water right now because of lock inversion issues it causes. Even so, I doubt I'd be back porting it to any stable kernel without having anyone report that it is the root cause of their OOM problems. > - Is mounting with noatime can alleviate the problem ? The problem that the patch I posted were supposed to fix, yes. The problem you are reporting here, most likely not. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] xfs: write back inodes during reclaim 2011-04-15 7:54 ` Dave Chinner @ 2011-04-15 9:13 ` Yann Dupont 0 siblings, 0 replies; 13+ messages in thread From: Yann Dupont @ 2011-04-15 9:13 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Le 15/04/2011 09:54, Dave Chinner a écrit : > {In future can you make sure you don't line wrap stack traces? they > turn into an utter mess when being quoted if you wrap them} > woops, sorry :( I just copied/paste from our internal request tracker, so the trace may have been mangled there. > This, I'd say, has nothing to do with XFS - the system has taken a > network interrupt and failed an allocation in bnx2 NIC driver. You > chopped off the line that describes the actual allocation parameters > that failed, so I can't really say why it failed... > once again sorry. I'm not used to send stack traces so often. I'll try to send clean traces, only if you think it may have something to do with xfs. We have a lot of thoses traces, this particular one may not be the best to describe the problem. My apologies. ... > None right now - the patch is dead in the water right now because of > lock inversion issues it causes. Even so, I doubt I'd be back > porting it to any stable kernel without having anyone report that it > is the root cause of their OOM problems. > ok >> - Is mounting with noatime can alleviate the problem ? > The problem that the patch I posted were supposed to fix, yes. The > problem you are reporting here, most likely not. ok, thanks for your quick answer, Regards, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-04-15 9:12 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-07 6:19 [PATCH 0/2] xfs: write back inodes during reclaim Dave Chinner 2011-04-07 6:19 ` [PATCH 1/2] bdi: mark the bdi flusher busy when being forked Dave Chinner 2011-04-11 18:34 ` Christoph Hellwig 2011-04-13 19:29 ` Alex Elder 2011-04-07 6:19 ` [PATCH 2/2] xfs: kick inode writeback when low on memory Dave Chinner 2011-04-11 18:36 ` Christoph Hellwig 2011-04-11 21:14 ` Dave Chinner 2011-04-13 20:33 ` Alex Elder 2011-04-14 5:08 ` Dave Chinner 2011-04-15 8:09 ` Christoph Hellwig 2011-04-15 7:23 ` [PATCH 0/2] xfs: write back inodes during reclaim Yann Dupont 2011-04-15 7:54 ` Dave Chinner 2011-04-15 9:13 ` Yann Dupont
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox