From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from aserp1040.oracle.com ([141.146.126.69]:19212 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932913AbbFJKl7 (ORCPT ); Wed, 10 Jun 2015 06:41:59 -0400 Date: Wed, 10 Jun 2015 18:41:29 +0800 From: Liu Bo To: fdmanana@kernel.org Cc: linux-btrfs@vger.kernel.org, ce3g8jdj@umail.furryterror.org, Filipe Manana Subject: Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead Message-ID: <20150610104129.GD3561@localhost.localdomain> Reply-To: bo.li.liu@oracle.com References: <1432598142-4434-1-git-send-email-fdmanana@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <1432598142-4434-1-git-send-email-fdmanana@kernel.org> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana > > Zygo Blaxell and other users have reported occasional hangs while an > inode is being evicted, leading to traces like the following: > > [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 > [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 > [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 > [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 > [ 5281.981396] Call Trace: > [ 5281.982066] [] schedule+0x74/0x83 > [ 5281.983341] [] wait_on_state+0xac/0xcd [btrfs] > [ 5281.985127] [] ? signal_pending_state+0x31/0x31 > [ 5281.986715] [] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > [ 5281.988680] [] lock_extent_bits+0x5d/0x88 [btrfs] > [ 5281.990200] [] btrfs_evict_inode+0x24e/0x5be [btrfs] > [ 5281.991781] [] evict+0xa0/0x148 > [ 5281.992735] [] iput+0x18f/0x1e5 > [ 5281.993796] [] do_unlinkat+0x15b/0x1fa > [ 5281.994806] [] ? ret_from_sys_call+0x1d/0x58 > [ 5281.996120] [] ? trace_hardirqs_on_caller+0x18f/0x1ab > [ 5281.997562] [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 5281.998815] [] SyS_unlinkat+0x29/0x2b > [ 5281.999920] [] system_call_fastpath+0x12/0x17 > [ 5282.001299] 1 lock held by rm/20488: > [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [] mnt_want_write+0x24/0x4b > > This happens when we have readahead, which calls readpages(), happening > right before the inode eviction handler is invoked. So the reason is > essentially: > > 1) readpages() is called while a reference on the inode is held, so > eviction can not be triggered before readpages() returns. It also > locks one or more ranges in the inode's io_tree (which is done at > extent_io.c:__do_contiguous_readpages()); > > 2) readpages() submits several read bios, all with an end io callback > that runs extent_io.c:end_bio_extent_readpage() and that is executed > by other task when a bio finishes, corresponding to a work queue > (fs_info->end_io_workers) worker kthread. This callback unlocks > the ranges in the inode's io_tree that were previously locked in > step 1; > > 3) readpages() returns, the reference on the inode is dropped; > > 4) One or more of the read bios previously submitted are still not > complete (their end io callback was not yet invoked or has not > yet finished execution); > > 5) Inode eviction is triggered (through an unlink call for example). > The inode reference count was not incremented before submitting > the read bios, therefore this is possible; > > 6) The eviction handler starts executing and enters the loop that > iterates over all extent states in the inode's io_tree; One question here: btrfs_evict_inode()-> evict_inode_truncate_pages()-> truncate_inode_pages_final() truncate_inode_pages_final() finds all pages in the mapping and lock_page() on each page, if there is a readpage running by other tasks, it should be blocked on lock_page() until end_bio_extent_readpage() does a unlock_page(). How does eviction handler bypasses this and enters 'extent_state' loop? Am I missing something? Thanks, -liubo > > 7) The loop picks one extent state record and uses its ->start and > ->end fields, after releasing the inode's io_tree spinlock, to > call lock_extent_bits() and clear_extent_bit(). The call to lock > the range [state->start, state->end] blocks because the whole > range or a part of it was locked by the previous call to > readpages() and the corresponding end io callback, which unlocks > the range was not yet executed; > > 8) The end io callback for the read bio is executed and unlocks the > range [state->start, state->end] (or a superset of that range). > And at clear_extent_bit() the extent_state record state is used > as a second argument to split_state(), which sets state->start to > a larger value; > > 9) The task executing the eviction handler is woken up by the task > executing the bio's end io callback (through clear_state_bit) and > the eviction handler locks the range > [old value for state->start, state->end]. Shortly after, when > calling clear_extent_bit(), it unlocks the range > [new value for state->start, state->end], so it ends up unlocking > only part of the range that it locked, leaving an extent state > record in the io_tree that represents the unlocked subrange; > > 10) The eviction handler loop, in its next iteration, gets the > extent_state record for the subrange that it did not unlock in the > previous step and then tries to lock it, resulting in an hang. > > So fix this by not using the ->start and ->end fields of an existing > extent_state record. This is a simple solution, and an alternative > could be to bump the inode's reference count before submitting each > read bio and having it dropped in the bio's end io callback. But that > would be a more invasive/complex change and would not protect against > other possible places that are not holding a reference on the inode > as well. Something to consider in the future. > > Many thanks to Zygo Blaxell for reporting, in the mailing list, the > issue, a set of scripts to trigger it and testing this fix. > > Reported-by: Zygo Blaxell > Tested-by: Zygo Blaxell > Signed-off-by: Filipe Manana > --- > fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0020b56..ef05800 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode) > } > write_unlock(&map_tree->lock); > > + /* > + * Keep looping until we have no more ranges in the io tree. > + * We can have ongoing bios started by readpages (called from readahead) > + * that didn't get their end io callbacks called yet or they are still > + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some > + * ranges can still be locked and eviction started because before > + * submitting those bios, which are executed by a separate task (work > + * queue kthread), inode references (inode->i_count) were not taken > + * (which would be dropped in the end io callback of each bio). > + * Therefore here we effectively end up waiting for those bios and > + * anyone else holding locked ranges without having bumped the inode's > + * reference count - if we don't do it, when they access the inode's > + * io_tree to unlock a range it may be too late, leading to an > + * use-after-free issue. > + */ > spin_lock(&io_tree->lock); > while (!RB_EMPTY_ROOT(&io_tree->state)) { > struct extent_state *state; > struct extent_state *cached_state = NULL; > + u64 start; > + u64 end; > > node = rb_first(&io_tree->state); > state = rb_entry(node, struct extent_state, rb_node); > - atomic_inc(&state->refs); > + start = state->start; > + end = state->end; > spin_unlock(&io_tree->lock); > > - lock_extent_bits(io_tree, state->start, state->end, > - 0, &cached_state); > - clear_extent_bit(io_tree, state->start, state->end, > + lock_extent_bits(io_tree, start, end, 0, &cached_state); > + clear_extent_bit(io_tree, start, end, > EXTENT_LOCKED | EXTENT_DIRTY | > EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | > EXTENT_DEFRAG, 1, 1, > &cached_state, GFP_NOFS); > - free_extent_state(state); > > cond_resched(); > spin_lock(&io_tree->lock); > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html