linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Steigerwald <Martin@lichtvoll.de>
To: bo.li.liu@oracle.com
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	"Chris Mason" <clm@fb.com>,
	miaox@cn.fujitsu.com, "Marc MERLIN" <marc@merlins.org>,
	Torbjørn <lists@skagestad.org>
Subject: Re: [PATCH] Btrfs: fix task hang under heavy compressed write
Date: Fri, 15 Aug 2014 19:51:58 +0200	[thread overview]
Message-ID: <1811101.HnCmXdD9fm@merkaba> (raw)
In-Reply-To: <1880657.ZySjN2vcoV@merkaba>

Am Donnerstag, 14. August 2014, 11:27:06 schrieb Martin Steigerwald:
> Am Mittwoch, 13. August 2014, 23:20:46 schrieb Liu Bo:
> > On Wed, Aug 13, 2014 at 01:54:40PM +0200, Martin Steigerwald wrote:
> > > Am Dienstag, 12. August 2014, 15:44:59 schrieb Liu Bo:
> > > > This has been reported and discussed for a long time, and this hang
> > > > occurs
> > > > in both 3.15 and 3.16.
> > > 
> > > Liu, is this safe for testing yet?
> > 
> > Yes, I've confirmed that this hang doesn't occur by running my tests for 2
> > days(usually it hangs in 2 hours).
> > 
> > But...
> > As Chris said in the thread, this is more a workaround, there're other
> > potential issues that would lead to similar deadlock.
> > 
> > I'm trying to write a real fix instead of a workaround.
> 
> Thanks, so this one goes together with the fixed compressed write corruption
> one? I would put them onto 3.16.1. With 3.17 I want to wait till rc2 I
> think.

Okay, testing this patch and the compressed write corruption fix one on 3.16.1 
now. The v3 patch seems to be 3.17 material as it doesn´t apply onto 3.16.1 
cleanly.

> > thanks,
> > -liubo
> > 
> > > Thanks,
> > > Martin
> > > 
> > > > Btrfs now migrates to use kernel workqueue, but it introduces this
> > > > hang
> > > > problem.
> > > > 
> > > > Btrfs has a kind of work queued as an ordered way, which means that
> > > > its
> > > > ordered_func() must be processed in the way of FIFO, so it usually
> > > > looks
> > > > like --
> > > > 
> > > > normal_work_helper(arg)
> > > > 
> > > >     work = container_of(arg, struct btrfs_work, normal_work);
> > > >     
> > > >     work->func() <---- (we name it work X)
> > > >     for ordered_work in wq->ordered_list
> > > >     
> > > >             ordered_work->ordered_func()
> > > >             ordered_work->ordered_free()
> > > > 
> > > > The hang is a rare case, first when we find free space, we get an
> > > > uncached
> > > > block group, then we go to read its free space cache inode for free
> > > > space
> > > > information, so it will
> > > > 
> > > > file a readahead request
> > > > 
> > > >     btrfs_readpages()
> > > >     
> > > >          for page that is not in page cache
> > > >          
> > > >                 __do_readpage()
> > > >                 
> > > >                      submit_extent_page()
> > > >                      
> > > >                            btrfs_submit_bio_hook()
> > > >                            
> > > >                                  btrfs_bio_wq_end_io()
> > > >                                  submit_bio()
> > > >                                  end_workqueue_bio() <--(ret by the
> > > >                                  1st
> > > > 
> > > > endio) queue a work(named work Y) for the 2nd also the real endio()
> > > > 
> > > > So the hang occurs when work Y's work_struct and work X's work_struct
> > > > happens to share the same address.
> > > > 
> > > > A bit more explanation,
> > > > 
> > > > A,B,C -- struct btrfs_work
> > > > arg   -- struct work_struct
> > > > 
> > > > kthread:
> > > > worker_thread()
> > > > 
> > > >     pick up a work_struct from @worklist
> > > >     process_one_work(arg)
> > > > 	
> > > > 	worker->current_work = arg;  <-- arg is A->normal_work
> > > > 	worker->current_func(arg)
> > > > 	
> > > > 		normal_work_helper(arg)
> > > > 		
> > > > 		     A = container_of(arg, struct btrfs_work, normal_work);
> > > > 		     
> > > > 		     A->func()
> > > > 		     A->ordered_func()
> > > > 		     A->ordered_free()  <-- A gets freed
> > > > 		     
> > > > 		     B->ordered_func()
> > > > 			  
> > > > 			  submit_compressed_extents()
> > > > 			  
> > > > 			      find_free_extent()
> > > > 				  
> > > > 				  load_free_space_inode()
> > > > 				  
> > > > 				      ...   <-- (the above readhead stack)
> > > > 				      end_workqueue_bio()
> > > > 					   
> > > > 					   btrfs_queue_work(work C)
> > > > 		     
> > > > 		     B->ordered_free()
> > > > 
> > > > As if work A has a high priority in wq->ordered_list and there are
> > > > more
> > > > ordered works queued after it, such as B->ordered_func(), its memory
> > > > could
> > > > have been freed before normal_work_helper() returns, which means that
> > > > kernel workqueue code worker_thread() still has worker->current_work
> > > > pointer to be work A->normal_work's, ie. arg's address.
> > > > 
> > > > Meanwhile, work C is allocated after work A is freed, work
> > > > C->normal_work
> > > > and work A->normal_work are likely to share the same address(I
> > > > confirmed
> > > > this with ftrace output, so I'm not just guessing, it's rare though).
> > > > 
> > > > When another kthread picks up work C->normal_work to process, and
> > > > finds
> > > > our
> > > > kthread is processing it(see find_worker_executing_work()), it'll
> > > > think
> > > > work C as a collision and skip then, which ends up nobody processing
> > > > work C.
> > > > 
> > > > So the situation is that our kthread is waiting forever on work C.
> > > > 
> > > > The key point is that they shouldn't have the same address, so this
> > > > defers
> > > > ->ordered_free() and does a batched free to avoid that.
> > > > 
> > > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > > > ---
> > > > 
> > > >  fs/btrfs/async-thread.c | 12 ++++++++++--
> > > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> > > > index 5a201d8..2ac01b3 100644
> > > > --- a/fs/btrfs/async-thread.c
> > > > +++ b/fs/btrfs/async-thread.c
> > > > @@ -195,6 +195,7 @@ static void run_ordered_work(struct
> > > > __btrfs_workqueue
> > > > *wq) struct btrfs_work *work;
> > > > 
> > > >  	spinlock_t *lock = &wq->list_lock;
> > > >  	unsigned long flags;
> > > > 
> > > > +	LIST_HEAD(free_list);
> > > > 
> > > >  	while (1) {
> > > >  	
> > > >  		spin_lock_irqsave(lock, flags);
> > > > 
> > > > @@ -219,17 +220,24 @@ static void run_ordered_work(struct
> > > > __btrfs_workqueue
> > > > *wq)
> > > > 
> > > >  		/* now take the lock again and drop our item from the list */
> > > >  		spin_lock_irqsave(lock, flags);
> > > > 
> > > > -		list_del(&work->ordered_list);
> > > > +		list_move_tail(&work->ordered_list, &free_list);
> > > > 
> > > >  		spin_unlock_irqrestore(lock, flags);
> > > >  		
> > > >  		/*
> > > >  		
> > > >  		 * we don't want to call the ordered free functions
> > > >  		 * with the lock held though
> > > >  		 */
> > > > 
> > > > +	}
> > > > +	spin_unlock_irqrestore(lock, flags);
> > > > +
> > > > +	while (!list_empty(&free_list)) {
> > > > +		work = list_entry(free_list.next, struct btrfs_work,
> > > > +				  ordered_list);
> > > > +
> > > > +		list_del(&work->ordered_list);
> > > > 
> > > >  		work->ordered_free(work);
> > > >  		trace_btrfs_all_work_done(work);
> > > >  	
> > > >  	}
> > > > 
> > > > -	spin_unlock_irqrestore(lock, flags);
> > > > 
> > > >  }
> > > >  
> > > >  static void normal_work_helper(struct work_struct *arg)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

  reply	other threads:[~2014-08-15 17:52 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-12  7:44 [PATCH] Btrfs: fix task hang under heavy compressed write Liu Bo
2014-08-12 14:35 ` [PATCH v2] " Liu Bo
2014-08-12 14:57 ` [PATCH] " Chris Mason
2014-08-13  0:53   ` Qu Wenruo
2014-08-13 11:54 ` Martin Steigerwald
2014-08-13 13:27   ` Rich Freeman
2014-08-13 15:20   ` Liu Bo
2014-08-14  9:27     ` Martin Steigerwald
2014-08-15 17:51       ` Martin Steigerwald [this message]
2014-08-15 15:36 ` [PATCH v3] " Liu Bo
2014-08-15 16:05   ` Chris Mason
2014-08-16  7:28   ` Miao Xie
2014-08-18  7:32     ` Liu Bo
2014-08-25 14:58   ` Chris Mason
2014-08-25 15:19     ` Liu Bo
2014-08-26 10:20     ` Martin Steigerwald
2014-08-26 10:38       ` Liu Bo
2014-08-26 12:04         ` Martin Steigerwald
2014-08-26 13:02       ` Chris Mason
2014-08-26 13:20         ` Martin Steigerwald
2014-08-31 11:48           ` Martin Steigerwald
2014-08-31 15:40             ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1811101.HnCmXdD9fm@merkaba \
    --to=martin@lichtvoll.de \
    --cc=bo.li.liu@oracle.com \
    --cc=clm@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@skagestad.org \
    --cc=marc@merlins.org \
    --cc=miaox@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).