All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Qu Wenruo <quwenruo@cn.fujitsu.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	xfs@oss.sgi.com, linux-mm@kvack.org,
	Michal Hocko <mhocko@kernel.org>, Ingo Molnar <mingo@kernel.org>
Subject: Re: Xfs lockdep warning with for-dave-for-4.6 branch
Date: Fri, 20 May 2016 10:17:14 +1000	[thread overview]
Message-ID: <20160520001714.GC26977@dastard> (raw)
In-Reply-To: <20160519081146.GS3193@twins.programming.kicks-ass.net>

On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> > > 
> > > > The reason we don't have lock clases for the ilock is that we aren't
> > > > supposed to call memory reclaim with that lock held in exclusive
> > > > mode. This is because reclaim can run transactions, and that may
> > > > need to flush dirty inodes to make progress. Flushing dirty inode
> > > > requires taking the ilock in shared mode.
> > > > 
> > > > In the code path that was reported, we hold the ilock in /shared/
> > > > mode with no transaction context (we are doing a read-only
> > > > operation). This means we can run transactions in memory reclaim
> > > > because a) we can't deadlock on the inode we hold locks on, and b)
> > > > transaction reservations will be able to make progress as we don't
> > > > hold any locks it can block on.
> > > 
> > > Just to clarify; I read the above as that we cannot block on recursive
> > > shared locks, is this correct?
> > > 
> > > Because we can in fact block on down_read()+down_read() just fine, so if
> > > you're assuming that, then something's busted.
> > 
> > The transaction reservation path will run down_read_trylock() on the
> > inode, not down_read(). Hence if there are no pending writers, it
> > will happily take the lock twice and make progress, otherwise it
> > will skip the inode and there's no deadlock.  If there's a pending
> > writer, then we have another context that is already in a
> > transaction context and has already pushed the item, hence it is
> > only in the scope of the current push because IO hasn't completed
> > yet and removed it from the list.
> > 
> > > Otherwise, I'm not quite reading it right, which is, given the
> > > complexity of that stuff, entirely possible.
> > 
> > There's a maze of dark, grue-filled twisty passages here...
> 
> OK; I might need a bit more again.
> 
> So now the code does something like:
> 
> 	down_read(&i_lock);		-- lockdep marks lock as held
> 	kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
> 	  --> reclaim()
> 	     down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS
> 
> Right?

In the path that can deadlock the log, yes. It's actually way more
complex than the above, because the down_read_trylock(&i_lock) that
matters is run in a completely separate, async kthread that
xfs_trans_reserve() will block waiting for.

process context				xfsaild kthread(*)
---------------				------------------
down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
     xfs_trans_reserve()
     ....
	  xfs_trans_push_ail()	---- called if no space in the log to kick the xfsaild into action
	  ....
       xlog_grant_head_wait()	---- blocks waiting for log space
       .....

					xfsaild_push()   ----- iterates AIL
					  grabs log item
					    lock log item
	>>>>>>>>>>>>>>>>>>>>>		      down_read_trylock(&i_lock);
					      format item into buffer
					      add to dirty buffer list
					  ....
					  submit dirty buffer list for IO
					    buffer IO started
					.....
					<async IO completion context>
					buffer callbacks
					  mark inode clean
					  remove inode from AIL
					  move tail of log forwards
					    wake grant head waiters
	<woken by log tail moving>
	<log space available>
	transaction reservation granted
     .....
     down_write(some other inode ilock)
     <modify some other inode>
     xfs_trans_commit
     .....

(*) xfsaild runs with PF_MEMALLOC context.

The problem is that if the ilock is held exclusively at GFP_KERNEL
time, the xfsaild cannot lock the inode to flush it, so if that
inode pins the tail of the log then we can't make space available
for xfs_trans_reserve and there is the deadlock.

Once xfs_trans_reserve completes, however, we'll take the ilock on
*some other inode*, and that's where the "it can't be the inode we
currently hold locked because we have references to it" and
henceit's safe to have a pattern like:

down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
    down_write(&ilock)

because the lock within reclaim context is completely unrelated to
the lock we already hold.

Lockdep can't possibly know about this because the deadlock involves
locking contexts that *aren't doing anything wrong within their own
contexts*. It's only when you add the dependency of log space
reservation requirements needed to make forwards progress that
there's then an issue with locking and reclaim.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Qu Wenruo <quwenruo@cn.fujitsu.com>,
	xfs@oss.sgi.com, linux-mm@kvack.org,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: Xfs lockdep warning with for-dave-for-4.6 branch
Date: Fri, 20 May 2016 10:17:14 +1000	[thread overview]
Message-ID: <20160520001714.GC26977@dastard> (raw)
In-Reply-To: <20160519081146.GS3193@twins.programming.kicks-ass.net>

On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> > > 
> > > > The reason we don't have lock clases for the ilock is that we aren't
> > > > supposed to call memory reclaim with that lock held in exclusive
> > > > mode. This is because reclaim can run transactions, and that may
> > > > need to flush dirty inodes to make progress. Flushing dirty inode
> > > > requires taking the ilock in shared mode.
> > > > 
> > > > In the code path that was reported, we hold the ilock in /shared/
> > > > mode with no transaction context (we are doing a read-only
> > > > operation). This means we can run transactions in memory reclaim
> > > > because a) we can't deadlock on the inode we hold locks on, and b)
> > > > transaction reservations will be able to make progress as we don't
> > > > hold any locks it can block on.
> > > 
> > > Just to clarify; I read the above as that we cannot block on recursive
> > > shared locks, is this correct?
> > > 
> > > Because we can in fact block on down_read()+down_read() just fine, so if
> > > you're assuming that, then something's busted.
> > 
> > The transaction reservation path will run down_read_trylock() on the
> > inode, not down_read(). Hence if there are no pending writers, it
> > will happily take the lock twice and make progress, otherwise it
> > will skip the inode and there's no deadlock.  If there's a pending
> > writer, then we have another context that is already in a
> > transaction context and has already pushed the item, hence it is
> > only in the scope of the current push because IO hasn't completed
> > yet and removed it from the list.
> > 
> > > Otherwise, I'm not quite reading it right, which is, given the
> > > complexity of that stuff, entirely possible.
> > 
> > There's a maze of dark, grue-filled twisty passages here...
> 
> OK; I might need a bit more again.
> 
> So now the code does something like:
> 
> 	down_read(&i_lock);		-- lockdep marks lock as held
> 	kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
> 	  --> reclaim()
> 	     down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS
> 
> Right?

In the path that can deadlock the log, yes. It's actually way more
complex than the above, because the down_read_trylock(&i_lock) that
matters is run in a completely separate, async kthread that
xfs_trans_reserve() will block waiting for.

process context				xfsaild kthread(*)
---------------				------------------
down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
     xfs_trans_reserve()
     ....
	  xfs_trans_push_ail()	---- called if no space in the log to kick the xfsaild into action
	  ....
       xlog_grant_head_wait()	---- blocks waiting for log space
       .....

					xfsaild_push()   ----- iterates AIL
					  grabs log item
					    lock log item
	>>>>>>>>>>>>>>>>>>>>>		      down_read_trylock(&i_lock);
					      format item into buffer
					      add to dirty buffer list
					  ....
					  submit dirty buffer list for IO
					    buffer IO started
					.....
					<async IO completion context>
					buffer callbacks
					  mark inode clean
					  remove inode from AIL
					  move tail of log forwards
					    wake grant head waiters
	<woken by log tail moving>
	<log space available>
	transaction reservation granted
     .....
     down_write(some other inode ilock)
     <modify some other inode>
     xfs_trans_commit
     .....

(*) xfsaild runs with PF_MEMALLOC context.

The problem is that if the ilock is held exclusively at GFP_KERNEL
time, the xfsaild cannot lock the inode to flush it, so if that
inode pins the tail of the log then we can't make space available
for xfs_trans_reserve and there is the deadlock.

Once xfs_trans_reserve completes, however, we'll take the ilock on
*some other inode*, and that's where the "it can't be the inode we
currently hold locked because we have references to it" and
henceit's safe to have a pattern like:

down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
    down_write(&ilock)

because the lock within reclaim context is completely unrelated to
the lock we already hold.

Lockdep can't possibly know about this because the deadlock involves
locking contexts that *aren't doing anything wrong within their own
contexts*. It's only when you add the dependency of log space
reservation requirements needed to make forwards progress that
there's then an issue with locking and reclaim.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-05-20  0:25 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-12  5:53 Xfs lockdep warning with for-dave-for-4.6 branch Qu Wenruo
2016-05-12  5:57 ` Darrick J. Wong
2016-05-12  8:03   ` Dave Chinner
2016-05-13 16:03     ` Michal Hocko
2016-05-13 16:03       ` Michal Hocko
2016-05-16 10:41       ` Peter Zijlstra
2016-05-16 10:41         ` Peter Zijlstra
2016-05-16 13:05         ` Michal Hocko
2016-05-16 13:05           ` Michal Hocko
2016-05-16 13:25           ` Peter Zijlstra
2016-05-16 13:25             ` Peter Zijlstra
2016-05-16 23:10             ` Dave Chinner
2016-05-16 23:10               ` Dave Chinner
2016-05-17 14:49               ` Peter Zijlstra
2016-05-17 14:49                 ` Peter Zijlstra
2016-05-17 22:35                 ` Dave Chinner
2016-05-17 22:35                   ` Dave Chinner
2016-05-18  7:20                   ` Peter Zijlstra
2016-05-18  7:20                     ` Peter Zijlstra
2016-05-18  8:25                     ` Michal Hocko
2016-05-18  8:25                       ` Michal Hocko
2016-05-18  9:49                       ` Peter Zijlstra
2016-05-18  9:49                         ` Peter Zijlstra
2016-05-18 11:31                         ` Michal Hocko
2016-05-18 11:31                           ` Michal Hocko
2016-05-19  8:11                   ` Peter Zijlstra
2016-05-19  8:11                     ` Peter Zijlstra
2016-05-20  0:17                     ` Dave Chinner [this message]
2016-05-20  0:17                       ` Dave Chinner
2016-06-01 13:17                       ` Michal Hocko
2016-06-01 13:17                         ` Michal Hocko
2016-06-01 18:16                         ` Peter Zijlstra
2016-06-01 18:16                           ` Peter Zijlstra
2016-06-02 14:50                           ` Michal Hocko
2016-06-02 14:50                             ` Michal Hocko
2016-06-02 15:11                             ` Peter Zijlstra
2016-06-02 15:11                               ` Peter Zijlstra
2016-06-02 15:46                               ` Michal Hocko
2016-06-02 15:46                                 ` Michal Hocko
2016-06-02 23:22                                 ` Dave Chinner
2016-06-02 23:22                                   ` Dave Chinner
2016-06-06 12:20                                   ` Michal Hocko
2016-06-06 12:20                                     ` Michal Hocko
2016-06-15  7:21                                     ` Dave Chinner
2016-06-15  7:21                                       ` Dave Chinner
2016-06-21 14:26                                       ` Michal Hocko
2016-06-21 14:26                                         ` Michal Hocko
2016-06-22  1:03                                         ` Dave Chinner
2016-06-22  1:03                                           ` Dave Chinner
2016-06-22 12:38                                           ` Michal Hocko
2016-06-22 12:38                                             ` Michal Hocko
2016-06-22 22:58                                             ` Dave Chinner
2016-06-22 22:58                                               ` Dave Chinner
2016-06-23 11:35                                               ` Michal Hocko
2016-06-23 11:35                                                 ` Michal Hocko
2016-10-06 13:04                           ` Michal Hocko
2016-10-06 13:04                             ` Michal Hocko
2016-10-17 13:49                             ` Michal Hocko
2016-10-17 13:49                               ` Michal Hocko
2016-10-19  0:33                             ` Dave Chinner
2016-10-19  0:33                               ` Dave Chinner
2016-10-19  5:30                               ` Dave Chinner
2016-10-19  5:30                                 ` Dave Chinner
2016-10-19  8:33                             ` Peter Zijlstra
2016-10-19  8:33                               ` Peter Zijlstra
2016-10-19 12:06                               ` Michal Hocko
2016-10-19 12:06                                 ` Michal Hocko
2016-10-19 21:49                                 ` Dave Chinner
2016-10-19 21:49                                   ` Dave Chinner
2016-10-20  7:15                                   ` Michal Hocko
2016-10-20  7:15                                     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160520001714.GC26977@dastard \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=quwenruo@cn.fujitsu.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.