All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Liu Bo <bo.liu@linux.alibaba.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz
Subject: Re: [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback
Date: Fri, 7 Dec 2018 16:20:51 +1100	[thread overview]
Message-ID: <20181207052051.GB6311@dastard> (raw)
In-Reply-To: <20181205170656.GJ30615@quack2.suse.cz>

On Wed, Dec 05, 2018 at 06:06:56PM +0100, Jan Kara wrote:
> Added MM people to CC since this starts to be relevant for them.
> 
> On Fri 30-11-18 07:40:19, Dave Chinner wrote:
> > On Thu, Nov 29, 2018 at 02:00:02PM +0100, Jan Kara wrote:
> > > On Thu 29-11-18 23:02:53, Dave Chinner wrote:
> > > > As it is, this sort of lock vs reclaim inversion should be caught by
> > > > lockdep - allocations and reclaim contexts are recorded by lockdep
> > > > we get reports if we do lock A - alloc and then do reclaim - lock A.
> > > > We've always had problems with false positives from lockdep for
> > > > these situations where common XFS code can be called from GFP_KERNEL
> > > > valid contexts as well as reclaim or GFP_NOFS-only contexts, but I
> > > > don't recall ever seeing such a report for the writeback path....
> > > 
> > > I think for A == page lock, XFS may have the problem (and lockdep won't
> > > notice because it does not track page locks). There are some parts of
> > > kernel which do GFP_KERNEL allocations under page lock - pte_alloc_one() is
> > > one such function which allocates page tables with GFP_KERNEL and gets
> > > called with the faulted page locked. And I believe there are others.
> > 
> > Where in direct reclaim are we doing writeback to XFS?
> > 
> > It doesn't happen, and I've recently proposed we remove ->writepage
> > support from XFS altogether so that memory reclaim never, ever
> > tries to write pages to XFS filesystems, even from kswapd.
> 
> Direct reclaim will never do writeback but it may still wait for writeback
> that has been started by someone else. That is enough for the deadlock to
> happen. But from what you write below you seem to understand that so I just
> write this comment here so that others don't get confused.
> 
> > > So direct reclaim from pte_alloc_one() can wait for writeback on page B
> > > while holding lock on page A. And if B is just prepared (added to bio,
> > > under writeback, unlocked) but not submitted in xfs_writepages() and we
> > > block on lock_page(A), we have a deadlock.
> > 
> > Fundamentally, doing GFP_KERNEL allocations with a page lock
> > held violates any ordering rules we might have for multiple page
> > locking order. This is asking for random ABBA reclaim deadlocks to
> > occur, and it's not a filesystem bug - that's a bug in the page
> > table code. e.g if we are doing this in a filesystem/page cache
> > context, it's always in ascending page->index order for pages
> > referenced by the inode's mapping. Memory reclaim provides none of
> > these lock ordering guarantees.
> 
> So this is where I'd like MM people to tell their opinion. Reclaim code
> tries to avoid possible deadlocks on page lock by always doing trylock on
> the page. But as this example shows it is not enough once is blocks in
> wait_on_page_writeback().

I think it only does this in a "legacy memcg" case, according to the
comment in shrink_page_list. Which is, apparently, a hack around the
fact that memcgs didn't used to have dirty page throttling. AFAIA,
balance_dirty_pages() has had memcg-based throttling for some time
now, so that kinda points to stale reclaim algorithms, right?

> > > Generally deadlocks like these will be invisible to lockdep because it does
> > > not track either PageWriteback or PageLocked as a dependency.
> > 
> > And, because lockdep doesn't report it, it's not a bug that needs
> > fixing, eh?
> 
> The bug definitely needs fixing IMO. Real user hit it after all...

Sorry, I left off the <sarcasm> tag. I'm so used to people ignoring
locking problems until someone adds a lockdep tag to catch that
case....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-12-07  5:26 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-30  0:22 [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback Liu Bo
2018-10-31 18:49 ` Liu Bo
2018-11-27 11:42 ` Jan Kara
2018-11-28 20:11   ` Liu Bo
2018-11-29  8:52     ` Jan Kara
2018-11-29 12:02       ` Dave Chinner
2018-11-29 13:00         ` Jan Kara
2018-11-29 14:07           ` Zhengping Zhou
2018-11-29 15:58             ` Jan Kara
2018-11-29 20:40           ` Dave Chinner
2018-12-05 17:06             ` Jan Kara
2018-12-07  5:20               ` Dave Chinner [this message]
2018-12-07  7:16                 ` Michal Hocko
2018-12-07 11:20                   ` Michal Hocko
2018-12-07 18:51                     ` Liu Bo
2018-12-10 17:59                       ` Michal Hocko
2018-11-29 19:24       ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181207052051.GB6311@dastard \
    --to=david@fromorbit.com \
    --cc=bo.liu@linux.alibaba.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mhocko@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.