From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Liu Bo <bo.liu@linux.alibaba.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz
Subject: Re: [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback
Date: Fri, 7 Dec 2018 16:20:51 +1100 [thread overview]
Message-ID: <20181207052051.GB6311@dastard> (raw)
In-Reply-To: <20181205170656.GJ30615@quack2.suse.cz>
On Wed, Dec 05, 2018 at 06:06:56PM +0100, Jan Kara wrote:
> Added MM people to CC since this starts to be relevant for them.
>
> On Fri 30-11-18 07:40:19, Dave Chinner wrote:
> > On Thu, Nov 29, 2018 at 02:00:02PM +0100, Jan Kara wrote:
> > > On Thu 29-11-18 23:02:53, Dave Chinner wrote:
> > > > As it is, this sort of lock vs reclaim inversion should be caught by
> > > > lockdep - allocations and reclaim contexts are recorded by lockdep
> > > > we get reports if we do lock A - alloc and then do reclaim - lock A.
> > > > We've always had problems with false positives from lockdep for
> > > > these situations where common XFS code can be called from GFP_KERNEL
> > > > valid contexts as well as reclaim or GFP_NOFS-only contexts, but I
> > > > don't recall ever seeing such a report for the writeback path....
> > >
> > > I think for A == page lock, XFS may have the problem (and lockdep won't
> > > notice because it does not track page locks). There are some parts of
> > > kernel which do GFP_KERNEL allocations under page lock - pte_alloc_one() is
> > > one such function which allocates page tables with GFP_KERNEL and gets
> > > called with the faulted page locked. And I believe there are others.
> >
> > Where in direct reclaim are we doing writeback to XFS?
> >
> > It doesn't happen, and I've recently proposed we remove ->writepage
> > support from XFS altogether so that memory reclaim never, ever
> > tries to write pages to XFS filesystems, even from kswapd.
>
> Direct reclaim will never do writeback but it may still wait for writeback
> that has been started by someone else. That is enough for the deadlock to
> happen. But from what you write below you seem to understand that so I just
> write this comment here so that others don't get confused.
>
> > > So direct reclaim from pte_alloc_one() can wait for writeback on page B
> > > while holding lock on page A. And if B is just prepared (added to bio,
> > > under writeback, unlocked) but not submitted in xfs_writepages() and we
> > > block on lock_page(A), we have a deadlock.
> >
> > Fundamentally, doing GFP_KERNEL allocations with a page lock
> > held violates any ordering rules we might have for multiple page
> > locking order. This is asking for random ABBA reclaim deadlocks to
> > occur, and it's not a filesystem bug - that's a bug in the page
> > table code. e.g if we are doing this in a filesystem/page cache
> > context, it's always in ascending page->index order for pages
> > referenced by the inode's mapping. Memory reclaim provides none of
> > these lock ordering guarantees.
>
> So this is where I'd like MM people to tell their opinion. Reclaim code
> tries to avoid possible deadlocks on page lock by always doing trylock on
> the page. But as this example shows it is not enough once is blocks in
> wait_on_page_writeback().
I think it only does this in a "legacy memcg" case, according to the
comment in shrink_page_list. Which is, apparently, a hack around the
fact that memcgs didn't used to have dirty page throttling. AFAIA,
balance_dirty_pages() has had memcg-based throttling for some time
now, so that kinda points to stale reclaim algorithms, right?
> > > Generally deadlocks like these will be invisible to lockdep because it does
> > > not track either PageWriteback or PageLocked as a dependency.
> >
> > And, because lockdep doesn't report it, it's not a bug that needs
> > fixing, eh?
>
> The bug definitely needs fixing IMO. Real user hit it after all...
Sorry, I left off the <sarcasm> tag. I'm so used to people ignoring
locking problems until someone adds a lockdep tag to catch that
case....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-12-07 5:20 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com>
2018-11-27 11:42 ` [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback Jan Kara
2018-11-28 20:11 ` Liu Bo
2018-11-29 8:52 ` Jan Kara
2018-11-29 12:02 ` Dave Chinner
2018-11-29 13:00 ` Jan Kara
2018-11-29 14:07 ` Zhengping Zhou
2018-11-29 15:58 ` Jan Kara
2018-11-29 20:40 ` Dave Chinner
2018-12-05 17:06 ` Jan Kara
2018-12-07 5:20 ` Dave Chinner [this message]
2018-12-07 7:16 ` Michal Hocko
2018-12-07 11:20 ` Michal Hocko
2018-12-07 18:51 ` Liu Bo
2018-12-10 17:59 ` Michal Hocko
2018-11-29 19:24 ` Liu Bo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181207052051.GB6311@dastard \
--to=david@fromorbit.com \
--cc=bo.liu@linux.alibaba.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).