All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock
Date: Tue, 26 Jun 2018 08:37:34 +1000	[thread overview]
Message-ID: <20180625223734.GA19934@dastard> (raw)
In-Reply-To: <984063CF-92B5-40CF-AFA6-96398D8692BF@fb.com>

On Fri, Jun 22, 2018 at 11:14:20AM -0400, Chris Mason wrote:
> On 21 Jun 2018, at 23:09, Dave Chinner wrote:
> 
> >From: Dave Chinner <dchinner@redhat.com>
> >
> >We've recently seen a workload on XFS filesystems with a repeatable
> >deadlock between background writeback and a multi-process
> >application doing concurrent writes and fsyncs to a small range of a
> >file.
> >
> >
> >The underlying cause of the problem here is that range_cyclic
> >writeback is processing pages in descending index order as we
> >hold higher index pages in a structure controlled from above
> >write_cache_pages(). The write_cache_pages() caller needs to
> >be able to submit these pages for IO before write_cache_pages
> >restarts writeback at mapping index 0 to avoid wcp inverting the
> >page lock/writeback wait order.
> >
> >generic_writepages() is not susceptible to this bug as it has no
> >private context held across write_cache_pages() - filesystems using
> >this infrastructure always submit pages in ->writepage immediately
> >and so there is no problem with range_cyclic going back to mapping
> >index 0.
> >
> >However:
> >	mpage_writepages() has a private bio context,
> >	exofs_writepages() has page_collect
> >	fuse_writepages() has fuse_fill_wb_data
> >	nfs_writepages() has nfs_pageio_descriptor
> >	xfs_vm_writepages() has xfs_writepage_ctx
> >
> >All of these ->writepages implementations can hold pages under
> >writeback in their private structures until write_cache_pages()
> >returns, and hence they are all susceptible to this deadlock.
> >
> >Also worth noting is that ext4 has it's own bastardised version of
> >write_cache_pages() and so it /may/ have an equivalent deadlock. I
> >looked at the code long enough to understand that it has a similar
> >retry loop for range_cyclic writeback reaching the end of the file
> >and then promptly ran away before my eyes bled too much.  I'll leave
> >it for the ext4 developers to determine if their code is actually
> >has this deadlock and how to fix it if it has.
> >
> >There's a few ways I can see avoid this deadlock. There's probably
> >more, but these are the first I've though of:
> >
> >1. get rid of range_cyclic altogether
> >
> >2. range_cyclic always stops at EOF, and we start again from
> >writeback index 0 on the next call into write_cache_pages()
> >
> >2a. wcp also returns EAGAIN to ->writepages implementations to
> >indicate range cyclic has hit EOF. write-pages implementations can
> >then flush the current context and call wpc again to continue. i.e.
> >lift the retry into the ->writepages implementation
> 
> Btrfs has a variation of 2a in our bastardized copy of
> write_cache_pages(), it has a call to flush the bios we've built up
> before doing operations that might deadlock, including waiting for
> writeback, locking pages etc:
> 
>                         if (wbc->sync_mode != WB_SYNC_NONE) {
>                                 if (PageWriteback(page))
>                                         flush_write_bio(epd);
>                                 wait_on_page_writeback(page);
>                         }

Yeah, that's essentially what I was thinking, but it's complicated
by the fact we don't know what the private data contains in wcp.
Seems like a reasonable thing to do if you have all the ducks in a
row.

> I don't see any problems with the solution you picked instead, but
> if we run into more complex variations of this we can just pass a
> callback and a flushing cookie to generic_writepages().
> 
> Open question on the perf benefits of staring IO early while we wait
> for writeback on page X or letting our bio build bigger.

It's async background writeback. IO latency doesn't matter - it's
likely to get throttled anyway - so really the optimisations should
focus around building the most well formed bios we can....

> Open question #2, at least for btrfs we're building a single fat bio
> by adding pages to it along the way.  This is a small variation on
> plugging...we could just pull the last bio off the plug stack and
> stuff the pages in.  Then we'd get all the deadlock prevent implicit
> in plugging for free.

The problem is that there's a lot of context around the bio that the
filesystem has to maintain as well, and we can't put that on the
plug list. I'd prefer that we don't make a simple one-way IO
aggregation mechanism (the plug list) much more complicated by
allowing tasks to use it as a per-task "almost but not quite
submitted bio" stack without a lot more thought about it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-06-25 22:37 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-22  3:09 [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock Dave Chinner
2018-06-22 14:41 ` Brian Foster
2018-06-25 22:27   ` Dave Chinner
2018-06-26 11:28     ` Brian Foster
2018-06-22 15:14 ` Chris Mason
2018-06-22 15:14   ` Chris Mason
2018-06-25 22:37   ` Dave Chinner [this message]
2018-09-10 11:51 ` Jan Kara
2018-09-10 23:21   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180625223734.GA19934@dastard \
    --to=david@fromorbit.com \
    --cc=clm@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.