From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 14/16] xfs: align writepages to large block sizes
Date: Thu, 15 Nov 2018 08:18:18 +1100 [thread overview]
Message-ID: <20181114211818.GW19305@dastard> (raw)
In-Reply-To: <20181114141925.GA19257@bfoster>
On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > For data integrity purposes, we need to write back the entire
> > filesystem block when asked to sync a sub-block range of the file.
> > When the filesystem block size is larger than the page size, this
> > means we need to convert single page integrity writes into whole
> > block integrity writes. We do this by extending the writepage range
> > to filesystem block granularity and alignment.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index f6ef9e0a7312..5334f16be166 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > .io_type = XFS_IO_HOLE,
> > };
> > int ret;
> > + unsigned bsize = i_blocksize(mapping->host);
> >
> > /*
> > * Refuse to write pages out if we are called from reclaim context.
> > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > return 0;
> >
> > + /*
> > + * If the block size is larger than page size, extent the incoming write
> > + * request to fsb granularity and alignment. This is a requirement for
> > + * data integrity operations and it doesn't hurt for other write
> > + * operations, so do it unconditionally.
> > + */
> > + if (wbc->range_start)
> > + wbc->range_start = round_down(wbc->range_start, bsize);
> > + if (wbc->range_end != LLONG_MAX)
> > + wbc->range_end = round_up(wbc->range_end, bsize);
> > + if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > + wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > +
>
> This latter bit causes endless writeback loops in tests such as
> generic/475 (I think I reproduced it with xfs/141 as well). The
Yup, I've seen that, but haven't fixed it yet because I still
haven't climbed out of the dedupe/clone/copy file range data
corruption hole that fsx pulled the lid of.
Basically, I can't get back to working on bs > ps until I get the
stuff we actually support working correctly first...
> writeback infrastructure samples ->nr_to_write before and after
> ->writepages() calls to identify progress. Unconditionally bumping it to
> something larger than the original value can lead to an underflow in the
> writeback code that seems to throw things off. E.g., see the following
> wb tracepoints (w/ 4k block and page size):
>
> kworker/u8:13-189 [003] ...1 317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> kworker/u8:13-189 [003] ...1 317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
>
> The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> the same basic calculation for 'wrote.'
Easy enough to fix, just stash the originals and restore them once
done.
>
> BTW, I haven't gone through the broader set, but just looking at this
> bit what's the purpose of rounding ->nr_to_write (which is a page count)
> to a block size in the first place?
fsync on a single page range.
We write that page, allocate the block (which spans 16 pages), and
then return from writeback leaving 15/16 pages on that block still
dirty in memory. Then we force the log, pushing the allocation and
metadata to disk. Crash.
On recovery, we expose 15/16 pages of stale data because we only
wrote one of the pages over the block during fsync.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-11-15 7:23 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-07 6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
2018-11-07 6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
2018-11-09 15:12 ` Christoph Hellwig
2018-11-12 21:08 ` Dave Chinner
2021-02-02 20:51 ` Darrick J. Wong
2018-11-07 6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
2018-11-07 6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
2018-11-07 16:55 ` Darrick J. Wong
2018-11-09 0:21 ` Dave Chinner
2018-11-07 6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
2018-11-07 6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-09 15:15 ` Christoph Hellwig
2018-11-07 6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
2018-11-08 11:28 ` Nikolay Borisov
2018-11-09 15:18 ` Christoph Hellwig
2018-11-11 1:12 ` Dave Chinner
2018-11-07 6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
2018-11-09 15:19 ` Christoph Hellwig
2018-11-11 1:15 ` Dave Chinner
2018-11-07 6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
2018-11-09 15:19 ` Christoph Hellwig
2018-11-07 6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
2018-11-07 6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
2018-11-07 6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
2018-11-07 6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
2018-11-07 6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
2018-11-07 6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
2018-11-09 15:22 ` Christoph Hellwig
2018-11-11 1:20 ` Dave Chinner
2018-11-11 16:32 ` Christoph Hellwig
2018-11-14 14:19 ` Brian Foster
2018-11-14 21:18 ` Dave Chinner [this message]
2018-11-15 12:55 ` Brian Foster
2018-11-16 6:19 ` Dave Chinner
2018-11-16 13:29 ` Brian Foster
2018-11-19 1:14 ` Dave Chinner
2018-11-07 6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
2018-11-07 6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
2018-11-07 22:04 ` Dave Chinner
2018-11-08 1:38 ` Darrick J. Wong
2018-11-08 9:04 ` Dave Chinner
2018-11-08 22:17 ` Darrick J. Wong
2018-11-08 22:22 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181114211818.GW19305@dastard \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).