From: Dave Chinner <david@fromorbit.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: do_sync() and XFSQA test 182 failures....
Date: Fri, 31 Oct 2008 11:12:49 +1100 [thread overview]
Message-ID: <20081031001249.GM4985@disturbed> (raw)
In-Reply-To: <20081030224625.GA18690@infradead.org>
On Thu, Oct 30, 2008 at 06:46:25PM -0400, Christoph Hellwig wrote:
> On Thu, Oct 30, 2008 at 07:50:20PM +1100, Dave Chinner wrote:
> > [*] Starting pdflush to sync data in the background when we are
> > about to start flushing ourselves is self-defeating. instead of
> > having a single thread doing optimal writeout patterns, we now
> > have two threads trying to sync the same filesystems and
> > competing with each other to write out dirty inodes. This
> > actually causes bugs in sync because pdflush is doing async
> > flushes. Hence if pdflush races and wins during the sync flush
> > part of the sync process, sync_inodes(1) will return before all
> > the data/metadata is on disk because it can't be found to be
> > waited on.
>
> Note that this is an XFS special. Every other filesystem (at least the
> major ones) rely on the VFS to write out data.
The race is based on which thread removes the remaining
dirty inodes from the sb_s_dirty list - I don't think that is XFS
specific.
> > Now the sync is _supposedly_ complete. But we still have a dirty
> > log and superblock thanks to delayed allocation that may have
> > occurred after the sync_supers() call. Hence we can immediately
> > see that we cannot *ever* do a proper sync of an XFS filesystem
> > in Linux without modifying do_sync() to do more callouts.
> >
> > Worse, XFS can also still have *dirty inodes* because sync_inodes(1)
> > will remove inodes from the dirty list in the async pass, but they
> > can get dirtied a short time later (if they had dirty data) when the
> > data I/O completes. Hence if the second sync pass completes before
> > the inode is dirtied again we'll miss flushing it. This will mean we
> > don't write inode size updates during sync. This is the same race
> > that pdflush running in the background can trigger.
>
> This is a common problem that would hit any filesystem trying to have
> some sort of ordered data or journaled data mode.
>
> > However, I have a problem - I'm an expert in XFS, not the other tens
> > of Linux filesystems so I can't begin to guess what the impact of
> > changing do_sync() would be on those many filesystems. How many
> > filesystems would such a change break? Indeed - how many are broken
> > right now by having dirty inodes and superblocks slip through
> > sync(1)?
>
> I would guess more are broken now then a change in order would break.
> Then again purely a change in order would still leave this code
> fragile as hell.
Right, which is why I want a custom ->do_sync method so I can
*guarantee* that sync works on XFS without fear breaking other
filesystems...
> > What are the alternatives? do_sync() operates above any particular
> > filesystem, so it's hard to provide a filesystem specific ->do_sync
> > method to avoid changing sync order for all filesystems. Do we
> > change do_sync() to completely sync a superblock at a time instead
> > of doing each operation across all superblocks before moving onto
> > the next operation? Is there any particular reason (e.g. performance, locking) for the current
> > method that would prevent changing to completely-sync-a-superblock
> > iteration algorithm so we can provide a custom ->do_sync method?
>
> Locking can't be the reason as we should never hold locks while
> returning from one of the VFS operations. I think it's performance
> or rather alledged performance as I think it doesn't really matter.
Ok.
> If it matters however there is an easy method to make it perform just
> as well with a proper callout - just spawn a thread for every filesystem
> to perform it in parallel.
Sure, that will be faster as long as the filesystems are on
separate devices.
As it is, once we have custom ->do_sync, XFS will probably grow
multiple threads per filesystem to sync AGs on independent spindles
in parallel.....
> > Are there any other ways that we can get a custom ->do_sync
> > method for XFS? I'd prefer a custom method so we don't have to
> > revalidate every linux filesystem, especially as XFS already has
> > everything it needs to provide it's own sync method (used for
> > freezing) and a test suite to validate it is working correctly.....
>
> I think having a method for this would be useful. And given that
> a proper sync should be exactly the same as a filesysytem freeze
> we should maybe use one method for both of those and use the chance
> to give filesystem better control over the freeze process?
Right - that's exactly where we should be going with this, I think.
I'd suggest two callouts, perhaps: ->sync_data and ->sync_metadata.
The freeze code can then still operate in two stages, and we can
also use then for separating data and inode writeback in pdflush....
FWIW, I mentioned doing this sort of thing here:
http://xfs.org/index.php/Improving_inode_Caching#Avoiding_the_Generic_pdflush_Code
I think I'll look at redoing do_sync() to provide a custom sync
method before trying to fix XFS....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2008-10-31 0:13 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-30 8:50 do_sync() and XFSQA test 182 failures Dave Chinner
2008-10-30 22:46 ` Christoph Hellwig
2008-10-31 0:12 ` Dave Chinner [this message]
2008-10-31 0:48 ` Dave Chinner
2008-10-31 20:37 ` Christoph Hellwig
2008-10-31 22:03 ` Dave Chinner
2008-10-31 22:19 ` Christoph Hellwig
2008-10-31 20:31 ` Christoph Hellwig
2008-10-31 21:54 ` Dave Chinner
2008-10-31 22:22 ` Christoph Hellwig
2008-10-31 4:02 ` Lachlan McIlroy
2008-10-31 13:06 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081031001249.GM4985@disturbed \
--to=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.