async O_DIRECT vs. buffered synchronization (or lack thereof)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kent Overstreet <kent.overstreet@gmail.com>
To: linux-fsdevel@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: async O_DIRECT vs. buffered synchronization (or lack thereof)
Date: Thu, 22 Oct 2015 15:42:45 -0800	[thread overview]
Message-ID: <20151022234245.GA20005@kmo-pixel> (raw)

While off reading code, I noticed something that didn't look quite right...

Look at generic_file_direct_write(), in mm/filemap.c. What the code there is
doing is:

 - dropping the range we're writing to from the page cache (writing it first if
   necessary), then
 - doing the write, then
 - invalidating that range in the pagecache again.

The second invalidate_inode_pages2() call is because we aren't holding any locks
that prevent _reads_ from pulling data into the pagecache, and we need to make
sure the pagecache is consistent with what's on disk - hence, we need to
invalidate the range we wrote _after_ our write completes, when new reads will
see the correct data.

But - if it's as async write, the write hasn't actually completed and the second
invalidate_inode_pages2() call is pointless (and it's not happening anywhere
else, either).

So - perhaps you're going to say "Why not just add a call to
invalidate_inode_pages2() in the correct spot, in the async completion path?"

Well, invalidate_inode_pages2() drops dirty pages on the floor - you need to to
call filemap_write_and_wait_range() first, and be holding something that
prevents pages from being redirtied - i_mutex. But we only hold i_mutex for IO
submission, it isn't held for the entire duration of the IO (that would suck).

This is particularly bad since O_DIRECT granularity is the fs block size - i.e.
potentially smaller than PAGE_SIZE. That buffered write that happens while the
AIO write is in flight could be going to the same page but different blocks -
but one of them is going to end up dropped/overwritten.

And another scary thing is the fact that O_DIRECT writes fall back to buffered
at the drop of a hat - writes to holes are the main case, but it can be anything
the filesystem didn't feel like implementing in the O_DIRECT case (I vaguelly
recall there being some weird btrfs case involving some weird extent situation
where it would fall back). So a user doesn't have to be intentionally doing
O_DIRECT and buffered IO at the same time.

Yet _another_ fun fact: I mentioned that for the filemap_write_and_wait_range();
invalidate_inode_pages2() sequence to work we have to be preventing pages from
being redirtied. Well, i_mutex does the job for buffered writes, but not
page faults - AFAICT page_mkwrite() would have to take i_mutex for this code to
not race with page faults, and the default page_mkwrite implementation
(filemap_page_mkwrite()) definitely does not.

It does _lock_ the page though, so if we had something that combined
filemap_write_and_wait_range() with invalidating pages, making sure to have the
page still locked when removing it from the page cache - that ought to work.

XFS does seem to attempt to get this right - its .page_mkwrite takes the inode
XFS_MMAPLOCK_SHARED lock, and the xfs truncate and fallocate code both take
XFS_MMAPLOCK_EXCL (truncate and (in particular) fcollapse also need to drop
ranges from the page cache, fcollapse is where I first noticed this particular
issue).

But AFAICT xfs's dio path does _not_ take the correct lock for this to work -
although if you look at xfs_file_dio_aio_write() they were clearly thinking
about page cache synchronization, so perhaps I'm missing something about how
xfs's locking works.

BTW, I'd love if someone could just show me that I was mistaken about some of
this and it does in fact work, a lot of this locking is damned subtle and it's
entirely possible I'm missing something crucial...

next             reply	other threads:[~2015-10-22 23:42 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-22 23:42 Kent Overstreet [this message]
2015-10-24 21:50 ` async O_DIRECT vs. buffered synchronization (or lack thereof) Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151022234245.GA20005@kmo-pixel \
    --to=kent.overstreet@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).