From: Kent Overstreet <kent.overstreet@gmail.com>
To: linux-fsdevel@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Theodore Ts'o <tytso@mit.edu>
Subject: async O_DIRECT vs. buffered synchronization (or lack thereof)
Date: Thu, 22 Oct 2015 15:42:45 -0800 [thread overview]
Message-ID: <20151022234245.GA20005@kmo-pixel> (raw)
While off reading code, I noticed something that didn't look quite right...
Look at generic_file_direct_write(), in mm/filemap.c. What the code there is
doing is:
- dropping the range we're writing to from the page cache (writing it first if
necessary), then
- doing the write, then
- invalidating that range in the pagecache again.
The second invalidate_inode_pages2() call is because we aren't holding any locks
that prevent _reads_ from pulling data into the pagecache, and we need to make
sure the pagecache is consistent with what's on disk - hence, we need to
invalidate the range we wrote _after_ our write completes, when new reads will
see the correct data.
But - if it's as async write, the write hasn't actually completed and the second
invalidate_inode_pages2() call is pointless (and it's not happening anywhere
else, either).
So - perhaps you're going to say "Why not just add a call to
invalidate_inode_pages2() in the correct spot, in the async completion path?"
Well, invalidate_inode_pages2() drops dirty pages on the floor - you need to to
call filemap_write_and_wait_range() first, and be holding something that
prevents pages from being redirtied - i_mutex. But we only hold i_mutex for IO
submission, it isn't held for the entire duration of the IO (that would suck).
This is particularly bad since O_DIRECT granularity is the fs block size - i.e.
potentially smaller than PAGE_SIZE. That buffered write that happens while the
AIO write is in flight could be going to the same page but different blocks -
but one of them is going to end up dropped/overwritten.
And another scary thing is the fact that O_DIRECT writes fall back to buffered
at the drop of a hat - writes to holes are the main case, but it can be anything
the filesystem didn't feel like implementing in the O_DIRECT case (I vaguelly
recall there being some weird btrfs case involving some weird extent situation
where it would fall back). So a user doesn't have to be intentionally doing
O_DIRECT and buffered IO at the same time.
Yet _another_ fun fact: I mentioned that for the filemap_write_and_wait_range();
invalidate_inode_pages2() sequence to work we have to be preventing pages from
being redirtied. Well, i_mutex does the job for buffered writes, but not
page faults - AFAICT page_mkwrite() would have to take i_mutex for this code to
not race with page faults, and the default page_mkwrite implementation
(filemap_page_mkwrite()) definitely does not.
It does _lock_ the page though, so if we had something that combined
filemap_write_and_wait_range() with invalidating pages, making sure to have the
page still locked when removing it from the page cache - that ought to work.
XFS does seem to attempt to get this right - its .page_mkwrite takes the inode
XFS_MMAPLOCK_SHARED lock, and the xfs truncate and fallocate code both take
XFS_MMAPLOCK_EXCL (truncate and (in particular) fcollapse also need to drop
ranges from the page cache, fcollapse is where I first noticed this particular
issue).
But AFAICT xfs's dio path does _not_ take the correct lock for this to work -
although if you look at xfs_file_dio_aio_write() they were clearly thinking
about page cache synchronization, so perhaps I'm missing something about how
xfs's locking works.
BTW, I'd love if someone could just show me that I was mistaken about some of
this and it does in fact work, a lot of this locking is damned subtle and it's
entirely possible I'm missing something crucial...
next reply other threads:[~2015-10-22 23:42 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-22 23:42 Kent Overstreet [this message]
2015-10-24 21:50 ` async O_DIRECT vs. buffered synchronization (or lack thereof) Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151022234245.GA20005@kmo-pixel \
--to=kent.overstreet@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.