linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-fsdevel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: async O_DIRECT vs. buffered synchronization (or lack thereof)
Date: Sun, 25 Oct 2015 08:50:14 +1100	[thread overview]
Message-ID: <20151024215014.GF8773@dastard> (raw)
In-Reply-To: <20151022234245.GA20005@kmo-pixel>

On Thu, Oct 22, 2015 at 03:42:45PM -0800, Kent Overstreet wrote:
> While off reading code, I noticed something that didn't look quite right...
> 
> Look at generic_file_direct_write(), in mm/filemap.c. What the code there is
> doing is:
> 
>  - dropping the range we're writing to from the page cache (writing it first if
>    necessary), then
>  - doing the write, then
>  - invalidating that range in the pagecache again.
....

> Yet _another_ fun fact: I mentioned that for the filemap_write_and_wait_range();
> invalidate_inode_pages2() sequence to work we have to be preventing pages from
> being redirtied. Well, i_mutex does the job for buffered writes, but not
> page faults - AFAICT page_mkwrite() would have to take i_mutex for this code to
> not race with page faults, and the default page_mkwrite implementation
> (filemap_page_mkwrite()) definitely does not.
> 
> It does _lock_ the page though, so if we had something that combined
> filemap_write_and_wait_range() with invalidating pages, making sure to have the
> page still locked when removing it from the page cache - that ought to work.
> 
> XFS does seem to attempt to get this right - its .page_mkwrite takes the inode
> XFS_MMAPLOCK_SHARED lock, and the xfs truncate and fallocate code both take
> XFS_MMAPLOCK_EXCL (truncate and (in particular) fcollapse also need to drop
> ranges from the page cache, fcollapse is where I first noticed this particular
> issue).
> 
> But AFAICT xfs's dio path does _not_ take the correct lock for this to work -
> although if you look at xfs_file_dio_aio_write() they were clearly thinking
> about page cache synchronization, so perhaps I'm missing something about how
> xfs's locking works.

We can't take it across direct IO submission/completion because that
creates a mmap_sem/XFS_MMAPLOCK inversion due to the direct IO code
calling get_user_pages().

We can't put locks in the page fault path to solve this problem - I
created the XFS_MMAPLOCK locking to solve the "page faults race with
extent manipulation operations" knowing that it couldn't be used to
solve the DIO vs mmap race conditions.

There's a reason why XFS developers still say "if you mix
buffered/mmap IO on the same file as direct IO, you get to keep all
the corrupted bits" and then point users at the open(2) man page:

	Applications should avoid mixing O_DIRECT and normal I/O to
	the same file, and especially to overlapping byte regions in
	the  same  file.   Even  when  the filesystem  correctly
	handles the coherency issues in this situation, overall I/O
	throughput is likely to be slower than using either mode
	alone.  Likewise, applications should avoid mixing mmap(2)
	of files with direct I/O to the same files.

We attempt best effort at maintaining coherency and preventing data
corruption, but we cannot guarantee coherency...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

      reply	other threads:[~2015-10-24 21:50 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-22 23:42 async O_DIRECT vs. buffered synchronization (or lack thereof) Kent Overstreet
2015-10-24 21:50 ` Dave Chinner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151024215014.GF8773@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).