Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@suse.cz>, Zheng Liu <gnehzuil.liu@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	Dave Chinner <david@fromorbit.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Tue, 17 Dec 2013 15:01:06 +0100	[thread overview]
Message-ID: <20131217140106.GA28330@quack.suse.cz> (raw)
In-Reply-To: <1387273422.2729.13.camel@menhir>

  Hi,

On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update.  But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes.  We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > 
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> >   Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> > 
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
  Well, but in a cluster filesystem I fail to see how do you want to
synchronize read from pagecache on one node with truncate / write on another
node. Either the modification has to broadcast to all nodes having the file
cached or read has to check whether the data cached in pagecache is still
valid. How is gfs2 solving this?

> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
  There are two things which aren't ideal:
1) DIO writes and buffered reads can race resulting in reads returning
zero-filled pages. Fixing this corner case would slow down the common case
so we decided to put this race in the cathegory of "don't do it when it
hurts".
2) DIO read falls back to buffered read on failure - this is there to
simplify life of filesystems which don't support DIO at all or do not
handle some special cases like DIO from holes, not block-aligned DIO etc.

Combination of these two leads to the problem that DIO read can return
zero-filled page when racing with DIO write. With enough luck this can
even lead to exposure of uninitialized data:
  DIO WRITE           BUFFERED READ
- allocates blocks
                      - sees blocks mapped so it reads data
- submits data write
- waits for IO completion
- extends i_size
                      - sees large i_size so hands over data to userspace

  Now without proper serialization of reads & writes this is hard to avoid
(we plan range locks for that but that's still blocked by some mm changes).
But checking i_size before reading data at least makes the race *much*
less serious (you have to have two threads modifying i_size in
parallel to read running and since these threads are synchronized on
i_mutex, some races become impossible, the ones returning 0s are still
possible but harder to trigger).

Now we could check i_size in ->readpage / ->readpages but conceptually that
seems as a wrong thing to do since readpage only cares about buffers being
mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
and NFS both revalidate it in their \.aio_read callbacks). So I'd be
happier if we could handle this inside GFS2...

> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> >   I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> > 
> > 								Honza
> 
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
  It is a good question why XFS doesn't have the problem. Looking into
the code, XFS doesn't use i_mutex but rather it's private rw-sem for
serialization (XFS_IOLOCK). Even reads hold the semaphore in shared mode
and as the test program is written, DIO reads will hold it in exclusive
mode for a while as well. A guess that provides enough serialization that
the test program doesn't hit the race (although the race still seems
possible).

> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,
  As I wrote above, mixing buffered and direct IO isn't really supported
(although we should fix the data exposure race). But DIO read vs DIO write
sounds sensible enough to support. And when DIO read falls back to buffered
read, we have problems...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

next prev parent reply	other threads:[~2013-12-17 14:01 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16  9:37 ` Steven Whitehouse
2013-12-16 10:01   ` Jan Kara
2013-12-17  9:43     ` Steven Whitehouse
2013-12-17 11:16       ` Zheng Liu
2013-12-17 12:17         ` Steven Whitehouse
2013-12-17 16:41           ` Zheng Liu
2013-12-19 12:27             ` Steven Whitehouse
2013-12-19 22:44               ` Dave Chinner
2013-12-20  9:28                 ` Steven Whitehouse
2013-12-23  3:00                   ` Dave Chinner
2014-01-14 15:22                     ` Steven Whitehouse
2014-01-14 19:19                       ` Jan Kara
2014-01-15  7:19                         ` Zheng Liu
2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20                             ` Miklos Szeredi
2014-01-24 14:42                               ` Steven Whitehouse
2014-01-27 10:13                                 ` Jan Kara
2013-12-17 14:01       ` Jan Kara [this message]
2013-12-17 15:32         ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Steven Whitehouse
2013-12-17 16:39           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131217140106.GA28330@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=dmonakhov@openvz.org \
    --cc=gnehzuil.liu@gmail.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=swhiteho@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).