From: Jan Kara <jack@suse.cz>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@suse.cz>, Zheng Liu <gnehzuil.liu@gmail.com>,
linux-fsdevel@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
Dmitry Monakhov <dmonakhov@openvz.org>,
Dave Chinner <david@fromorbit.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Tue, 17 Dec 2013 17:39:41 +0100 [thread overview]
Message-ID: <20131217163941.GA3319@quack.suse.cz> (raw)
In-Reply-To: <1387294358.2729.82.camel@menhir>
Hi,
On Tue 17-12-13 15:32:38, Steven Whitehouse wrote:
> On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote:
> > On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > > Hi,
> > > > >
> > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > >
> > > > > > Currently we check i_size in buffered read path after we know the page
> > > > > > is update. But it could return some zero-filled pages to the userspace
> > > > > > when we do some append dio writes. We could use the following code
> > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > >
> > > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > > the underlying fs has not had a chance to get its own locks and update
> > > > > the inode size.
> > > > Hum, this isn't really about page being uptodate or not, is it? It is
> > > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > > function when obtaining inode lock. Or do you know about other filesystem
> > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > > do something similar?
> > > >
> > > Well we could do that, but I don't think it makes sense really. The
> > > point of having the page cache is to put it "in front" of the expensive
> > > operations where possible (such as getting cluster locks). I don't think
> > > it makes sense to grab and drop a cluster lock on every call to
> > > generic_file_aio_read() when, for cached pages, we would not even need
> > > to call ->readpage.
> > Well, but in a cluster filesystem I fail to see how do you want to
> > synchronize read from pagecache on one node with truncate / write on another
> > node. Either the modification has to broadcast to all nodes having the file
> > cached or read has to check whether the data cached in pagecache is still
> > valid. How is gfs2 solving this?
> >
>
> The same way that every other page cache based filesystem does it. If
> there is a cached page that is uptodate, then it can be copied to the
> user, and the page lock is used to ensure that the state doesn't change
> under us.
If a page is uptodate, we don't ever take the page lock (see
do_generic_file_read())...
> If there is no cached page, then the vfs has to ask the
> filesystem to provide one, and then the cluster locks (glocks) are
> taken. Those locks are then cached until another node, or memory
> pressure results in them being dropped.
>
> GFS2's internal locking via the glock layer will invalidate all cached
> pages before it drops the glock to a mode other than shared or
> exclusive. It will also write back all dirty data when dropping an
> exclusive lock. For dio we use the "deferred" mode which is basically a
> shared mode that is incompatible with the normal shared mode - it allows
> shared caching of metadata, but no caching of data. GFS2 reverts to
> buffered I/O in order to deal with cases where we need to allocate
> blocks, including file extension.
OK, thanks for explanation. Now I see how you avoid the races I was
thinking about.
> > > The problem appears to be that somehow we are getting a page returned
> > > when it should not be. So I suspect that the correct place to catch that
> > > is in ->readpage(s)
> > There are two things which aren't ideal:
> > 1) DIO writes and buffered reads can race resulting in reads returning
> > zero-filled pages. Fixing this corner case would slow down the common case
> > so we decided to put this race in the cathegory of "don't do it when it
> > hurts".
>
> Well if I alter the test program to use buffered reads, then it takes
> much longer to hit the race. With the original O_DIRECT reads, then it
> is not a great surprise that not all the write i/o has actually hit disk
> at the time of the read, since as you say there is no synchronization
> between the reads and writes.
>
> So that makes sense to me.
>
> > 2) DIO read falls back to buffered read on failure - this is there to
> > simplify life of filesystems which don't support DIO at all or do not
> > handle some special cases like DIO from holes, not block-aligned DIO etc.
> >
> > Combination of these two leads to the problem that DIO read can return
> > zero-filled page when racing with DIO write. With enough luck this can
> > even lead to exposure of uninitialized data:
> > DIO WRITE BUFFERED READ
> > - allocates blocks
> > - sees blocks mapped so it reads data
> > - submits data write
> > - waits for IO completion
> > - extends i_size
> > - sees large i_size so hands over data to userspace
> >
> > Now without proper serialization of reads & writes this is hard to avoid
> > (we plan range locks for that but that's still blocked by some mm changes).
> > But checking i_size before reading data at least makes the race *much*
> > less serious (you have to have two threads modifying i_size in
> > parallel to read running and since these threads are synchronized on
> > i_mutex, some races become impossible, the ones returning 0s are still
> > possible but harder to trigger).
> >
> Well I agree with pretty much all of that except that part about adding
> in the extra i_size test. It may make the race harder to hit, but it
> doesn't eliminate it. Instead in means that we'll have to add a bunch of
> extra code into the fast path just to ensure that it is uptodate.
>
> The reason that the i_size check that is already in
> do_generic_file_read() is ok is that there is a locked page at the time
> of the i_size_read call, as per the comment:
>
> /*
> * i_size must be checked after we know the page is Uptodate.
> *
> * Checking i_size after the check allows us to calculate
> * the correct value for "nr", which means the zero-filled
> * part of the page is not copied back to userspace (unless
> * another truncate extends the file - this is desired though).
> */
Note that the page is *not* locked at the time of the i_size check. And
I'm aware of the comment - however that seems to talk about a case where
truncate() makes the file smaller. If we checked i_size only before having
page uptodate, we could see i_size value before truncate and thus return to
userspace data that is already zeroed-out by block_truncate_page() or its
equivalent. Actually, the race still seems to be there as we could read
i_size before truncate begins and if we take long enough nap just after
that file_read_actor() could copy data already zeroed out by
block_truncate_page(). That just doesn't seem likely to happen in practice.
So you are right that we might be better off to hold page lock during
i_size check and copy to userspace. However that isn't really possible as
that could deadlock - consider an evil program that does
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
read(fd, buf, 4096);
In that case file_read_actor() will hit a page fault when copying data to
userspace and to satisfy the page fault we need the very same page lock
that we would hold over the i_size check. This is the reason why write path
goes through the hoops of ->write_begin, ->write_end, and prefaulting
pages.
> > Now we could check i_size in ->readpage / ->readpages but conceptually that
> > seems as a wrong thing to do since readpage only cares about buffers being
> > mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
> > And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
> > and NFS both revalidate it in their \.aio_read callbacks). So I'd be
> > happier if we could handle this inside GFS2...
> >
> I would argue that GFS2 is doing this the right way in this case though
> - it makes no sense to put the i_size check into the fast path, when it
> is only required in the case that there is no cached page (i.e the slow
> path). In addition, putting it in the slow path (where the fs can take
> its own locks) also fixes the race, rather than just making it more
> difficult to hit.
I disagree here: i_size check is necessary regardless the page being
uptodate or not. For local filesystems truncate can happen completely
independently of the read...
> So far as I can tell the VFS code was originally intended to work in
> exactly this way, however it is not that obvious that this is the case,
> maybe better documentation is required in this case.
Originally, VFS never though about clustered filesystems so i_size was
always uptodate. We only had to take care about truncate changing i_size
while read was running. And to mitigate problems with that we needed to
check after the page is uptodate.
> If we did add in some GFS2 code to update i_size ahead of
> generic_file_aio_read() then we can only do that by grabbing and
> dropping the glock. That is a relatively expensive operation, and still
> doesn't fix the race as we must drop the glock before calling
> generic_file_aio_read() since we must not hold glocks over the copy to
> user stage.
Is it that expensive even if you have the glock already cached? Because
if you have cached page, you have to have cached glock as well if I
understand GFS2 design right.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
prev parent reply other threads:[~2013-12-17 16:39 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16 9:37 ` Steven Whitehouse
2013-12-16 10:01 ` Jan Kara
2013-12-17 9:43 ` Steven Whitehouse
2013-12-17 11:16 ` Zheng Liu
2013-12-17 12:17 ` Steven Whitehouse
2013-12-17 16:41 ` Zheng Liu
2013-12-19 12:27 ` Steven Whitehouse
2013-12-19 22:44 ` Dave Chinner
2013-12-20 9:28 ` Steven Whitehouse
2013-12-23 3:00 ` Dave Chinner
2014-01-14 15:22 ` Steven Whitehouse
2014-01-14 19:19 ` Jan Kara
2014-01-15 7:19 ` Zheng Liu
2014-01-16 15:35 ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20 ` Miklos Szeredi
2014-01-24 14:42 ` Steven Whitehouse
2014-01-27 10:13 ` Jan Kara
2013-12-17 14:01 ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32 ` Steven Whitehouse
2013-12-17 16:39 ` Jan Kara [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131217163941.GA3319@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=dmonakhov@openvz.org \
--cc=gnehzuil.liu@gmail.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=swhiteho@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).