From: Steven Whitehouse <swhiteho@redhat.com>
To: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
Dmitry Monakhov <dmonakhov@openvz.org>,
Dave Chinner <david@fromorbit.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Tue, 17 Dec 2013 12:17:44 +0000 [thread overview]
Message-ID: <1387282664.2729.42.camel@menhir> (raw)
In-Reply-To: <20131217111626.GA7544@gmail.com>
Hi,
On Tue, 2013-12-17 at 19:16 +0800, Zheng Liu wrote:
> Hi Steven,
>
> On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> > Hi,
> >
> > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > Hi,
> > > >
> > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > >
> > > > > Currently we check i_size in buffered read path after we know the page
> > > > > is update. But it could return some zero-filled pages to the userspace
> > > > > when we do some append dio writes. We could use the following code
> > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > >
> > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > the underlying fs has not had a chance to get its own locks and update
> > > > the inode size.
> > > Hum, this isn't really about page being uptodate or not, is it? It is
> > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > function when obtaining inode lock. Or do you know about other filesystem
> > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > do something similar?
> > >
> > Well we could do that, but I don't think it makes sense really. The
> > point of having the page cache is to put it "in front" of the expensive
> > operations where possible (such as getting cluster locks). I don't think
> > it makes sense to grab and drop a cluster lock on every call to
> > generic_file_aio_read() when, for cached pages, we would not even need
> > to call ->readpage.
> >
> > The problem appears to be that somehow we are getting a page returned
> > when it should not be. So I suspect that the correct place to catch that
> > is in ->readpage(s)
> >
> > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > avoid the race that you've identified here, if I've understood it
> > > > correctly,
> > > I don't understand how that would work. Can you elaborate a bit? Here the
> > > problem is i_size gets extended by DIO write during buffered read (which is
> > > a fallback from DIO read) so we return zero-filled page instead of either
> > > data filled page or short read. I don't see where launder_page() would come
> > > into the picture...
> > >
> > > Honza
> >
> > Ah, sorry. I was thinking the other way around wrt to
> > read/write :( However, I still don't think that generic_file_aio_read()
> > is the right place to fix this. I note that XFS seems to pass the test
> > and it also uses mpage_readpage and mpage_readpages as well as
> > generic_file_aio_read() so maybe that is a clue as to where the answer
> > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
>
> Yes, xfs can pass the test under direct IO. I suspect that the reason
> is that xfs uses ilock/iolock. But it doesn't means that all file
> systems need to fix this issue by themselves because the root cause is
> in vfs layer.
>
I'm not convinced that this is the case though. The VFS is asking the fs
to provide a page via ->readpage(s) and therefore the fs should not be
providing a page (presumably marked uptodate in this case) where one
does not already exist, and if it does exist, then the content of the
page should be correct.
I don't see how this test can be done at a point before the fs has a
chance to get its own locks, and thus ensuring that the inode size is
actually correct.
However, I'd like to understand what is going on at ->readpage level,
and thus why we get this page generated incorrectly, and what actual i/o
gets generated (or not) in the problematic case.
> >
> > The other question that we've not really answered is why it is useful to
> > mix buffered and direct i/o to the same file at the same time anyway.
> > While I agree that it ought to work, I'm not convinced that its
> > enormously useful, which is maybe why it has not been spotted before,
>
> Yes, mixed buffered read and direct write seems to be useless. But in
> our product system, the issue will appear under direct read/write. The
> application has a reader and a writer. The reader should read nothing
> or the content that has been written by writer. But now the reader gets
> the zero-filled page. Hence the application needs to prevent this issue
> using a mutex or other flag. It doesn't make sense to the userspace
> programmer.
>
I'm not sure I fully understand what you are saying here... if you see
the same issue using only O_DIRECT reads and writes, how are you getting
that?
> TBH, we have found this issue two year ago but we don't take care of it.
> So currently the application just solves it as I said above. But it will
> issue a flush calling fsync(2) to flush the dirty page in order to make
> sure the data has been flushed into the disk. Obviously, the performance
> of this application is impacted by this issue. Now we have known that we
> don't need to flush the dirty page but I think we'd better fix it.
>
> Regards,
> - Zheng
Calling fsync is a separate issue though, since that is required anyway
if you want to be sure that the data has actually reached the disk,
Steve.
next prev parent reply other threads:[~2013-12-17 12:18 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16 9:37 ` Steven Whitehouse
2013-12-16 10:01 ` Jan Kara
2013-12-17 9:43 ` Steven Whitehouse
2013-12-17 11:16 ` Zheng Liu
2013-12-17 12:17 ` Steven Whitehouse [this message]
2013-12-17 16:41 ` Zheng Liu
2013-12-19 12:27 ` Steven Whitehouse
2013-12-19 22:44 ` Dave Chinner
2013-12-20 9:28 ` Steven Whitehouse
2013-12-23 3:00 ` Dave Chinner
2014-01-14 15:22 ` Steven Whitehouse
2014-01-14 19:19 ` Jan Kara
2014-01-15 7:19 ` Zheng Liu
2014-01-16 15:35 ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20 ` Miklos Szeredi
2014-01-24 14:42 ` Steven Whitehouse
2014-01-27 10:13 ` Jan Kara
2013-12-17 14:01 ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32 ` Steven Whitehouse
2013-12-17 16:39 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1387282664.2729.42.camel@menhir \
--to=swhiteho@redhat.com \
--cc=david@fromorbit.com \
--cc=dmonakhov@openvz.org \
--cc=gnehzuil.liu@gmail.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).