From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
Dmitry Monakhov <dmonakhov@openvz.org>,
Dave Chinner <david@fromorbit.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Tue, 17 Dec 2013 19:16:26 +0800 [thread overview]
Message-ID: <20131217111626.GA7544@gmail.com> (raw)
In-Reply-To: <1387273422.2729.13.camel@menhir>
Hi Steven,
On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > >
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > >
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update. But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes. We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > >
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> > Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> >
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
>
> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
>
> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> > I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> >
> > Honza
>
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
Yes, xfs can pass the test under direct IO. I suspect that the reason
is that xfs uses ilock/iolock. But it doesn't means that all file
systems need to fix this issue by themselves because the root cause is
in vfs layer.
>
> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,
Yes, mixed buffered read and direct write seems to be useless. But in
our product system, the issue will appear under direct read/write. The
application has a reader and a writer. The reader should read nothing
or the content that has been written by writer. But now the reader gets
the zero-filled page. Hence the application needs to prevent this issue
using a mutex or other flag. It doesn't make sense to the userspace
programmer.
TBH, we have found this issue two year ago but we don't take care of it.
So currently the application just solves it as I said above. But it will
issue a flush calling fsync(2) to flush the dirty page in order to make
sure the data has been flushed into the disk. Obviously, the performance
of this application is impacted by this issue. Now we have known that we
don't need to flush the dirty page but I think we'd better fix it.
Regards,
- Zheng
next prev parent reply other threads:[~2013-12-17 11:13 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16 9:37 ` Steven Whitehouse
2013-12-16 10:01 ` Jan Kara
2013-12-17 9:43 ` Steven Whitehouse
2013-12-17 11:16 ` Zheng Liu [this message]
2013-12-17 12:17 ` Steven Whitehouse
2013-12-17 16:41 ` Zheng Liu
2013-12-19 12:27 ` Steven Whitehouse
2013-12-19 22:44 ` Dave Chinner
2013-12-20 9:28 ` Steven Whitehouse
2013-12-23 3:00 ` Dave Chinner
2014-01-14 15:22 ` Steven Whitehouse
2014-01-14 19:19 ` Jan Kara
2014-01-15 7:19 ` Zheng Liu
2014-01-16 15:35 ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20 ` Miklos Szeredi
2014-01-24 14:42 ` Steven Whitehouse
2014-01-27 10:13 ` Jan Kara
2013-12-17 14:01 ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32 ` Steven Whitehouse
2013-12-17 16:39 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131217111626.GA7544@gmail.com \
--to=gnehzuil.liu@gmail.com \
--cc=david@fromorbit.com \
--cc=dmonakhov@openvz.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=swhiteho@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).