Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	Dave Chinner <david@fromorbit.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Wed, 18 Dec 2013 00:41:59 +0800	[thread overview]
Message-ID: <20131217164159.GA7331@gmail.com> (raw)
In-Reply-To: <1387282664.2729.42.camel@menhir>

On Tue, Dec 17, 2013 at 12:17:44PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Tue, 2013-12-17 at 19:16 +0800, Zheng Liu wrote:
> > Hi Steven,
> > 
> > On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > > Hi,
> > > > > 
> > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > > 
> > > > > > Currently we check i_size in buffered read path after we know the page
> > > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > > when we do some append dio writes.  We could use the following code
> > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > > 
> > > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > > the underlying fs has not had a chance to get its own locks and update
> > > > > the inode size.
> > > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > > function when obtaining inode lock. Or do you know about other filesystem
> > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > > do something similar?
> > > > 
> > > Well we could do that, but I don't think it makes sense really. The
> > > point of having the page cache is to put it "in front" of the expensive
> > > operations where possible (such as getting cluster locks). I don't think
> > > it makes sense to grab and drop a cluster lock on every call to
> > > generic_file_aio_read() when, for cached pages, we would not even need
> > > to call ->readpage.
> > > 
> > > The problem appears to be that somehow we are getting a page returned
> > > when it should not be. So I suspect that the correct place to catch that
> > > is in ->readpage(s)
> > > 
> > > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > > avoid the race that you've identified here, if I've understood it
> > > > > correctly,
> > > >   I don't understand how that would work. Can you elaborate a bit? Here the
> > > > problem is i_size gets extended by DIO write during buffered read (which is
> > > > a fallback from DIO read) so we return zero-filled page instead of either
> > > > data filled page or short read. I don't see where launder_page() would come
> > > > into the picture...
> > > > 
> > > > 								Honza
> > > 
> > > Ah, sorry. I was thinking the other way around wrt to
> > > read/write :( However, I still don't think that generic_file_aio_read()
> > > is the right place to fix this. I note that XFS seems to pass the test
> > > and it also uses mpage_readpage and mpage_readpages as well as
> > > generic_file_aio_read() so maybe that is a clue as to where the answer
> > > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
> > 
> > Yes, xfs can pass the test under direct IO.  I suspect that the reason
> > is that xfs uses ilock/iolock.  But it doesn't means that all file
> > systems need to fix this issue by themselves because the root cause is
> > in vfs layer.
> >
> I'm not convinced that this is the case though. The VFS is asking the fs
> to provide a page via ->readpage(s) and therefore the fs should not be
> providing a page (presumably marked uptodate in this case) where one
> does not already exist, and if it does exist, then the content of the
> page should be correct.

As far as I understand, ->readpage(s) should return zero-filled pages if
this page doesn't have a block mapping because the content should be '0'.

> 
> I don't see how this test can be done at a point before the fs has a
> chance to get its own locks, and thus ensuring that the inode size is
> actually correct.
> 
> However, I'd like to understand what is going on at ->readpage level,
> and thus why we get this page generated incorrectly, and what actual i/o
> gets generated (or not) in the problematic case.
> 
> > > 
> > > The other question that we've not really answered is why it is useful to
> > > mix buffered and direct i/o to the same file at the same time anyway.
> > > While I agree that it ought to work, I'm not convinced that its
> > > enormously useful, which is maybe why it has not been spotted before,
> > 
> > Yes, mixed buffered read and direct write seems to be useless.  But in
> > our product system, the issue will appear under direct read/write.  The
> > application has a reader and a writer.  The reader should read nothing
> > or the content that has been written by writer.  But now the reader gets
> > the zero-filled page.  Hence the application needs to prevent this issue
> > using a mutex or other flag.  It doesn't make sense to the userspace
> > programmer.
> > 
> I'm not sure I fully understand what you are saying here... if you see
> the same issue using only O_DIRECT reads and writes, how are you getting
> that?

Sorry, maybe I don't clarify the test program.  In commit log the test
program does direct read/write.  I found this problem because the app
developer asks me why his application doesn't work well undert direct
read/write.

> 
> > TBH, we have found this issue two year ago but we don't take care of it.
> > So currently the application just solves it as I said above.  But it will
> > issue a flush calling fsync(2) to flush the dirty page in order to make
> > sure the data has been flushed into the disk.  Obviously, the performance
> > of this application is impacted by this issue.  Now we have known that we
> > don't need to flush the dirty page but I think we'd better fix it.
> > 
> > Regards,
> >                                                 - Zheng
> 
> Calling fsync is a separate issue though, since that is required anyway
> if you want to be sure that the data has actually reached the disk,

Yes, calling fsync is another issue.  I just want to say that this issue
has impacted the application.  At least in our product system.

Thanks,
                                                - Zheng

next prev parent reply	other threads:[~2013-12-17 16:38 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16  9:37 ` Steven Whitehouse
2013-12-16 10:01   ` Jan Kara
2013-12-17  9:43     ` Steven Whitehouse
2013-12-17 11:16       ` Zheng Liu
2013-12-17 12:17         ` Steven Whitehouse
2013-12-17 16:41           ` Zheng Liu [this message]
2013-12-19 12:27             ` Steven Whitehouse
2013-12-19 22:44               ` Dave Chinner
2013-12-20  9:28                 ` Steven Whitehouse
2013-12-23  3:00                   ` Dave Chinner
2014-01-14 15:22                     ` Steven Whitehouse
2014-01-14 19:19                       ` Jan Kara
2014-01-15  7:19                         ` Zheng Liu
2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20                             ` Miklos Szeredi
2014-01-24 14:42                               ` Steven Whitehouse
2014-01-27 10:13                                 ` Jan Kara
2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32         ` Steven Whitehouse
2013-12-17 16:39           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131217164159.GA7331@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=david@fromorbit.com \
    --cc=dmonakhov@openvz.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=swhiteho@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.