All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Dmitry Monakhov <dmonakhov@openvz.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Zheng Liu <wenqing.lz@taobao.com>,
	Lukas Czerner <lczerner@redhat.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Wed, 15 Jan 2014 15:19:33 +0800	[thread overview]
Message-ID: <20140115071933.GA3449@gmail.com> (raw)
In-Reply-To: <20140114191901.GC27863@quack.suse.cz>

Hi all,

Sorry for the delay because of holiday and other issues.

On Tue, Jan 14, 2014 at 08:19:01PM +0100, Jan Kara wrote:
[...]
> > Sorry for the delay - I have been working on this and I've now figured
> > out whats really going on here. I'll describe the GFS2 case first, but
> > it seems that it is something that will affect other fs too, including
> > local fs most likely.
> > 
> > So we have one thread doing writes to a file, 1M at a time, using
> > O_DIRECT. Now with GFS2 that results in more or less a fallback to a
> > buffered write - there is one important difference however, and that is
> > that the direct i/o page invalidations still occur, just as if it were a
> > direct i/o write.
> > 
> > Thread two is doing direct i/o reads. When this results in calling
> > ->direct_IO, all is well. GFS2 does all the correct locking to ensure
> > that this is coherent with the buffered writes. There is a problem
> > though - in generic_file_aio_read() we have:
> > 
> >        /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> >         if (filp->f_flags & O_DIRECT) {
> >                 loff_t size;
> >                 struct address_space *mapping;
> >                 struct inode *inode;
> > 
> >                 mapping = filp->f_mapping;
> >                 inode = mapping->host;
> >                 if (!count)
> >                         goto out; /* skip atime */
> >                 size = i_size_read(inode);
> >                 if (pos < size) {
> >                         retval = filemap_write_and_wait_range(mapping, pos,
> >                                         pos + iov_length(iov, nr_segs) - 1);
> >                         if (!retval) {
> >                                 retval = mapping->a_ops->direct_IO(READ, iocb,
> >                                                         iov, pos, nr_segs);
> >                         }
> > 
> > The important thing here is that it is checking the i_size with no
> > locking! This means that it can check i_size, find its the same as the
> > request read position (if the read gets in first before the next block
> > is written to the file) and then fall back to buffered read.
> > 
> > That buffered read then gets a page which is invalidated by the page
> > invalidation after the (DIO, but fallen back to buffered) write has
> > occurred. I added some trace_printk() calls to the read path and it is
> > fairly clear that is the path thats being taken when the failure occurs.
> > Also, it explains why, despite the disk having been used and reused many
> > times, the page that is returned from the failure is always 0, so its
> > not a page thats been read from disk. It also explains why I've only
> > ever seen this for the first page of the read, and the rest of the read
> > is ok.
>   Yes, race like this is something Zheng and I have been talking about from
> the beginning. Thanks for tracking down the details.
> 
> > Changing the test "if (pos < size) {" to "if (1) {" results in zero
> > failures from the test program, since it takes the ->direct_IO path each
> > time which then checks the i_size under the glock, in a race free
> > manner.
> > 
> > So far as I can tell, this can be just as much a problem for local
> > filesystems, since they may update i_size at any time too. I note that
> > XFS appears to wrap this bit of code under its iolock, so that probably
> > explains why XFS doesn't suffer from this issue.
> > 
> > So what is the correct fix... we have several options I think:
> > 
> >  a) Ignore it on the basis that its not important for normal workloads
>   Zheng has a real world usecase which hits this race and it does look
> rather normal. So don't like a) very much.

Yes, as far as I know, at least there are two applications that has met
this problem in our product system at Taobao.  So that is why I dig into
this bug and hope it can be fixed.

> 
> >  b) Simply remove the "if (pos < size) {" test and leave the fs to get
> > on with it (I know that leaves the btrfs test which also uses i_size
> > later on, but that doesn't appear to be an issue since its after we've
> > called ->direct_IO). Is it an issue that this may call an extra
> > filemap_write_and_wait_range() in some cases where it didn't before? I
> > suspect not since the extra calls would be for cases where the pages
> > were beyond the end of file anyway (according to i_size), so a no-op in
> > most cases.

I try this approach and I can confirm that it is ok for ext4.

Dave, I have send out a patch for xfstests to add a new testcase
according to this bug but I don't get any response [1].  Could you
please review it?  Thanks.

1. http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58428

Regards,
                                                - Zheng

  reply	other threads:[~2014-01-15  7:15 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16  9:37 ` Steven Whitehouse
2013-12-16 10:01   ` Jan Kara
2013-12-17  9:43     ` Steven Whitehouse
2013-12-17 11:16       ` Zheng Liu
2013-12-17 12:17         ` Steven Whitehouse
2013-12-17 16:41           ` Zheng Liu
2013-12-19 12:27             ` Steven Whitehouse
2013-12-19 22:44               ` Dave Chinner
2013-12-20  9:28                 ` Steven Whitehouse
2013-12-23  3:00                   ` Dave Chinner
2014-01-14 15:22                     ` Steven Whitehouse
2014-01-14 19:19                       ` Jan Kara
2014-01-15  7:19                         ` Zheng Liu [this message]
2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20                             ` Miklos Szeredi
2014-01-24 14:42                               ` Steven Whitehouse
2014-01-27 10:13                                 ` Jan Kara
2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32         ` Steven Whitehouse
2013-12-17 16:39           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140115071933.GA3449@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=david@fromorbit.com \
    --cc=dmonakhov@openvz.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=lczerner@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=swhiteho@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.